Data Quality - Notes

Author

Nils Rechberger

Published

February 9, 2026

Lecture 01: Introduction

Relevance of Data Quality

A key advance in data quality emerged in the 1920s through R. A. Fisher’s work in experimental design, which introduced randomization and replication to estimate error, bias, and precision.

Definitions

Data

Data are abstract representations of selected features of real-world entities, events, and concepts, expressed and understood through clearly definable conventions (Sebastian-Coleman, 2013).

Data Quality

  • Contextual data quality: The quality of data is defined by two related factors: how well it meets the expectations of data consumers (how well it is able to serve the purposes of its intended use).
  • Intrinsic data quality: How well it represents the objects, events, and concepts it is created to represent.

Data Management

DQ as part of Data Management

Data Governance sits at the center (the hub) because it provides oversight, direction, policies, and coordination. Data quality is but one element of effective data management.

Data Management

Role of Data Quality Managers

  • Develop a governed approach to make data fit for purpose based on data consumers requirements.
  • Define standards, requirements, and specifications for data quality controls as part of the data lifecycle.
  • Define and implement processes to measure, monitor, and report on data quality levels.
  • Identify and advocate for opportunities to improve the quality of data, through process and system improvements.

Data Quality Dimensions

  • Correctness/accuracy: The data accurately describe the entity in question
  • Completeness: No missing records/field values
  • Conformity/validity: Correct types, value ranges, etc.
  • Consistency: No contradictions between different data sets / tables

Note: Breaking the issue down into dimensions helps to quantify issues.

Note

Most sources agree that the concept of data quality has several dimensions. But they don’t agree on what these dimensions are.

Lecutre 02: Data Quality Dimensions

In order to measure a broad concept such as data quality, it needs to be broken down into measurable, actionable dimensions. Each dimension captures one measurable aspect of data quality. Measuring data quality serves to analyze/understand/resolve or minimize data quality problems.

Note: Some of the dimensions are objective/quantitative or subjective/qualitative.

Classification of Dimensions

Objective vs. subjective measurement

  • Objective: Verifiable by rules, counts, constraints, timestamps, or comparisons. Independent of user perception.
  • Subjective: Depends on human judgment, expectations, or context of use.

Data value (intrinsic) vs. usage-dependent (contextual)

  • Data value: Intrinsic quality of the dataset itself. Affects correctness, analytical validity, and downstream results even if usage is unknown. E.g. completeness, consistency
  • Usage: Quality emerges only in interaction with users, systems, or tasks. Context-dependent.

Role of Representation

Besides conformance/validity, another prerequisite of correctness/accuracy is correct representation.

Caution
  • December 13, 1941 -> invalid
  • 1941-13-12 -> valid entry, but wrong representation
  • 1941-12-13 -> valid, correct representation correct/ accurate (provided the information is intrinsically accurate)

Role of Metadata

Metadata are data about data. Without metadata, it can be difficult to make sense of data. High quality metadata is important for representational effectiveness. Standard database metadata includes:

  • Table/ column names and definitions
  • Data types and domains of values
  • Whether columns can have NULL values
  • Rules regarding data relationships
  • Cardinality (Number of unique values compared to the total rows, e.g. gender, boolean flags)

Lecture 03: Exploratory Data Analysis

Data Preparation

The process of manipulating data before analysis is referred to as data preparation. The meaning of the term data preparation depends on Point of view, Roles, Intended data use.

Point of view

The meaning of data preparation varies depending on one’s position in the hierarchy.

Note: We take a statistical view and focus on the EXPLORE/TRANSFORM level.

Role and intended data use

Role and intended data use

From Data to Understanding

The process from data to understanding is often described to as a cycle

“Data Wrangling is the ability to take a messy, unrefined source of data and wrangle it into something useful. It’s the art of using computer programming to extract raw data and creating clear and actionable bits of information for your analysis. Data wrangling is the entire front end of the analytic process and requires numerous tasks that can be categorized within the get, clean, and transform components.” (Bradley Boehmke)

Data understanding cycle

Exploratory Data Analysis (EDA)

Tip

Always plot your data first.

Raw Approach:

  • Examine the data before applying a specific probability model
  • Focus on descriptive analysis over hypothesis testing
  • Methods of graphical data analysis

Iterative cycle:

  • Generate questions about your data.
  • Search for answers by transforming, visualizing, and modeling your data.

Steps in the EDA Process

Identify data attributes. Determine measurement scales

Hierarchy of Scales

Univariate data analysis. Recognize basic properties of the data (e.g. boxplot)

Goal: Recognize basic properties of the data.

Description of central tendency:
  • Arithmetic mean → Very sensitive to outliers
  • Median → Splits data in half (25% smaller, 75% larger) → More robust to outliers than mean
Description of variability/ dispersion:
  • Empirical variance
  • Empirical standard deviation
  • 1st quartile/ 25th percentile → 25% smaller, 75% larger.
  • 3rd quartile/ 75th percentile → 75% smaller, 25% larger.
  • Interquartile range (IQR) = 3rd quartile – 1st quartile → More robust than standard deviation

Bivariate & Multivariate data analysis. Recognize interactions in the data

Goal: Using bivariate & multivariate data analysis to recognize interactions in the data by using Plots and hypothesis tests.

Typical graphical procedures: - Scatterplot - Mosaic Plot

Typical tests: - Pearson Chi-Square test

Following EDA steps

  • Detect aberrant & missing values. Analysis & adjustment
  • Outlier detection. Analysis & cleaning
  • Create derived variables (index, etc.) & Variable transformation

Lecture 04: Missing Values

Why do we care about missing data?

Naively, one might think that missing data just means there is less data to analyze, so long as there are enough data to begin with. Unfortunately, this is only true in rare circumstances. In many cases, missing data may actually bias the results (give systematically wrong answers).

Note: Missing data is the rule, not the exception.

Reasons for missing data

At the source

  • Survey respondents refuse to answer certain questions
  • Answers are invalid or cannot be decoded – for example, in paper questionnaires.
  • In longitudinal studies, parts of the sample do not participate at all measurement time points
  • Sensor failure

Causes during data processing

  • typing errors during coding
  • Reading errors in digitalizing (for example, scanning of paper questionnaires)

Note: The more technical reasons are often the most benign, because they may occur randomly and therefore don’t lead to bias.

Forms of non-response

  • Unit non-response: subjects refuse to participate or are systematically unreachable
  • Item non-response: subjects refuse to answer specific questions

Forms of item non-response (Little and Rubin, 2020)

Forms of item non-response (Little and Rubin, 2020)

Missing data mechanism

The missing data mechanism describes the underlying process that led to the missing data. It is crucial for determining which methods are adequate for dealing with the problem, if any.

Missing Completely at Random (MCAR)

The probability of a value being missing does not depend on any observed or missing data. I.e., missing values are completely randomly distributed across all cases (persons, etc.) Cases with missing values do not differ systematically from cases without missing values.

Missing at Random (MAR)

The probability of a value being missing may depend on observed data, but not on missing data. The occurrence of a missing value occurs conditionally at random and can be explained by the values in other variables. Cases with missing values may differ systematically from cases without missing values, but in a way that can be modeled.

Missing not at Random (MNAR)

The probability of a value being missing depends on missing data. Values are systematically missing but no information is available to model their absence. There is no adequate statistical procedure to avoid bias.

Dealing with Missing Data

Listwise deletion / Complete case analysis

Delete all rows that have a missing value. Advantage: simple. Disadvantages: Wasteful. Imagine a dataset with 10 attributes, each with 10% missings. Then \(P\text{(complete row)} = 0.9^{10} = 35\%\).

Pairwise deletion / Complete case analysis

Delete all rows that have a missing value. Advantage: simple & less wasteful. Results in different sample sizes for different model.

Single Imputation

Generally, imputation refers to the practice of replacing missing values with values constructed in some way. Specifically, single imputation means that each missing value is replaced by one value.

Mean Imputation

NAs are replaced by the mean of each variable. Advantages: Simple. Disadvantages: Not unbiased for regression or correlation.

Regression Imputation

For each target variable to be imputed, run a regression on some or all other variables and use it to predict the missing values. Advantages: Under MCAR and MAR, unbiased for regression. Disadvantages: Not unbiased for correlation.

Stochastic Imputation

For each target variable to be imputed, randomly draw a value from some distribution. Advantages: Under MCAR and MAR, unbiased even for correlation. Disadvantages: Standard errors too small.

Overview of Imputation Methods

Overview of Simple Methods

Multiple Imputation

In multiple imputation, instead of applying stochastic imputation once, it is applied m times (e.g., m=5). This creates 5 complete datasets. The data sets are analyzed separately, and the results are combined (pooled) according to certain rules. Advantages: Only method that can yield correct standard errors, even under MAR. Disadvantages: Still cannot handle MNAR (nothing can, only more data).

Note: In ML, we’re often not interested in standard errors.

Little’s MCAR Test

It essentially tests if the means / covariances are the same under each missingness pattern.

  • \(H_0\): The data are MCAR
  • \(H_A\): They are not
Important

The test has low power: even if the test doesn’t reject, MCAR is not guaranteed.