Data Quality - Notes

Author

Nils Rechberger

Published

February 23, 2026

Tasks for Exercise 01: Introduction

Task 01

Data quality describes data characteristics at the meta-level. High-quality data provides a reliable and useful structure for information. Conversely, poor data quality can lead to mistrust and inaccurate outcomes.

Note: Personal definition.

Task 02

Scenario 2: E-Commerce – Inventory Management

How would incomplete data affect decision-making in your scenario?

Incomplete data can lead to a misunderstanding of consumer behavior and purchasing patterns. Without a full picture, businesses may fail to identify emerging trends or customer needs.

What could go wrong if the data in your scenario is not accurate?

Inaccurate data can result in supply chain bottlenecks or an oversupply of products. This leads to either lost sales due to stockouts or increased storage costs due to excess inventory.

How could the fact, that data is not available when required, affect your scenario?

A lack of real-time data availability undermines trust in data-driven processes. If stakeholders cannot access information when needed, they may revert to “gut-feeling” decisions, which are prone to error.

In what ways does the data need to be reliable and relevant?

To support effective inventory management, data must meet specific quality standards, such as:

Availability: Data must be accessible to decision-makers at all times.
Velocity: Data must be processed and updated at the speed of the business.
Completeness: No critical data points (like regional demand) should be missing.
Usability: Data must be in a format that is easy to interpret and act upon.

Note: The answers are not disjunct.

Tasks for Exercise 02: Data Quality Dimensions

Task 01

id	last_name	first_name	age	department	function	salary	commision_rate
1	Smith	Bill	56	Sales	Head of Sales	120.000	15%
2	Muller	John	25	Social Media	Creative Director	100.000	N/A
3	Grey	Anna	37	SEO	Google Expert	90.000	N/A
4	Berger	Lia	22	NULL	Freelancer		NULL
5	?	Mike	46	Facility	Team Manager	75.000	N/A
6	Doe	Jane	30	IT	Dev	N/A	NaN

Scenario 1: The value exists but is not known (that is, known unknown): last_name of Mike (ID 5). Every person has a last_name, but it is currently missing from the dataset.
Scenario 2: The value does not exist at all: department of Lia (ID 4). As an external freelancer, she is not part of the internal organizational structure.
Scenario 3: The existence of the value is not known (that is, unknown unknown): salary of Jane (ID 6). It is unclear if she is a paid employee or an unpaid volunteer/intern; thus, the existence of the attribute itself is in question.
Scenario 4: The attribute is not applicable: commision_rate for non-sales employees. This metric is only defined for sales roles and is fundamentally inapplicable to other functions.
Scenario 5: The value is only populated under specific conditions: salary of Lia (ID 4). As a freelancer, the field remains empty until a specific hourly-based invoice or contract condition triggers the entry.

Tasks for Exercise 03: Exploratory Data Analysis

Task 01

Using R, generate a boxplot broken down by gender (variable sex). How can the boxplot be interpreted?

library(readxl)

data <- read_excel("/home/nils/dev/mscids-notes/fs26/dq/data/HSE.xlsx")

boxplot(wtval ~ sex, data = data)

We can clearly se a difference between the two classes.

Task 02

What are the distributions of the following boxplots?

Answer

Varriable A: Normal
Varriable B: Right skewed
Varriable C: Right skewed