---
name: Data Quality Monitoring
description: Techniques and tools for ensuring the accuracy, completeness, and reliability of data across the pipeline.
---

# Data Quality Monitoring

## Overview

Data Quality (DQ) Monitoring is the continuous process of validating data against predefined rules and expectations. In a modern data stack, monitoring must happen at every stage: **Ingestion**, **Transformation**, and **Serving**.

**Core Principle**: "Garbage in, garbage out. If the data is wrong, the analytics, AI, and decisions will be wrong too."

---

## 1. The Six Dimensions of Data Quality

Common industry standards for measuring data health:

| Dimension | Definition | Example Metric |
| :--- | :--- | :--- |
| **Accuracy** | Does the data reflect reality? | % of records matching the source system. |
| **Completeness**| Are there missing values? | % of non-null values in mandatory fields. |
| **Consistency** | Does data match across systems? | Do user names match in both SQL and Redis? |
| **Timeliness** | Is data fresh enough? | Data latency (current time - event time). |
| **Uniqueness** | Are there duplicate records? | Count of duplicate primary keys. |
| **Validity** | Does it follow specified formats?| % of emails following the regex pattern. |

---

## 2. Automated DQ Testing (Frameworks)

### A. dbt (Data Build Tool) Tests
Ideal for SQL-based transformations.
```yaml
# schema.yml
models:
  - name: active_users
    columns:
      - name: user_id
        tests:
          - unique
          - not_null
      - name: age
        tests:
          - accepted_values:
              values: [18, 120]
```

### B. Great Expectations (Python)
The most flexible tool for complex Python-based data pipelines.
```python
import great_expectations as ge

df = ge.read_csv("data.csv")
df.expect_column_values_to_be_between("age", 18, 120)
df.expect_column_values_to_not_be_null("email")
df.expect_column_values_to_match_regex("email", r"[\w.-]+@[\w.-]+")
```

---

## 3. Real-Time DQ with SQL Assertions

You can run "Check" queries alongside your production workload to catch silent failures.

```sql
-- Detect "Price Drift" anomaly
SELECT 
    product_id, 
    AVG(price) OVER(PARTITION BY category) as avg_cat_price,
    price
FROM current_inventory
WHERE price > (avg_cat_price * 5) -- Alert if price is 5x the average
```

---

## 4. Monitoring Tools Landscape

| Tool | Focus | Best For |
| :--- | :--- | :--- |
| **Great Expectations**| Validation | Python pipelines, Airflow integration. |
| **Monte Carlo** | Data Observability | ML-driven anomaly detection, lineage. |
| **Soda** | Data Contracts / Monitoring | Collaborative DQ for Data & Business teams. |
| **Anodot** | Streaming Anomaly | Catching spikes/dips in real-time metrics. |

---

## 5. Data Profiling and Drift Detection

**Data Drift** occurs when the statistical properties of the data change over time, even if the schema stays the same.

*   **Schema Drift**: Adding/Removing columns.
*   **Concept Drift**: The meaning of a value changes (e.g., "Active" now means something different).
*   **Predictive Drift**: Distribution of data shifts (e.g., average purchase value drops by 50%).

---

## 6. DQ Incident Management

When a monitor fails, it should trigger an incident workflow:
1.  **Detect**: Alert fires (Slack/PagerDuty).
2.  **Triage**: Is the data just late, or is it incorrect?
3.  **Isolation**: Block downstream pipelines to prevent "polluting" the data warehouse.
4.  **Remediation**: Re-run the pipeline or fix the source code.
5.  **Validation**: Verify the fix.

---

## 7. The Data Quality Dashboard

A high-level view for stakeholders:
*   **DQ Score**: (0-100) Aggregated health of all datasets.
*   **Freshness SLA**: % of pipelines meeting their timeliness target.
*   **Top Failing Tests**: Which columns are most problematic?
*   **Time to Resolve**: Average time to fix DQ incidents.

---

## 8. Real-World Scenario: E-commerce Ghost Orders
*   **Scenario**: A promotional campaign led to a surge in orders, but a bug caused the `tax_amount` to be 0 for all of them.
*   **Detection**: A DQ monitor was in place checking that `tax_amount > 0` for all non-exempt regions.
*   **Response**: The monitor failed at 2:00 AM. The data pipeline was automatically paused.
*   **Outcome**: The finance team was notified before the morning report was generated. The tax glitch was fixed, preventing thousands of incorrect invoices.

---

## 9. DQ Monitoring Checklist

* [ ] **Completeness**: Are there checks for `NULL` values in primary keys?
* [ ] **Freshness**: Do we have alerts for data that hasn't arrived in X hours?
* [ ] **Volume**: Sudden drop or spike in row count (e.g., < 50% of typical volume)?
* [ ] **Distribution**: Has the average or median value shifted significantly today?
* [ ] **Schema**: Have any columns been renamed or dropped?
* [ ] **Downstream blocking**: Does a failure stop downstream tasks automatically?

---

## Related Skills
* `43-data-reliability/data-contracts`
* `43-data-reliability/data-lineage`
* `42-cost-engineering/cost-observability`