Data Quality & Reliability — from "tests" to a DQ framework.
The round where they ask "how do you know your pipeline is healthy?". Schema, volume, distribution, referential checks. Lineage. Freshness SLAs. The on-call playbook. Seven sections, every check articulable in one sentence, every alert actionable.
Contents
- The four check categories — schema, volume, distribution, referential
- Schema checks — the contract layer
- Volume checks — anomaly detection on row counts
- Distribution checks — null %, uniqueness, value drift
- Referential checks — cross-table integrity
- Lineage, freshness SLAs & the on-call playbook
- The 90-second articulation script
The four check categories — schema, volume, distribution, referential.
"Data quality" is too broad to be useful. The interview-grade framing breaks it into four orthogonal categories. Every alert maps to one of these; every category has a different remediation pattern; every team can reason about coverage gaps because the buckets don't overlap.
The four buckets
| Category | Question it answers | Example check |
|---|---|---|
| Schema | Is the structure correct? | Column email exists, type VARCHAR, NOT NULL |
| Volume | Did we get the right amount? | Row count yesterday between 80%-120% of 7-day average |
| Distribution | Are the values plausible? | Null % < 0.5%; uniqueness on user_id; value range 0-100 |
| Referential | Does it join cleanly to other tables? | Every order.customer_id exists in customers |
Why this taxonomy works
Different categories fail differently and need different responses:
- Schema failures = pipeline broken, page on-call immediately. Loud failure, fast fix (revert / hotfix the producer).
- Volume failures = upstream incident OR downstream consumer issue. Investigation needed before action.
- Distribution failures = data source quality issue OR business reality changed. Often a slow drift, not a sudden break.
- Referential failures = cross-team coordination problem. The two tables drift out of sync because they have different freshness SLAs.
The check-per-table coverage matrix
Every important table should have at least one check in each category:
fact_orders:
schema: column types match contract; PK uniqueness
volume: row count today within 75-125% of L7 avg
distribution: null(customer_id) = 0; null(amount) < 0.1%; amount range $0-$10K
referential: customer_id ∈ dim_customer; product_id ∈ dim_product
dim_customer:
schema: email format; phone format
volume: new rows ≥ 0 (never shrinks)
distribution: distinct(country) ≤ 250; null(email) < 1%
referential: region_id ∈ dim_region
The four common DQ tools
| Tool | Strength | Weakness |
|---|---|---|
| dbt tests | Built-in to dbt (not_null, unique, accepted_values, relationships); free | Limited to schema + simple distribution; no anomaly detection |
| Great Expectations | Rich expectation library; detailed result docs | Heavy to operate; alerting weak out-of-box |
| Soda | Lightweight YAML-based; integrates with Airflow / dbt | Less mature than GE for complex checks |
| Monte Carlo / Datafold (commercial observability) | ML-based anomaly detection on volume + distribution; auto-discover lineage | $$$; vendor lock-in |
Schema checks — the contract layer.
Schema is the easiest category to enforce and the most damaging when it fails. The good news: it's mostly automatable. The bad news: most teams treat it as runtime drift instead of a contract.
The four schema check types
| Check | Implementation |
|---|---|
| Column existence | Does the column appear in the table at all? |
| Column type | Is it INT vs VARCHAR vs DATE as expected? |
| Nullability | Is the NOT NULL constraint declared and enforced? |
| Constraint enforcement | PK, UNIQUE, FK declared and active? |
dbt model contracts (1.5+)
models:
- name: dim_user
config:
contract:
enforced: true
columns:
- name: user_id
data_type: bigint
constraints: [{ type: not_null }, { type: primary_key }]
- name: email
data_type: varchar
constraints: [{ type: not_null }]
- name: created_at
data_type: timestamp
constraints: [{ type: not_null }]
The model fails to build if the SELECT output schema drifts. This is build-time enforcement — the warehouse equivalent of a schema registry.
Schema checks for raw / source tables (no contract)
Source tables are owned by other teams and don't have dbt contracts. Detect drift via information_schema:
-- Daily check job, comparing today's columns to expected
SELECT table_name, column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema = 'raw' AND table_name = 'orders'
EXCEPT
SELECT table_name, column_name, data_type, is_nullable
FROM dq_expected_schema
WHERE table_name = 'orders';
-- Any row returned = drift; alert
The "added column" trap
Schema evolution rules — what's safe vs not
| Change | Safe? | How to ship |
|---|---|---|
| Add nullable column | Yes | Producer-first; consumer ignores until needed |
| Add NOT NULL column with default | Yes | Same as above |
| Add NOT NULL column without default | No | Add as nullable first; backfill; then enforce NOT NULL |
| Drop column | No | Deprecation flag → 90 days → drop |
| Rename column | No | Add new + dual-populate → migrate consumers → drop old |
| Change column type | No | Add new typed column; migrate; drop old |
| Tighten NOT NULL on existing column | No | Backfill existing nulls → then add constraint |
Volume checks — anomaly detection on row counts.
The simplest check that catches the most production issues: "did we get the right number of rows today?". Implemented naively, it pages constantly; implemented well, it catches incidents 30 minutes earlier than any other signal.
The fixed-threshold trap
The simplest anomaly check
-- Compare today's count to last 7 days' average + std-dev
WITH baseline AS (
SELECT AVG(daily_count) AS mean, STDDEV(daily_count) AS sd
FROM (
SELECT DATE(created_at) AS day, COUNT(*) AS daily_count
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '8 days'
AND created_at < CURRENT_DATE
GROUP BY day
)
),
today AS (
SELECT COUNT(*) AS cnt FROM orders WHERE DATE(created_at) = CURRENT_DATE - 1
)
SELECT
today.cnt,
baseline.mean,
baseline.sd,
ABS(today.cnt - baseline.mean) / NULLIF(baseline.sd, 0) AS z_score
FROM today, baseline;
-- Alert if |z_score| > 3
Seasonality is the gotcha
"7-day average" works for stable workloads; fails for weekly seasonality (B2B traffic drops 80% on weekends). The fix: compare to same-day-of-week last 4 weeks:
WHERE DATE(created_at) IN (
CURRENT_DATE - INTERVAL '7 days',
CURRENT_DATE - INTERVAL '14 days',
CURRENT_DATE - INTERVAL '21 days',
CURRENT_DATE - INTERVAL '28 days'
)
-- Compare today vs Tuesday-4-weeks-ago, not vs yesterday
Production-grade volume monitoring
Real volume monitors do three things at once:
- Day-over-day — quick spike / dip alarm
- Week-over-week (same DoW) — seasonality-aware baseline
- Monthly drift — slow trend detection (catches gradual leaks)
When zero rows is normal vs alarming
| Table type | Zero rows = ? |
|---|---|
| Event stream (orders, clicks) | Almost always alarming — pipeline broken |
| Error log | Best case (no errors); only alarming if logging itself broke |
| Daily snapshot | Always alarming — snapshot job must run |
| Anomaly / fraud-flag table | Often expected — most days are clean |
Encode the expectation explicitly: min_rows_per_day, max_rows_per_day, expected_zero_rows: false.
The freshness-volume cousin
Volume = how many rows. Freshness = how recent. Both matter; both have similar anomaly logic. Freshness is covered in §06; just note that the patterns mirror.
Distribution checks — null %, uniqueness, value drift.
Schema says "the column exists". Volume says "rows are present". Distribution says "the values are plausible". This is where data quality gets subtle — and where most production data corruption lives undetected.
The five distribution check families
| Family | Question | Example |
|---|---|---|
| Null rate | How often is this column NULL? | null(customer_email) < 0.5% |
| Uniqueness | Are unique columns actually unique? | distinct(order_id) = count(*) |
| Value range | Are values within plausible bounds? | amount BETWEEN 0 AND 10000 |
| Cardinality | Is the distinct-value count reasonable? | distinct(country) ≤ 250 |
| Categorical drift | Is the distribution of values stable? | % of orders with status='cancelled' between 3-7% |
Null-rate checks
# Soda check
checks for orders:
- missing_count(customer_id) = 0 # PK never NULL
- missing_percent(email) < 1.0
- missing_percent(phone) < 5.0 # phone is optional but rare
Uniqueness — the silent killer
Duplicate rows in a "unique" column don't break joins immediately — they just make every downstream aggregation wrong. The check:
# dbt schema test
columns:
- name: order_id
tests:
- unique
- not_null
For composite keys: {{ dbt_utils.unique_combination_of_columns(['user_id', 'session_id']) }}
Value range — the most-skipped check
# Plausibility checks save you from data-corruption nightmares
checks for transactions:
- invalid_count(amount) = 0:
valid min: 0
valid max: 100000 # > $100K is alarming
- invalid_count(country_code) = 0:
valid format: ISO_3166_1
- invalid_count(age) = 0:
valid min: 0
valid max: 120
Categorical drift — the slow leak
"Cancelled rate is normally 5%; today it's 22%" is the early signal of an upstream issue (payment processor down, fraud spike, deploy bug). Distribution drift catches this.
-- % distribution of order.status today vs L7 baseline
WITH today AS (
SELECT status, COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () AS share
FROM orders WHERE DATE(created_at) = CURRENT_DATE - 1
GROUP BY status
),
baseline AS (
SELECT status, AVG(daily_share) AS expected_share
FROM (
SELECT DATE(created_at) AS day, status,
COUNT(*) * 1.0 / SUM(COUNT(*)) OVER (PARTITION BY DATE(created_at)) AS daily_share
FROM orders WHERE created_at >= CURRENT_DATE - INTERVAL '8 days'
AND created_at < CURRENT_DATE
GROUP BY day, status
) GROUP BY status
)
SELECT t.status, t.share AS today_share, b.expected_share,
ABS(t.share - b.expected_share) AS drift
FROM today t JOIN baseline b USING (status)
WHERE ABS(t.share - b.expected_share) > 0.05;
-- > 5pp drift on any status = alert
Cardinality — the schema-rot detector
distinct(country) = 250 is fine; = 50000 means someone's writing free-text where there should be ISO codes. Cardinality bounds catch upstream-validation rot.
Referential checks — cross-table integrity.
Cross-table referential integrity is what makes a warehouse trustworthy. "Every order has a customer" sounds obvious; in practice, it breaks every week because the customer dim has a different freshness SLA than the orders fact.
The two failure modes
| Mode | Cause |
|---|---|
| Orphan facts | Fact row references a dim row that doesn't exist (yet — late-arriving dim) or never existed (data corruption) |
| Dim with no fact | Dim row exists but no fact rows reference it (legitimate for new entities; alarming if old) |
The orphan-fact check
-- dbt relationships test
columns:
- name: customer_id
tests:
- relationships:
to: ref('dim_customer')
field: customer_id
-- Equivalent SQL
SELECT COUNT(*) AS orphan_count
FROM fact_orders f
LEFT JOIN dim_customer d ON d.customer_id = f.customer_id
WHERE d.customer_id IS NULL
AND f.customer_id IS NOT NULL;
The freshness-mismatch trap
The LAD-aware referential check
-- "Orphans" that are actually late-arriving dims (LAD)
WITH classified AS (
SELECT f.order_id, f.customer_id,
CASE
WHEN d.customer_id IS NOT NULL THEN 'matched'
WHEN f.created_at > CURRENT_TIMESTAMP - INTERVAL '24 hours' THEN 'lad_pending'
ELSE 'true_orphan'
END AS status
FROM fact_orders f
LEFT JOIN dim_customer d ON d.customer_id = f.customer_id
WHERE DATE(f.created_at) = CURRENT_DATE - 1
)
SELECT status, COUNT(*) AS rows
FROM classified GROUP BY status;
-- Alert only on true_orphan, not lad_pending
The cross-table count reconciliation
Beyond row-level integrity, aggregate-level reconciliation catches subtle bugs:
-- Order line items should sum to the order total
SELECT o.order_id,
o.total AS order_total,
SUM(li.line_amount) AS computed_from_lines,
ABS(o.total - SUM(li.line_amount)) AS discrepancy
FROM orders o
JOIN order_line_items li ON li.order_id = o.order_id
GROUP BY o.order_id, o.total
HAVING ABS(o.total - SUM(li.line_amount)) > 0.01;
-- Any row = order total disagrees with line sum
Cross-pipeline reconciliation
Two systems should agree on the same number. "Stripe says we processed $X yesterday; warehouse says $Y." The reconciliation:
| Source A | Source B | Acceptable diff |
|---|---|---|
| Stripe API total | Warehouse fact_payments SUM | < 0.01% (rounding only) |
| Application DAU | Mixpanel / Amplitude DAU | < 5% (different definitions of "active") |
| Salesforce ARR | Finance close ARR | 0% — must match for audit |
Lineage, freshness SLAs & the on-call playbook.
Checks that fire are useful only if someone responds. The operational layer turns DQ from a build-time thing into a continuous-reliability thing.
Lineage — what depends on what
Lineage is the DAG of which tables are derived from which sources. When a check fires upstream, lineage tells you which 40 downstream dashboards are about to break.
| Source | What it generates |
|---|---|
| dbt | Auto-generated DAG via ref() calls; dbt docs renders it |
| OpenLineage | Cross-tool lineage events (Airflow + dbt + Spark + Flink) |
| Datahub / Atlan / OpenMetadata | Centralized catalog with lineage UI + impact analysis |
| Column-level lineage | Tracks which downstream column came from which upstream column — required for GDPR + sensitive-data tracking |
Freshness SLAs — the contract with consumers
Every important table publishes a freshness SLA: how recent the data must be at any given moment.
# dbt source freshness — the simplest form
sources:
- name: raw
tables:
- name: orders
loaded_at_field: ingested_at
freshness:
warn_after: { count: 1, period: hour }
error_after: { count: 4, period: hour }
Translation: warn if no new rows in 1 hour; page if 4. Each table has different criticality:
| Table type | Typical freshness SLA |
|---|---|
| Real-time event stream (clickstream) | < 5 minutes |
| Hourly fact (orders, payments) | < 1 hour warn / < 4 hours error |
| Daily dim (customers, products) | < 24 hours |
| Monthly snapshot (subscription state) | < 7 days from month-end |
The on-call playbook — what fires, what to do
| Alert | First diagnostic step | Common root cause |
|---|---|---|
| Schema drift | Check information_schema diff vs expected | Producer deploy added/removed column without notice |
| Volume drop | Check upstream Kafka lag / source DB query volume | Producer / connector outage; rarely warehouse-side |
| Volume spike | Check for double-write / replay events | Pipeline restart replayed events; bot traffic; deploy bug |
| Null-rate spike | Trace specific column to producer; check producer's recent commits | Refactor dropped a populating call |
| Distribution drift | Compare value mix today vs yesterday vs same-DoW-last-week | Upstream data source change; A/B experiment ramped; seasonal effect |
| Referential orphans | Check dim freshness vs fact freshness | Dim refresh job stalled; LAD that didn't catch up |
| Freshness breach | Check ingestion job logs / Airflow DAG state | Source connector down; orchestrator stalled; permission issue |
The "circuit breaker" pattern
When upstream data quality fails, propagating it to consumers is worse than not refreshing. Circuit-break the pipeline:
# Airflow DAG with a quality gate
[ extract → load_to_bronze → DQ_GATE → transform_silver → ... ]
│
└─▶ if fails: STOP propagation;
page on-call;
consumers see yesterday's data
until the issue is fixed.
Better to serve slightly stale data than wrong data.
Quality scoring — the "is the table healthy?" rollup
For executive dashboards, roll up dozens of checks into one number:
quality_score = (1 - failed_checks / total_checks) × freshness_score × 100
where freshness_score = max(0, 1 - (hours_stale / SLA_hours))
Each table gets a score 0-100; published in the data catalog. Consumers see "fact_orders: 98 / 100" before they trust the dashboard built from it.
The DQ post-mortem template
Every Sev-1 DQ incident gets a postmortem with:
- Detection time — when did the alert fire?
- Impact — which dashboards / consumers / decisions were affected?
- Root cause — schema / volume / distribution / referential / freshness?
- Why didn't existing checks catch it? Coverage gap.
- New check added — concrete addition to prevent recurrence.
The 90-second articulation script.
▸ THE 90-SECOND SCRIPT
"My DQ framework has four orthogonal categories: schema, volume, distribution, referential. Every check maps to one; every category has its own remediation pattern. Coverage matrix: every important table has at least one check in each."
"Schema is build-time enforced via dbt contracts for our models, runtime via information_schema diff for source tables. Renames and type changes go through 90-day deprecation windows, never in-place."
"Volume is anomaly-based, not fixed-threshold. Day-over-day catches sudden breaks; same-DoW-last-4-weeks handles seasonality; Z-score > 3 is the alert. Fixed thresholds are a treadmill of tuning."
"Distribution covers null %, uniqueness, value range, cardinality, and categorical drift. Categorical drift is the slow-leak detector — 'cancellation rate jumped from 5% to 22%' is how you catch the payment-processor incident before billing's dashboard does."
"Referential classifies orphans into LAD-pending (legitimate, late-arriving dim) vs true orphans (page). Cross-pipeline reconciliation is its own check class — finance facts must match Stripe to the cent; engagement metrics can have 5% definitional drift."
"Operations: lineage from OpenLineage, freshness SLAs per-table in the catalog, on-call playbook keyed to alert type. Circuit-breaker pattern halts propagation when upstream fails — yesterday's good data beats today's broken data. Every Sev-1 gets a postmortem with a new check; the framework grows with every failure."
Three sentences that signal seniority — in any DQ round
- "Fixed thresholds don't survive seasonality — every volume check needs to be anomaly-based with same-DoW comparison."
- "Most production data corruption isn't a schema break — it's distribution drift. The 5%-to-22% cancellation rate is the canary in the coalmine."
- "Better to serve yesterday's good data than today's broken data. The circuit-breaker pattern is non-negotiable."