▸ DESIGN · Data Quality & Reliability

Data Quality & Reliability — from "tests" to a DQ framework.

The round where they ask "how do you know your pipeline is healthy?". Schema, volume, distribution, referential checks. Lineage. Freshness SLAs. The on-call playbook. Seven sections, every check articulable in one sentence, every alert actionable.

The four check categories — schema, volume, distribution, referential
Schema checks — the contract layer
Volume checks — anomaly detection on row counts
Distribution checks — null %, uniqueness, value drift
Referential checks — cross-table integrity
Lineage, freshness SLAs & the on-call playbook
The 90-second articulation script

§ 01 — The framework

The four check categories — schema, volume, distribution, referential.

"Data quality" is too broad to be useful. The interview-grade framing breaks it into four orthogonal categories. Every alert maps to one of these; every category has a different remediation pattern; every team can reason about coverage gaps because the buckets don't overlap.

The four buckets

Category	Question it answers	Example check
Schema	Is the structure correct?	Column `email` exists, type VARCHAR, NOT NULL
Volume	Did we get the right amount?	Row count yesterday between 80%-120% of 7-day average
Distribution	Are the values plausible?	Null % < 0.5%; uniqueness on user_id; value range 0-100
Referential	Does it join cleanly to other tables?	Every `order.customer_id` exists in `customers`

Why this taxonomy works

Different categories fail differently and need different responses:

Schema failures = pipeline broken, page on-call immediately. Loud failure, fast fix (revert / hotfix the producer).
Volume failures = upstream incident OR downstream consumer issue. Investigation needed before action.
Distribution failures = data source quality issue OR business reality changed. Often a slow drift, not a sudden break.
Referential failures = cross-team coordination problem. The two tables drift out of sync because they have different freshness SLAs.

The check-per-table coverage matrix

Every important table should have at least one check in each category:

fact_orders:
  schema:       column types match contract; PK uniqueness
  volume:       row count today within 75-125% of L7 avg
  distribution: null(customer_id) = 0; null(amount) < 0.1%; amount range $0-$10K
  referential:  customer_id ∈ dim_customer; product_id ∈ dim_product

dim_customer:
  schema:       email format; phone format
  volume:       new rows ≥ 0 (never shrinks)
  distribution: distinct(country) ≤ 250; null(email) < 1%
  referential:  region_id ∈ dim_region

The four common DQ tools

Tool	Strength	Weakness
dbt tests	Built-in to dbt (not_null, unique, accepted_values, relationships); free	Limited to schema + simple distribution; no anomaly detection
Great Expectations	Rich expectation library; detailed result docs	Heavy to operate; alerting weak out-of-box
Soda	Lightweight YAML-based; integrates with Airflow / dbt	Less mature than GE for complex checks
Monte Carlo / Datafold (commercial observability)	ML-based anomaly detection on volume + distribution; auto-discover lineage	$$$; vendor lock-in

Senior signal. The framework matters more than the tool. "I'd cover schema and referential with dbt tests because they're versioned with the model. Volume + distribution drift gets a vendor — Monte Carlo or Datafold — because anomaly detection needs ML and we don't want to build that. Lineage comes from OpenLineage emitted by dbt + Airflow."

§ 02 — Schema

Schema checks — the contract layer.

Schema is the easiest category to enforce and the most damaging when it fails. The good news: it's mostly automatable. The bad news: most teams treat it as runtime drift instead of a contract.

The four schema check types

Check	Implementation
Column existence	Does the column appear in the table at all?
Column type	Is it INT vs VARCHAR vs DATE as expected?
Nullability	Is the NOT NULL constraint declared and enforced?
Constraint enforcement	PK, UNIQUE, FK declared and active?

dbt model contracts (1.5+)

models:
  - name: dim_user
    config:
      contract:
        enforced: true
    columns:
      - name: user_id
        data_type: bigint
        constraints: [{ type: not_null }, { type: primary_key }]
      - name: email
        data_type: varchar
        constraints: [{ type: not_null }]
      - name: created_at
        data_type: timestamp
        constraints: [{ type: not_null }]

The model fails to build if the SELECT output schema drifts. This is build-time enforcement — the warehouse equivalent of a schema registry.

Schema checks for raw / source tables (no contract)

Source tables are owned by other teams and don't have dbt contracts. Detect drift via information_schema:

-- Daily check job, comparing today's columns to expected
SELECT table_name, column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema = 'raw' AND table_name = 'orders'
EXCEPT
SELECT table_name, column_name, data_type, is_nullable
FROM dq_expected_schema
WHERE table_name = 'orders';
-- Any row returned = drift; alert

The "added column" trap

Trap. Producer adds a column. Consumer ignores it (no break). Six months later, an analyst finds the column and expects historical data — but it's only populated forward. Always alert on new columns even when they don't break anything; document them in the schema registry; backfill if business-relevant.

Schema evolution rules — what's safe vs not

Change	Safe?	How to ship
Add nullable column	Yes	Producer-first; consumer ignores until needed
Add NOT NULL column with default	Yes	Same as above
Add NOT NULL column without default	No	Add as nullable first; backfill; then enforce NOT NULL
Drop column	No	Deprecation flag → 90 days → drop
Rename column	No	Add new + dual-populate → migrate consumers → drop old
Change column type	No	Add new typed column; migrate; drop old
Tighten NOT NULL on existing column	No	Backfill existing nulls → then add constraint

Senior signal. "Schema enforcement is build-time for our models (dbt contracts) and runtime for source tables (information_schema diff). Every change classified backward-compatible or not; non-compatible ones go through a migration window. Anything dropping or renaming is tombstone-then-add over 90 days."

§ 03 — Volume

Volume checks — anomaly detection on row counts.

The simplest check that catches the most production issues: "did we get the right number of rows today?". Implemented naively, it pages constantly; implemented well, it catches incidents 30 minutes earlier than any other signal.

The fixed-threshold trap

Trap. "Alert if row count < 1M" is the wrong shape. On a Tuesday it's right; on Black Friday it's noise; on Christmas Day it's a false negative. Fixed thresholds are a treadmill of tuning. Use anomaly detection.

The simplest anomaly check

-- Compare today's count to last 7 days' average + std-dev
WITH baseline AS (
  SELECT AVG(daily_count) AS mean, STDDEV(daily_count) AS sd
  FROM (
    SELECT DATE(created_at) AS day, COUNT(*) AS daily_count
    FROM orders
    WHERE created_at >= CURRENT_DATE - INTERVAL '8 days'
      AND created_at <  CURRENT_DATE
    GROUP BY day
  )
),
today AS (
  SELECT COUNT(*) AS cnt FROM orders WHERE DATE(created_at) = CURRENT_DATE - 1
)
SELECT
  today.cnt,
  baseline.mean,
  baseline.sd,
  ABS(today.cnt - baseline.mean) / NULLIF(baseline.sd, 0) AS z_score
FROM today, baseline;
-- Alert if |z_score| > 3

Seasonality is the gotcha

"7-day average" works for stable workloads; fails for weekly seasonality (B2B traffic drops 80% on weekends). The fix: compare to same-day-of-week last 4 weeks:

WHERE DATE(created_at) IN (
  CURRENT_DATE - INTERVAL '7 days',
  CURRENT_DATE - INTERVAL '14 days',
  CURRENT_DATE - INTERVAL '21 days',
  CURRENT_DATE - INTERVAL '28 days'
)
-- Compare today vs Tuesday-4-weeks-ago, not vs yesterday

Production-grade volume monitoring

Real volume monitors do three things at once:

Day-over-day — quick spike / dip alarm
Week-over-week (same DoW) — seasonality-aware baseline
Monthly drift — slow trend detection (catches gradual leaks)

When zero rows is normal vs alarming

Table type	Zero rows = ?
Event stream (orders, clicks)	Almost always alarming — pipeline broken
Error log	Best case (no errors); only alarming if logging itself broke
Daily snapshot	Always alarming — snapshot job must run
Anomaly / fraud-flag table	Often expected — most days are clean

Encode the expectation explicitly: min_rows_per_day, max_rows_per_day, expected_zero_rows: false.

The freshness-volume cousin

Volume = how many rows. Freshness = how recent. Both matter; both have similar anomaly logic. Freshness is covered in §06; just note that the patterns mirror.

Senior signal. "Volume checks are anomaly-based, not fixed-threshold. Day-over-day catches sudden breaks; same-DoW-last-4-weeks handles weekly seasonality; monthly trend catches slow drift. Z-score > 3 is the alert; auto-resolve at < 2 to prevent flapping. We never page-on zero-rows except for tables we've explicitly tagged as 'must produce'."

§ 04 — Distribution

Distribution checks — null %, uniqueness, value drift.

Schema says "the column exists". Volume says "rows are present". Distribution says "the values are plausible". This is where data quality gets subtle — and where most production data corruption lives undetected.

The five distribution check families

Family	Question	Example
Null rate	How often is this column NULL?	null(customer_email) < 0.5%
Uniqueness	Are unique columns actually unique?	distinct(order_id) = count(*)
Value range	Are values within plausible bounds?	amount BETWEEN 0 AND 10000
Cardinality	Is the distinct-value count reasonable?	distinct(country) ≤ 250
Categorical drift	Is the distribution of values stable?	% of orders with status='cancelled' between 3-7%

Null-rate checks

# Soda check
checks for orders:
  - missing_count(customer_id) = 0          # PK never NULL
  - missing_percent(email) < 1.0
  - missing_percent(phone) < 5.0            # phone is optional but rare

Uniqueness — the silent killer

Duplicate rows in a "unique" column don't break joins immediately — they just make every downstream aggregation wrong. The check:

# dbt schema test
columns:
  - name: order_id
    tests:
      - unique
      - not_null

For composite keys: {{ dbt_utils.unique_combination_of_columns(['user_id', 'session_id']) }}

Value range — the most-skipped check

# Plausibility checks save you from data-corruption nightmares
checks for transactions:
  - invalid_count(amount) = 0:
      valid min: 0
      valid max: 100000           # > $100K is alarming
  - invalid_count(country_code) = 0:
      valid format: ISO_3166_1
  - invalid_count(age) = 0:
      valid min: 0
      valid max: 120

Categorical drift — the slow leak

"Cancelled rate is normally 5%; today it's 22%" is the early signal of an upstream issue (payment processor down, fraud spike, deploy bug). Distribution drift catches this.

-- % distribution of order.status today vs L7 baseline
WITH today AS (
  SELECT status, COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () AS share
  FROM orders WHERE DATE(created_at) = CURRENT_DATE - 1
  GROUP BY status
),
baseline AS (
  SELECT status, AVG(daily_share) AS expected_share
  FROM (
    SELECT DATE(created_at) AS day, status,
      COUNT(*) * 1.0 / SUM(COUNT(*)) OVER (PARTITION BY DATE(created_at)) AS daily_share
    FROM orders WHERE created_at >= CURRENT_DATE - INTERVAL '8 days'
                  AND created_at < CURRENT_DATE
    GROUP BY day, status
  ) GROUP BY status
)
SELECT t.status, t.share AS today_share, b.expected_share,
       ABS(t.share - b.expected_share) AS drift
FROM today t JOIN baseline b USING (status)
WHERE ABS(t.share - b.expected_share) > 0.05;
-- > 5pp drift on any status = alert

Cardinality — the schema-rot detector

distinct(country) = 250 is fine; = 50000 means someone's writing free-text where there should be ISO codes. Cardinality bounds catch upstream-validation rot.

The 95-percent rule. Most distribution checks should pass > 99% of the time. If a check fires on 5% of runs, the threshold is wrong — re-tune. Persistently flaky checks erode trust in the alert system.

Senior signal. "Distribution checks are five families: null %, uniqueness, value range, cardinality, categorical drift. Categorical drift is the slow-leak detector — 'cancelled rate jumped from 5% to 22%' is how you find the payment-processor incident 30 minutes before billing's dashboard tells you. Threshold is anomaly-based with same-DoW seasonality."

§ 05 — Referential

Referential checks — cross-table integrity.

Cross-table referential integrity is what makes a warehouse trustworthy. "Every order has a customer" sounds obvious; in practice, it breaks every week because the customer dim has a different freshness SLA than the orders fact.

The two failure modes

Mode	Cause
Orphan facts	Fact row references a dim row that doesn't exist (yet — late-arriving dim) or never existed (data corruption)
Dim with no fact	Dim row exists but no fact rows reference it (legitimate for new entities; alarming if old)

The orphan-fact check

-- dbt relationships test
columns:
  - name: customer_id
    tests:
      - relationships:
          to: ref('dim_customer')
          field: customer_id

-- Equivalent SQL
SELECT COUNT(*) AS orphan_count
FROM fact_orders f
LEFT JOIN dim_customer d ON d.customer_id = f.customer_id
WHERE d.customer_id IS NULL
  AND f.customer_id IS NOT NULL;

The freshness-mismatch trap

Trap. Fact updates hourly; dim updates nightly. Between midnight (when dim refreshes) and 1am (when first fact arrives with new entities from the day), every fact row referencing a brand-new entity is technically an orphan. Tolerance window: allow N% orphans for entities < 24 hours old; alert only on persistent orphans.

The LAD-aware referential check

-- "Orphans" that are actually late-arriving dims (LAD)
WITH classified AS (
  SELECT f.order_id, f.customer_id,
    CASE
      WHEN d.customer_id IS NOT NULL              THEN 'matched'
      WHEN f.created_at > CURRENT_TIMESTAMP - INTERVAL '24 hours' THEN 'lad_pending'
      ELSE 'true_orphan'
    END AS status
  FROM fact_orders f
  LEFT JOIN dim_customer d ON d.customer_id = f.customer_id
  WHERE DATE(f.created_at) = CURRENT_DATE - 1
)
SELECT status, COUNT(*) AS rows
FROM classified GROUP BY status;
-- Alert only on true_orphan, not lad_pending

The cross-table count reconciliation

Beyond row-level integrity, aggregate-level reconciliation catches subtle bugs:

-- Order line items should sum to the order total
SELECT o.order_id,
  o.total                           AS order_total,
  SUM(li.line_amount)               AS computed_from_lines,
  ABS(o.total - SUM(li.line_amount)) AS discrepancy
FROM orders o
JOIN order_line_items li ON li.order_id = o.order_id
GROUP BY o.order_id, o.total
HAVING ABS(o.total - SUM(li.line_amount)) > 0.01;
-- Any row = order total disagrees with line sum

Cross-pipeline reconciliation

Two systems should agree on the same number. "Stripe says we processed $X yesterday; warehouse says $Y." The reconciliation:

Source A	Source B	Acceptable diff
Stripe API total	Warehouse fact_payments SUM	< 0.01% (rounding only)
Application DAU	Mixpanel / Amplitude DAU	< 5% (different definitions of "active")
Salesforce ARR	Finance close ARR	0% — must match for audit

Senior signal. "Referential checks classify orphans into LAD-pending vs true. Only true orphans page; LAD-pending is logged for trend analysis. Aggregate reconciliation across pipelines (Stripe vs warehouse) is its own check class with explicit tolerance — finance facts must match exactly, engagement metrics can have small definitional drift."

§ 06 — Operations

Lineage, freshness SLAs & the on-call playbook.

Checks that fire are useful only if someone responds. The operational layer turns DQ from a build-time thing into a continuous-reliability thing.

Lineage — what depends on what

Lineage is the DAG of which tables are derived from which sources. When a check fires upstream, lineage tells you which 40 downstream dashboards are about to break.

Source	What it generates
dbt	Auto-generated DAG via `ref()` calls; `dbt docs` renders it
OpenLineage	Cross-tool lineage events (Airflow + dbt + Spark + Flink)
Datahub / Atlan / OpenMetadata	Centralized catalog with lineage UI + impact analysis
Column-level lineage	Tracks which downstream column came from which upstream column — required for GDPR + sensitive-data tracking

Freshness SLAs — the contract with consumers

Every important table publishes a freshness SLA: how recent the data must be at any given moment.

# dbt source freshness — the simplest form
sources:
  - name: raw
    tables:
      - name: orders
        loaded_at_field: ingested_at
        freshness:
          warn_after: { count: 1, period: hour }
          error_after: { count: 4, period: hour }

Translation: warn if no new rows in 1 hour; page if 4. Each table has different criticality:

Table type	Typical freshness SLA
Real-time event stream (clickstream)	< 5 minutes
Hourly fact (orders, payments)	< 1 hour warn / < 4 hours error
Daily dim (customers, products)	< 24 hours
Monthly snapshot (subscription state)	< 7 days from month-end

The on-call playbook — what fires, what to do

Alert	First diagnostic step	Common root cause
Schema drift	Check `information_schema` diff vs expected	Producer deploy added/removed column without notice
Volume drop	Check upstream Kafka lag / source DB query volume	Producer / connector outage; rarely warehouse-side
Volume spike	Check for double-write / replay events	Pipeline restart replayed events; bot traffic; deploy bug
Null-rate spike	Trace specific column to producer; check producer's recent commits	Refactor dropped a populating call
Distribution drift	Compare value mix today vs yesterday vs same-DoW-last-week	Upstream data source change; A/B experiment ramped; seasonal effect
Referential orphans	Check dim freshness vs fact freshness	Dim refresh job stalled; LAD that didn't catch up
Freshness breach	Check ingestion job logs / Airflow DAG state	Source connector down; orchestrator stalled; permission issue

The "circuit breaker" pattern

When upstream data quality fails, propagating it to consumers is worse than not refreshing. Circuit-break the pipeline:

# Airflow DAG with a quality gate
[ extract → load_to_bronze → DQ_GATE → transform_silver → ... ]
                              │
                              └─▶ if fails:  STOP propagation;
                                            page on-call;
                                            consumers see yesterday's data
                                            until the issue is fixed.

Better to serve slightly stale data than wrong data.

Quality scoring — the "is the table healthy?" rollup

For executive dashboards, roll up dozens of checks into one number:

quality_score = (1 - failed_checks / total_checks) × freshness_score × 100
where freshness_score = max(0, 1 - (hours_stale / SLA_hours))

Each table gets a score 0-100; published in the data catalog. Consumers see "fact_orders: 98 / 100" before they trust the dashboard built from it.

The DQ post-mortem template

Every Sev-1 DQ incident gets a postmortem with:

Detection time — when did the alert fire?
Impact — which dashboards / consumers / decisions were affected?
Root cause — schema / volume / distribution / referential / freshness?
Why didn't existing checks catch it? Coverage gap.
New check added — concrete addition to prevent recurrence.

Senior signal. "Lineage from OpenLineage / dbt; freshness SLAs published per-table in the catalog; on-call playbook keyed to alert type. The circuit-breaker pattern stops bad data from propagating — better to serve yesterday's good data than today's broken data. Every Sev-1 incident gets a postmortem with a new check added — the framework grows with every failure."

§ 07 — Articulation

The 90-second articulation script.

▸ THE 90-SECOND SCRIPT

"My DQ framework has four orthogonal categories: schema, volume, distribution, referential. Every check maps to one; every category has its own remediation pattern. Coverage matrix: every important table has at least one check in each."

"Schema is build-time enforced via dbt contracts for our models, runtime via information_schema diff for source tables. Renames and type changes go through 90-day deprecation windows, never in-place."

"Volume is anomaly-based, not fixed-threshold. Day-over-day catches sudden breaks; same-DoW-last-4-weeks handles seasonality; Z-score > 3 is the alert. Fixed thresholds are a treadmill of tuning."

"Distribution covers null %, uniqueness, value range, cardinality, and categorical drift. Categorical drift is the slow-leak detector — 'cancellation rate jumped from 5% to 22%' is how you catch the payment-processor incident before billing's dashboard does."

"Referential classifies orphans into LAD-pending (legitimate, late-arriving dim) vs true orphans (page). Cross-pipeline reconciliation is its own check class — finance facts must match Stripe to the cent; engagement metrics can have 5% definitional drift."

"Operations: lineage from OpenLineage, freshness SLAs per-table in the catalog, on-call playbook keyed to alert type. Circuit-breaker pattern halts propagation when upstream fails — yesterday's good data beats today's broken data. Every Sev-1 gets a postmortem with a new check; the framework grows with every failure."

Three sentences that signal seniority — in any DQ round

"Fixed thresholds don't survive seasonality — every volume check needs to be anomaly-based with same-DoW comparison."
"Most production data corruption isn't a schema break — it's distribution drift. The 5%-to-22% cancellation rate is the canary in the coalmine."
"Better to serve yesterday's good data than today's broken data. The circuit-breaker pattern is non-negotiable."

· · ·

▸ Seven sections · the patterns hold across domains · go well.

  ← Back to Design pillar
   · 
  System Design — Data Platforms
   · 
  Streaming Architecture

Data Quality & Reliability — from "tests" to a DQ framework.

Contents

The four check categories — schema, volume, distribution, referential.

The four buckets

Why this taxonomy works

The check-per-table coverage matrix

The four common DQ tools

Schema checks — the contract layer.

The four schema check types

dbt model contracts (1.5+)

Schema checks for raw / source tables (no contract)

The "added column" trap

Schema evolution rules — what's safe vs not

Volume checks — anomaly detection on row counts.

The fixed-threshold trap

The simplest anomaly check

Seasonality is the gotcha

Production-grade volume monitoring

When zero rows is normal vs alarming

The freshness-volume cousin

Distribution checks — null %, uniqueness, value drift.

The five distribution check families

Null-rate checks

Uniqueness — the silent killer

Value range — the most-skipped check

Categorical drift — the slow leak

Cardinality — the schema-rot detector

Referential checks — cross-table integrity.

The two failure modes

The orphan-fact check

The freshness-mismatch trap

The LAD-aware referential check

The cross-table count reconciliation

Cross-pipeline reconciliation

Lineage, freshness SLAs & the on-call playbook.

Lineage — what depends on what

Freshness SLAs — the contract with consumers

The on-call playbook — what fires, what to do

The "circuit breaker" pattern

Quality scoring — the "is the table healthy?" rollup

The DQ post-mortem template

The 90-second articulation script.

Three sentences that signal seniority — in any DQ round