← PaddySpeaks
Interview Studio · Practice · Q&A Design
▸ DESIGN · Data Quality & Reliability

Data Quality & Reliability — from "tests" to a DQ framework.

The round where they ask "how do you know your pipeline is healthy?". Schema, volume, distribution, referential checks. Lineage. Freshness SLAs. The on-call playbook. Seven sections, every check articulable in one sentence, every alert actionable.

§ 01 — The framework

The four check categories — schema, volume, distribution, referential.

"Data quality" is too broad to be useful. The interview-grade framing breaks it into four orthogonal categories. Every alert maps to one of these; every category has a different remediation pattern; every team can reason about coverage gaps because the buckets don't overlap.

The four buckets

CategoryQuestion it answersExample check
SchemaIs the structure correct?Column email exists, type VARCHAR, NOT NULL
VolumeDid we get the right amount?Row count yesterday between 80%-120% of 7-day average
DistributionAre the values plausible?Null % < 0.5%; uniqueness on user_id; value range 0-100
ReferentialDoes it join cleanly to other tables?Every order.customer_id exists in customers

Why this taxonomy works

Different categories fail differently and need different responses:

  • Schema failures = pipeline broken, page on-call immediately. Loud failure, fast fix (revert / hotfix the producer).
  • Volume failures = upstream incident OR downstream consumer issue. Investigation needed before action.
  • Distribution failures = data source quality issue OR business reality changed. Often a slow drift, not a sudden break.
  • Referential failures = cross-team coordination problem. The two tables drift out of sync because they have different freshness SLAs.

The check-per-table coverage matrix

Every important table should have at least one check in each category:

fact_orders:
  schema:       column types match contract; PK uniqueness
  volume:       row count today within 75-125% of L7 avg
  distribution: null(customer_id) = 0; null(amount) < 0.1%; amount range $0-$10K
  referential:  customer_id ∈ dim_customer; product_id ∈ dim_product

dim_customer:
  schema:       email format; phone format
  volume:       new rows ≥ 0 (never shrinks)
  distribution: distinct(country) ≤ 250; null(email) < 1%
  referential:  region_id ∈ dim_region

The four common DQ tools

ToolStrengthWeakness
dbt testsBuilt-in to dbt (not_null, unique, accepted_values, relationships); freeLimited to schema + simple distribution; no anomaly detection
Great ExpectationsRich expectation library; detailed result docsHeavy to operate; alerting weak out-of-box
SodaLightweight YAML-based; integrates with Airflow / dbtLess mature than GE for complex checks
Monte Carlo / Datafold (commercial observability)ML-based anomaly detection on volume + distribution; auto-discover lineage$$$; vendor lock-in
Senior signal. The framework matters more than the tool. "I'd cover schema and referential with dbt tests because they're versioned with the model. Volume + distribution drift gets a vendor — Monte Carlo or Datafold — because anomaly detection needs ML and we don't want to build that. Lineage comes from OpenLineage emitted by dbt + Airflow."
§ 02 — Schema

Schema checks — the contract layer.

Schema is the easiest category to enforce and the most damaging when it fails. The good news: it's mostly automatable. The bad news: most teams treat it as runtime drift instead of a contract.

The four schema check types

CheckImplementation
Column existenceDoes the column appear in the table at all?
Column typeIs it INT vs VARCHAR vs DATE as expected?
NullabilityIs the NOT NULL constraint declared and enforced?
Constraint enforcementPK, UNIQUE, FK declared and active?

dbt model contracts (1.5+)

models:
  - name: dim_user
    config:
      contract:
        enforced: true
    columns:
      - name: user_id
        data_type: bigint
        constraints: [{ type: not_null }, { type: primary_key }]
      - name: email
        data_type: varchar
        constraints: [{ type: not_null }]
      - name: created_at
        data_type: timestamp
        constraints: [{ type: not_null }]

The model fails to build if the SELECT output schema drifts. This is build-time enforcement — the warehouse equivalent of a schema registry.

Schema checks for raw / source tables (no contract)

Source tables are owned by other teams and don't have dbt contracts. Detect drift via information_schema:

-- Daily check job, comparing today's columns to expected
SELECT table_name, column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema = 'raw' AND table_name = 'orders'
EXCEPT
SELECT table_name, column_name, data_type, is_nullable
FROM dq_expected_schema
WHERE table_name = 'orders';
-- Any row returned = drift; alert

The "added column" trap

Trap. Producer adds a column. Consumer ignores it (no break). Six months later, an analyst finds the column and expects historical data — but it's only populated forward. Always alert on new columns even when they don't break anything; document them in the schema registry; backfill if business-relevant.

Schema evolution rules — what's safe vs not

ChangeSafe?How to ship
Add nullable columnYesProducer-first; consumer ignores until needed
Add NOT NULL column with defaultYesSame as above
Add NOT NULL column without defaultNoAdd as nullable first; backfill; then enforce NOT NULL
Drop columnNoDeprecation flag → 90 days → drop
Rename columnNoAdd new + dual-populate → migrate consumers → drop old
Change column typeNoAdd new typed column; migrate; drop old
Tighten NOT NULL on existing columnNoBackfill existing nulls → then add constraint
Senior signal. "Schema enforcement is build-time for our models (dbt contracts) and runtime for source tables (information_schema diff). Every change classified backward-compatible or not; non-compatible ones go through a migration window. Anything dropping or renaming is tombstone-then-add over 90 days."
§ 03 — Volume

Volume checks — anomaly detection on row counts.

The simplest check that catches the most production issues: "did we get the right number of rows today?". Implemented naively, it pages constantly; implemented well, it catches incidents 30 minutes earlier than any other signal.

The fixed-threshold trap

Trap. "Alert if row count < 1M" is the wrong shape. On a Tuesday it's right; on Black Friday it's noise; on Christmas Day it's a false negative. Fixed thresholds are a treadmill of tuning. Use anomaly detection.

The simplest anomaly check

-- Compare today's count to last 7 days' average + std-dev
WITH baseline AS (
  SELECT AVG(daily_count) AS mean, STDDEV(daily_count) AS sd
  FROM (
    SELECT DATE(created_at) AS day, COUNT(*) AS daily_count
    FROM orders
    WHERE created_at >= CURRENT_DATE - INTERVAL '8 days'
      AND created_at <  CURRENT_DATE
    GROUP BY day
  )
),
today AS (
  SELECT COUNT(*) AS cnt FROM orders WHERE DATE(created_at) = CURRENT_DATE - 1
)
SELECT
  today.cnt,
  baseline.mean,
  baseline.sd,
  ABS(today.cnt - baseline.mean) / NULLIF(baseline.sd, 0) AS z_score
FROM today, baseline;
-- Alert if |z_score| > 3

Seasonality is the gotcha

"7-day average" works for stable workloads; fails for weekly seasonality (B2B traffic drops 80% on weekends). The fix: compare to same-day-of-week last 4 weeks:

WHERE DATE(created_at) IN (
  CURRENT_DATE - INTERVAL '7 days',
  CURRENT_DATE - INTERVAL '14 days',
  CURRENT_DATE - INTERVAL '21 days',
  CURRENT_DATE - INTERVAL '28 days'
)
-- Compare today vs Tuesday-4-weeks-ago, not vs yesterday

Production-grade volume monitoring

Real volume monitors do three things at once:

  1. Day-over-day — quick spike / dip alarm
  2. Week-over-week (same DoW) — seasonality-aware baseline
  3. Monthly drift — slow trend detection (catches gradual leaks)

When zero rows is normal vs alarming

Table typeZero rows = ?
Event stream (orders, clicks)Almost always alarming — pipeline broken
Error logBest case (no errors); only alarming if logging itself broke
Daily snapshotAlways alarming — snapshot job must run
Anomaly / fraud-flag tableOften expected — most days are clean

Encode the expectation explicitly: min_rows_per_day, max_rows_per_day, expected_zero_rows: false.

The freshness-volume cousin

Volume = how many rows. Freshness = how recent. Both matter; both have similar anomaly logic. Freshness is covered in §06; just note that the patterns mirror.

Senior signal. "Volume checks are anomaly-based, not fixed-threshold. Day-over-day catches sudden breaks; same-DoW-last-4-weeks handles weekly seasonality; monthly trend catches slow drift. Z-score > 3 is the alert; auto-resolve at < 2 to prevent flapping. We never page-on zero-rows except for tables we've explicitly tagged as 'must produce'."
§ 04 — Distribution

Distribution checks — null %, uniqueness, value drift.

Schema says "the column exists". Volume says "rows are present". Distribution says "the values are plausible". This is where data quality gets subtle — and where most production data corruption lives undetected.

The five distribution check families

FamilyQuestionExample
Null rateHow often is this column NULL?null(customer_email) < 0.5%
UniquenessAre unique columns actually unique?distinct(order_id) = count(*)
Value rangeAre values within plausible bounds?amount BETWEEN 0 AND 10000
CardinalityIs the distinct-value count reasonable?distinct(country) ≤ 250
Categorical driftIs the distribution of values stable?% of orders with status='cancelled' between 3-7%

Null-rate checks

# Soda check
checks for orders:
  - missing_count(customer_id) = 0          # PK never NULL
  - missing_percent(email) < 1.0
  - missing_percent(phone) < 5.0            # phone is optional but rare

Uniqueness — the silent killer

Duplicate rows in a "unique" column don't break joins immediately — they just make every downstream aggregation wrong. The check:

# dbt schema test
columns:
  - name: order_id
    tests:
      - unique
      - not_null

For composite keys: {{ dbt_utils.unique_combination_of_columns(['user_id', 'session_id']) }}

Value range — the most-skipped check

# Plausibility checks save you from data-corruption nightmares
checks for transactions:
  - invalid_count(amount) = 0:
      valid min: 0
      valid max: 100000           # > $100K is alarming
  - invalid_count(country_code) = 0:
      valid format: ISO_3166_1
  - invalid_count(age) = 0:
      valid min: 0
      valid max: 120

Categorical drift — the slow leak

"Cancelled rate is normally 5%; today it's 22%" is the early signal of an upstream issue (payment processor down, fraud spike, deploy bug). Distribution drift catches this.

-- % distribution of order.status today vs L7 baseline
WITH today AS (
  SELECT status, COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () AS share
  FROM orders WHERE DATE(created_at) = CURRENT_DATE - 1
  GROUP BY status
),
baseline AS (
  SELECT status, AVG(daily_share) AS expected_share
  FROM (
    SELECT DATE(created_at) AS day, status,
      COUNT(*) * 1.0 / SUM(COUNT(*)) OVER (PARTITION BY DATE(created_at)) AS daily_share
    FROM orders WHERE created_at >= CURRENT_DATE - INTERVAL '8 days'
                  AND created_at < CURRENT_DATE
    GROUP BY day, status
  ) GROUP BY status
)
SELECT t.status, t.share AS today_share, b.expected_share,
       ABS(t.share - b.expected_share) AS drift
FROM today t JOIN baseline b USING (status)
WHERE ABS(t.share - b.expected_share) > 0.05;
-- > 5pp drift on any status = alert

Cardinality — the schema-rot detector

distinct(country) = 250 is fine; = 50000 means someone's writing free-text where there should be ISO codes. Cardinality bounds catch upstream-validation rot.

The 95-percent rule. Most distribution checks should pass > 99% of the time. If a check fires on 5% of runs, the threshold is wrong — re-tune. Persistently flaky checks erode trust in the alert system.
Senior signal. "Distribution checks are five families: null %, uniqueness, value range, cardinality, categorical drift. Categorical drift is the slow-leak detector — 'cancelled rate jumped from 5% to 22%' is how you find the payment-processor incident 30 minutes before billing's dashboard tells you. Threshold is anomaly-based with same-DoW seasonality."
§ 05 — Referential

Referential checks — cross-table integrity.

Cross-table referential integrity is what makes a warehouse trustworthy. "Every order has a customer" sounds obvious; in practice, it breaks every week because the customer dim has a different freshness SLA than the orders fact.

The two failure modes

ModeCause
Orphan factsFact row references a dim row that doesn't exist (yet — late-arriving dim) or never existed (data corruption)
Dim with no factDim row exists but no fact rows reference it (legitimate for new entities; alarming if old)

The orphan-fact check

-- dbt relationships test
columns:
  - name: customer_id
    tests:
      - relationships:
          to: ref('dim_customer')
          field: customer_id

-- Equivalent SQL
SELECT COUNT(*) AS orphan_count
FROM fact_orders f
LEFT JOIN dim_customer d ON d.customer_id = f.customer_id
WHERE d.customer_id IS NULL
  AND f.customer_id IS NOT NULL;

The freshness-mismatch trap

Trap. Fact updates hourly; dim updates nightly. Between midnight (when dim refreshes) and 1am (when first fact arrives with new entities from the day), every fact row referencing a brand-new entity is technically an orphan. Tolerance window: allow N% orphans for entities < 24 hours old; alert only on persistent orphans.

The LAD-aware referential check

-- "Orphans" that are actually late-arriving dims (LAD)
WITH classified AS (
  SELECT f.order_id, f.customer_id,
    CASE
      WHEN d.customer_id IS NOT NULL              THEN 'matched'
      WHEN f.created_at > CURRENT_TIMESTAMP - INTERVAL '24 hours' THEN 'lad_pending'
      ELSE 'true_orphan'
    END AS status
  FROM fact_orders f
  LEFT JOIN dim_customer d ON d.customer_id = f.customer_id
  WHERE DATE(f.created_at) = CURRENT_DATE - 1
)
SELECT status, COUNT(*) AS rows
FROM classified GROUP BY status;
-- Alert only on true_orphan, not lad_pending

The cross-table count reconciliation

Beyond row-level integrity, aggregate-level reconciliation catches subtle bugs:

-- Order line items should sum to the order total
SELECT o.order_id,
  o.total                           AS order_total,
  SUM(li.line_amount)               AS computed_from_lines,
  ABS(o.total - SUM(li.line_amount)) AS discrepancy
FROM orders o
JOIN order_line_items li ON li.order_id = o.order_id
GROUP BY o.order_id, o.total
HAVING ABS(o.total - SUM(li.line_amount)) > 0.01;
-- Any row = order total disagrees with line sum

Cross-pipeline reconciliation

Two systems should agree on the same number. "Stripe says we processed $X yesterday; warehouse says $Y." The reconciliation:

Source ASource BAcceptable diff
Stripe API totalWarehouse fact_payments SUM< 0.01% (rounding only)
Application DAUMixpanel / Amplitude DAU< 5% (different definitions of "active")
Salesforce ARRFinance close ARR0% — must match for audit
Senior signal. "Referential checks classify orphans into LAD-pending vs true. Only true orphans page; LAD-pending is logged for trend analysis. Aggregate reconciliation across pipelines (Stripe vs warehouse) is its own check class with explicit tolerance — finance facts must match exactly, engagement metrics can have small definitional drift."
§ 06 — Operations

Lineage, freshness SLAs & the on-call playbook.

Checks that fire are useful only if someone responds. The operational layer turns DQ from a build-time thing into a continuous-reliability thing.

Lineage — what depends on what

Lineage is the DAG of which tables are derived from which sources. When a check fires upstream, lineage tells you which 40 downstream dashboards are about to break.

SourceWhat it generates
dbtAuto-generated DAG via ref() calls; dbt docs renders it
OpenLineageCross-tool lineage events (Airflow + dbt + Spark + Flink)
Datahub / Atlan / OpenMetadataCentralized catalog with lineage UI + impact analysis
Column-level lineageTracks which downstream column came from which upstream column — required for GDPR + sensitive-data tracking

Freshness SLAs — the contract with consumers

Every important table publishes a freshness SLA: how recent the data must be at any given moment.

# dbt source freshness — the simplest form
sources:
  - name: raw
    tables:
      - name: orders
        loaded_at_field: ingested_at
        freshness:
          warn_after: { count: 1, period: hour }
          error_after: { count: 4, period: hour }

Translation: warn if no new rows in 1 hour; page if 4. Each table has different criticality:

Table typeTypical freshness SLA
Real-time event stream (clickstream)< 5 minutes
Hourly fact (orders, payments)< 1 hour warn / < 4 hours error
Daily dim (customers, products)< 24 hours
Monthly snapshot (subscription state)< 7 days from month-end

The on-call playbook — what fires, what to do

AlertFirst diagnostic stepCommon root cause
Schema driftCheck information_schema diff vs expectedProducer deploy added/removed column without notice
Volume dropCheck upstream Kafka lag / source DB query volumeProducer / connector outage; rarely warehouse-side
Volume spikeCheck for double-write / replay eventsPipeline restart replayed events; bot traffic; deploy bug
Null-rate spikeTrace specific column to producer; check producer's recent commitsRefactor dropped a populating call
Distribution driftCompare value mix today vs yesterday vs same-DoW-last-weekUpstream data source change; A/B experiment ramped; seasonal effect
Referential orphansCheck dim freshness vs fact freshnessDim refresh job stalled; LAD that didn't catch up
Freshness breachCheck ingestion job logs / Airflow DAG stateSource connector down; orchestrator stalled; permission issue

The "circuit breaker" pattern

When upstream data quality fails, propagating it to consumers is worse than not refreshing. Circuit-break the pipeline:

# Airflow DAG with a quality gate
[ extract → load_to_bronze → DQ_GATE → transform_silver → ... ]
                              │
                              └─▶ if fails:  STOP propagation;
                                            page on-call;
                                            consumers see yesterday's data
                                            until the issue is fixed.

Better to serve slightly stale data than wrong data.

Quality scoring — the "is the table healthy?" rollup

For executive dashboards, roll up dozens of checks into one number:

quality_score = (1 - failed_checks / total_checks) × freshness_score × 100
where freshness_score = max(0, 1 - (hours_stale / SLA_hours))

Each table gets a score 0-100; published in the data catalog. Consumers see "fact_orders: 98 / 100" before they trust the dashboard built from it.

The DQ post-mortem template

Every Sev-1 DQ incident gets a postmortem with:

  1. Detection time — when did the alert fire?
  2. Impact — which dashboards / consumers / decisions were affected?
  3. Root cause — schema / volume / distribution / referential / freshness?
  4. Why didn't existing checks catch it? Coverage gap.
  5. New check added — concrete addition to prevent recurrence.
Senior signal. "Lineage from OpenLineage / dbt; freshness SLAs published per-table in the catalog; on-call playbook keyed to alert type. The circuit-breaker pattern stops bad data from propagating — better to serve yesterday's good data than today's broken data. Every Sev-1 incident gets a postmortem with a new check added — the framework grows with every failure."
§ 07 — Articulation

The 90-second articulation script.

▸ THE 90-SECOND SCRIPT

"My DQ framework has four orthogonal categories: schema, volume, distribution, referential. Every check maps to one; every category has its own remediation pattern. Coverage matrix: every important table has at least one check in each."

"Schema is build-time enforced via dbt contracts for our models, runtime via information_schema diff for source tables. Renames and type changes go through 90-day deprecation windows, never in-place."

"Volume is anomaly-based, not fixed-threshold. Day-over-day catches sudden breaks; same-DoW-last-4-weeks handles seasonality; Z-score > 3 is the alert. Fixed thresholds are a treadmill of tuning."

"Distribution covers null %, uniqueness, value range, cardinality, and categorical drift. Categorical drift is the slow-leak detector — 'cancellation rate jumped from 5% to 22%' is how you catch the payment-processor incident before billing's dashboard does."

"Referential classifies orphans into LAD-pending (legitimate, late-arriving dim) vs true orphans (page). Cross-pipeline reconciliation is its own check class — finance facts must match Stripe to the cent; engagement metrics can have 5% definitional drift."

"Operations: lineage from OpenLineage, freshness SLAs per-table in the catalog, on-call playbook keyed to alert type. Circuit-breaker pattern halts propagation when upstream fails — yesterday's good data beats today's broken data. Every Sev-1 gets a postmortem with a new check; the framework grows with every failure."

Three sentences that signal seniority — in any DQ round

  1. "Fixed thresholds don't survive seasonality — every volume check needs to be anomaly-based with same-DoW comparison."
  2. "Most production data corruption isn't a schema break — it's distribution drift. The 5%-to-22% cancellation rate is the canary in the coalmine."
  3. "Better to serve yesterday's good data than today's broken data. The circuit-breaker pattern is non-negotiable."
· · ·
▸ Seven sections · the patterns hold across domains · go well.
← Back to Design pillar  ·  System Design — Data Platforms  ·  Streaming Architecture