▸ DESIGN · System Design · Data Platforms

System Design — Data Platforms.

The whiteboard round when they say "design a data platform for X". Lakehouse vs warehouse, batch vs streaming, dbt + Airflow patterns, schema registry, Kappa. Six sections, every diagram drawable in 30 seconds, every trade-off articulable in two sentences.

The 5-minute opening — frame the system in one diagram
Lakehouse vs warehouse vs lake — pick the right primitive
Batch vs streaming vs Lambda vs Kappa — the architecture quadrant
dbt + Airflow patterns — orchestration that survives audit
Schema registry & contract-first pipelines
The 90-second articulation script — works on any prompt

§ 01 — The opening move

The 5-minute opening — frame the system in one diagram.

Most candidates fail the data-platform whiteboard in the first 90 seconds — they jump to "use Spark" before establishing the shape of the system. The opening isn't tools; it's the data flow with grain hints. Draw this, then layer in technology choices.

The canonical 5-box diagram

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ SOURCES │───▶│ INGEST │───▶│ STORAGE │───▶│ TRANSFORM │───▶│ SERVE │ │ │ │ │ │ │ │ │ │ │ │ • OLTP DBs │ │ • Kafka │ │ • S3 / GCS │ │ • dbt models │ │ • BI dash │ │ • Mobile SDK │ │ • Kinesis │ │ • Iceberg / │ │ • Spark jobs │ │ • REST API │ │ • Web events │ │ • CDC stream │ │ Delta Lake │ │ • Flink │ │ • Reverse │ │ • SaaS APIs │ │ • Batch SFTP │ │ • Warehouse │ │ stream jobs │ │ ETL → CRM │ │ │ │ │ │ (Snowflake) │ │ │ │ • ML serving │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ ▲ ▲ ▲ │ │ │ schema registry metadata catalog lineage tracker (Confluent / (Datahub / (OpenLineage / Apicurio) Atlas / Unity) Marquez)

Five rectangles. Five steady-state arrows. Three control-plane services underneath (registry, catalog, lineage). That's the entire opening — repeated for any data-platform prompt: ride-share, ads, payments, e-commerce. The contents of each box change; the shape doesn't.

The five questions to ask BEFORE drawing

Question	Why it matters
What's the latency SLA — minutes, hours, daily?	Picks streaming vs batch in box 4
What's the data volume / event rate?	Drives storage tier (warehouse vs lakehouse) and ingest tooling
Who consumes the data and how?	Determines the serve box — BI / API / reverse-ETL / ML
What's the regulatory environment?	GDPR/HIPAA/SOX shifts toward immutable lakehouse + lineage
What's the team size + maturity?	3-engineer team can't run Flink; 100-engineer team can

Senior signal. Don't ask all five — ask the two that obviously aren't pinned. "Before I dive in: are we serving sub-second analytics or daily reports? And is the team running Spark today or starting from scratch?" Two well-chosen questions show you're prioritizing.

The grain-first articulation

After the diagram, say one sentence per box that pins the grain — the level of detail at each stage:

Sources — "Each source emits events at its own grain. Mobile SDK is per-action; OLTP CDC is per-row-change."
Ingest — "Kafka topics are partitioned by entity_id; one event = one Kafka message."
Storage — "Raw zone keeps every event forever; curated zone is per-entity-per-hour rollups."
Transform — "Star schema in the warehouse — dim_user, fact_event at user-event grain."
Serve — "BI queries hit per-user-per-day aggregates; ML serving hits per-user-realtime feature store."

Junior engineers say "Spark transforms data". Senior engineers say "the transform layer aggregates from per-event to per-user-per-day grain, with both grains co-existing for different consumer needs."

§ 02 — Storage primitives

Lakehouse vs warehouse vs lake — pick the right primitive.

The 2024 reality: most companies are mid-migration from warehouse-only or lake-only to lakehouse. Knowing why each exists and where each wins is the difference between a recommendation and a religious war.

The four flavors

Flavor	Examples	Strength	Weakness
Warehouse	Snowflake, BigQuery, Redshift	Fastest SQL, strongest governance, easy for analysts	Compute coupled to vendor; expensive at PB scale; weak for ML / images
Lake	S3, GCS + Spark / Presto	Cheapest storage; supports any file type; full vendor neutrality	Slow on small queries; no transactions; "data swamp" risk
Lakehouse	Databricks Delta, Iceberg + any engine, Snowflake Unistore	Lake's economics + warehouse's transactions; vendor-portable file format	Younger tooling; more moving parts; team must understand both worlds
Streaming store	Pinot, Druid, ClickHouse	Sub-second analytics on event data	Limited SQL surface; one-shape-of-question per cluster

Lakehouse — what makes it different

The lakehouse isn't a product — it's a file format with transactional metadata layered on top of cheap object storage. Three open-source contenders: Apache Iceberg, Delta Lake, Apache Hudi. All three solve the same problems:

Lake limitation	How lakehouse fixes it
No ACID — concurrent writes corrupt	Transaction log records every commit; readers see consistent snapshot
No schema enforcement — junk data accumulates	Schema validation at write; evolution rules tracked
No DELETE / UPDATE — only file overwrite	Row-level operations via merge-on-read or copy-on-write
No time-travel — can't see "yesterday's view"	Snapshot history; query as of any prior commit
Slow file listing at scale	Manifest files index the data files; planning is O(metadata) not O(data)

Iceberg vs Delta vs Hudi — when to pick which

	Iceberg	Delta Lake	Hudi
Sponsor	Netflix, Apache	Databricks, now Linux Foundation	Uber, Apache
Engine support	Broadest — Trino, Spark, Flink, Snowflake, BigQuery	Spark-first; Flink + Trino added	Spark-first; Flink + Presto added
Update strategy	Copy-on-write or merge-on-read	Copy-on-write (default), MoR added	Native MoR — best for streaming upserts
Pick when	Multi-engine reads required, vendor-neutral future	All-in on Databricks; deepest tooling integration	Heavy CDC / streaming-upsert workload

The 2024 momentum: Iceberg is winning the multi-engine race; Snowflake, BigQuery, and Databricks all support it. Delta is sticky if you're on Databricks. Hudi remains the choice for high-volume streaming upserts.

When NOT to use a lakehouse

Anti-pattern. Small data (< 1 TB), single team, simple analytics: a lakehouse is overkill. Snowflake or BigQuery alone is faster to ship and cheaper to operate. Lakehouse pays off at multi-PB scale, multi-engine consumers, or when ML training needs raw files alongside curated tables.

The medallion architecture (bronze / silver / gold)

Most lakehouse implementations follow a three-tier "medallion" layering:

Tier	Contents	SLA	Schema enforcement
Bronze	Raw ingested events, exactly as received	Best-effort, no SLA	None — schema-on-read
Silver	Cleaned, deduplicated, conformed dims	Hours	Strict — schema registry enforced
Gold	Business-ready aggregates / cubes	Daily or near-real-time	Strict — versioned + dbt-tested

Drop the term in the interview: "raw events land in bronze, dbt cleans them into silver with conformed dims, gold is the per-team aggregation marts."

Senior signal. "I'd default to Iceberg-on-S3 with Snowflake as the primary query engine — that gives me Snowflake's SQL polish without locking transformation jobs to Snowflake compute. The medallion layering means bronze is replayable forever; silver is the contract layer; gold is the per-team mart."

§ 03 — Architectural quadrant

Batch vs streaming vs Lambda vs Kappa — the quadrant.

The interview's hardest moment: "how would you architect this — batch or streaming?". The honest answer is both, with a clear seam. Lambda and Kappa are the two ways to draw the seam.

The four cases

Case	Example	Latency tolerance
Pure batch	Monthly close, regulatory reports	Hours-to-days
Pure streaming	Fraud detection, live dashboards	Seconds
Lambda (both paths)	Recommendations: stream for real-time, batch for daily refresh	Mixed
Kappa (single stream)	Modern event-sourced platform — stream is the source of truth, batch is replay	Mixed

Lambda — the dual-path architecture

┌─────────────────┐ │ Speed Layer │ │ (Flink/Storm) │──▶ approximate │ │ real-time view ┌──────────┐ └─────────────────┘ │ │ Source │──┐ │ ┌──────────┐ │ events │ │ ├─▶│ Serve │ └──────────┘ │ ┌─────────────────┐ │ │ layer │ └──▶│ Batch Layer │ │ └──────────┘ │ (Spark/dbt) │──▶ accurate │ │ delayed view └─────────────────┘

Lambda pros	Lambda cons
Real-time + accurate; never blocked on accuracy	Two codebases for the same logic — drift inevitable
Batch can fix streaming approximation errors	Reconciliation step is its own problem
Migrate-incrementally pattern (start batch, add stream)	Operational burden — two systems, two on-calls

Kappa — the single-stream architecture

┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ Serve │ │ Source │───▶│ Kafka log │──┬──streaming┴─▶layer │ │ events │ │ (durable, │ │ (Flink job) │ └──────────┘ │ replayable) │ │ └──────────────────┘ └──── replay──┐ │ ▼ ┌──────────────┐ │ Replay │ │ batch job │ │ (rebuilds │ │ any state) │ └──────────────┘

Kappa eliminates the dual codebase. The stream is the source of truth; batch is just the same job replayed against historical Kafka offsets. Three preconditions to make Kappa work:

Durable, retentioned event log — Kafka with infinite retention OR tiered storage (Confluent Cloud, Pulsar)
Idempotent / deterministic stream jobs — replay must produce identical output
Replay-friendly state backend — RocksDB checkpoints in Flink; Kafka Streams state stores

When to pick which

Pick Lambda when	Pick Kappa when
Migrating from a legacy batch warehouse	Greenfield platform with strong streaming culture
Accuracy SLA is regulator-grade (banking close)	Eventual consistency is acceptable
Team can't yet replay confidently	Team can replay 30+ days of events deterministically
Heterogeneous source latencies (mix of CDC + nightly SFTP)	All sources can be modeled as event streams

The 2024+ default — Kappa-leaning hybrid

Most modern platforms aren't pure either. The shape that's winning:

Kafka as the spine (event-source-of-truth)
Flink jobs writing to Iceberg in near-real-time (sub-minute)
"Batch" jobs are dbt incremental models that read the same Iceberg tables
One transform logic; two query paths (Flink for low-latency views, dbt for analyst SQL)

Senior signal. "I'd start Kappa-leaning — Kafka spine, Flink jobs writing to Iceberg, dbt models reading from there. If accuracy regulators force a true Lambda dual-path, I'd add the batch reconciliation only for the regulated facts (revenue, settlement) — not the whole platform. The cost of dual code is the cost of the audit; Kappa for everything else."

§ 04 — Orchestration

dbt + Airflow patterns — orchestration that survives audit.

Airflow runs the schedule. dbt builds the SQL. Together they're the most common transformation stack in 2024. The interview question is rarely "what tool" — it's "how do they integrate so a failed step doesn't corrupt the warehouse".

The division of labor

Concern	Owner
Schedule (cron / sensor / event)	Airflow
Inter-DAG dependencies	Airflow (TriggerDagRunOperator, ExternalTaskSensor)
Within-DAG transformation logic	dbt (models, tests, docs)
Idempotency (same input → same output)	dbt (materializations + incremental keys)
Data quality assertions	dbt (schema.yml tests) + Great Expectations / Soda for richer checks
Lineage	dbt's auto-generated DAG; OpenLineage for cross-tool
Backfill	Airflow's date-range backfill + dbt's --vars '{"target_date": ...}'

The canonical dbt model lifecycle

One dbt model goes through this lifecycle on every run:

1. Source check     ─▶  dbt source freshness   (is upstream data fresh?)
2. Build            ─▶  dbt run --models +my_model  (run model + upstream deps)
3. Test             ─▶  dbt test --models my_model  (assertions: not_null, unique, accepted_values, custom)
4. Docs / lineage   ─▶  dbt docs generate              (refresh DAG + descriptions)
5. Notify           ─▶  Slack / PagerDuty on failure   (Airflow on_failure_callback)

Materializations — pick the right one

Materialization	What it does	Use when
`view`	Just CREATE VIEW; no data stored	Cheap transforms over small upstream
`table`	Full rebuild every run	Small dim tables; logic changes frequently
`incremental`	Insert new rows; merge on key	Large fact tables; only recent data changes
`snapshot`	SCD2 — track history of changes	Slowly-changing dims you must audit
`ephemeral`	Inlined as a CTE; never materialized	Reusable transformation logic, no separate object

The incremental pattern that survives backfill

{{ config(
    materialized='incremental',
    unique_key='event_id',
    on_schema_change='append_new_columns',
    incremental_strategy='merge'
) }}

SELECT event_id, user_id, event_type, event_at
FROM {{ source('raw', 'events') }}

{% if is_incremental() %}
  WHERE event_at > (SELECT MAX(event_at) FROM {{ this }})
{% endif %}

Two failure modes to anticipate:

Late-arriving rows — events with event_at earlier than current MAX get skipped. Fix: add a lookback (event_at > MAX(event_at) - INTERVAL '24 hours') and let MERGE deduplicate.
Reprocessing — the --full-refresh flag rebuilds from scratch. Required when business logic changes; protected by branch-CI gating.

Airflow patterns — the four every senior knows

Pattern	When
Sensor + downstream DAG (ExternalTaskSensor)	Wait for upstream team's DAG to finish before starting
Dynamic task mapping (.expand)	One task per partition / customer / date — fan-out at runtime
TaskGroup	Visual grouping of related tasks; reusable patterns across DAGs
Branching (BranchPythonOperator)	Different paths based on data-driven decision (only run gold-layer if silver passed quality)

The "audit-safe" pattern

Compliance-grade pipelines need to prove what ran, when, with what version, and what the input was. The pattern:

Versioned dbt manifest — every run captures the dbt project version + git SHA
Run logs in a database — Airflow's metadata DB plus dbt's run_results.json uploaded to S3 per run
Lineage snapshot — OpenLineage events per task to a central catalog
Reproducibility test — quarterly job that rebuilds last quarter's gold table from raw + git SHA, asserts byte-equality

Trap. "We use dbt + Airflow" is not an answer. The senior version names the materializations, the incremental lookback, the audit trail, and the failure-callback wiring. Memorize that stack.

Senior signal. "Airflow owns scheduling and inter-DAG dependencies; dbt owns within-DAG SQL and data quality. The integration point is one BashOperator running dbt run --models +tag:gold with a retries=2 and an on_failure_callback posting to PagerDuty. Backfill is two flags away — Airflow's --start-date + dbt's --vars. And the dbt manifest goes to S3 every run for audit replay."

§ 05 — Contracts

Schema registry & contract-first pipelines.

Three things break a data platform: schema changes, schema changes, and schema changes. A field renamed upstream silently breaks 40 downstream dashboards two weeks later. Schema registry is the contract layer that prevents this.

What a schema registry actually is

A schema registry stores the structure (Avro / Protobuf / JSON Schema) of every Kafka topic. Producers register their schema; consumers fetch it; the registry enforces compatibility rules on every change.

Component	Examples
Schema registry	Confluent Schema Registry, Apicurio, AWS Glue Schema Registry
Schema format	Avro (most common, binary), Protobuf (Google preferred), JSON Schema (least efficient)
Compatibility check	Backward / Forward / Full / None — enforced at publish-time

The four compatibility modes

Mode	Producer change OK if	Consumer impact
Backward	New consumer can read OLD data with new schema	Safe to upgrade consumer first
Forward	OLD consumer can read NEW data with old schema	Safe to upgrade producer first
Full	Both directions	Anyone can upgrade in any order
None	Anything goes	Production fire

Default to backward for fact streams (consumers are usually dashboards / analytics that lag producers). Default to full for multi-team event APIs.

Allowed schema changes (Avro example)

Change	Backward	Forward	Full
Add field with default	✓	✓	✓
Remove field	✗	✓	✗
Add field without default	✗	✓	✗
Rename field	✗	✗	✗
Change field type	✗	✗	✗

Renames and type changes are never compatible. The senior pattern: "add the new field, dual-write for a deprecation window, switch consumers, drop the old field."

Contract-first development — the workflow

Producer team writes Avro schema in a shared repo (typically a "schemas" monorepo)
CI validates compatibility against the registered schema — PR fails if break
Schema published on merge to schema registry; producer code generated from it
Consumer teams discover schemas via registry browser; generate their own client code
Runtime enforcement — Kafka brokers reject messages that don't match registered schema

The tombstone / deprecation pattern

// Schema v1
{ "type": "record", "name": "User", "fields": [
  { "name": "user_id", "type": "long" },
  { "name": "name",    "type": "string" }
]}

// Schema v2 — adding email, deprecating name
{ "type": "record", "name": "User", "fields": [
  { "name": "user_id", "type": "long" },
  { "name": "name",    "type": "string", "deprecated": true },
  { "name": "email",   "type": ["null", "string"], "default": null },
  { "name": "first_name", "type": ["null", "string"], "default": null },
  { "name": "last_name",  "type": ["null", "string"], "default": null }
]}

// 90 days later, schema v3 — drop name
{ "type": "record", "name": "User", "fields": [
  { "name": "user_id", "type": "long" },
  { "name": "first_name", "type": ["null", "string"], "default": null },
  { "name": "last_name",  "type": ["null", "string"], "default": null }
]}

Schema in the warehouse — dbt contracts

dbt 1.5+ added contract enforcement at the model level — runtime validation that the output table matches a declared schema:

models:
  - name: dim_user
    config:
      contract:
        enforced: true
    columns:
      - name: user_id
        data_type: bigint
        constraints: [{ type: not_null }, { type: primary_key }]
      - name: email
        data_type: varchar
        constraints: [{ type: not_null }]

The model fails to build if the output schema drifts. Combined with downstream tests, this is the warehouse-side equivalent of schema registry.

Senior signal. "Every Kafka topic has an Avro schema in Confluent Registry, defaulted to backward compatibility. Schema changes go through a PR-CI loop that runs the registry's compatibility check. Renames are tombstone-then-add — a 90-day deprecation window with both fields populated. In the warehouse we layer dbt contracts so the model build fails on accidental schema drift. The contract is the entire architectural backbone — without it, the platform rots in 6 months."

§ 06 — Articulation

The 90-second articulation script — works on any prompt.

You've drawn the diagram, named the technologies, called out the trade-offs. The closing 90 seconds bring it together as one coherent design. Memorize this template; substitute the specific tools and grain.

▸ THE 90-SECOND SCRIPT

"For a platform with [latency requirement] and [volume hint], I'd shape this as five stages: sources → ingest → storage → transform → serve, with schema registry, metadata catalog, and lineage as cross-cutting control planes."

"On storage, I'd default to a lakehouse pattern — Iceberg files on S3 — with a medallion layering: bronze for raw, silver for cleaned with conformed dims, gold for the per-team aggregation marts. That gives me lake economics with warehouse transactional guarantees, and Iceberg's broad engine support means I'm not locked into one query engine."

"On the batch-vs-streaming question, I'd go Kappa-leaning: Kafka as the event spine, Flink jobs writing into Iceberg in near-real-time, dbt models on top reading the same tables. One transform logic, two read paths — Flink for sub-minute views, dbt for analyst SQL. If a regulatory fact like revenue needs reconciliation, I'd add a true Lambda batch path just for that fact, not the whole platform."

"Orchestration is Airflow over dbt — Airflow owns scheduling and cross-DAG dependencies, dbt owns within-DAG SQL transformations and data-quality tests. Materializations are mostly incremental with merge strategy for late-arriving data. Audit replay works because every dbt run uploads its manifest + run_results to S3 keyed by git SHA."

"Schema registry with backward compatibility is non-negotiable — every Kafka topic registered, schema PRs CI-validated, renames are tombstone-then-add over 90 days. Warehouse-side, dbt contracts enforce output schema at build time. The contract layer is what keeps the platform from rotting."

"The three risks I'd flag and mitigate first: schema drift (registry + contracts), late-arriving data (incremental lookback windows), and skew at the heaviest entity (salted aggregation in Spark / Flink). The platform without these three is a ticking clock."

Three sentences that signal seniority — in any round

"I'd start with the grain question — what's the smallest unit of fact, and what's the consumer grain?"
"The contract layer (schema registry + dbt contracts) does more for reliability than the orchestration choice."
"Lambda is what you choose when the team can't yet replay confidently; Kappa is the default when they can."

· · ·

▸ Six sections · the patterns hold across domains · go well.

  ← Back to Design pillar
   · 
  Data Modeling notebook
   · 
  Senior DE Prep

System Design — Data Platforms.

Contents

The 5-minute opening — frame the system in one diagram.

The canonical 5-box diagram

The five questions to ask BEFORE drawing

The grain-first articulation

Lakehouse vs warehouse vs lake — pick the right primitive.

The four flavors

Lakehouse — what makes it different

Iceberg vs Delta vs Hudi — when to pick which

When NOT to use a lakehouse

The medallion architecture (bronze / silver / gold)

Batch vs streaming vs Lambda vs Kappa — the quadrant.

The four cases

Lambda — the dual-path architecture

Kappa — the single-stream architecture

When to pick which

The 2024+ default — Kappa-leaning hybrid

dbt + Airflow patterns — orchestration that survives audit.

The division of labor

The canonical dbt model lifecycle

Materializations — pick the right one

The incremental pattern that survives backfill

Airflow patterns — the four every senior knows

The "audit-safe" pattern

Schema registry & contract-first pipelines.

What a schema registry actually is

The four compatibility modes

Allowed schema changes (Avro example)

Contract-first development — the workflow

The tombstone / deprecation pattern

Schema in the warehouse — dbt contracts

The 90-second articulation script — works on any prompt.

Three sentences that signal seniority — in any round