System Design — Data Platforms.
The whiteboard round when they say "design a data platform for X". Lakehouse vs warehouse, batch vs streaming, dbt + Airflow patterns, schema registry, Kappa. Six sections, every diagram drawable in 30 seconds, every trade-off articulable in two sentences.
Contents
- The 5-minute opening — frame the system in one diagram
- Lakehouse vs warehouse vs lake — pick the right primitive
- Batch vs streaming vs Lambda vs Kappa — the architecture quadrant
- dbt + Airflow patterns — orchestration that survives audit
- Schema registry & contract-first pipelines
- The 90-second articulation script — works on any prompt
The 5-minute opening — frame the system in one diagram.
Most candidates fail the data-platform whiteboard in the first 90 seconds — they jump to "use Spark" before establishing the shape of the system. The opening isn't tools; it's the data flow with grain hints. Draw this, then layer in technology choices.
The canonical 5-box diagram
Five rectangles. Five steady-state arrows. Three control-plane services underneath (registry, catalog, lineage). That's the entire opening — repeated for any data-platform prompt: ride-share, ads, payments, e-commerce. The contents of each box change; the shape doesn't.
The five questions to ask BEFORE drawing
| Question | Why it matters |
|---|---|
| What's the latency SLA — minutes, hours, daily? | Picks streaming vs batch in box 4 |
| What's the data volume / event rate? | Drives storage tier (warehouse vs lakehouse) and ingest tooling |
| Who consumes the data and how? | Determines the serve box — BI / API / reverse-ETL / ML |
| What's the regulatory environment? | GDPR/HIPAA/SOX shifts toward immutable lakehouse + lineage |
| What's the team size + maturity? | 3-engineer team can't run Flink; 100-engineer team can |
The grain-first articulation
After the diagram, say one sentence per box that pins the grain — the level of detail at each stage:
- Sources — "Each source emits events at its own grain. Mobile SDK is per-action; OLTP CDC is per-row-change."
- Ingest — "Kafka topics are partitioned by entity_id; one event = one Kafka message."
- Storage — "Raw zone keeps every event forever; curated zone is per-entity-per-hour rollups."
- Transform — "Star schema in the warehouse — dim_user, fact_event at user-event grain."
- Serve — "BI queries hit per-user-per-day aggregates; ML serving hits per-user-realtime feature store."
Junior engineers say "Spark transforms data". Senior engineers say "the transform layer aggregates from per-event to per-user-per-day grain, with both grains co-existing for different consumer needs."
Lakehouse vs warehouse vs lake — pick the right primitive.
The 2024 reality: most companies are mid-migration from warehouse-only or lake-only to lakehouse. Knowing why each exists and where each wins is the difference between a recommendation and a religious war.
The four flavors
| Flavor | Examples | Strength | Weakness |
|---|---|---|---|
| Warehouse | Snowflake, BigQuery, Redshift | Fastest SQL, strongest governance, easy for analysts | Compute coupled to vendor; expensive at PB scale; weak for ML / images |
| Lake | S3, GCS + Spark / Presto | Cheapest storage; supports any file type; full vendor neutrality | Slow on small queries; no transactions; "data swamp" risk |
| Lakehouse | Databricks Delta, Iceberg + any engine, Snowflake Unistore | Lake's economics + warehouse's transactions; vendor-portable file format | Younger tooling; more moving parts; team must understand both worlds |
| Streaming store | Pinot, Druid, ClickHouse | Sub-second analytics on event data | Limited SQL surface; one-shape-of-question per cluster |
Lakehouse — what makes it different
The lakehouse isn't a product — it's a file format with transactional metadata layered on top of cheap object storage. Three open-source contenders: Apache Iceberg, Delta Lake, Apache Hudi. All three solve the same problems:
| Lake limitation | How lakehouse fixes it |
|---|---|
| No ACID — concurrent writes corrupt | Transaction log records every commit; readers see consistent snapshot |
| No schema enforcement — junk data accumulates | Schema validation at write; evolution rules tracked |
| No DELETE / UPDATE — only file overwrite | Row-level operations via merge-on-read or copy-on-write |
| No time-travel — can't see "yesterday's view" | Snapshot history; query as of any prior commit |
| Slow file listing at scale | Manifest files index the data files; planning is O(metadata) not O(data) |
Iceberg vs Delta vs Hudi — when to pick which
| Iceberg | Delta Lake | Hudi | |
|---|---|---|---|
| Sponsor | Netflix, Apache | Databricks, now Linux Foundation | Uber, Apache |
| Engine support | Broadest — Trino, Spark, Flink, Snowflake, BigQuery | Spark-first; Flink + Trino added | Spark-first; Flink + Presto added |
| Update strategy | Copy-on-write or merge-on-read | Copy-on-write (default), MoR added | Native MoR — best for streaming upserts |
| Pick when | Multi-engine reads required, vendor-neutral future | All-in on Databricks; deepest tooling integration | Heavy CDC / streaming-upsert workload |
The 2024 momentum: Iceberg is winning the multi-engine race; Snowflake, BigQuery, and Databricks all support it. Delta is sticky if you're on Databricks. Hudi remains the choice for high-volume streaming upserts.
When NOT to use a lakehouse
The medallion architecture (bronze / silver / gold)
Most lakehouse implementations follow a three-tier "medallion" layering:
| Tier | Contents | SLA | Schema enforcement |
|---|---|---|---|
| Bronze | Raw ingested events, exactly as received | Best-effort, no SLA | None — schema-on-read |
| Silver | Cleaned, deduplicated, conformed dims | Hours | Strict — schema registry enforced |
| Gold | Business-ready aggregates / cubes | Daily or near-real-time | Strict — versioned + dbt-tested |
Drop the term in the interview: "raw events land in bronze, dbt cleans them into silver with conformed dims, gold is the per-team aggregation marts."
Batch vs streaming vs Lambda vs Kappa — the quadrant.
The interview's hardest moment: "how would you architect this — batch or streaming?". The honest answer is both, with a clear seam. Lambda and Kappa are the two ways to draw the seam.
The four cases
| Case | Example | Latency tolerance |
|---|---|---|
| Pure batch | Monthly close, regulatory reports | Hours-to-days |
| Pure streaming | Fraud detection, live dashboards | Seconds |
| Lambda (both paths) | Recommendations: stream for real-time, batch for daily refresh | Mixed |
| Kappa (single stream) | Modern event-sourced platform — stream is the source of truth, batch is replay | Mixed |
Lambda — the dual-path architecture
| Lambda pros | Lambda cons |
|---|---|
| Real-time + accurate; never blocked on accuracy | Two codebases for the same logic — drift inevitable |
| Batch can fix streaming approximation errors | Reconciliation step is its own problem |
| Migrate-incrementally pattern (start batch, add stream) | Operational burden — two systems, two on-calls |
Kappa — the single-stream architecture
Kappa eliminates the dual codebase. The stream is the source of truth; batch is just the same job replayed against historical Kafka offsets. Three preconditions to make Kappa work:
- Durable, retentioned event log — Kafka with infinite retention OR tiered storage (Confluent Cloud, Pulsar)
- Idempotent / deterministic stream jobs — replay must produce identical output
- Replay-friendly state backend — RocksDB checkpoints in Flink; Kafka Streams state stores
When to pick which
| Pick Lambda when | Pick Kappa when |
|---|---|
| Migrating from a legacy batch warehouse | Greenfield platform with strong streaming culture |
| Accuracy SLA is regulator-grade (banking close) | Eventual consistency is acceptable |
| Team can't yet replay confidently | Team can replay 30+ days of events deterministically |
| Heterogeneous source latencies (mix of CDC + nightly SFTP) | All sources can be modeled as event streams |
The 2024+ default — Kappa-leaning hybrid
Most modern platforms aren't pure either. The shape that's winning:
- Kafka as the spine (event-source-of-truth)
- Flink jobs writing to Iceberg in near-real-time (sub-minute)
- "Batch" jobs are dbt incremental models that read the same Iceberg tables
- One transform logic; two query paths (Flink for low-latency views, dbt for analyst SQL)
dbt + Airflow patterns — orchestration that survives audit.
Airflow runs the schedule. dbt builds the SQL. Together they're the most common transformation stack in 2024. The interview question is rarely "what tool" — it's "how do they integrate so a failed step doesn't corrupt the warehouse".
The division of labor
| Concern | Owner |
|---|---|
| Schedule (cron / sensor / event) | Airflow |
| Inter-DAG dependencies | Airflow (TriggerDagRunOperator, ExternalTaskSensor) |
| Within-DAG transformation logic | dbt (models, tests, docs) |
| Idempotency (same input → same output) | dbt (materializations + incremental keys) |
| Data quality assertions | dbt (schema.yml tests) + Great Expectations / Soda for richer checks |
| Lineage | dbt's auto-generated DAG; OpenLineage for cross-tool |
| Backfill | Airflow's date-range backfill + dbt's --vars '{"target_date": ...}' |
The canonical dbt model lifecycle
One dbt model goes through this lifecycle on every run:
1. Source check ─▶ dbt source freshness (is upstream data fresh?)
2. Build ─▶ dbt run --models +my_model (run model + upstream deps)
3. Test ─▶ dbt test --models my_model (assertions: not_null, unique, accepted_values, custom)
4. Docs / lineage ─▶ dbt docs generate (refresh DAG + descriptions)
5. Notify ─▶ Slack / PagerDuty on failure (Airflow on_failure_callback)
Materializations — pick the right one
| Materialization | What it does | Use when |
|---|---|---|
view | Just CREATE VIEW; no data stored | Cheap transforms over small upstream |
table | Full rebuild every run | Small dim tables; logic changes frequently |
incremental | Insert new rows; merge on key | Large fact tables; only recent data changes |
snapshot | SCD2 — track history of changes | Slowly-changing dims you must audit |
ephemeral | Inlined as a CTE; never materialized | Reusable transformation logic, no separate object |
The incremental pattern that survives backfill
{{ config(
materialized='incremental',
unique_key='event_id',
on_schema_change='append_new_columns',
incremental_strategy='merge'
) }}
SELECT event_id, user_id, event_type, event_at
FROM {{ source('raw', 'events') }}
{% if is_incremental() %}
WHERE event_at > (SELECT MAX(event_at) FROM {{ this }})
{% endif %}
Two failure modes to anticipate:
- Late-arriving rows — events with
event_atearlier than current MAX get skipped. Fix: add a lookback (event_at > MAX(event_at) - INTERVAL '24 hours') and let MERGE deduplicate. - Reprocessing — the
--full-refreshflag rebuilds from scratch. Required when business logic changes; protected by branch-CI gating.
Airflow patterns — the four every senior knows
| Pattern | When |
|---|---|
| Sensor + downstream DAG (ExternalTaskSensor) | Wait for upstream team's DAG to finish before starting |
| Dynamic task mapping (.expand) | One task per partition / customer / date — fan-out at runtime |
| TaskGroup | Visual grouping of related tasks; reusable patterns across DAGs |
| Branching (BranchPythonOperator) | Different paths based on data-driven decision (only run gold-layer if silver passed quality) |
The "audit-safe" pattern
Compliance-grade pipelines need to prove what ran, when, with what version, and what the input was. The pattern:
- Versioned dbt manifest — every run captures the dbt project version + git SHA
- Run logs in a database — Airflow's metadata DB plus dbt's
run_results.jsonuploaded to S3 per run - Lineage snapshot — OpenLineage events per task to a central catalog
- Reproducibility test — quarterly job that rebuilds last quarter's gold table from raw + git SHA, asserts byte-equality
dbt run --models +tag:gold with a retries=2 and an on_failure_callback posting to PagerDuty. Backfill is two flags away — Airflow's --start-date + dbt's --vars. And the dbt manifest goes to S3 every run for audit replay."
Schema registry & contract-first pipelines.
Three things break a data platform: schema changes, schema changes, and schema changes. A field renamed upstream silently breaks 40 downstream dashboards two weeks later. Schema registry is the contract layer that prevents this.
What a schema registry actually is
A schema registry stores the structure (Avro / Protobuf / JSON Schema) of every Kafka topic. Producers register their schema; consumers fetch it; the registry enforces compatibility rules on every change.
| Component | Examples |
|---|---|
| Schema registry | Confluent Schema Registry, Apicurio, AWS Glue Schema Registry |
| Schema format | Avro (most common, binary), Protobuf (Google preferred), JSON Schema (least efficient) |
| Compatibility check | Backward / Forward / Full / None — enforced at publish-time |
The four compatibility modes
| Mode | Producer change OK if | Consumer impact |
|---|---|---|
| Backward | New consumer can read OLD data with new schema | Safe to upgrade consumer first |
| Forward | OLD consumer can read NEW data with old schema | Safe to upgrade producer first |
| Full | Both directions | Anyone can upgrade in any order |
| None | Anything goes | Production fire |
Default to backward for fact streams (consumers are usually dashboards / analytics that lag producers). Default to full for multi-team event APIs.
Allowed schema changes (Avro example)
| Change | Backward | Forward | Full |
|---|---|---|---|
| Add field with default | ✓ | ✓ | ✓ |
| Remove field | ✗ | ✓ | ✗ |
| Add field without default | ✗ | ✓ | ✗ |
| Rename field | ✗ | ✗ | ✗ |
| Change field type | ✗ | ✗ | ✗ |
Renames and type changes are never compatible. The senior pattern: "add the new field, dual-write for a deprecation window, switch consumers, drop the old field."
Contract-first development — the workflow
- Producer team writes Avro schema in a shared repo (typically a "schemas" monorepo)
- CI validates compatibility against the registered schema — PR fails if break
- Schema published on merge to schema registry; producer code generated from it
- Consumer teams discover schemas via registry browser; generate their own client code
- Runtime enforcement — Kafka brokers reject messages that don't match registered schema
The tombstone / deprecation pattern
// Schema v1
{ "type": "record", "name": "User", "fields": [
{ "name": "user_id", "type": "long" },
{ "name": "name", "type": "string" }
]}
// Schema v2 — adding email, deprecating name
{ "type": "record", "name": "User", "fields": [
{ "name": "user_id", "type": "long" },
{ "name": "name", "type": "string", "deprecated": true },
{ "name": "email", "type": ["null", "string"], "default": null },
{ "name": "first_name", "type": ["null", "string"], "default": null },
{ "name": "last_name", "type": ["null", "string"], "default": null }
]}
// 90 days later, schema v3 — drop name
{ "type": "record", "name": "User", "fields": [
{ "name": "user_id", "type": "long" },
{ "name": "first_name", "type": ["null", "string"], "default": null },
{ "name": "last_name", "type": ["null", "string"], "default": null }
]}
Schema in the warehouse — dbt contracts
dbt 1.5+ added contract enforcement at the model level — runtime validation that the output table matches a declared schema:
models:
- name: dim_user
config:
contract:
enforced: true
columns:
- name: user_id
data_type: bigint
constraints: [{ type: not_null }, { type: primary_key }]
- name: email
data_type: varchar
constraints: [{ type: not_null }]
The model fails to build if the output schema drifts. Combined with downstream tests, this is the warehouse-side equivalent of schema registry.
The 90-second articulation script — works on any prompt.
You've drawn the diagram, named the technologies, called out the trade-offs. The closing 90 seconds bring it together as one coherent design. Memorize this template; substitute the specific tools and grain.
▸ THE 90-SECOND SCRIPT
"For a platform with [latency requirement] and [volume hint], I'd shape this as five stages: sources → ingest → storage → transform → serve, with schema registry, metadata catalog, and lineage as cross-cutting control planes."
"On storage, I'd default to a lakehouse pattern — Iceberg files on S3 — with a medallion layering: bronze for raw, silver for cleaned with conformed dims, gold for the per-team aggregation marts. That gives me lake economics with warehouse transactional guarantees, and Iceberg's broad engine support means I'm not locked into one query engine."
"On the batch-vs-streaming question, I'd go Kappa-leaning: Kafka as the event spine, Flink jobs writing into Iceberg in near-real-time, dbt models on top reading the same tables. One transform logic, two read paths — Flink for sub-minute views, dbt for analyst SQL. If a regulatory fact like revenue needs reconciliation, I'd add a true Lambda batch path just for that fact, not the whole platform."
"Orchestration is Airflow over dbt — Airflow owns scheduling and cross-DAG dependencies, dbt owns within-DAG SQL transformations and data-quality tests. Materializations are mostly incremental with merge strategy for late-arriving data. Audit replay works because every dbt run uploads its manifest + run_results to S3 keyed by git SHA."
"Schema registry with backward compatibility is non-negotiable — every Kafka topic registered, schema PRs CI-validated, renames are tombstone-then-add over 90 days. Warehouse-side, dbt contracts enforce output schema at build time. The contract layer is what keeps the platform from rotting."
"The three risks I'd flag and mitigate first: schema drift (registry + contracts), late-arriving data (incremental lookback windows), and skew at the heaviest entity (salted aggregation in Spark / Flink). The platform without these three is a ticking clock."
Three sentences that signal seniority — in any round
- "I'd start with the grain question — what's the smallest unit of fact, and what's the consumer grain?"
- "The contract layer (schema registry + dbt contracts) does more for reliability than the orchestration choice."
- "Lambda is what you choose when the team can't yet replay confidently; Kappa is the default when they can."