← PaddySpeaks
Interview Studio · Practice · Q&A Design
▸ DESIGN · System Design · Data Platforms

System Design — Data Platforms.

The whiteboard round when they say "design a data platform for X". Lakehouse vs warehouse, batch vs streaming, dbt + Airflow patterns, schema registry, Kappa. Six sections, every diagram drawable in 30 seconds, every trade-off articulable in two sentences.

§ 01 — The opening move

The 5-minute opening — frame the system in one diagram.

Most candidates fail the data-platform whiteboard in the first 90 seconds — they jump to "use Spark" before establishing the shape of the system. The opening isn't tools; it's the data flow with grain hints. Draw this, then layer in technology choices.

The canonical 5-box diagram

┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ SOURCES │───▶│ INGEST │───▶│ STORAGE │───▶│ TRANSFORM │───▶│ SERVE │ │ │ │ │ │ │ │ │ │ │ │ • OLTP DBs │ │ • Kafka │ │ • S3 / GCS │ │ • dbt models │ │ • BI dash │ │ • Mobile SDK │ │ • Kinesis │ │ • Iceberg / │ │ • Spark jobs │ │ • REST API │ │ • Web events │ │ • CDC stream │ │ Delta Lake │ │ • Flink │ │ • Reverse │ │ • SaaS APIs │ │ • Batch SFTP │ │ • Warehouse │ │ stream jobs │ │ ETL → CRM │ │ │ │ │ │ (Snowflake) │ │ │ │ • ML serving │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ ▲ ▲ ▲ │ │ │ schema registry metadata catalog lineage tracker (Confluent / (Datahub / (OpenLineage / Apicurio) Atlas / Unity) Marquez)

Five rectangles. Five steady-state arrows. Three control-plane services underneath (registry, catalog, lineage). That's the entire opening — repeated for any data-platform prompt: ride-share, ads, payments, e-commerce. The contents of each box change; the shape doesn't.

The five questions to ask BEFORE drawing

QuestionWhy it matters
What's the latency SLA — minutes, hours, daily?Picks streaming vs batch in box 4
What's the data volume / event rate?Drives storage tier (warehouse vs lakehouse) and ingest tooling
Who consumes the data and how?Determines the serve box — BI / API / reverse-ETL / ML
What's the regulatory environment?GDPR/HIPAA/SOX shifts toward immutable lakehouse + lineage
What's the team size + maturity?3-engineer team can't run Flink; 100-engineer team can
Senior signal. Don't ask all five — ask the two that obviously aren't pinned. "Before I dive in: are we serving sub-second analytics or daily reports? And is the team running Spark today or starting from scratch?" Two well-chosen questions show you're prioritizing.

The grain-first articulation

After the diagram, say one sentence per box that pins the grain — the level of detail at each stage:

  • Sources — "Each source emits events at its own grain. Mobile SDK is per-action; OLTP CDC is per-row-change."
  • Ingest — "Kafka topics are partitioned by entity_id; one event = one Kafka message."
  • Storage — "Raw zone keeps every event forever; curated zone is per-entity-per-hour rollups."
  • Transform — "Star schema in the warehouse — dim_user, fact_event at user-event grain."
  • Serve — "BI queries hit per-user-per-day aggregates; ML serving hits per-user-realtime feature store."

Junior engineers say "Spark transforms data". Senior engineers say "the transform layer aggregates from per-event to per-user-per-day grain, with both grains co-existing for different consumer needs."

§ 02 — Storage primitives

Lakehouse vs warehouse vs lake — pick the right primitive.

The 2024 reality: most companies are mid-migration from warehouse-only or lake-only to lakehouse. Knowing why each exists and where each wins is the difference between a recommendation and a religious war.

The four flavors

FlavorExamplesStrengthWeakness
WarehouseSnowflake, BigQuery, RedshiftFastest SQL, strongest governance, easy for analystsCompute coupled to vendor; expensive at PB scale; weak for ML / images
LakeS3, GCS + Spark / PrestoCheapest storage; supports any file type; full vendor neutralitySlow on small queries; no transactions; "data swamp" risk
LakehouseDatabricks Delta, Iceberg + any engine, Snowflake UnistoreLake's economics + warehouse's transactions; vendor-portable file formatYounger tooling; more moving parts; team must understand both worlds
Streaming storePinot, Druid, ClickHouseSub-second analytics on event dataLimited SQL surface; one-shape-of-question per cluster

Lakehouse — what makes it different

The lakehouse isn't a product — it's a file format with transactional metadata layered on top of cheap object storage. Three open-source contenders: Apache Iceberg, Delta Lake, Apache Hudi. All three solve the same problems:

Lake limitationHow lakehouse fixes it
No ACID — concurrent writes corruptTransaction log records every commit; readers see consistent snapshot
No schema enforcement — junk data accumulatesSchema validation at write; evolution rules tracked
No DELETE / UPDATE — only file overwriteRow-level operations via merge-on-read or copy-on-write
No time-travel — can't see "yesterday's view"Snapshot history; query as of any prior commit
Slow file listing at scaleManifest files index the data files; planning is O(metadata) not O(data)

Iceberg vs Delta vs Hudi — when to pick which

IcebergDelta LakeHudi
SponsorNetflix, ApacheDatabricks, now Linux FoundationUber, Apache
Engine supportBroadest — Trino, Spark, Flink, Snowflake, BigQuerySpark-first; Flink + Trino addedSpark-first; Flink + Presto added
Update strategyCopy-on-write or merge-on-readCopy-on-write (default), MoR addedNative MoR — best for streaming upserts
Pick whenMulti-engine reads required, vendor-neutral futureAll-in on Databricks; deepest tooling integrationHeavy CDC / streaming-upsert workload

The 2024 momentum: Iceberg is winning the multi-engine race; Snowflake, BigQuery, and Databricks all support it. Delta is sticky if you're on Databricks. Hudi remains the choice for high-volume streaming upserts.

When NOT to use a lakehouse

Anti-pattern. Small data (< 1 TB), single team, simple analytics: a lakehouse is overkill. Snowflake or BigQuery alone is faster to ship and cheaper to operate. Lakehouse pays off at multi-PB scale, multi-engine consumers, or when ML training needs raw files alongside curated tables.

The medallion architecture (bronze / silver / gold)

Most lakehouse implementations follow a three-tier "medallion" layering:

TierContentsSLASchema enforcement
BronzeRaw ingested events, exactly as receivedBest-effort, no SLANone — schema-on-read
SilverCleaned, deduplicated, conformed dimsHoursStrict — schema registry enforced
GoldBusiness-ready aggregates / cubesDaily or near-real-timeStrict — versioned + dbt-tested

Drop the term in the interview: "raw events land in bronze, dbt cleans them into silver with conformed dims, gold is the per-team aggregation marts."

Senior signal. "I'd default to Iceberg-on-S3 with Snowflake as the primary query engine — that gives me Snowflake's SQL polish without locking transformation jobs to Snowflake compute. The medallion layering means bronze is replayable forever; silver is the contract layer; gold is the per-team mart."
§ 03 — Architectural quadrant

Batch vs streaming vs Lambda vs Kappa — the quadrant.

The interview's hardest moment: "how would you architect this — batch or streaming?". The honest answer is both, with a clear seam. Lambda and Kappa are the two ways to draw the seam.

The four cases

CaseExampleLatency tolerance
Pure batchMonthly close, regulatory reportsHours-to-days
Pure streamingFraud detection, live dashboardsSeconds
Lambda (both paths)Recommendations: stream for real-time, batch for daily refreshMixed
Kappa (single stream)Modern event-sourced platform — stream is the source of truth, batch is replayMixed

Lambda — the dual-path architecture

┌─────────────────┐ │ Speed Layer │ │ (Flink/Storm) │──▶ approximate │ │ real-time view ┌──────────┐ └─────────────────┘ │ │ Source │──┐ │ ┌──────────┐ │ events │ │ ├─▶│ Serve │ └──────────┘ │ ┌─────────────────┐ │ │ layer │ └──▶│ Batch Layer │ │ └──────────┘ │ (Spark/dbt) │──▶ accurate │ │ delayed view └─────────────────┘
Lambda prosLambda cons
Real-time + accurate; never blocked on accuracyTwo codebases for the same logic — drift inevitable
Batch can fix streaming approximation errorsReconciliation step is its own problem
Migrate-incrementally pattern (start batch, add stream)Operational burden — two systems, two on-calls

Kappa — the single-stream architecture

┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ Serve │ │ Source │───▶│ Kafka log │──┬──streaming┴─▶layer │ │ events │ │ (durable, │ │ (Flink job) │ └──────────┘ │ replayable) │ │ └──────────────────┘ └──── replay──┐ │ ▼ ┌──────────────┐ │ Replay │ │ batch job │ │ (rebuilds │ │ any state) │ └──────────────┘

Kappa eliminates the dual codebase. The stream is the source of truth; batch is just the same job replayed against historical Kafka offsets. Three preconditions to make Kappa work:

  1. Durable, retentioned event log — Kafka with infinite retention OR tiered storage (Confluent Cloud, Pulsar)
  2. Idempotent / deterministic stream jobs — replay must produce identical output
  3. Replay-friendly state backend — RocksDB checkpoints in Flink; Kafka Streams state stores

When to pick which

Pick Lambda whenPick Kappa when
Migrating from a legacy batch warehouseGreenfield platform with strong streaming culture
Accuracy SLA is regulator-grade (banking close)Eventual consistency is acceptable
Team can't yet replay confidentlyTeam can replay 30+ days of events deterministically
Heterogeneous source latencies (mix of CDC + nightly SFTP)All sources can be modeled as event streams

The 2024+ default — Kappa-leaning hybrid

Most modern platforms aren't pure either. The shape that's winning:

  • Kafka as the spine (event-source-of-truth)
  • Flink jobs writing to Iceberg in near-real-time (sub-minute)
  • "Batch" jobs are dbt incremental models that read the same Iceberg tables
  • One transform logic; two query paths (Flink for low-latency views, dbt for analyst SQL)
Senior signal. "I'd start Kappa-leaning — Kafka spine, Flink jobs writing to Iceberg, dbt models reading from there. If accuracy regulators force a true Lambda dual-path, I'd add the batch reconciliation only for the regulated facts (revenue, settlement) — not the whole platform. The cost of dual code is the cost of the audit; Kappa for everything else."
§ 04 — Orchestration

dbt + Airflow patterns — orchestration that survives audit.

Airflow runs the schedule. dbt builds the SQL. Together they're the most common transformation stack in 2024. The interview question is rarely "what tool" — it's "how do they integrate so a failed step doesn't corrupt the warehouse".

The division of labor

ConcernOwner
Schedule (cron / sensor / event)Airflow
Inter-DAG dependenciesAirflow (TriggerDagRunOperator, ExternalTaskSensor)
Within-DAG transformation logicdbt (models, tests, docs)
Idempotency (same input → same output)dbt (materializations + incremental keys)
Data quality assertionsdbt (schema.yml tests) + Great Expectations / Soda for richer checks
Lineagedbt's auto-generated DAG; OpenLineage for cross-tool
BackfillAirflow's date-range backfill + dbt's --vars '{"target_date": ...}'

The canonical dbt model lifecycle

One dbt model goes through this lifecycle on every run:

1. Source check     ─▶  dbt source freshness   (is upstream data fresh?)
2. Build            ─▶  dbt run --models +my_model  (run model + upstream deps)
3. Test             ─▶  dbt test --models my_model  (assertions: not_null, unique, accepted_values, custom)
4. Docs / lineage   ─▶  dbt docs generate              (refresh DAG + descriptions)
5. Notify           ─▶  Slack / PagerDuty on failure   (Airflow on_failure_callback)

Materializations — pick the right one

MaterializationWhat it doesUse when
viewJust CREATE VIEW; no data storedCheap transforms over small upstream
tableFull rebuild every runSmall dim tables; logic changes frequently
incrementalInsert new rows; merge on keyLarge fact tables; only recent data changes
snapshotSCD2 — track history of changesSlowly-changing dims you must audit
ephemeralInlined as a CTE; never materializedReusable transformation logic, no separate object

The incremental pattern that survives backfill

{{ config(
    materialized='incremental',
    unique_key='event_id',
    on_schema_change='append_new_columns',
    incremental_strategy='merge'
) }}

SELECT event_id, user_id, event_type, event_at
FROM {{ source('raw', 'events') }}

{% if is_incremental() %}
  WHERE event_at > (SELECT MAX(event_at) FROM {{ this }})
{% endif %}

Two failure modes to anticipate:

  • Late-arriving rows — events with event_at earlier than current MAX get skipped. Fix: add a lookback (event_at > MAX(event_at) - INTERVAL '24 hours') and let MERGE deduplicate.
  • Reprocessing — the --full-refresh flag rebuilds from scratch. Required when business logic changes; protected by branch-CI gating.

Airflow patterns — the four every senior knows

PatternWhen
Sensor + downstream DAG (ExternalTaskSensor)Wait for upstream team's DAG to finish before starting
Dynamic task mapping (.expand)One task per partition / customer / date — fan-out at runtime
TaskGroupVisual grouping of related tasks; reusable patterns across DAGs
Branching (BranchPythonOperator)Different paths based on data-driven decision (only run gold-layer if silver passed quality)

The "audit-safe" pattern

Compliance-grade pipelines need to prove what ran, when, with what version, and what the input was. The pattern:

  1. Versioned dbt manifest — every run captures the dbt project version + git SHA
  2. Run logs in a database — Airflow's metadata DB plus dbt's run_results.json uploaded to S3 per run
  3. Lineage snapshot — OpenLineage events per task to a central catalog
  4. Reproducibility test — quarterly job that rebuilds last quarter's gold table from raw + git SHA, asserts byte-equality
Trap. "We use dbt + Airflow" is not an answer. The senior version names the materializations, the incremental lookback, the audit trail, and the failure-callback wiring. Memorize that stack.
Senior signal. "Airflow owns scheduling and inter-DAG dependencies; dbt owns within-DAG SQL and data quality. The integration point is one BashOperator running dbt run --models +tag:gold with a retries=2 and an on_failure_callback posting to PagerDuty. Backfill is two flags away — Airflow's --start-date + dbt's --vars. And the dbt manifest goes to S3 every run for audit replay."
§ 05 — Contracts

Schema registry & contract-first pipelines.

Three things break a data platform: schema changes, schema changes, and schema changes. A field renamed upstream silently breaks 40 downstream dashboards two weeks later. Schema registry is the contract layer that prevents this.

What a schema registry actually is

A schema registry stores the structure (Avro / Protobuf / JSON Schema) of every Kafka topic. Producers register their schema; consumers fetch it; the registry enforces compatibility rules on every change.

ComponentExamples
Schema registryConfluent Schema Registry, Apicurio, AWS Glue Schema Registry
Schema formatAvro (most common, binary), Protobuf (Google preferred), JSON Schema (least efficient)
Compatibility checkBackward / Forward / Full / None — enforced at publish-time

The four compatibility modes

ModeProducer change OK ifConsumer impact
BackwardNew consumer can read OLD data with new schemaSafe to upgrade consumer first
ForwardOLD consumer can read NEW data with old schemaSafe to upgrade producer first
FullBoth directionsAnyone can upgrade in any order
NoneAnything goesProduction fire

Default to backward for fact streams (consumers are usually dashboards / analytics that lag producers). Default to full for multi-team event APIs.

Allowed schema changes (Avro example)

ChangeBackwardForwardFull
Add field with default
Remove field
Add field without default
Rename field
Change field type

Renames and type changes are never compatible. The senior pattern: "add the new field, dual-write for a deprecation window, switch consumers, drop the old field."

Contract-first development — the workflow

  1. Producer team writes Avro schema in a shared repo (typically a "schemas" monorepo)
  2. CI validates compatibility against the registered schema — PR fails if break
  3. Schema published on merge to schema registry; producer code generated from it
  4. Consumer teams discover schemas via registry browser; generate their own client code
  5. Runtime enforcement — Kafka brokers reject messages that don't match registered schema

The tombstone / deprecation pattern

// Schema v1
{ "type": "record", "name": "User", "fields": [
  { "name": "user_id", "type": "long" },
  { "name": "name",    "type": "string" }
]}

// Schema v2 — adding email, deprecating name
{ "type": "record", "name": "User", "fields": [
  { "name": "user_id", "type": "long" },
  { "name": "name",    "type": "string", "deprecated": true },
  { "name": "email",   "type": ["null", "string"], "default": null },
  { "name": "first_name", "type": ["null", "string"], "default": null },
  { "name": "last_name",  "type": ["null", "string"], "default": null }
]}

// 90 days later, schema v3 — drop name
{ "type": "record", "name": "User", "fields": [
  { "name": "user_id", "type": "long" },
  { "name": "first_name", "type": ["null", "string"], "default": null },
  { "name": "last_name",  "type": ["null", "string"], "default": null }
]}

Schema in the warehouse — dbt contracts

dbt 1.5+ added contract enforcement at the model level — runtime validation that the output table matches a declared schema:

models:
  - name: dim_user
    config:
      contract:
        enforced: true
    columns:
      - name: user_id
        data_type: bigint
        constraints: [{ type: not_null }, { type: primary_key }]
      - name: email
        data_type: varchar
        constraints: [{ type: not_null }]

The model fails to build if the output schema drifts. Combined with downstream tests, this is the warehouse-side equivalent of schema registry.

Senior signal. "Every Kafka topic has an Avro schema in Confluent Registry, defaulted to backward compatibility. Schema changes go through a PR-CI loop that runs the registry's compatibility check. Renames are tombstone-then-add — a 90-day deprecation window with both fields populated. In the warehouse we layer dbt contracts so the model build fails on accidental schema drift. The contract is the entire architectural backbone — without it, the platform rots in 6 months."
§ 06 — Articulation

The 90-second articulation script — works on any prompt.

You've drawn the diagram, named the technologies, called out the trade-offs. The closing 90 seconds bring it together as one coherent design. Memorize this template; substitute the specific tools and grain.

▸ THE 90-SECOND SCRIPT

"For a platform with [latency requirement] and [volume hint], I'd shape this as five stages: sources → ingest → storage → transform → serve, with schema registry, metadata catalog, and lineage as cross-cutting control planes."

"On storage, I'd default to a lakehouse pattern — Iceberg files on S3 — with a medallion layering: bronze for raw, silver for cleaned with conformed dims, gold for the per-team aggregation marts. That gives me lake economics with warehouse transactional guarantees, and Iceberg's broad engine support means I'm not locked into one query engine."

"On the batch-vs-streaming question, I'd go Kappa-leaning: Kafka as the event spine, Flink jobs writing into Iceberg in near-real-time, dbt models on top reading the same tables. One transform logic, two read paths — Flink for sub-minute views, dbt for analyst SQL. If a regulatory fact like revenue needs reconciliation, I'd add a true Lambda batch path just for that fact, not the whole platform."

"Orchestration is Airflow over dbt — Airflow owns scheduling and cross-DAG dependencies, dbt owns within-DAG SQL transformations and data-quality tests. Materializations are mostly incremental with merge strategy for late-arriving data. Audit replay works because every dbt run uploads its manifest + run_results to S3 keyed by git SHA."

"Schema registry with backward compatibility is non-negotiable — every Kafka topic registered, schema PRs CI-validated, renames are tombstone-then-add over 90 days. Warehouse-side, dbt contracts enforce output schema at build time. The contract layer is what keeps the platform from rotting."

"The three risks I'd flag and mitigate first: schema drift (registry + contracts), late-arriving data (incremental lookback windows), and skew at the heaviest entity (salted aggregation in Spark / Flink). The platform without these three is a ticking clock."

Three sentences that signal seniority — in any round

  1. "I'd start with the grain question — what's the smallest unit of fact, and what's the consumer grain?"
  2. "The contract layer (schema registry + dbt contracts) does more for reliability than the orchestration choice."
  3. "Lambda is what you choose when the team can't yet replay confidently; Kappa is the default when they can."
· · ·
▸ Six sections · the patterns hold across domains · go well.
← Back to Design pillar  ·  Data Modeling notebook  ·  Senior DE Prep