Overview

A multi-file knowledge base for deep, production-grade data engineering. Each file goes far beyond surface treatment — internals, math, code samples, war stories, and the things that get asked in senior interview loops.

How to use this

Read in order if you're studying. Jump to a file if you're cramming for a specific area. The final file is interview Q&A built from real scenarios — page-at-3am incidents, design rounds, debugging walkthroughs — not leetcode.

Each file is self-contained. Code samples are runnable (Spark/PySpark, Flink, SQL on Snowflake/BigQuery/Postgres, Python). All examples assume Python 3.11+, Spark 3.5+, Flink 1.18+ unless stated otherwise.

Files

#	File	What's in it
01	`01-data-modeling.md`	Bounded vs unbounded data shapes, Kimball/Inmon/Vault, SCD 0–7 with code, fact grain, OBT vs star, lakehouse modeling, NULL semantics, contract design
02	`02-batch-processing.md`	Bounded data theory, idempotency proofs, incremental + backfill patterns, MERGE patterns, file-format internals, partition design math
03	`03-streaming-processing.md`	Unbounded data, event time vs processing time, watermark math, window mechanics, Flink internals, Kafka protocol, exactly-once proofs
04	`04-spark-internals.md`	Catalyst rules, AQE behavior, shuffle service, broadcast vs SMJ vs SHJ math, skew detection algorithms, memory model, Tungsten
05	`05-sql-deep-dive.md`	Logical → physical plan translation, join algorithms with complexity, indexes & zone maps, window function internals, advanced patterns (gaps, sessionize, PIT, sketches)
06	`06-python-de.md`	GIL bytecode-level, Pandas vs Polars vs PySpark trade math, pandas/Arrow boundary, generators, async patterns, testing strategy, packaging
07	`07-lakehouse-iceberg.md`	Iceberg & Delta on-disk layout, snapshot isolation, MERGE internals, compaction, hidden partitioning, time travel, schema evolution semantics
08	`08-interview-qa-scenarios.md`	40+ real interview scenarios with full answer skeletons — incident response, system design, deep-internals, judgment calls. Not leetcode.

Reading paths

Cramming for an interview in a week: 08 first to know what you're aiming for, then 02/03 (processing), then 05 (SQL), then 01 (modeling), then 04 (Spark) and 07 (lakehouse) for depth on stack-specific questions.

Building a real system: 01 → 02 → 03 → 07 → 04. SQL and Python depth as needed.

Refresher / lookup: jump to whichever file has the topic. Each file's TOC is browsable.

What "deep" means here

Where the previous deep dive said "watermarks are a promise that no events with event_time ≤ W will arrive," this version derives the watermark formula, walks through how Flink computes it across operators with different latencies, shows the actual math for tuning bounded-out-of-orderness, and gives the code.

Where the previous version said "AQE handles skew automatically," this version shows the algorithm — how Spark detects a skewed partition, how it splits it, what the trade-offs are with broadcast vs shuffle, and the configs that govern each step.

That's the bar.

↑ Back to top

Part 01

Data Modeling

The shape of your data at rest outlives every other choice. Code gets rewritten; schemas get migrated but never quite leave. This file goes deeper than "use a star schema" — it derives the decisions from first principles, shows the math that justifies them, and gives full DDL and MERGE patterns you can actually run.

1. Bounded vs Unbounded Data at Rest

Before storage models, the underlying question: is the dataset bounded (finite, the end is known) or unbounded (arriving forever)?

Bounded at rest: a daily snapshot, a historical archive, a reference table. You can scan it, sort it, compute global aggregates, and re-derive it from source on demand.
Unbounded at rest: an append-only log of events where you're modeling "everything that ever happened." Storage grows forever. Queries must always constrain time.

This distinction drives physical choices:

Aspect	Bounded tables	Unbounded tables
Partitioning	Optional, by category	Required, by time
Sort key	By query predicate	By time, always
Vacuum policy	Rare, small savings	Aggressive — old partitions dropped or cold-tiered
Query pattern	Full scans tolerable	Must always include time bound
Compaction	Manual, when files accumulate	Continuous or scheduled

The engineering mistake is treating an unbounded stream-origin table as bounded — e.g., SELECT COUNT(DISTINCT user_id) FROM events without a dt filter. Over time this becomes unrunnable. Build guards into your table design (required partitions, partition-prune-or-fail settings like Snowflake's REQUIRE_PARTITION_FILTER, BigQuery's require_partition_filter=TRUE).

-- BigQuery: reject queries that don't filter on the partition column
CREATE TABLE playback.events_daily (
  event_ts      TIMESTAMP,
  user_id       STRING,
  title_id      STRING,
  watch_ms      INT64,
  dt            DATE
)
PARTITION BY dt
OPTIONS (
  require_partition_filter = TRUE,
  partition_expiration_days = 730   -- auto-drop after 2 years
);

-- Snowflake equivalent: rely on clustering + session param
ALTER SESSION SET QUERY_TAG = 'enforce_partition_filter';
-- Enforcement is done via resource monitors + query profiling, not DDL.

The unbounded nature also affects backfills. For a bounded dimension, "regenerate from source" is a few GB; for an unbounded fact, it's years of events.

2. OLTP vs OLAP — The Physical Reality

The choice between row-oriented and column-oriented storage isn't philosophical — it's dictated by what the hardware does with your queries.

Row-oriented storage (Postgres, MySQL, Oracle)

A row is laid out contiguously on disk. Reading row 42 = one random I/O + parse the row. Fast for "give me all columns of one row by primary key." Slow for "give me the average of column X across 100M rows" because every row must be loaded even though only one column is needed.

The storage page (typically 8 KB in Postgres, 16 KB in MySQL) is the unit of I/O. Postgres reads whole pages into shared_buffers. TID is (page_number, slot_number). B-tree indexes point at TIDs; looking up row 42 = B-tree walk (log N pages) + heap fetch.

Column-oriented storage (Parquet, ORC, Snowflake FDN, Redshift, BigQuery Capacitor)

Each column stored separately, with its own compression. Reading column X = read just X's bytes, skipping everything else. Compression ratios of 5–20× are routine because a single column has low cardinality locally.

A Parquet file is organized as:

File footer: schema, row-group locations, column stats (min/max/null_count/distinct_count).
Row groups (typically 128 MB): a horizontal slice of rows.
- Column chunks: within a row group, one per column.
  - Pages (typically 1 MB): within a chunk, the unit of decoding.

The footer stats are the basis of row-group skipping — a WHERE filter can exclude row groups whose min/max don't match. This is why WHERE dt = '2026-04-16' AND country = 'US' is fast on a Parquet table partitioned by dt and sorted by country: both conditions prune.

Compression schemes within a column:

RLE (Run-Length Encoding): AAAA BBBB CCCC → (A,4)(B,4)(C,4). Great on sorted columns with runs.
Dictionary encoding: map values to ints; store ints + dictionary. Great on low-cardinality strings.
Bit-packing: encode small integers in <8 bits each. Often combined with RLE.
Delta encoding: store differences from previous value. Great on timestamps, sorted sequences.

The column-oriented nature has a direct modeling implication: adding a column is nearly free at read time (you only pay for it if you select it). Wide tables (OBT) aren't the disaster they'd be in a row store.

OLTP modeling implications

Normalize: each update touches one row, so redundancy is expensive.
B-tree indexes on everything queried. Covering indexes for hot queries.
FK constraints enforced at write.
Partitioning is a tool for manageability (vacuum, detach old partitions) more than performance.

OLAP modeling implications

Denormalize within reason. Joins are expensive in distributed systems; repeated string values cost almost nothing with dictionary encoding.
Sort / cluster by the most common filter key (usually date).
No FK constraints; integrity is pipeline-enforced.
Partitioning is the #1 physical layout decision. Get it wrong and queries scan entire years.

HTAP blur

New engines (TiDB, CockroachDB, SingleStore, Snowflake Hybrid Tables) offer row + column storage of the same logical table — writes go to the row store, reads to the column store with asynchronous replication. Nice for mid-size workloads; still limited in scale or consistency compared to pure OLTP/OLAP.

3. Normalization Derived From Functional Dependencies

Normalization is taught as a set of rules. It's actually a consequence of functional dependencies — statements of the form "column set X determines column set Y" (written X → Y).

Given dependencies:

order_id → customer_id, order_date, status
customer_id → customer_name, country

The relation Orders(order_id, customer_id, order_date, status, customer_name, country) violates 3NF because customer_id (non-key) determines customer_name, country — a transitive dependency. Decomposition:

Orders(order_id, customer_id, order_date, status)
Customers(customer_id, customer_name, country)

1NF — atomic values

No arrays, no comma-separated strings, no nested structures that the DB can't query. tags = "red,blue,green" forces LIKE '%blue%' queries that can't use indexes. Decompose into a child table order_tags(order_id, tag).

Modern engines allow arrays (Postgres text[], BigQuery ARRAY<STRING>) with GIN/ARRAY-function support. That's a departure from strict 1NF but acceptable when the array operations are well-supported.

2NF — full functional dependency on the key

Only relevant for composite keys. If PK is (order_id, line_number) and order_date depends on order_id alone, order_date doesn't belong in the line-item table. Move it.

3NF — no transitive dependencies

Shown above. In practice, 3NF is the ceiling for OLTP — further decomposition rarely helps.

BCNF (3.5NF) — every determinant is a candidate key

The textbook failure case: ClassRoom(course_id, instructor, room) where (course_id, instructor) → room and instructor → room. BCNF requires instructor to be a candidate key; it isn't, so decompose. These situations are rare in real data.

4NF & 5NF — multi-valued & join dependencies

Multi-valued dependencies arise when two independent multi-valued facts coexist: a teacher teaches multiple courses AND speaks multiple languages, and all combinations appear. 4NF says split these. 5NF handles even more arcane join dependencies. In 20 years of modeling work I've applied these consciously twice.

Practical heuristic

Normalize to 3NF. Denormalize deliberately, with a comment explaining why. A typical OLTP schema has 90% 3NF + 10% intentional denormalization (caching displayed values, storing denormalized totals for performance).

4. Dimensional Modeling — Kimball in Full

Kimball's dimensional model is a theory of analytic queries. It says: most analytics look like "a measure, broken down by attributes, filtered by other attributes." If your storage matches that shape, queries are trivial and fast.

The four-step process (and what each step actually means)

Step 1 — Business process. Not "the whole business" — a single process: orders, shipments, returns, playback sessions. Business processes are where events get generated; each is a candidate fact table. Get them from business users, not from source-system tables.

Step 2 — Grain. Declare exactly what one fact row represents. "One row per order line item per order" is a grain. "One row per playback session per user" is another. Grain must be atomic — prefer finer grain; you can always aggregate up.

The test: pick any two rows; they must represent genuinely distinct events. If two rows could be the same event counted twice, your grain is wrong.

Step 3 — Dimensions. The descriptors: who, what, where, when, why, how. For each dimension, identify the grain and attributes. Dimensions are denormalized (all attributes in one table) — you want category, subcategory, brand, department all present in dim_product even though they form a hierarchy.

Step 4 — Facts. The numeric measures at the declared grain. Prefer additive measures (quantity, extended_price). Document semi-additive (balance — additive across accounts but not time) and non-additive (ratio — don't sum ever).

Full star schema — playback analytics

-- -----------------------------------------------------
-- Date dimension (discussed more in §8)
-- -----------------------------------------------------
CREATE TABLE dim_date (
    date_key           INT PRIMARY KEY,             -- 20260419
    full_date          DATE NOT NULL,
    day_of_week        SMALLINT NOT NULL,           -- 1=Monday
    day_name           VARCHAR(10) NOT NULL,
    day_of_month       SMALLINT NOT NULL,
    day_of_year        SMALLINT NOT NULL,
    week_of_year       SMALLINT NOT NULL,           -- ISO week
    month_num          SMALLINT NOT NULL,
    month_name         VARCHAR(10) NOT NULL,
    quarter            SMALLINT NOT NULL,
    year               SMALLINT NOT NULL,
    fiscal_year        SMALLINT NOT NULL,
    fiscal_quarter     SMALLINT NOT NULL,
    fiscal_period      SMALLINT NOT NULL,
    is_weekend         BOOLEAN NOT NULL,
    is_holiday         BOOLEAN NOT NULL,
    holiday_name       VARCHAR(50),
    is_business_day    BOOLEAN NOT NULL,
    days_from_today    INT                          -- maintained via view
);

-- -----------------------------------------------------
-- User dimension — SCD Type 2
-- -----------------------------------------------------
CREATE TABLE dim_user (
    user_key           BIGINT PRIMARY KEY,          -- surrogate
    user_id            VARCHAR(36) NOT NULL,        -- natural/durable
    email              VARCHAR(320),
    country_code       CHAR(2),
    subscription_tier  VARCHAR(20),                 -- basic/standard/premium
    household_size     SMALLINT,
    profile_language   VARCHAR(10),
    signup_date        DATE,
    churn_date         DATE,
    valid_from         TIMESTAMPTZ NOT NULL,
    valid_to           TIMESTAMPTZ,
    is_current         BOOLEAN NOT NULL,
    hash_diff          CHAR(64) NOT NULL            -- SHA-256 of tracked attrs
);
CREATE UNIQUE INDEX dim_user_natural ON dim_user(user_id, valid_from);
CREATE INDEX dim_user_current ON dim_user(user_id) WHERE is_current;

-- -----------------------------------------------------
-- Title dimension — mostly Type 1, some Type 2 (rating)
-- -----------------------------------------------------
CREATE TABLE dim_title (
    title_key          BIGINT PRIMARY KEY,
    title_id           VARCHAR(36) NOT NULL,
    title              VARCHAR(500),
    title_type         VARCHAR(20),                 -- movie/series/documentary
    genre_primary      VARCHAR(50),
    genre_secondary    VARCHAR(50),
    runtime_minutes    INT,
    release_year       SMALLINT,
    language_primary   VARCHAR(10),
    content_rating     VARCHAR(20),                 -- MA, PG-13, etc.
    current_rating     NUMERIC(3,2),                -- aggregated, Type 1
    season_number      SMALLINT,
    episode_number     SMALLINT,
    is_original        BOOLEAN,
    valid_from         TIMESTAMPTZ NOT NULL,
    valid_to           TIMESTAMPTZ,
    is_current         BOOLEAN NOT NULL
);

-- -----------------------------------------------------
-- Device dimension — Type 1 (device facts don't "change" per user)
-- -----------------------------------------------------
CREATE TABLE dim_device (
    device_key         BIGINT PRIMARY KEY,
    device_id          VARCHAR(64) NOT NULL,
    device_type        VARCHAR(20),                 -- tv/mobile/tablet/web
    manufacturer       VARCHAR(50),
    model              VARCHAR(100),
    os                 VARCHAR(50),
    os_version         VARCHAR(20),
    app_version        VARCHAR(20)
);

-- -----------------------------------------------------
-- Playback session fact — transaction grain
-- One row per playback session (session starts and ends)
-- -----------------------------------------------------
CREATE TABLE fact_playback_session (
    session_key        BIGINT PRIMARY KEY,          -- surrogate per session
    session_id         VARCHAR(64) NOT NULL,        -- natural, from client
    user_key           BIGINT NOT NULL REFERENCES dim_user,
    title_key          BIGINT NOT NULL REFERENCES dim_title,
    device_key         BIGINT NOT NULL REFERENCES dim_device,
    start_date_key     INT NOT NULL REFERENCES dim_date,
    end_date_key       INT NOT NULL REFERENCES dim_date,
    start_ts           TIMESTAMPTZ NOT NULL,
    end_ts             TIMESTAMPTZ NOT NULL,
    -- Measures (all numeric, additive where possible)
    watch_ms           BIGINT NOT NULL,             -- additive
    buffer_ms          BIGINT NOT NULL,             -- additive
    seeks              INT NOT NULL,                -- additive
    pauses             INT NOT NULL,                -- additive
    max_bitrate_kbps   INT,                         -- non-additive (use avg)
    avg_bitrate_kbps   INT,                         -- non-additive (use avg)
    completion_ratio   NUMERIC(4,3),                -- non-additive ratio
    -- Degenerate dimensions
    ended_reason       VARCHAR(20),                 -- user/credits/error
    qoe_score          NUMERIC(3,2),
    -- Partition
    dt                 DATE NOT NULL
)
PARTITION BY RANGE (dt);

Key details people miss:

Every FK is a surrogate _key, not a natural _id. The natural IDs are attributes, preserved for lookup.
start_date_key and end_date_key both reference dim_date — same table, role-playing.
session_id is stored as a degenerate dimension (an attribute on the fact with no corresponding dim table) because there's no other data to put in a dim_session.
max_bitrate_kbps and avg_bitrate_kbps are non-additive measures. You need both if you want to compute weighted averages at higher grains.

The query this shape makes easy

-- Total watch hours by country and content rating, last 7 days, weekdays only
SELECT
    u.country_code,
    t.content_rating,
    SUM(f.watch_ms) / 3600.0 / 1000 AS watch_hours
FROM fact_playback_session f
JOIN dim_user  u ON u.user_key  = f.user_key
JOIN dim_title t ON t.title_key = f.title_key
JOIN dim_date  d ON d.date_key  = f.start_date_key
WHERE d.full_date >= CURRENT_DATE - 7
  AND d.is_weekend = FALSE
GROUP BY u.country_code, t.content_rating
ORDER BY watch_hours DESC;

No CASE statements, no date arithmetic, no EXTRACT — just filter and group. This is why Kimball wins.

Why surrogate keys

Four reasons, in increasing order of importance:

Natural keys change. Source systems renumber, reassign IDs during migrations. Your warehouse shouldn't break.
SCD2 requires multiple rows per natural ID — the surrogate disambiguates.
Integer surrogates are small and fast in joins. A BIGINT is 8 bytes; a VARCHAR(36) UUID is 36 bytes + overhead and comparison is slower.
Surrogates decouple the warehouse's keyspace from source systems. You can integrate two sources whose customer_id namespaces collide.

Generate surrogates deterministically with SHA-256 of the natural key (for idempotent re-runs) or from a sequence (for insertion-order keys). The deterministic approach is safer in modern pipelines.

-- Deterministic surrogate from natural key
SELECT
    HASH(user_id, valid_from) AS user_key,   -- Snowflake HASH(x1,...) -> BIGINT
    user_id, valid_from, ...
FROM staging_users;

-- Spark / PySpark
from pyspark.sql.functions import sha2, concat_ws, col
df = df.withColumn(
    "user_key",
    sha2(concat_ws("||", col("user_id"), col("valid_from")), 256)
)

Use a hashing function that's stable across restarts; Spark's built-in hash() isn't guaranteed stable across versions, but sha2 and md5 are.

5. Fact Table Grain (The Most Important Decision)

Grain is where every modeling interview should start. The failure mode is a fact table where some rows are at one grain and others at another. The bug is invisible until a sum is 2× too large.

Three transactional fact patterns

Transaction fact — one row per event. Most common. Fully additive. Example: fact_playback_session above.

Periodic snapshot fact — one row per entity per period. Measures describe the state at period end. Semi-additive.

CREATE TABLE fact_subscription_daily (
    snapshot_date_key  INT,
    user_key           BIGINT,
    -- Semi-additive: summable across users, NOT across days
    mrr_cents          BIGINT,       -- monthly recurring revenue
    status             VARCHAR(20),
    days_until_renewal SMALLINT,
    -- Additive within the day
    payments_made      INT,
    refunds_issued     INT,
    PRIMARY KEY (snapshot_date_key, user_key)
);

The MRR column is the classic semi-additive trap. Users total mrr across two days and get 2× the real revenue. Solutions: label the column in the metadata catalog; always aggregate via AVG across time and SUM across entity; make the column name end in _point_in_time.

Accumulating snapshot fact — one row per process instance, updated as it progresses through milestones.

CREATE TABLE fact_order_fulfillment (
    order_key         BIGINT PRIMARY KEY,
    customer_key      BIGINT,
    -- Milestones (date keys; NULL until reached)
    order_date_key    INT NOT NULL,
    paid_date_key     INT,
    packed_date_key   INT,
    shipped_date_key  INT,
    delivered_date_key INT,
    returned_date_key INT,
    -- Durations, derived at each update
    hours_to_pay      INT,
    hours_to_pack     INT,
    hours_to_ship     INT,
    hours_to_deliver  INT,
    -- Measures
    order_total_cents BIGINT,
    items_count       INT
);

Pros: a single row per order gives "lead time analysis" trivially. Cons: heavy on updates (updates are the enemy of columnar engines). Use when the process has bounded, well-defined milestones.

Additivity

Additivity is the single most-often-wrong attribute in a fact table. Get it wrong and every dashboard off that table is silently lying. The taxonomy is three-way, but the real skill is recognizing which bucket any given measure falls into across domains. The table below expands the usual starter examples into a cross-industry reference you can pattern-match against.

Additivity	Sum across any dim?	Representative measures	Warehouse pattern
Additive	Yes	revenue, quantity_sold, watch_ms, clicks, impressions, seeks, bytes_transferred, kWh_consumed, calls_handled, miles_driven, tickets_sold, pageviews, messages_sent, COGS, gross_margin_dollars	`SUM()` everywhere
Semi-additive	Some dims only	account_balance, loan_principal_outstanding, inventory_on_hand, headcount, open_positions, active_users_point_in_time, queue_depth, subscriber_count, MRR/ARR snapshot, portfolio_market_value, occupancy_count, beds_available, in-flight_aircraft, open_tickets, WIP_units, cache_entry_count	`SUM()` across non-time dims; `AVG()`, `LAST_VALUE()`, or `MAX(snapshot_date)` across time
Non-additive	Never	unit_price, margin_pct, conversion_rate, bounce_rate, CTR, CPC, CPM, ROAS, churn_rate, NPS, CSAT, avg_session_duration, LTV, CAC_ratio, z_score, cohort_retention_pct, exchange_rate, utilization_pct, SLA_pct, fill_rate_pct, stock_price, cap_rate, interest_rate	Store numerator + denominator; compute ratio in SELECT. Never pre-aggregate the ratio itself.

Domain-specific additivity cheat sheet

Use this to pattern-match new domains against known ones. The same measure shape shows up again and again.

Domain	Additive	Semi-additive (snapshot)	Non-additive (ratios)
SaaS / subscription	signups, cancellations, payments_received, refunds, feature_events	MRR, ARR, active_subscribers, seats_in_use, storage_GB_used	churn_rate, gross_retention_pct, NRR, ARPU, LTV/CAC
E-commerce / retail	orders_placed, units_sold, revenue, returns, shipping_cost, gift_cards_issued	inventory_on_hand, cart_value_total, wishlist_size, open_orders	conversion_rate, AOV, return_rate, margin_pct, sell_through_rate
Fintech / banking	deposits, withdrawals, transactions_count, interest_accrued, fees_charged	account_balance, loan_outstanding, credit_exposure, AUM	APR, NPL_ratio, NIM, capital_ratio, Sharpe_ratio
Payments / fraud	transactions, approved_count, declined_count, chargebacks, refunds_dollar	active_card_count, in_review_queue_depth, fraud_blocklist_size	approval_rate, fraud_rate_bps, false_positive_rate, chargeback_ratio
Advertising / adtech	impressions, clicks, conversions, spend_dollars, video_starts, video_completes	active_campaigns, remaining_budget, inventory_available	CTR, CPC, CPM, CPA, ROAS, viewability_pct, VTR
Streaming media	watch_ms, starts, completes, seeks, downloads, new_titles_added	concurrent_streams, active_accounts, catalog_size	completion_rate, stream_start_success_rate, rebuffer_ratio
Gaming	sessions, purchases, XP_earned, deaths, matches_played, currency_spent	active_players, open_lobbies, inventory_items_held, mmr_snapshot	D1/D7/D30_retention, ARPDAU, win_rate, K/D_ratio, match_fill_rate
Ride-share / logistics	rides_completed, miles_driven, fares_collected, driver_hours_online, tips	drivers_online, riders_in_queue, open_requests, fleet_size	acceptance_rate, cancellation_rate, utilization_pct, ETA_accuracy, $/mile
IoT / industrial	readings_count, alerts_raised, kWh, bytes_telemetered, maintenance_events	devices_online, queue_depth, battery_level_snapshot, firmware_version_count	uptime_pct, packet_loss_rate, alert_false_positive_rate, MTBF
Healthcare / payer	claims_submitted, visits, prescriptions_filled, procedures, payments	members_enrolled, authorizations_open, beds_occupied, PMPM_exposure	denial_rate, readmission_rate, MLR, PMPM_cost_trend, claim_cycle_time
Ops / on-call	incidents_opened, pages_sent, deploys, rollbacks, PRs_merged	open_incidents, queue_depth, on_call_count	MTTR, change_failure_rate, deploy_freq_norm, SLO_pct, error_budget_remaining
Marketplace (two-sided)	listings_created, bookings, transactions, disputes_opened	active_supply_nodes, active_demand_nodes, open_listings	match_rate, take_rate, supply/demand_ratio, repeat_use_rate

Measures that look additive but aren't

The following look like counts — they are not safely sum-able without context. Each has a specific remediation.

Misleading measure	Why it breaks	Fix
`distinct_users` per day	Summing two days double-counts users active in both	Store `users_active_hll` (HyperLogLog sketch); merge sketches, then cardinality
`unique_sessions`	Sessions can span day boundaries; naive sum over-counts or under-counts	Sessionize with a fixed boundary (e.g., always assign to session-start day)
`avg_session_duration` per day	Average of averages weights each day equally regardless of traffic	Store `sum_duration_ms` + `session_count`; divide at query time
`median_order_value`	Medians don't aggregate linearly	Store a t-digest or KLL sketch per day; merge sketches, then query percentiles
`max_concurrent_streams`	Summing daily maxes ≠ global max; daily max can coincide or not	Store interval-level counts; compute max over intersection in SELECT
`returning_visitor_count`	"Returning" is relative to a baseline that shifts per query window	Reclassify per query; store `first_seen_ts` per visitor, derive returning flag on demand
`churn_rate` (pre-computed)	Churn % cannot be averaged across cohorts of different sizes	Store `churned_count` + `cohort_size`; compute ratio in SELECT
`weighted_avg_price`	Re-weighted averages don't compose	Store `sum_price_times_qty` + `sum_qty`; divide in SELECT
`rank_position` from LIMIT queries	Rank is relative to the surrounding rowset, not intrinsic	Never persist rank as a column; compute via window function on read
`latency_p99` per minute	Percentiles don't average — p99 of (p99s) is not the global p99	Store histogram buckets or a sketch; query-time merge

The canonical non-additive mistake: you have fact_product_daily(product_key, date_key, avg_price). A BI user computes "average price this week = AVG(avg_price) across 7 days." That's an average of averages — not the weighted average they want. Fix: store sum_of_prices and price_count, compute the ratio in the SELECT. Same rule for everything in the third table above — the shape is always "store the components, derive the ratio on read."

Factless fact tables

A fact table with only FKs — no measures. Models events or eligibility:

-- "Which users had access to which titles on which day"
CREATE TABLE fact_title_availability_daily (
    date_key  INT,
    user_key  BIGINT,
    title_key BIGINT,
    -- no measures
    PRIMARY KEY (date_key, user_key, title_key)
);

The "measure" is COUNT(*) — "how many users had access to Stranger Things in France on 2026-04-01."

These get huge fast (N_users × N_titles × N_days). Sparse-encode where possible (don't emit rows for absent combinations).

Multi-grain fact: how to avoid it

The canonical multi-grain bug: an orders fact table where some rows are "one per order" and some are "one per line item." Summing order_total sums header rows plus line rows — each order's total appears many times.

Rules:

One fact table per declared grain.
If you need both grains, build two fact tables: fact_order_header and fact_order_line. They roll up via shared dimensions.
Never introduce an "aggregation type" column to disambiguate rows of different grains. That's a band-aid on a severed arm.

6. Dimensions — Conformed, Junk, Degenerate, Mini, Role-Playing

Conformed dimensions

A dimension is conformed when multiple fact tables share it. This is how you answer "return rate by customer segment" — fact_returns.customer_key and fact_orders.customer_key must reference the same dim_customer.

Conformed dimensions are an organizational commitment as much as a technical one. A dim_customer owned by marketing with fields marketing cares about, and another owned by billing with billing fields, is a failure. You need a single dim_customer with all stakeholders as consumers.

The enterprise bus matrix (Kimball's term) is a table of business processes × dimensions, with an X where they intersect. Use it to plan which dimensions must be conformed:

                      dim_user  dim_title  dim_device  dim_date  dim_geo  dim_promo
playback_session         X         X          X          X         X        -
billing_charge           X         -          -          X         X        X
search_query             X         -          X          X         X        -
signup                   X         -          X          X         X        X

Junk dimensions

When you have a handful of low-cardinality flags and indicators that don't belong together but don't warrant their own dimensions: combine them into a junk dimension.

-- Instead of: is_trial BOOLEAN, is_promo BOOLEAN, device_is_new BOOLEAN on every fact row...
CREATE TABLE dim_session_flags (
    flags_key   INT PRIMARY KEY,
    is_trial    BOOLEAN,
    is_promo    BOOLEAN,
    is_new_device BOOLEAN,
    signal_source VARCHAR(20)
);
-- Only 2x2x2x (distinct sources) = ~16 rows.

Then the fact has one flags_key FK and you get clean reporting.

Degenerate dimensions

An attribute on the fact with no corresponding dim table — like session_id above. Reason: there's no other data about a session beyond what's already in the fact row. Creating dim_session would just be a table with one column.

Mini-dimensions

When a dimension has a subset of frequently-changing attributes that would inflate SCD2 storage: split them off.

-- dim_user (stable attrs)
user_key, user_id, email, country_code, signup_date, ...

-- dim_user_demographics (mini-dim; snapshots of volatile attrs)
demographics_key, age_bracket, income_bracket, household_size, interest_segment, snapshot_ts

The fact has FKs to both: user_key (as-current) and demographics_key (as-at-event-time). Saves SCD2 storage when the volatile attributes change often and the stable ones don't.

Role-playing dimensions

One physical dimension, multiple roles in the fact. dim_date plays order_date, ship_date, delivered_date. Implement as views aliasing the base dim so query joins read naturally:

CREATE VIEW dim_order_date    AS SELECT * FROM dim_date;
CREATE VIEW dim_ship_date     AS SELECT * FROM dim_date;
CREATE VIEW dim_delivered_date AS SELECT * FROM dim_date;

SELECT
    od.year, od.month_num,
    sd.day_name AS ship_day,
    dd.week_of_year AS delivered_week,
    SUM(f.order_total_cents)
FROM fact_order_fulfillment f
JOIN dim_order_date      od ON od.date_key = f.order_date_key
JOIN dim_ship_date       sd ON sd.date_key = f.shipped_date_key
JOIN dim_delivered_date  dd ON dd.date_key = f.delivered_date_key
GROUP BY od.year, od.month_num, sd.day_name, dd.week_of_year;

The physical storage is one table; the logical model has three.

Swappable / outrigger

When a dimension attribute has its own attributes worth rolling up: a small normalized "outrigger" joined to the main dim. dim_product.category_id → dim_category.name, parent_category. Violates pure star; sometimes necessary.

7. Slowly Changing Dimensions — All Seven Types, with Code

Type 0 — Retain original

Never update. Fields like date_of_birth, signup_date, birth_country. No code needed beyond rejecting updates.

Type 1 — Overwrite

Just update. No history. Use for typo corrections, GDPR erasure of PII.

-- Stage has latest values; merge overwrites
MERGE INTO dim_user AS tgt
USING stg_user AS src
  ON tgt.user_id = src.user_id
WHEN MATCHED THEN UPDATE SET
    email         = src.email,
    country_code  = src.country_code,
    updated_at    = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);

Type 2 — Add new row, track history

The standard. Full history. Every change creates a new row; old row closed with valid_to.

The MERGE pattern (dialect-independent)

Implemented as two statements because a single MERGE can't both update and insert the same logical record.

-- Step 1: close out rows whose tracked attributes have changed
UPDATE dim_user
SET valid_to   = CURRENT_TIMESTAMP,
    is_current = FALSE
WHERE is_current = TRUE
  AND user_id IN (SELECT user_id FROM stg_user)
  AND (
    -- Use a hash_diff for concise comparison
    hash_diff <> (
      SELECT SHA2(CONCAT_WS('||',
        COALESCE(s.email, ''),
        COALESCE(s.country_code, ''),
        COALESCE(s.subscription_tier, ''),
        COALESCE(s.household_size::VARCHAR, ''),
        COALESCE(s.profile_language, '')
      ), 256)
      FROM stg_user s WHERE s.user_id = dim_user.user_id
    )
  );

-- Step 2: insert a row for every natural key with no current row
INSERT INTO dim_user (
    user_key, user_id, email, country_code, subscription_tier,
    household_size, profile_language, signup_date, churn_date,
    valid_from, valid_to, is_current, hash_diff
)
SELECT
    nextval('dim_user_seq'),                   -- surrogate
    s.user_id,
    s.email, s.country_code, s.subscription_tier,
    s.household_size, s.profile_language,
    s.signup_date, s.churn_date,
    CURRENT_TIMESTAMP, NULL, TRUE,
    SHA2(CONCAT_WS('||',
      COALESCE(s.email, ''),
      COALESCE(s.country_code, ''),
      COALESCE(s.subscription_tier, ''),
      COALESCE(s.household_size::VARCHAR, ''),
      COALESCE(s.profile_language, '')
    ), 256)
FROM stg_user s
LEFT JOIN dim_user d
  ON d.user_id = s.user_id AND d.is_current = TRUE
WHERE d.user_id IS NULL;

Why hash_diff: comparing every column individually is verbose, error-prone, and slow. Hashing the concatenated tracked columns lets you compare two states with a single =. Use SHA-256 (not MD5; collision-free enough for dim-change detection).

Why two statements: MERGE's WHEN MATCHED runs once per matched row. You can't both close the old row and insert a new one in the same MERGE without duplication. Separate steps are clearer anyway.

PySpark / Delta Lake version

from delta.tables import DeltaTable
from pyspark.sql.functions import sha2, concat_ws, coalesce, lit, col, current_timestamp, monotonically_increasing_id

dim = DeltaTable.forName(spark, "warehouse.dim_user")
stg = spark.read.table("stg_user")

# Add hash_diff to the stage
stg_h = stg.withColumn(
    "hash_diff",
    sha2(concat_ws("||",
        coalesce(col("email"), lit("")),
        coalesce(col("country_code"), lit("")),
        coalesce(col("subscription_tier"), lit("")),
        coalesce(col("household_size").cast("string"), lit("")),
        coalesce(col("profile_language"), lit(""))
    ), 256)
)

# Step 1: close rows whose hash changed
(dim.alias("t")
    .merge(stg_h.alias("s"),
           "t.user_id = s.user_id AND t.is_current = TRUE AND t.hash_diff <> s.hash_diff")
    .whenMatchedUpdate(set={
        "valid_to":   "current_timestamp()",
        "is_current": "FALSE"
    })
    .execute())

# Step 2: insert new versions
changed = (stg_h.alias("s")
    .join(dim.toDF().alias("t"),
          (col("s.user_id") == col("t.user_id")) & (col("t.is_current") == True),
          "left_anti"))   # keep s rows not present as current

changed_with_keys = changed.withColumn("user_key", monotonically_increasing_id() + <max_key>)
changed_with_keys.write.format("delta").mode("append").saveAsTable("warehouse.dim_user")

monotonically_increasing_id() per partition + offset is one pattern for surrogate generation. Deterministic hashing is safer across re-runs.

dbt snapshot (the declarative way)

# snapshots/dim_user.sql
{% snapshot dim_user_snapshot %}
{{ config(
    target_schema='snapshots',
    unique_key='user_id',
    strategy='check',
    check_cols=['email', 'country_code', 'subscription_tier',
                'household_size', 'profile_language']
) }}
SELECT user_id, email, country_code, subscription_tier,
       household_size, profile_language, signup_date, churn_date
FROM {{ source('raw', 'users') }}
{% endsnapshot %}

dbt handles the entire Type 2 lifecycle — adds dbt_valid_from, dbt_valid_to, compares columns, inserts new versions. Production-grade with one file.

Type 3 — Add new column (previous + current)

Tracks exactly one prior value. Niche — useful only when business cares about "previous region" specifically.

ALTER TABLE dim_user
  ADD COLUMN previous_country_code CHAR(2),
  ADD COLUMN country_changed_at TIMESTAMPTZ;

On update: UPDATE ... SET previous_country_code = country_code, country_code = new, country_changed_at = now().

Type 4 — History in a side table

Current row in the main dim (Type 1 semantics for ease of query); history in a separate table.

-- Main dim (current-state only)
CREATE TABLE dim_user_current (
    user_key BIGINT PRIMARY KEY,
    user_id VARCHAR(36) UNIQUE,
    email, country_code, ...
);

-- History (append-only, every change)
CREATE TABLE dim_user_history (
    user_id VARCHAR(36),
    email, country_code, ...,
    valid_from TIMESTAMPTZ,
    valid_to TIMESTAMPTZ,
    change_reason VARCHAR(50)
);

Good when the main dim is hit constantly by dashboards and you want a narrow table. Point-in-time joins use the history table.

Type 6 — 1 + 2 + 3

A Type 2 dim where every historical row also carries the current value of select attributes. Lets you report "sales by the customer's current segment" even on old facts.

CREATE TABLE dim_user (
    user_key BIGINT PRIMARY KEY,
    user_id  VARCHAR(36) NOT NULL,
    -- Historical values (Type 2 — different per version)
    email, country_code, subscription_tier,
    -- Current values (Type 1 — same across all versions of same user_id)
    current_email, current_country_code, current_subscription_tier,
    -- History metadata
    valid_from, valid_to, is_current,
    hash_diff
);

The Type 1 columns are updated across all rows with the same user_id whenever the current state changes — an expensive broadcast update, but query-time simplicity worth it.

Query patterns:

"Watch hours by historical segment": join on user_key, group by subscription_tier.
"Watch hours by current segment": join on user_key, group by current_subscription_tier.

Type 7 — Dual keys

The fact carries both the surrogate (user_key — version at event time) and the durable natural key (user_id). Consumers pick which to join on. Rare but powerful for complex reporting platforms.

ALTER TABLE fact_playback_session ADD COLUMN user_id VARCHAR(36);  -- durable

-- Historical view
JOIN dim_user d ON d.user_key = f.user_key

-- Current view (resolve through natural key)
JOIN dim_user_current d ON d.user_id = f.user_id

Picking a type per attribute

Attributes within one dimension almost always use different types. The skill is choosing correctly per attribute. The expanded table below covers the attributes you'll see in most real dimensions, with the reasoning.

Attribute	Type	Why
Customer / user dimension
`date_of_birth`	0	Immutable; a correction is rare and reported as data-entry fix
`signup_date`	0	Facts of origin; never changes
`birth_country`	0	Cannot change retroactively
`email` (typo-fix)	1	Corrections should replace silently, not create a history row
`gdpr_erased_flag`	1	Privacy operations overwrite by design
`country_code_current`	2	Relocations must not retro-attribute old orders
`subscription_tier`	2	Revenue attribution by tier depends on tier at event time
`loyalty_level`	2	Historical level drove the discount historically given
`referred_by_user_id`	3 (prev + curr)	Short history matters; full history rarely queried
`display_name`	6	UI needs current name everywhere; analytics may need point-in-time
`full_name` (post-marriage)	6	Same — current in app, historical for compliance
`last_churn_date`	1 + audit log	Most-recent value is the useful one; audit trail elsewhere
Product / SKU dimension
`product_id` (natural)	0	Primary identity, must never mutate
`product_name_display`	1	Rebrand should update cleanly; history of names is noise
`category_hierarchy`	2	Reorganizations break historical category rollups if overwritten
`list_price`	2	Revenue by price-tier depends on price-at-order-time
`cost_of_goods`	2	Margin math requires the historical COGS at order time
`is_active` flag	2	Window of availability matters for supply analyses
`current_inventory`	— (not a dim attr)	Move to a periodic-snapshot fact, not a dimension
Store / location dimension
`store_open_date`	0	Facts of origin
`manager_name`	2	Attribution by manager-of-record at event time
`store_format` (big-box vs mall)	2	Format reshapes affect historical comps
`physical_address`	2	Tax jurisdiction is historical
`branding_name`	1	Marketing rename; no analytical impact
`square_footage`	2	Renovations affect per-sqft metrics historically
Employee / HR dimension
`hire_date`	0	Origin fact
`legal_name`	6	Legal name history retained for compliance; display uses current
`employment_status`	2	Employed-at-event-time drives most HR analytics
`reporting_manager`	2	Org-chart snapshots matter historically
`level_band` / grade	2	Promotion-aware analyses need the level at event time
`salary_band`	2	Compensation history drives budget rollups
`personal_email`	1	Pure contact info update
Marketing campaign dimension
`campaign_id`	0	Identity
`launch_date`	0	Origin
`budget_usd`	2	Budget revisions attribute to the window they applied
`internal_name`	1	Naming hygiene only
`channel`	2	Channel reclassification affects attribution history
`owning_team`	2	Ownership transfers need historical accuracy for credit
Account / subscription dimension (B2B)
`account_id`	0	Identity
`contract_start_date`	0	Origin
`plan_name`	2	Historical plan drives historical entitlement
`seats_purchased`	2	Contract-size history is revenue math
`renewal_date`	1	Current renewal date is the useful one
`owner_account_exec`	2	Historical ownership for commission attribution
`industry_classification`	2	Reclassification should not rewrite historical industry rollups

Document every per-attribute decision in the data catalog. A newcomer should be able to read the catalog and predict whether an update to country_code produces a new row in dim_user. If the answer is ambiguous, the catalog is incomplete.

The red-flag shapes

Three attribute categories warrant extra suspicion when choosing an SCD type:

Anything regulatory. Legal name, address, tax ID, consent flags — treat as Type 2 minimum, plus a separate audit trail, regardless of analytical need. Compliance trumps modeling elegance.
Anything that drives financial attribution. Tier, plan, price, territory assignment, account executive ownership — always Type 2. A future auditor will ask "who owned this account when the deal closed." Type 1 cannot answer that.
Anything computed. Lifetime value, segment classification, engagement score — don't SCD these; move them to a periodic snapshot fact. Dimensions should hold identity and attributes; computed aggregates belong with the facts.

8. Date & Time Dimensions Done Right

A first-class date dimension makes fiscal/holiday/weekend/quarter queries trivial. Generate it once, populate 20 years of rows, never touch again.

Generating dim_date (Postgres)

INSERT INTO dim_date (
    date_key, full_date, day_of_week, day_name,
    day_of_month, day_of_year, week_of_year,
    month_num, month_name, quarter, year,
    fiscal_year, fiscal_quarter, fiscal_period,
    is_weekend, is_holiday, holiday_name, is_business_day
)
SELECT
    TO_CHAR(d, 'YYYYMMDD')::INT                       AS date_key,
    d                                                 AS full_date,
    EXTRACT(ISODOW FROM d)                            AS day_of_week,
    TO_CHAR(d, 'FMDay')                               AS day_name,
    EXTRACT(DAY FROM d)                               AS day_of_month,
    EXTRACT(DOY FROM d)                               AS day_of_year,
    EXTRACT(WEEK FROM d)                              AS week_of_year,
    EXTRACT(MONTH FROM d)                             AS month_num,
    TO_CHAR(d, 'FMMonth')                             AS month_name,
    EXTRACT(QUARTER FROM d)                           AS quarter,
    EXTRACT(YEAR FROM d)                              AS year,
    -- Netflix fiscal year happens to be calendar; substitute your rules
    EXTRACT(YEAR FROM d)                              AS fiscal_year,
    EXTRACT(QUARTER FROM d)                           AS fiscal_quarter,
    EXTRACT(MONTH FROM d)                             AS fiscal_period,
    (EXTRACT(ISODOW FROM d) IN (6,7))                 AS is_weekend,
    FALSE                                             AS is_holiday,
    NULL                                              AS holiday_name,
    (EXTRACT(ISODOW FROM d) BETWEEN 1 AND 5)          AS is_business_day
FROM generate_series(DATE '2010-01-01', DATE '2040-12-31', INTERVAL '1 day') d;

-- Holidays (join a separate table, mark is_holiday)
UPDATE dim_date d SET is_holiday = TRUE, holiday_name = h.name, is_business_day = FALSE
FROM holidays h WHERE h.observed_date = d.full_date;

Time-of-day dimension (separate)

Never combine date + time into one dimension — cardinality explodes (30 years × 86400 seconds = ~1 billion rows).

CREATE TABLE dim_time (
    time_key     SMALLINT PRIMARY KEY,         -- 0..1439 (minute-of-day)
    hour         SMALLINT,
    minute       SMALLINT,
    period       VARCHAR(2),                   -- AM/PM
    hour_12      SMALLINT,
    time_of_day_segment VARCHAR(20),           -- 'morning', 'evening'
    is_peak_hours BOOLEAN
);
-- 1440 rows total.

Facts reference both: start_date_key + start_time_key from start_ts.

Timezone handling — the ironclad rule

Store everything in UTC. Convert for display at presentation.

Source systems emit timestamps; normalize to UTC in the first transform (bronze → silver).
Use TIMESTAMP WITH TIME ZONE columns (TIMESTAMPTZ) in Postgres/Redshift; TIMESTAMP in Snowflake/BigQuery is naive — wrap consistently.
For local-time analytics ("plays during 8pm local per user"), precompute the local date/time on ingest using the user's known timezone. This avoids runtime AT TIME ZONE for millions of rows.

-- Compute user-local ts at ingest
SELECT
    event_ts AT TIME ZONE 'UTC' AT TIME ZONE u.timezone AS local_ts,
    (event_ts AT TIME ZONE 'UTC' AT TIME ZONE u.timezone)::DATE AS local_date
FROM raw_events e JOIN dim_user u ON u.user_id = e.user_id;

DST is the silent bug. Whenever you convert, be prepared for 2026-03-08 02:30 US/Eastern — which doesn't exist. Use library functions (pytz, Java ZonedDateTime) that raise errors rather than silently shift.

9. Data Vault 2.0 — When and Why

Data Vault is a modeling methodology for enterprises with many source systems, strong auditability requirements, and the need to evolve quickly without reshaping downstream.

Core structures

Hub — unique list of business keys.

CREATE TABLE hub_customer (
    customer_hk    CHAR(64) PRIMARY KEY,       -- hash of business key
    customer_bk    VARCHAR(100) NOT NULL,       -- natural / business key
    load_dts       TIMESTAMPTZ NOT NULL,
    record_source  VARCHAR(100) NOT NULL
);

Link — relationships between hubs.

CREATE TABLE link_order (
    order_hk       CHAR(64) PRIMARY KEY,        -- hash(customer_bk, product_bk, order_bk)
    customer_hk    CHAR(64) NOT NULL REFERENCES hub_customer,
    product_hk     CHAR(64) NOT NULL REFERENCES hub_product,
    order_bk       VARCHAR(100) NOT NULL,
    load_dts       TIMESTAMPTZ NOT NULL,
    record_source  VARCHAR(100) NOT NULL
);

Satellite — descriptive attributes, with full history built in.

CREATE TABLE sat_customer_details (
    customer_hk    CHAR(64) NOT NULL,
    load_dts       TIMESTAMPTZ NOT NULL,
    load_end_dts   TIMESTAMPTZ,
    hash_diff      CHAR(64) NOT NULL,
    -- Attributes
    email          VARCHAR(320),
    country_code   CHAR(2),
    subscription_tier VARCHAR(20),
    record_source  VARCHAR(100) NOT NULL,
    PRIMARY KEY (customer_hk, load_dts)
);

Loading pattern

Highly parallel — each hub, link, and sat can be loaded independently as long as hubs are loaded before their sats. No dependencies between sources.

-- Load pattern (per source, per hub)
INSERT INTO hub_customer (customer_hk, customer_bk, load_dts, record_source)
SELECT DISTINCT
    SHA2(s.customer_id, 256) AS customer_hk,
    s.customer_id,
    CURRENT_TIMESTAMP,
    'billing_system'
FROM stage_billing s
LEFT JOIN hub_customer h ON h.customer_hk = SHA2(s.customer_id, 256)
WHERE h.customer_hk IS NULL;

Pros

Source-system-agnostic: the vault survives upstream changes with zero refactor.
Auditable: every fact has load_dts + record_source; lineage is built in.
Parallel loading: no cross-source dependencies.
Adding new sources is an additive operation.

Cons

Unqueryable directly. You build a Kimball-style "information mart" on top for BI. Two layers = more moving parts.
Many joins: hubs + sats + links to get a single business view.
Overkill for small/single-source warehouses.
Team needs training; the discipline is easy to violate.

When to use it

50+ source systems to integrate.
Regulatory compliance (insurance, banking, healthcare) requires full audit trail.
Large data engineering team that can maintain both vault + marts.

When NOT to use it: small warehouses (<10 sources), teams without DV expertise, or when business moves too fast for a two-layer architecture.

10. One Big Table & the Columnar Revolution

Columnar compression changes the math. With dictionary encoding, 100M rows of "premium" cost roughly 100M × 1-byte-dictionary-code + one entry "premium" in the dictionary. ~100MB uncompressed. With RLE (runs of "premium"): a few bytes per run.

This means denormalizing a dimension into a fact table adds almost no storage. The join-elimination benefit is real. Hence OBT.

What OBT looks like

-- A wide, denormalized table: 200 columns, billions of rows, no joins required
CREATE TABLE gold.playback_sessions_enriched (
    -- Primary identifiers
    session_id           VARCHAR(64),
    session_key          BIGINT,
    -- User attributes (denormalized from dim_user)
    user_id              VARCHAR(36),
    user_country_code    CHAR(2),
    user_subscription_tier VARCHAR(20),
    user_household_size  SMALLINT,
    user_profile_language VARCHAR(10),
    user_signup_date     DATE,
    user_segment         VARCHAR(50),
    -- Title attributes
    title_id             VARCHAR(36),
    title                VARCHAR(500),
    title_type           VARCHAR(20),
    title_genre_primary  VARCHAR(50),
    title_runtime_min    INT,
    title_is_original    BOOLEAN,
    -- Device attributes
    device_type          VARCHAR(20),
    device_os            VARCHAR(50),
    -- Session measures
    watch_ms             BIGINT,
    buffer_ms            BIGINT,
    seeks                INT,
    pauses               INT,
    max_bitrate_kbps     INT,
    completion_ratio     NUMERIC(4,3),
    -- Derived (not in star)
    watched_in_peak_hours BOOLEAN,
    was_binge_session    BOOLEAN,
    qoe_score            NUMERIC(3,2),
    -- Time
    start_ts             TIMESTAMPTZ,
    end_ts               TIMESTAMPTZ,
    dt                   DATE
)
PARTITIONED BY (dt)
CLUSTER BY (user_country_code, title_genre_primary);

Consumer queries have zero joins:

SELECT
    user_country_code, title_genre_primary,
    SUM(watch_ms) / 3600e3 AS watch_hours
FROM gold.playback_sessions_enriched
WHERE dt BETWEEN '2026-04-12' AND '2026-04-18'
GROUP BY user_country_code, title_genre_primary;

Zone maps + clustering make this scan a tiny fraction of the table.

OBT trade-offs

Pros:

One join-less table.
Dashboards fly.
Simple mental model for analysts.
Compressed storage cost is often lower than the original star due to better dictionary encoding across a wider table.

Cons:

SCD2 becomes painful. Do you carry every historical attribute on every fact row? Or pin-time user attributes as-of event time and skip "current" reporting? Either is a choice.
Backfilling dimension changes is expensive. If user_segment logic changes, every historical row must be rewritten.
Data contracts grow. 200 columns × many downstream consumers = a lot of coordination on changes.
Governance nightmare. Sensitive fields proliferate — you now have email on billions of fact rows; GDPR deletes touch all of them.

Hybrid: Kimball silver + OBT gold

The pragmatic pattern most serious teams end up with:

Silver = conformed star schema (dims + facts), Kimball-style.
Gold = one OBT per consumption pattern, materialized from silver.

Regenerate gold cheaply when dimensions change. Silver provides the single source of truth. BI reads gold for speed.

11. Medallion Architecture (Bronze/Silver/Gold)

The lakehouse convention. Each layer has a distinct purpose:

Bronze — raw, append-only, preserved

Never business-logic-applied. Schema-on-read preserved (or schema promoted but all fields retained). Idempotent ingest by source event ID.

# Ingest Kafka events to Bronze Iceberg
(stream
    .writeStream
    .format("iceberg")
    .outputMode("append")
    .option("path", "s3://lake/bronze/events/")
    .option("checkpointLocation", "s3://lake/checkpoints/bronze_events/")
    .partitionBy("ingest_date")
    .trigger(processingTime="5 minutes")
    .start())

Bronze is the "replayability" layer. If silver/gold ever get corrupted, you can always rebuild from bronze.

Silver — cleaned, conformed, typed, deduped

Business entities emerge. Dimensions and facts in Kimball style. SCD handling lives here.

Gold — consumption-ready, OBT per use case

One gold table per dashboard domain. Materialized from silver. Rebuilt cheaply.

The implicit contract

Bronze is owned by the ingestion team. Contract: "we will deliver every source event exactly once, preserve schema, and never backfill by overwrite."
Silver is owned by the platform team. Contract: "conformed dimensions, typed, documented, stable schema."
Gold is owned by the consumer team. Contract: "you build it, you own it; break it quietly and it's on you."

Medallion + streaming

Works the same way. Bronze = append stream into Iceberg / Delta. Silver = streaming sessionization / dedup / enrichment, written as upsert into Iceberg. Gold = periodic or streaming rollup. Iceberg's row-level DELETE/MERGE (v2) makes this feasible at real latencies.

12. NULL Semantics — The Silent Source of Bugs

NULL in SQL is three-valued logic. NULL = NULL is NULL (not TRUE). x = NULL always yields NULL. This is the source of countless quiet bugs.

NULL in joins

a.x = b.x never matches when either side is NULL. If you want NULL to equal NULL, use:

a.x IS NOT DISTINCT FROM b.x (Postgres, Spark)
COALESCE(a.x, '') = COALESCE(b.x, '') (universal but ugly; only works if '' isn't a valid value)
equal_null(a.x, b.x) (Snowflake)

NULL in aggregations

COUNT(*) counts all rows.
COUNT(x) counts non-NULL x.
SUM, AVG, MIN, MAX ignore NULLs.
AVG(x) = SUM(x) / COUNT(x) — divides by non-NULL count. Surprising when you expect "nulls treated as zero."
COUNT(DISTINCT x) counts distinct non-NULL values — so a column with values {1, NULL, NULL} has COUNT DISTINCT 1.

NULL in filters

WHERE x <> 5 excludes rows where x is 5 AND rows where x is NULL. If you want nulls too: WHERE x IS NULL OR x <> 5. Forgetting this is a dominant source of silent data loss.

NULL ordering

Postgres: NULLs sort LAST by default in ASC, FIRST in DESC. Override with NULLS FIRST / NULLS LAST.
MySQL: NULLs sort FIRST in ASC.
Be explicit. Always.

NULL in GROUP BY

NULLs form their own group. GROUP BY country returns one row with country = NULL for all rows missing country. Surface this explicitly in reports or filter before aggregation.

NULLs in window functions

LAG(x) returns NULL before the first row. Use LAG(x, 1, default_value) or COALESCE. LAG(x) IGNORE NULLS (Snowflake/BigQuery) skips NULLs and returns the last non-NULL value — powerful for "last known state" patterns.

-- Last non-null country (for users with intermittent country updates)
SELECT user_id, event_ts,
       LAST_VALUE(country IGNORE NULLS) OVER (
           PARTITION BY user_id ORDER BY event_ts
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) AS country_known
FROM events;

Rule: every NULL should be documented

Either NULL means "not yet known" (e.g., churn_date) or "not applicable" (e.g., return_reason on non-returned orders). Document which. Consider sentinel values (unknown codes instead of NULL) when the latter — they force explicit handling downstream.

-- Instead of NULL for "unknown country":
country_code CHAR(2) NOT NULL DEFAULT 'ZZ',   -- ZZ = unknown
-- Reports naturally include "ZZ" as a group rather than silently dropping.

13. Data Contracts & Schema Evolution

A data contract is a guarantee between a producer and consumers about a dataset's schema, semantics, freshness, and quality. Without it, consumers build on sand.

What a contract specifies

# data_contracts/playback_sessions.yaml
id: bronze.playback_sessions
version: 3.1.0
owner: playback-platform@co
schema:
  event_id:        { type: string, required: true, unique: true }
  session_id:      { type: string, required: true }
  user_id:         { type: string, required: true, pii: true }
  title_id:        { type: string, required: true }
  event_ts:        { type: timestamp, required: true, timezone: UTC }
  event_type:      { type: string, required: true, enum: [start, heartbeat, end, error] }
  watch_ms:        { type: long, required: false, constraints: "value >= 0" }
sla:
  freshness: "p99 < 15 minutes after event_ts"
  completeness: ">= 99.5% of upstream client emissions"
  uniqueness: "100% on event_id within 7 days"
tests:
  - name: no_negative_watch_ms
    sql: "SELECT COUNT(*) FROM {{ this }} WHERE watch_ms < 0"
    threshold: 0
breaking_changes_require:
  - major_version_bump
  - 30_day_deprecation_notice
  - consumer_acknowledgment_from: [analytics, ml-platform, product-insights]

Schema evolution rules

Change	Compatibility	Action
Add optional column	Backward	Minor version
Add required column	Breaking	Major version, default-fill historical rows
Drop column	Breaking	Deprecate (write-null for N versions, then drop)
Rename column	Breaking	Add new, deprecate old, drop after transition
Widen type (INT → BIGINT)	Backward	Minor version
Narrow type	Breaking	Major version, data validation
Change semantics without renaming	Catastrophic	Never. Rename + deprecate.

Enforcing contracts

Automatic checks in CI:

# contract_check.py
import yaml, pyarrow.parquet as pq

spec = yaml.safe_load(open("data_contracts/playback_sessions.yaml"))
schema = pq.ParquetFile(sample_file).schema_arrow
observed = {f.name: str(f.type) for f in schema}

for field_name, field_spec in spec["schema"].items():
    if field_spec["required"] and field_name not in observed:
        raise ValueError(f"Required field {field_name} missing")
    # Type, enum, constraint checks...

Pair with runtime data quality tests (Great Expectations, Soda, dbt tests) that run on each pipeline execution and block promotion on failure.

Lakehouse-native schema evolution

Iceberg and Delta both support safe evolution:

Iceberg tracks columns by unique ID, not name. Rename is metadata-only, no rewrite.
Add column: metadata-only; existing rows treated as NULL.
Drop column: metadata-only (writes stop emitting it; old files untouched).
Reorder: metadata-only.
Type promotion (INT → LONG, FLOAT → DOUBLE): metadata-only, allowed.
Type narrowing: not allowed.

-- Iceberg
ALTER TABLE silver.playback_sessions ADD COLUMN engagement_score DOUBLE;
ALTER TABLE silver.playback_sessions RENAME COLUMN qoe_score TO quality_score;

14. Modeling Checklist & Anti-Patterns

Before shipping a model:

Grain declared and documented? If two engineers could disagree about what a row represents, the model is broken.
Are all keys surrogate? Natural keys are attributes, not PKs on dims.
Is every measure's additivity documented? Labels in the data catalog; column naming conventions (_point_in_time, _ratio).
Is history handled deliberately? SCD type per attribute, not per dim.
Are dimensions conformed across facts? Single dim_customer, multiple FKs.
Is the date dimension a real table? Not a function call.
Are slowly-changing fact attributes modeled as mini-dimensions? Not VARCHAR columns.
Are all NULLs meaningful and documented? Sentinel values where applicable.
Are timestamps UTC with explicit timezone storage? Local times precomputed for user-local analytics.
Does the table have a partition/cluster strategy that prunes typical queries? Default: partition by date, cluster by most-filtered high-cardinality column.
Are contracts published and enforced? Schema + SLA + tests in version control, blocked by CI.
Can the table be rebuilt from bronze in a reasonable window? If not, you have single-source fragility.

Anti-patterns

Mixed-grain fact tables. Invisible bug, impossible to audit.
Natural keys as PKs. Breaks the first time source-system renumbering happens.
FLOAT for money. Accumulation errors. Use BIGINT cents or DECIMAL(18,2).
No date dimension. Every report ends up with CASE WHEN EXTRACT(MONTH ...) scattered everywhere.
TIMESTAMP without timezone. Analytics broken after the next DST change.
"Latest" table that's actually mutable. A view over SCD2 with WHERE is_current is safe; an actual table that gets UPDATE'd is a mutation-ordering bug farm.
Soft-delete without a partial index. WHERE deleted = FALSE becomes a full scan on billions of rows.
EAV (entity-attribute-value) inside a warehouse. (entity_id, attr_name, attr_value) — looks flexible, makes every query a pivot, kills indexability.
One column per language (name_en, name_fr, name_de, ...). Unbounded schema drift. Normalize to a translations child table.
Copying source tables verbatim into the warehouse as "the model." Source schema serves source transactions; warehouse schema should serve analytics.
Logic embedded in views with no materialization. When the dashboard query gets slow, caching becomes a retrofit.
Dimensions that aren't dimensions. If a column has 8 distinct values, it's a mini-dim candidate, not a VARCHAR fact column.

Next: 02-batch-processing.md — the theory and mechanics of bounded-data processing, idempotency, MERGE patterns, file formats, and partition design.

14. Surrogate Key Strategies in Depth

A surrogate key is an engineered stand-in for a natural key. Choosing the wrong surrogate strategy is a recurring source of production bugs, late-arriving data hell, and expensive re-keying projects two years in.

The three strategies

Sequence (identity column). Monotonic integer issued by the database. Tiny, fast, great for indexing. Fatal weakness: you cannot generate it independently in two systems. If you load facts in Spark and dimensions in Snowflake, you need a round-trip to assign keys. That round-trip is slow, fragile, and forbids truly distributed pipelines.

Hash (deterministic on business key). Apply a collision-resistant hash (SHA-256, xxhash) to the natural key tuple and store the hash. Any system can produce the same key without coordination. Works beautifully for Data Vault and lakehouse pipelines. Trade-off: keys are larger (16–32 bytes) and not human-readable. Birthday-paradox collisions are theoretically possible but practically never occur at data-warehouse scale with 128+ bit hashes.

UUID (random). Nice for unordered distributed generation but terrible for range scans and clustering. Never use random UUIDs as a primary key in a columnar warehouse — they destroy compression and min/max pruning. UUIDv7 (time-sortable) is acceptable if you really want UUID semantics.

Decision table

Requirement	Sequence	Hash	UUID
Cross-system generation	No	Yes	Yes
Clustered storage efficiency	Best	Good	Worst
Human-readable	Yes	No	No
Idempotent re-run	No (re-assigns)	Yes	No
Size	4–8 bytes	16–32 bytes	16 bytes

Hashed surrogate — worked example

-- Snowflake / BigQuery / Spark all agree on this pattern
SELECT
  MD5(CONCAT_WS('|',
    LOWER(TRIM(source_system)),
    CAST(source_id AS STRING),
    CAST(effective_date AS STRING)
  )) AS customer_sk,
  source_system,
  source_id,
  effective_date,
  ...
FROM staging_customer;

Key properties: lowercase + trim on text columns (defensive against inconsistent casing), an explicit delimiter that cannot appear in the payload, and inclusion of the temporal component for SCD Type 2. The same row re-ingested produces the same key — idempotency for free.

15. Bridge Tables and Many-to-Many Dimensions

Not every dimension attaches to a fact with a clean foreign key. A policy can have multiple beneficiaries. An insurance claim can have multiple diagnosis codes. A marketing touch can belong to multiple campaigns. These require bridge tables — and bridge tables are where naive data modelers get caught.

The problem with denormalizing

Tempting shortcut: flatten the many-to-many into a wide fact with diagnosis_code_1, diagnosis_code_2, diagnosis_code_3, …. This works until the day you get a claim with 12 codes. Now you either drop data, create a 20-column wasteland, or reshape the fact — which breaks every downstream consumer. Avoid this pattern.

The bridge table pattern

Three tables: the fact (fct_claim), the dimension (dim_diagnosis), and the bridge (bridge_claim_diagnosis). The bridge stores one row per (claim_sk, diagnosis_sk) with an optional weight column.

CREATE TABLE bridge_claim_diagnosis (
  claim_sk        BIGINT NOT NULL,
  diagnosis_sk    BIGINT NOT NULL,
  is_primary      BOOLEAN NOT NULL,
  weight          DECIMAL(6,4),  -- optional: allocate claim value across codes
  PRIMARY KEY (claim_sk, diagnosis_sk)
);

The weight column is not optional. Without it, you cannot aggregate claim dollars by diagnosis without double-counting. Ten codes on a claim would cause the same dollars to be attributed ten times when joined naively. Interviewers probe for this.

Aggregation safely across a bridge

-- Correct: weighted allocation
SELECT
  d.diagnosis_category,
  SUM(f.claim_amount * b.weight) AS allocated_amount
FROM fct_claim f
JOIN bridge_claim_diagnosis b ON b.claim_sk = f.claim_sk
JOIN dim_diagnosis d           ON d.diagnosis_sk = b.diagnosis_sk
GROUP BY d.diagnosis_category;

-- Wrong: naive join double-counts
SELECT d.diagnosis_category, SUM(f.claim_amount) FROM ...  -- DO NOT

16. Late-Arriving Facts and Dimensions

"Late-arriving" data is the term of art for rows whose business timestamp is in the past but whose arrival timestamp is now. Two distinct cases, each with its own playbook.

Late-arriving facts

A mobile app emits an event on Tuesday but the SDK holds it in local cache due to offline mode, and uploads it Thursday. The fact has a business date of Tuesday. You must decide: does Tuesday's daily aggregate get corrected retroactively, or is the late record attributed to Thursday?

The senior answer is: both tables exist. A "by business date" partition table recomputes Tuesday's totals when the late row arrives. A "by arrival date" partition table leaves Tuesday alone and attributes the row to Thursday. Each serves different consumers — finance cares about business date, operations cares about arrival date — and conflating them is a classic bug.

Late-arriving dimensions

A fact arrives referencing a new customer_id that doesn't exist in dim_customer yet. Three strategies:

Inferred member. Insert a placeholder row into dim_customer with the known natural key and NULL attributes. The fact joins successfully. When the full dim row arrives later, update in place (SCD Type 1) or issue a new version (SCD Type 2). Best default.
Quarantine. Hold the fact in a quarantine table until the dim row appears. Replay when it does. Use when joining with missing context would be meaningless.
Orphan facts. Let the fact land with a NULL dim key. Never recommended — downstream SQL silently drops rows.

17. Accumulating Snapshot Fact Tables

Most facts are transactional (one row per event) or periodic (one row per entity per period). The third, less-known kind is the accumulating snapshot — one row per business process instance, updated in place as the process progresses. The canonical example is an order lifecycle.

Structure

One row per order. Columns for every milestone: placed_ts, paid_ts, shipped_ts, delivered_ts, returned_ts. Plus computed lag columns: days_to_ship, days_to_deliver. The row is created at order placement and UPDATEd at every milestone. NULL columns mean "hasn't happened yet."

CREATE TABLE fct_order_lifecycle (
  order_sk         BIGINT NOT NULL,
  customer_sk      BIGINT NOT NULL,
  placed_ts        TIMESTAMP NOT NULL,
  paid_ts          TIMESTAMP,
  shipped_ts       TIMESTAMP,
  delivered_ts     TIMESTAMP,
  returned_ts      TIMESTAMP,
  days_to_pay      INT,
  days_to_ship     INT,
  days_to_deliver  INT,
  order_amount     DECIMAL(10,2) NOT NULL,
  current_status   VARCHAR(20) NOT NULL
);

Why this fact type exists

Without it, "how long does it take to ship a paid order" requires a self-join on a transactional fact across millions of rows. With accumulating snapshot, it's AVG(days_to_ship). The cost is maintenance complexity: every milestone update has to find and update the right row, ideally via a MERGE on the natural key.

Lakehouse implementation

On Iceberg or Delta, accumulating snapshots are implemented via MERGE INTO. The row is inserted on the first event and updated on every subsequent event. Use a current_status column to track where in the lifecycle the row currently sits — this makes recovery after a failed merge straightforward.

↑ Back to top

Part 02

Batch Processing

Batch processing is the workhorse of the data warehouse. It's also where the subtle bugs live — the ones that take three weeks to notice and two days to find. This file goes into the theory of bounded-data processing, the mathematics of idempotency, the anatomy of MERGE, the internals of Parquet/ORC, partition design with real math, and how to build pipelines that survive backfills, reruns, and bad data.

1. Mental Model: Bounded vs Unbounded Data (Deep)

A batch job processes bounded data — a finite set of rows with a defined beginning and end. You can scan it, sort it globally, compute the total, and wait for everything before you emit. These are luxuries a streaming job doesn't have.

But "bounded" is a simplification. Most production batch jobs operate on windows of an unbounded stream: "process yesterday's events" is a bounded view of the event firehose. The sealed-ness of that window depends entirely on how late data can arrive.

The three dimensions of boundedness

Temporal boundedness — is there a cutoff time after which no more data for the window can arrive?
Cardinality boundedness — is the count of rows bounded? (It always is in retrospect, but the job needs to know when it's seen them all.)
Completeness boundedness — can you prove all upstream producers have emitted?

Real-world jobs rarely have all three cleanly:

A daily job on mobile app events: temporally bounded (sealing at midnight + safety window), cardinality uncertain (user activity varies), completeness probabilistic (some clients reconnect days later).
An S3 daily drop: temporally bounded + cardinality bounded once the drop completes, completeness signal from a _SUCCESS file or a manifest.
A CDC stream dump for prior day: cardinality bounded by rows in source table; temporal boundedness depends on the CDC tool's lag.

Implications for pipeline design

A truly bounded source (a sealed S3 drop) admits:

Strict correctness checks: row counts, checksums, column distributions against expected ranges.
Full refresh fallback: if anything looks off, reprocess from source.

A temporally-bounded-but-late-arriving source admits:

Safety windows: process event_time ≥ T - 7 days even on the "today" job, so late arrivals are included.
Periodic reconciliation: a separate weekly job re-processes the last N days to catch late data.
Grace periods: "seal" day D only after 48h of additional quiet.

Watermark-as-promise, applied to batch

Even batch has watermarks — implicit ones. "Process yesterday" with a 1-hour safety buffer means: I trust that by now() - 1 hour, no events with event_time in [yesterday_start, yesterday_end] will still be arriving. Make this explicit:

def run_daily(dt: date, safety_hours: int = 2):
    """Process dt's events; assume all events for dt have arrived by dt+1day+safety_hours."""
    assert datetime.utcnow() >= datetime.combine(dt + timedelta(days=1), time.min) \
           + timedelta(hours=safety_hours), \
           f"Not safe to process {dt} yet"
    ...

The assert prevents silent correctness loss when backfilling to "today."

Bounded storage, unbounded log

Even a bounded table has an unbounded "log" of changes — the sequence of insertions, updates, deletions over time. Modern lakehouse formats (Iceberg, Delta) expose this log: you can read the table as-of any past snapshot (time travel) or stream the log of changes (Change Data Feed in Delta, incremental reads in Iceberg).

This lets batch jobs consume batch sources as if they were streams:

# Iceberg incremental read — only snapshots since the last run
df = (spark.read
      .format("iceberg")
      .option("start-snapshot-id", last_processed_snapshot)
      .option("end-snapshot-id",   current_snapshot)
      .load("warehouse.bronze.events"))

The line between batch and streaming blurs. Modern advice: unify the code path; only the trigger differs.

2. Anatomy of a Batch Job

A production batch job has these stages, in order. Each can fail; each must be recoverable.

Parameter binding — job receives logical_date (the date being processed), not current_date. This is the single most important contract; without it, the job is not idempotent.
Source existence check — upstream data exists and is complete. Fail fast if missing (don't produce an empty output that looks successful).
Source read — pull from bronze / staging / raw.
Deduplication — producers retry, sources double-emit.
Conforming — type casting, UTC normalization, null handling, enum validation.
Enrichment — joins to dimensions to hydrate descriptors.
Transformation — the actual business logic.
Validation — row counts, distribution checks, uniqueness assertions. This is the airbag.
Write — atomic commit to the target.
Publish — signal to downstream consumers (dataset events, XComs, manifest writes).
Audit log — row counts in/out, source snapshot IDs, emission latency, job version.

Minimal Python/Spark skeleton

from dataclasses import dataclass
from datetime import date, datetime, timedelta
import logging

log = logging.getLogger(__name__)

@dataclass
class RunContext:
    logical_date: date
    job_version: str
    run_id: str
    spark: "SparkSession"

def run(ctx: RunContext) -> None:
    log.info("job_start", extra={"extra": ctx.__dict__})

    source_exists(ctx)                 # fail fast if upstream missing
    raw = read_source(ctx)             # (1)
    deduped = dedupe(raw)              # (2)
    conformed = conform_types(deduped) # (3)
    enriched = enrich_dimensions(ctx, conformed) # (4)
    output = transform(enriched)       # (5)

    audit = validate(output, ctx)      # (6) — raises on hard-fail
    if audit.soft_warnings:
        notify(audit.soft_warnings)

    write_atomic(output, ctx)          # (7)
    publish_dataset_event(ctx, audit)  # (8)
    log.info("job_done", extra={"extra": {"rows": audit.row_count}})

Each helper should be independently unit-testable; the orchestrator is thin.

The validation gate

This is the most important step and the most often skipped.

@dataclass
class Audit:
    row_count: int
    unique_rate: float
    null_rates: dict[str, float]
    sum_checks: dict[str, int]
    soft_warnings: list[str]

def validate(df, ctx) -> Audit:
    # Hard assertions (raise)
    row_count = df.count()
    if row_count == 0:
        raise AssertionError("Empty output — upstream issue or bug")
    if row_count < ctx.expected_min_rows:
        raise AssertionError(f"Row count {row_count} below min {ctx.expected_min_rows}")
    dup_rate = 1 - df.select("event_id").distinct().count() / row_count
    if dup_rate > 0.001:
        raise AssertionError(f"Uniqueness violated: {dup_rate:.4%} duplicates")

    # Soft warnings (log, notify)
    soft = []
    null_rates = {}
    for col_name in ["user_id", "title_id", "watch_ms"]:
        nr = df.filter(F.col(col_name).isNull()).count() / row_count
        null_rates[col_name] = nr
        if nr > 0.005:  # 0.5% threshold
            soft.append(f"High null rate in {col_name}: {nr:.2%}")

    return Audit(row_count, 1 - dup_rate, null_rates, {}, soft)

Store audit records in a permanent table — historical audit is the tool for diagnosing "what changed overnight."

3. Idempotency — Proofs and Patterns

Definition: running the job twice with the same parameters produces the same output as running it once.

Formally: if f(x) is the job and S_0 is the starting state, then f(f(x))(S_0) = f(x)(S_0). The second run is a no-op relative to the first.

Why it matters

Reruns on failure: if your job crashes mid-way, you need to retry without producing duplicates or partial writes.
Backfills: reprocessing a historical date must overwrite cleanly, not accumulate.
Human error: "I accidentally ran the job twice" shouldn't cause an incident.
Distributed retries: orchestrators retry automatically; idempotency prevents them causing damage.

Pattern 1 — Partition overwrite

For append-only outputs partitioned by the job parameter (usually date):

(df.write
  .format("iceberg")
  .mode("overwrite")
  .option("replace-where", f"dt = '{ctx.logical_date}'")
  .saveAsTable("warehouse.silver.playback_sessions"))

The replace-where clause scopes the overwrite to that one partition. Rerun = same partition overwritten with same data = no-op. Iceberg's atomic commit guarantees no reader sees a half-written state.

Safety check: verify df only contains rows for that partition.

assert df.filter(F.col("dt") != ctx.logical_date).count() == 0, \
       "Output contains rows outside target partition — partitioning bug"

Pattern 2 — MERGE by key

For outputs with compound keys and potential updates:

MERGE INTO silver.dim_user AS t
USING stg_user AS s
  ON t.user_id = s.user_id
WHEN MATCHED AND t.updated_at < s.updated_at THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);

Idempotent because a second run against the same stg_user finds no rows where t.updated_at < s.updated_at (they're equal). Requires a proper version column (updated_at); otherwise you might silently overwrite a newer value with an older one.

Pattern 3 — Hash + upsert

For CDC-like feeds:

MERGE INTO silver.events AS t
USING (
    SELECT *, SHA2(CONCAT_WS('||', event_id, event_ts, payload), 256) AS row_hash
    FROM stg_events
) AS s
  ON t.event_id = s.event_id
WHEN MATCHED AND t.row_hash <> s.row_hash THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);

Second run: same hashes match, no updates. Idempotent.

Anti-pattern: `INSERT ... SELECT` without dedup

-- BAD: second run doubles the data
INSERT INTO silver.events SELECT * FROM stg_events;

This is the single most common idempotency violation. Fix:

Replace with MERGE or partition-overwrite.
Or: truncate target partition first (but now two statements must be atomic — use Iceberg/Delta semantics).

Anti-pattern: `now()` in transform

-- BAD: second run produces different output because now() changes
SELECT *, NOW() AS processed_at FROM stg_events;

Fix: use ctx.logical_date or a per-run constant. If you need a "processed_at" column, set it once at the start of the run and pass it in.

Anti-pattern: monotonically increasing keys that differ across runs

# BAD: Spark's monotonically_increasing_id() isn't stable across runs
df.withColumn("key", F.monotonically_increasing_id())

Fix: deterministic hash of natural keys.

df.withColumn("key", F.sha2(F.concat_ws("||", *natural_key_cols), 256))

Testing idempotency

Write a test that runs the job twice and diffs the output:

def test_idempotent_run(tmp_table, test_data):
    run(RunContext(logical_date=date(2026,4,10), ...))
    snapshot1 = spark.table(tmp_table).collect()
    run(RunContext(logical_date=date(2026,4,10), ...))
    snapshot2 = spark.table(tmp_table).collect()
    assert snapshot1 == snapshot2, "Output differs between runs — not idempotent"

This catches now() leaks, non-deterministic ordering, and accidental row duplication.

4. MERGE Under the Hood

MERGE (aka UPSERT) looks simple; it's anything but. Understanding its execution is essential to reasoning about performance and correctness.

The logical model

MERGE INTO target t
USING source s
  ON t.key = s.key
WHEN MATCHED AND <cond> THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
WHEN MATCHED AND <cond2> THEN DELETE
WHEN NOT MATCHED BY SOURCE THEN DELETE;   -- supported by some engines

Conceptually: for every (t, s) pair where t.key = s.key, apply the first matching WHEN MATCHED clause. For every s with no match in t, apply WHEN NOT MATCHED. For every t with no match in s (when WHEN NOT MATCHED BY SOURCE), apply that clause.

The multi-match rule

If a single target row matches multiple source rows, behavior is undefined and most engines error:

ERROR: Cannot perform MERGE as multiple source rows matched and attempted to modify
       the same target row.

Fix: deduplicate the source to a single row per join key before the MERGE.

MERGE INTO target t
USING (
    SELECT DISTINCT ON (user_id) *           -- Postgres
    FROM stg_user
    ORDER BY user_id, updated_at DESC
) s
ON t.user_id = s.user_id
...

Or using window functions (portable):

USING (
    SELECT * FROM (
        SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) rn
        FROM stg_user
    ) WHERE rn = 1
) s

Execution on lakehouse (Iceberg/Delta)

MERGE on a lakehouse is not an in-place update. Files are immutable. The steps:

Scan target to find files touched by matched rows.
Scan source and join against target on key.
For each target file touched, read it fully, apply the merge, write a new file with updated/inserted rows.
Commit: swap the manifest to reference new files instead of old.
Old files remain (for time travel) until vacuum.

Implications:

Write amplification: updating 1% of rows in a 10 GB file still rewrites the whole 10 GB file. This is why small, well-sized files matter — bigger files = more write amplification per update.
Merge-on-read vs copy-on-write: Iceberg v2 and Delta support "merge-on-read" mode where deletes/updates are stored as separate delete files (positional or equality deletes), merged at read time. Cheaper writes, more expensive reads until compaction.
Compaction is essential. Without periodic OPTIMIZE / rewrite_data_files, merge-on-read tables slow down over time.

Copy-on-write vs merge-on-read

Copy-on-write (default in Iceberg v1, Delta default):
  Update → rewrite touched files entirely.
  Pros: fast reads, no merge at query time.
  Cons: slow writes, high write amplification.
  Use: update-light workloads (SCD2, daily overwrites).

Merge-on-read (Iceberg v2, Delta with deletion vectors):
  Update → write delete file + new rows file.
  Pros: fast writes.
  Cons: reads must apply delete vectors; slower until compacted.
  Use: update-heavy workloads (CDC, streaming upserts).

Configure per table based on expected update pattern.

Partition-aware MERGE

If the target is partitioned and the merge key lets the planner prune partitions, only those are touched. Make sure the partition column appears in the ON clause:

-- Good: planner prunes partitions where dt doesn't appear in source
MERGE INTO target t
USING source s
ON t.user_id = s.user_id AND t.dt = s.dt   -- includes partition col
...

Without the partition predicate in the ON clause, the engine scans all partitions.

Spark / Delta MERGE performance

from delta.tables import DeltaTable

# Enable deletion vectors (merge-on-read) for this table
spark.sql("""
    ALTER TABLE silver.events SET TBLPROPERTIES (
        'delta.enableDeletionVectors' = 'true'
    )
""")

t = DeltaTable.forName(spark, "silver.events")
(t.alias("t")
    .merge(
        source=stg.alias("s"),
        condition="t.event_id = s.event_id AND t.dt = s.dt"
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute())

Tips:

Narrow the MERGE with a partition predicate if possible.
Use spark.databricks.delta.merge.repartitionBeforeWrite.enabled = true to redistribute writes for better parallelism.
Monitor numOutputRows vs numTargetRowsUpdated — if you're rewriting 100% for a 1% update, reconsider file sizing.

5. Incremental Processing Patterns

Full refresh is simple but expensive above ~100 GB. Incremental processing is cheap but introduces correctness traps.

Pattern 1 — High watermark incremental

Track the max processed timestamp per source; next run pulls everything since.

def get_high_watermark() -> datetime:
    return spark.sql("""
        SELECT MAX(event_ts) FROM silver.events
    """).collect()[0][0] or datetime.min

def run_incremental():
    hwm = get_high_watermark()
    new_rows = spark.read.table("bronze.events") \
                    .filter(F.col("event_ts") > hwm)
    # process...
    new_rows.write.mode("append").saveAsTable("silver.events")

Correctness traps:

Late-arriving data with event_ts < hwm never gets processed. Fix: widen the window (process event_ts > hwm - safety_window, then dedupe).
Clock skew across producers can push HWM forward too fast. Fix: use event_ts not processing_ts.
A failed run that half-updated the target leaves the HWM in a wrong state. Fix: watermark is derived from the committed target, not stored separately.

Pattern 2 — Partition-based incremental

The target has daily partitions; each run processes one partition's worth of source data.

def run_incremental(dt: date):
    source_for_day = (spark.read.table("bronze.events")
        .filter(F.col("dt") == dt))   # safety: always one partition
    processed = transform(source_for_day)
    (processed.write
        .format("iceberg")
        .mode("overwrite")
        .option("replace-where", f"dt = '{dt}'")
        .saveAsTable("silver.events"))

Cleaner than HWM — each partition is an atomic unit. Reruns are safe. Backfills loop over dates.

Pattern 3 — Snapshot-diff incremental (Iceberg / Delta)

Read only the changes to a source table since the last processed snapshot.

# Iceberg
df_changes = (spark.read
    .format("iceberg")
    .option("start-snapshot-id", last_snapshot)
    .option("end-snapshot-id",   current_snapshot)
    .load("warehouse.bronze.events"))

# Delta: Change Data Feed (CDF)
df_changes = (spark.read
    .format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", last_version)
    .option("endingVersion",   current_version)
    .load("/path/to/bronze/events"))
# Rows include _change_type = 'insert' | 'update_preimage' | 'update_postimage' | 'delete'

Much cheaper than scanning the whole table when most data is unchanged. Used for streaming-from-batch and silver → gold refresh.

Pattern 4 — Merge-on-arrival (streaming batch)

For sources that arrive as a stream but are processed in micro-batches (Spark Structured Streaming with Trigger.AvailableNow):

(spark.readStream
    .format("iceberg")
    .load("warehouse.bronze.events")
    .writeStream
    .format("iceberg")
    .option("path", "warehouse.silver.events")
    .option("checkpointLocation", "s3://.../ckpt/silver_events/")
    .trigger(availableNow=True)    # process whatever's available, then stop
    .start()
    .awaitTermination())

Best of both: streaming semantics (state, checkpoints, exactly-once), batch cadence (scheduled run, then stop).

6. Backfills — Design, Safety, and Throttling

A backfill is how pipelines prove themselves. The healthy pattern:

Parameterized job (takes logical_date).
Idempotent writes (partition overwrite or MERGE).
Separate orchestrator (dedicated DAG or job) that doesn't block the daily schedule.
Rate limiting to avoid flooding upstream or spiking cost.
Monitoring: backfill progress, failure rate, error logs.

Backfill driver script

def backfill(start: date, end: date, parallelism: int = 4, throttle_s: int = 30):
    dates = [start + timedelta(days=i) for i in range((end - start).days + 1)]
    with ThreadPoolExecutor(max_workers=parallelism) as pool:
        futures = {}
        for i, d in enumerate(dates):
            futures[pool.submit(run_daily, d)] = d
            time.sleep(throttle_s / parallelism)   # stagger starts
        for f in as_completed(futures):
            d = futures[f]
            try:
                f.result()
                log.info(f"backfilled {d}")
            except Exception as e:
                log.error(f"backfill failed for {d}: {e}")
                # Decide: halt entire backfill, or continue and report?

Key knobs:

Parallelism: run N days simultaneously. Too high = overwhelms source / cluster.
Throttle: pause between submissions. Prevents thundering herd.
Error policy: fail fast vs continue. Depends on whether days are independent.

Airflow dynamic task mapping for backfills

from airflow.decorators import dag, task
from datetime import datetime, timedelta

@dag(
    schedule=None,            # triggered manually
    start_date=datetime(2026, 1, 1),
    catchup=False,
    max_active_tasks=8,        # concurrency limit
)
def backfill_playback():
    @task
    def list_dates(start: str, end: str) -> list[str]:
        s = datetime.fromisoformat(start).date()
        e = datetime.fromisoformat(end).date()
        return [(s + timedelta(days=i)).isoformat() for i in range((e-s).days+1)]

    @task(retries=2)
    def run_one(dt: str):
        run_daily_for(dt)

    run_one.expand(dt=list_dates("{{ params.start }}", "{{ params.end }}"))

backfill_playback()

Trigger with airflow dags trigger backfill_playback -p '{"start":"2026-01-01","end":"2026-02-01"}'.

Backfill correctness

If the current pipeline version differs from the one that produced the historical data, backfill produces different results. Document this; version-tag the output.
Upstream sources may have had different schemas historically; backfill code must handle all historical shapes OR start from a normalized bronze.
Dimensions are SCD2: a fact backfilled today joins to dim versions that were current on the backfill date, not today.

Cost-aware backfills

A naive "backfill one year of events" can cost $50k on a warehouse. Tactics:

Sampling: backfill 10% first; verify output shape; extrapolate cost; decide.
Coarse-grained backfill: backfill weekly rollups first; fine-grained only where needed.
Use a dedicated cluster/warehouse: don't let the backfill contend with production.
Cold-path data: older data goes to cheaper storage tiers (S3 Glacier, Snowflake external tables).

7. File Format Internals — Parquet, ORC, Avro

Parquet (columnar, analytic)

Layout:

Structure

File
- Magic bytes "PAR1"
- Row Group 1
  - Column Chunk: col_a
    - Page 1 (data + RLE/dict encoding)
    - Page 2
    - ...
  - Column Chunk: col_b
  - ...
- Row Group 2
- ...
- File Metadata (schema, row group locations, column stats)
- Magic bytes "PAR1"

Row group size: 128 MB default. Smaller = finer pruning but more overhead per file. Larger = more efficient scans but coarser filtering.
Page size: 1 MB default. The unit of decompression.
Dictionary page: encode low-cardinality columns as ints pointing to a dictionary. Spark writes dictionary pages up to 1 MB; then falls back to plain encoding.

Statistics available:

Per row group, per column: min, max, null_count, distinct_count (optional).
Bloom filters (Parquet 2.5+, optional): probabilistic set membership per column chunk.

Encodings:

PLAIN — raw values, no compression.
RLE_DICTIONARY — RLE of dictionary IDs. Default for most columns.
DELTA_BINARY_PACKED — delta encoding. Good for sorted integers / timestamps.
DELTA_BYTE_ARRAY — delta encoding for strings with common prefixes.

Compression:

SNAPPY — fast, moderate compression. Default.
ZSTD — slower, better compression (10-20% smaller than SNAPPY). Preferred in 2026.
GZIP — very slow, best compression. Rarely used.
LZ4 — fastest decompression.

Reading pattern (what the engine does):

Open file, seek to footer, read metadata.
Apply filters against per-row-group stats; skip non-matching row groups.
For each surviving row group, for each needed column, seek to column chunk.
Decode pages; apply predicates again at page level (if dictionary-filter eligible).
Assemble rows.

This is why selecting only needed columns and filtering on stat-having columns is so fast in Parquet.

ORC (columnar, Hive-era)

Similar to Parquet but:

Stripe = ORC's row group (64 MB default).
Richer lightweight indexes (per-10k-row bloom filters, not just per-stripe).
Better complex-type handling historically (structs, maps).
Compresses slightly better (ZSTD/ZLIB) due to more aggressive type-aware encoding.

Most modern lakehouses use Parquet for ecosystem reasons. ORC dominant in traditional Hadoop.

Avro (row-oriented, streaming)

Row-oriented format. Bad for analytics (must read full rows); good for:

Streaming source data: Kafka topics often serialize records as Avro with a schema registry.
Backup / archival: faster to produce, less CPU to write.
Schema evolution: Avro has a rich schema resolution model (reader schema vs writer schema, nullable union types).

# Writing Avro
df.write.format("avro").save("s3://backup/events/")

# Reading with explicit schema
import avro.schema
schema = avro.schema.parse(open("events.avsc").read())

Format choice

Analytic tables: Parquet (lakehouse default) or ORC (Hive legacy).
Streaming bronze (Kafka → lake): Avro in-flight (schema registry), Parquet on landing.
Archival / backup: Avro or gzipped JSON.
Row-level state: neither — use a row store (Postgres, RocksDB).

8. Partition Design Math

Partition choice is the #1 physical layout decision in a lakehouse. Bad partitioning wastes more compute than any other mistake.

The cardinality sweet spot

Rule: each partition should hold hundreds of MB to a few GB of uncompressed data, with at least a few hundred partitions total for a large table and not more than ~tens of thousands.

Math:

Row group size target: 128 MB.
Files per partition: 1–10 for good parallelism without small-file overhead.
→ Partition size target: 128 MB – 1.3 GB.

For a table with 10 TB of data:

10 TB / 500 MB per partition ≈ 20,000 partitions. Upper edge.
10 TB / 5 GB per partition ≈ 2,000 partitions. Nicer.

If the natural partition column (e.g., dt) gives you 10,000 days — that's 10,000 partitions, ~1 GB each.

If the natural partition column gives you 10M users — that's 10M partitions of tiny data each. Disaster. Repartition by a coarser key (day + country, or user_id % 1000).

Partition evolution

Iceberg supports partition evolution: change the partition spec without rewriting data. Old data keeps its partitioning; new writes use the new spec.

-- Started with daily partitions
ALTER TABLE silver.events SET PARTITION SPEC (days(event_ts), bucket(16, user_id));
-- Old daily partitions survive; new writes use day+user_bucket

This is huge. In traditional Hive, changing partitioning meant rewriting the entire table.

Partition pruning — what actually happens

Query: SELECT * FROM silver.events WHERE dt = '2026-04-19' AND country = 'US'.

Planner reads table metadata → sees partition spec days(event_ts) → partition column dt.
Planner matches filter dt = '2026-04-19' → determines relevant partition(s).
Planner reads manifest files for matching partitions only.
Manifests list data files with per-column stats (min/max of country).
Planner prunes files where country stats exclude 'US'.
Spark launches tasks only against surviving files.

What breaks pruning:

Function on partition column: WHERE DATE(event_ts) = '2026-04-19' — planner can't prove this matches partition days(event_ts)=2026-04-19. Rewrite: WHERE event_ts >= '2026-04-19' AND event_ts < '2026-04-20'.
Iceberg's hidden partitioning fixes this — days(event_ts) is an implicit partition derivation, and filters on event_ts directly are pruned.
Implicit type casts: WHERE dt = '20260419' (string) vs partition column is DATE — some engines won't prune. Match types.
UDFs in predicates: opaque; never pruned.

Hidden partitioning (Iceberg's killer feature)

Hive-style: the partition column is user-managed. You must always write WHERE dt = '...' matching the partition value exactly.

Iceberg: partition derived from source column. User writes WHERE event_ts BETWEEN ...; Iceberg transparently computes days(event_ts) and prunes.

-- Iceberg
CREATE TABLE silver.events (
    event_id STRING,
    event_ts TIMESTAMP,
    user_id STRING,
    ...
)
USING iceberg
PARTITIONED BY (days(event_ts), bucket(16, user_id));

-- Query naturally uses event_ts (no dt column visible)
SELECT COUNT(*) FROM silver.events
WHERE event_ts BETWEEN '2026-04-19' AND '2026-04-20'
  AND user_id = 'abc';
-- Prunes: days(event_ts) matches 2026-04-19 partition
-- Prunes: bucket(16, 'abc') matches one of 16 buckets

Clustering (sort within partition)

Even with good partitioning, within a partition the data order matters. Clustering (Snowflake), z-ordering (Delta), or sort-order (Iceberg) physically sorts rows so filters prune at the file / row-group level.

-- Iceberg: write new data sorted by user_id within each partition
ALTER TABLE silver.events WRITE ORDERED BY user_id;

-- Delta: z-order after writes (multi-column space-filling curve)
OPTIMIZE silver.events ZORDER BY (user_id, country);

-- Snowflake: clustering key (automatic maintenance)
ALTER TABLE silver.events CLUSTER BY (user_id);

When to cluster:

You frequently filter on a specific high-cardinality column (e.g., user_id).
That column is too high-cardinality to partition on directly.
Query patterns are read-heavy; the sort amortizes.

Z-order packs rows with similar values across multiple columns. Best when queries filter on a pair of columns unpredictably — e.g., sometimes user_id, sometimes country, sometimes both. Single-column sort is better when there's one dominant filter.

Bucket partitioning

Divide rows into N buckets by hash(key) % N. Useful when:

Key is high-cardinality (can't partition directly).
You want predictable file sizes (each bucket has roughly 1/N of data).
Joins on that key can use bucketed joins (skip shuffle if both sides are bucketed identically).

CREATE TABLE silver.events ( ... )
USING iceberg
PARTITIONED BY (days(event_ts), bucket(32, user_id));

32 buckets × 365 daily partitions = 11,680 partitions for a year. Each partition covers 1_day × (1/32 of users). A query filtering on user_id = 'abc' prunes to 1 day × 1 bucket = small.

9. Small Files and Compaction

The small files problem: tiny files kill read performance because each file incurs fixed overhead (open, footer read, metadata parse). A table with millions of tiny files is slower than one with thousands of right-sized files, despite holding the same data.

Causes of small files

High write parallelism + low data volume. 200 Spark partitions × 200 hourly runs = 40,000 files even if each run is small.
Streaming micro-batches with short triggers. 1-minute triggers for a low-volume stream: ~525,000 files per year per topic.
Skewed partitioning. A partition with lots of data gets sub-partitioned; one with little gets one tiny file.

Prevention at write time

df.coalesce(N) before write: reduces file count without shuffle. Use N = target_data_size / target_file_size.
df.repartition(N, ...) before write: shuffles to exactly N partitions. Costs a shuffle but guarantees even file sizes.
Spark AQE: spark.sql.adaptive.advisoryPartitionSizeInBytes = 128MB — AQE coalesces output partitions to hit this target.
Iceberg write.target-file-size-bytes table property.

df.write \
  .format("iceberg") \
  .option("target-file-size-bytes", 134217728) \  # 128 MB
  .mode("append") \
  .saveAsTable("silver.events")

Compaction (post-write)

Running OPTIMIZE or rewrite_data_files periodically rewrites many small files into fewer large ones.

-- Iceberg: compact files smaller than 100 MB, combining to ~512 MB
CALL system.rewrite_data_files(
    table => 'silver.events',
    options => map(
        'target-file-size-bytes', '536870912',
        'min-file-size-bytes',    '104857600'
    )
);

-- Delta
OPTIMIZE silver.events;

Schedule: daily for streaming-ingested tables, weekly for batch-ingested.

Garbage collection (expiring snapshots)

Iceberg/Delta keep old snapshots for time travel. Expire them to reclaim storage:

-- Iceberg: expire snapshots older than 7 days, keep at least 5 most recent
CALL system.expire_snapshots(
    table => 'silver.events',
    older_than => TIMESTAMP '2026-04-12 00:00:00',
    retain_last => 5
);

-- Delta
VACUUM silver.events RETAIN 168 HOURS;   -- 7 days

Caution: vacuuming kills time-travel for data older than the retention window. If compliance requires 7-year retention, configure accordingly.

10. CDC (Change Data Capture) Patterns

Source OLTP database → analytics warehouse. The canonical patterns:

Full snapshot (baseline)

Daily full dump of the source table. Simple, correct, expensive. Viable up to ~tens of GB.

# JDBC to Iceberg (full reload)
(spark.read
    .format("jdbc")
    .option("url", "jdbc:postgresql://...")
    .option("dbtable", "public.orders")
    .option("partitionColumn", "order_id")
    .option("numPartitions", 20)
    .option("lowerBound", "1")
    .option("upperBound", "100000000")
    .load()
    .write
    .format("iceberg")
    .mode("overwrite")
    .saveAsTable("bronze.orders"))

Pros: guaranteed correctness, no state. Cons: expensive at scale, lag = 1 day.

Incremental snapshot (changed-rows-only)

Source table has updated_at; pull rows where updated_at > last_pull.

hwm = get_high_watermark("bronze.orders", "updated_at")
(spark.read
    .format("jdbc")
    .option("dbtable", f"(SELECT * FROM orders WHERE updated_at > '{hwm}') AS t")
    ...
    .write
    .format("delta")
    .mode("append")
    .saveAsTable("bronze.orders_changes"))

# Upsert into bronze.orders using MERGE

Pros: cheaper than full reload. Cons: misses hard deletes (row gone from source with no marker); relies on accurate updated_at.

Log-based CDC (Debezium, Fivetran, AWS DMS)

Read from the database's write-ahead log (Postgres logical replication, MySQL binlog, SQL Server CDC). Emit Kafka messages per row change, with operation type:

{
  "before": {"order_id": 42, "status": "paid",   "total": 100},
  "after":  {"order_id": 42, "status": "shipped", "total": 100},
  "op": "u",
  "ts_ms": 1713567890000,
  "source": {"schema": "public", "table": "orders", "lsn": "0/30A3B8"}
}

Ops: c (create/insert), u (update), d (delete), r (snapshot read).

Pros:

Sub-second lag.
Hard deletes captured.
Exact change history, not just "current state."
Works without touching the source.

Cons:

Operationally complex: Kafka, Debezium, schema registry.
Requires database-side logging (WAL) configured correctly.
Schema drift at source breaks pipelines.

Upserting CDC into a lakehouse

cdc_stream = (spark.readStream
    .format("kafka")
    .option("subscribe", "cdc.public.orders")
    ...
    .load())

# Parse Debezium envelope
parsed = cdc_stream.select(
    F.from_json(F.col("value").cast("string"), debezium_schema).alias("env")
).select(
    "env.op", "env.after", "env.before", "env.ts_ms"
)

# Apply to target with forEachBatch + MERGE
def merge_batch(batch_df, batch_id):
    from delta.tables import DeltaTable
    t = DeltaTable.forName(spark, "silver.orders")
    (t.alias("t")
        .merge(
            batch_df.alias("s"),
            "t.order_id = s.after.order_id OR (s.op = 'd' AND t.order_id = s.before.order_id)"
        )
        .whenMatchedDelete(condition="s.op = 'd'")
        .whenMatchedUpdateAll(condition="s.op = 'u' AND s.ts_ms > t.cdc_ts_ms")
        .whenNotMatchedInsert(
            condition="s.op IN ('c','r')",
            values={
                "order_id": "s.after.order_id",
                "status":   "s.after.status",
                ...
                "cdc_ts_ms": "s.ts_ms"
            }
        )
        .execute())

(parsed.writeStream
    .foreachBatch(merge_batch)
    .option("checkpointLocation", "s3://.../ckpt/orders_cdc/")
    .trigger(processingTime="30 seconds")
    .start())

The cdc_ts_ms > t.cdc_ts_ms guard prevents out-of-order CDC events from overwriting a newer state with an older one.

CDC + SCD2

For downstream dim tables: the stream of CDC events drives SCD2 insertion of new rows. Every update event closes the current row and inserts a new one; every delete marks the current row as invalidated.

11. Data Quality in Batch Pipelines

Quality checks live alongside logic — not "if we have time later."

Taxonomy of checks

Check type	Example	When to run
Schema	columns exist with expected types	Every run, fail fast
Row count	count > 0, count within ±20% of 7-day avg	Every run, warn on drift
Uniqueness	PK is unique	Every run, hard fail on violation
Referential	all FKs resolve	Every run, warn or fail
Null rate	non-key columns < 1% null	Every run, warn
Distribution	numeric columns within expected p99	Periodic, drift detection
Freshness	max(event_ts) within SLO	Every run, SLA alert
Volume	same volume as upstream, ±5%	Every run, warn

dbt tests (declarative)

# models/silver/playback_sessions.yml
version: 2
models:
  - name: playback_sessions
    description: "Sessionized playback events"
    columns:
      - name: session_id
        tests:
          - unique
          - not_null
      - name: user_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_user')
              field: user_id
      - name: watch_ms
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 86400000   # 24h in ms
    tests:
      - dbt_utils.recency:
          datepart: hour
          field: max_end_ts
          interval: 2

Great Expectations (imperative + metadata)

import great_expectations as gx

context = gx.get_context()
ds = context.sources.add_pandas("pd")
asset = ds.add_parquet_asset(
    "playback_sessions",
    filepath_or_buffer="s3://lake/silver/playback_sessions/dt=2026-04-19/*.parquet"
)

batch = asset.build_batch_request()
validator = context.get_validator(batch_request=batch)

validator.expect_column_values_to_not_be_null("session_id")
validator.expect_column_values_to_be_unique("session_id")
validator.expect_column_values_to_be_between("watch_ms", min_value=0, max_value=86_400_000)
validator.expect_column_mean_to_be_between("watch_ms", min_value=60_000, max_value=3_600_000)

validator.save_expectation_suite("playback_sessions.suite")
result = validator.validate()
if not result.success:
    raise ValueError(f"DQ failed: {result}")

Soda Core (YAML-first)

checks for silver.playback_sessions:
  - row_count > 1000000
  - missing_count(session_id) = 0
  - duplicate_count(session_id) = 0
  - values in (device_type) must be in ['tv','mobile','tablet','web','other']
  - freshness(end_ts) < 1h
  - anomaly score for row_count < 3

Designing "what to check"

Every PK column: not null, unique.
Every FK: relationships to parent (or tolerate N% unresolved with a warning).
Every measure: physically plausible range (no negatives on watch time, etc.).
Every timestamp: within expected date range (catch 1970-01-01 defaults).
Every enum column: values in allowed set.
Daily volumes: within ±20% of 7-day rolling average; alert on anomalies.

Over-checking is cheap; under-checking is expensive.

12. Orchestration Patterns for Batch

The simple scheduled DAG

One task per stage, linear dependencies, cron schedule.

extract → dedupe → transform → validate → write → publish

Best for: small, well-understood pipelines.

Asset-based orchestration (Dagster)

Instead of tasks-with-dependencies, declare the datasets (assets) and their producers. Dagster figures out the dependency graph.

from dagster import asset

@asset
def bronze_events(context):
    return pull_from_kafka(context)

@asset
def silver_events(bronze_events):
    return transform(bronze_events)

@asset
def gold_daily_plays(silver_events):
    return aggregate(silver_events)

Materializing gold_daily_plays rebuilds upstream as needed.

Sensor-based triggering

Instead of cron, trigger when upstream data arrives.

# Airflow
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor

wait = S3KeySensor(
    task_id="wait_for_upstream",
    bucket_key="s3://upstream/events/{{ ds }}/_SUCCESS",
    timeout=60 * 60 * 4,
    poke_interval=300,
)

For lakehouse tables: dataset-based scheduling (Airflow 2.4+) reacts when the upstream table is updated.

Backfill separation

Never use the same DAG for daily and backfill. Daily runs should not contend with backfills. Backfill DAG:

Dedicated compute pool.
Lower priority.
Separate alerting (backfill failures rarely need 3am pages).

SLA management

Every table should have:

Freshness SLO — "updated within 1h of upstream seal."
Completeness SLO — "≥99% of expected row count."
Correctness SLO — "DQ tests green on every run."

Expose as metrics; alert on violations with an on-call rotation.

# Airflow SLA miss callback
def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
    notify_on_call(...)

default_args = {
    "sla": timedelta(hours=1),
    "sla_miss_callback": sla_miss_callback,
}

Next: 03-streaming-processing.md — unbounded data, event-time processing, watermark math, Flink and Kafka internals, exactly-once proofs.

13. Dependency Graphs and Critical Path Analysis

A batch "pipeline" at scale is never a chain — it's a directed acyclic graph of hundreds of tasks with cross-dataset dependencies. Two questions dominate: when will the whole graph finish, and which node, if it slips, slips the whole graph? These are critical-path questions.

Modeling the graph

Each node is a task with a duration. Each edge expresses "task B cannot start until task A succeeds." Total graph completion time is the longest-path sum from any source to any sink. That longest path is the critical path.

Nodes off the critical path have slack — the amount they can slip without affecting total completion time. Nodes on the critical path have zero slack. A single extra minute on any critical-path task adds a minute to the whole pipeline.

Why this matters for interviews

"Our daily pipeline finishes at 6am but the business wants it by 4am — how do you speed it up?" Wrong answer: "I'd optimize the slowest job." Right answer: "I'd find the critical path and optimize the longest task on it. Optimizing off-path tasks won't move completion time at all."

The Airflow / Dagster reality

Most orchestrators expose task duration logs. Extract them, build the adjacency matrix, compute longest-path via topological sort + dynamic programming. O(V+E). For a 500-task graph, runs in sub-second and tells you exactly which five tasks you must optimize.

# Simplified: compute critical path duration
def critical_path(nodes, edges, duration):
    # topological sort
    indeg = {n: 0 for n in nodes}
    for a, b in edges: indeg[b] += 1
    order, q = [], [n for n,d in indeg.items() if d == 0]
    while q:
        n = q.pop(0); order.append(n)
        for a, b in edges:
            if a == n:
                indeg[b] -= 1
                if indeg[b] == 0: q.append(b)
    # DP: earliest finish time
    finish = {n: duration[n] for n in nodes}
    for n in order:
        for a, b in edges:
            if a == n:
                finish[b] = max(finish[b], finish[n] + duration[b])
    return max(finish.values())

14. The Batch Cost Model

Senior candidates should be able to estimate batch cost without invoking the cloud-provider pricing page. The model has four terms.

The four costs

Storage. GB-months × $/GB-month. Flat. For S3 Standard, ~$0.023/GB-month. For Glacier, ~1/10 of that.
Compute. Core-hours × $/core-hour. EC2 spot typically $0.01–0.02 per vCPU-hour; on-demand 3x that.
Shuffle / intermediate I/O. Bytes written to local disk or the shuffle service. On ephemeral disk, free but capacity-limited. On remote shuffle (cloud managed), charged per GB.
Egress. Cross-region or internet egress. The silent killer. $0.02–0.09 per GB depending on direction. Can dwarf compute cost for lift-and-shift workloads.

Worked estimate — 10 TB daily batch

Input 10 TB. One Spark job with a 3-way join producing 2 TB output. Shuffle amplification typically 2–4x input — say 30 TB of shuffle. Runtime 40 min on 100 vCPUs.

Compute: 100 vCPU × 0.67 hour × $0.02 = $1.33
Storage (same-region S3, input rereadable): negligible incremental
Shuffle: 30 TB local disk, free
Egress: zero if same region, ~$900 if cross-region output

The answer "this job costs roughly $1.50 same-region, $900+ cross-region" is the senior-level answer. Notice the 600x swing just on egress — that's why "lift-and-shift to another region" is so often the wrong design move.

15. Parallelism Math and the Skew Tax

Perfectly parallel work distributes across N workers with speedup ≈ N. Real work rarely achieves this. Two sources of loss: sequential fraction (Amdahl's law) and skew (one partition is fatter than the rest).

The skew tax, quantified

If you partition 1 TB of work into 100 evenly-sized pieces of 10 GB each, total wall-clock time is ≈ (10 GB work / per-worker throughput). If instead one partition is 500 GB and the other 99 average 5 GB, total wall-clock is bounded below by 500 GB / throughput — 50x slower than the balanced case, despite the same total work. That's the skew tax.

Detecting skew ahead of time

Compute partition size distribution: SELECT part_key, COUNT(*), SUM(bytes) FROM … GROUP BY 1 ORDER BY 2 DESC LIMIT 20. If the top partition is 10x the median, you have skew.
Coefficient of variation (stddev / mean) on partition sizes. Above 0.5 = significant skew.

Mitigations

Salting. Suffix the skewed key with a small random integer (key + '_' + RAND(10)), join, then aggregate up. Fixes the shuffle but doubles join complexity.
Broadcast. If the other side is small, broadcast it and avoid shuffling the skewed key entirely.
Isolate the hotspot. Split the skewed keys into a separate job with different parallelism.
AQE. Spark's adaptive skew-join detects skewed partitions at runtime and splits them. Configure spark.sql.adaptive.skewJoin.enabled=true, know what skewedPartitionThresholdInBytes controls.

16. Checkpoint and Restart Semantics

Batch jobs are not point-in-time atomic. A 6-hour job that fails at hour 4 either resumes from a checkpoint (if designed for it) or restarts from zero. The senior design decision is which.

Three restart strategies

Fully idempotent restart from start. Every job is safe to rerun end-to-end. Requires deterministic inputs and idempotent sinks (MERGE on business key). Simple; operationally resilient. Cost: wasted compute on every failure.
Task-level checkpointing. Orchestrator records task completion; restart resumes at the failed task. Works for DAG-structured jobs. Requires each task's output to be written atomically (typically via temp-then-rename).
Intra-task checkpointing. A long task persists progress state (e.g., partition cursor) and resumes from it. Necessary for multi-hour jobs. Expensive to implement right; usually worth it only for jobs that exceed ~2 hours.

The atomic-write discipline

Every job output must be all-or-nothing. The pattern: write to a temp location, validate, then atomic rename or commit. Never "write in place and pray." Lakehouse formats (Iceberg, Delta) get you atomic commits for free. Hive partitioned tables do not — use _SUCCESS markers and write to a staging path first.

The "restart at noon" drill

Interviewers love to ask: "Yesterday's job failed at 11pm. You rerun at noon today. What happens downstream?" The senior answer enumerates: which tables get rewritten, which downstream caches invalidate, which dashboards see a stale→fresh transition, which SLAs trip. If you can't answer this for your own pipelines, you don't own them operationally — no matter what the org chart says.

17. Partition Key Decisions by Domain

Choosing a partition key is the single highest-leverage physical-design decision in a lakehouse or warehouse table. A good choice makes common queries prune 99% of the data; a bad one produces billion-row scans for every dashboard. Below is a pattern-match table of typical domains, the natural partition key, the access patterns that justify it, and the most common wrong choice candidates reach for first.

Domain / table	Recommended partition key	Secondary (clustering/sort)	Why	Common wrong pick
Events (app, web, mobile)	`event_date`	`user_id`, `event_name`	Nearly all queries bound by a date range	`user_id` — cardinality explosion, files per user explode
Orders / transactions	`order_date`	`customer_id`, `region`	Daily batches, MoM reports, finance closes — all date-bounded	`order_id` — gives one row per partition
Clickstream / impressions	`event_date` + `hour` (hourly)	`session_id`, `campaign_id`	High volume; hourly grain keeps partitions under ~1 TB	`campaign_id` — skew on hero campaigns
Payments / ledger	`posting_date`	`merchant_id`, `account_id`	Financial close runs on posting date	`merchant_id` — power-law skew
Subscription billing	`invoice_month`	`account_id`	Billing cycles are monthly	`account_id` — tenant skew; too few files for small tenants
CDC / change log	`change_date`	`source_table`, `pk_hash`	Replay and backfill are date-bounded	`source_table` — one huge partition per high-volume table
IoT telemetry	`ingest_date` + `hour`	`device_id` bucket (16–64 buckets)	Continuous high-volume stream; hourly prevents partition bloat	`device_id` raw — cardinality blowup
Logs (systems)	`log_date` + `hour`	`service`, `severity`	Investigations bounded by time; joining across services is rare	`host_id` — millions of tiny partitions
Media catalog	— (no partition; small)	`title_id` cluster	A catalog of 10–100 K items is too small to benefit from partitioning	partitioning by `genre` — rarely prunes
Playback / watch events	`event_date`	`title_id`, `region`	Engagement reporting is date-bounded; title-level rollups are frequent enough to cluster on	`title_id` — one-hit wonders create skew
Ad impressions	`event_date` + `hour`	`advertiser_id`, `placement_id`	Volume is extreme; hourly is the right grain	`advertiser_id` — top advertiser dominates
Inventory snapshots	`snapshot_date`	`warehouse_id`, `sku`	Snapshots are by date; comparisons are date-vs-date	`sku` — SKU count is in the millions
Clinical claims	`service_date`	`payer_id`, `provider_id`	Service date anchors claims processing and audit	`patient_id` — cardinality and privacy issues
Ride / trip data	`trip_end_date`	`city_id`, `driver_id`	Operational reporting is daily per city	`driver_id` — long tail of one-trip drivers
Support tickets	`opened_date`	`product_area`, `priority`	SLA and backlog queries are date-bounded	`customer_id` — sparse per customer
CRM activity	`activity_date`	`owner_id`, `account_id`	Rep productivity reports run by week/month	`account_id` — uneven size
Search queries	`query_date`	`query_hash` bucket	Investigations and model training are both date-bounded	`query_text` — one partition per unique query
Model training features	`feature_date`	`entity_id` bucket	Training sets are date-bounded; entity_id is bucketed to avoid skew	`entity_id` raw — skew + small partitions

Sizing heuristic

Aim for partitions in the ~100 MB – ~10 GB range after compaction. Below that, metadata overhead dominates. Above it, a single consumer reading one partition loses parallelism. For a daily-partitioned 1 TB/day fact, that's ideal. For a 100 GB/day fact, daily is still fine. For a 10 TB/day fact, move to hourly.

The bucketed-id trick

When you have a high-cardinality key (user_id, device_id) that users filter on but can't partition on directly, hash it into a fixed number of buckets and use that as a secondary partition. The query can push the bucket predicate down (WHERE user_bucket = hash(user_id) % 32), pruning ~97% of partitions while keeping each partition reasonable size. This is how Iceberg's hidden partitioning works under the hood for bucket transforms.

-- Iceberg: hidden bucketing — no user code changes
CREATE TABLE events (
  event_date DATE,
  user_id BIGINT,
  event_name STRING,
  ...
)
USING ICEBERG
PARTITIONED BY (event_date, bucket(32, user_id));

-- Queries that filter on user_id automatically prune 31/32 buckets
SELECT * FROM events
WHERE event_date = '2026-04-20' AND user_id = 12345;

18. NULL Semantics — Examples by Domain

NULL is never a single thing. It carries different meanings across columns and domains, and conflating them is a root cause of subtle downstream bugs. The rule: every nullable column in a contract must document which meaning applies. Examples below.

NULL meaning	Column examples	Recommended handling
Unknown (value exists but not captured)	`customer_dob`, `visit_purpose`, `referrer_url`	Leave NULL; document that absence ≠ absence-of-fact
Not applicable (value cannot exist for this row)	`spouse_name` for single customers, `return_reason` for non-returns, `churn_date` for active users	Leave NULL; contract explicitly enumerates which combinations preclude the column
Pending (value will arrive later)	`shipped_ts`, `paid_ts`, `resolved_ts` in an accumulating snapshot	NULL is the signal; downstream code tests `IS NULL` explicitly
Sentinel for "no match"	`fraud_model_score` on transactions the model declined to score, `segment_id` for un-enrollable users	Prefer a sentinel value over NULL: `-1` for scores, `'UNASSIGNED'` for enums
Soft-deleted	`deleted_at` populated = row is logically deleted; NULL = active	NULL means "not deleted"; every consumer must filter on `deleted_at IS NULL`
End-date of open interval	`valid_to`, `employment_end_date`, `lease_end_date`	NULL means "still valid"; OR set to `9999-12-31` for range-join friendliness
Privacy-suppressed	`customer_ip`, `email` post-GDPR	Prefer an explicit status column (`pii_status='erased'`) alongside NULL
Default-not-set	`preferred_language`, `notification_opt_in`	Fill with a system default (`'en-US'`, `FALSE`); NULL here is usually a bug
Zero vs NULL (the classic)	`refund_amount`, `discount_amount`, `gift_card_value`	Zero when we know no refund occurred; NULL only if we don't know. The two are different rows in finance.

The aggregation trap

SUM(column_with_nulls) ignores NULLs, but SUM(column_with_nulls) / COUNT(*) does not — COUNT(*) includes NULL rows. If you want the average over non-null rows, use AVG(col) or SUM(col) / COUNT(col). Mixing SUM and COUNT(*) is the single most common "the numbers are off" bug in dashboards that mix nullable measures.

The NULL-vs-zero revenue bug

Finance often wants "revenue = 0 for days with no sales." If your fact table uses NULL to mean "no sales," your MoM chart will produce NULL MoM for zero-revenue-days, not "down 100%." Generate a dense calendar grid (see Q4 in Part 09's SQL bank), COALESCE NULLs to 0, and you're safe.

↑ Back to top

Part 03

Streaming Processing

Streaming is where data engineering gets hard — not because the APIs are complex, but because the physics of distributed time makes correctness subtle. This file goes into watermark math, window mechanics, Flink's checkpoint algorithm, Kafka's internal protocols, and the proofs behind exactly-once semantics.

1. Unbounded Data — What Actually Changes

An unbounded source has four properties that break batch intuitions:

No sealed state. You never "have all the data." Every computation is "as of now," subject to revision.
Arbitrary lateness. Events can arrive minutes, hours, or days after they happened (mobile reconnects, buffering clients).
Out-of-order arrival. An event timestamped 10:05 may arrive after an event timestamped 10:07, because it took a different route through the system.
Continuous state. The processor must maintain state (windows, joins, aggregations) that grows without bound unless explicitly expired.

Consequence 1: "correct" becomes "correct as of watermark W"

A batch SQL SELECT SUM(watch_ms) GROUP BY dt has one answer. A streaming equivalent has an answer that keeps updating as late data arrives — and at any moment, the answer is "the sum of all data observed up to the current watermark, with the agreement that events older than W are considered closed."

Consequence 2: computations have four dimensions, not one

Beam's influential model:

What are you computing? (the aggregation function)
Where in event time? (the windowing scheme)
When in processing time? (the trigger — when to emit a result)
How do refinements relate? (accumulating, discarding, accumulating-with-retractions)

Naming them explicitly clarifies every streaming design decision.

Consequence 3: state is load-bearing

A batch job's state is transient (shuffle files, intermediate RDDs). A streaming job's state is the product. Lose it, lose correctness. Managing state durably (checkpoints, savepoints, RocksDB), bounding its growth (TTL, key expiration), and recovering it on failure (from checkpoint) dominates streaming engineering.

Consequence 4: time is a first-class citizen

In batch, "what time is it?" is a parameter. In streaming, time is the dataflow itself. Watermarks advance; triggers fire; windows close; state expires. Every operation is time-indexed.

2. The Three Times (Event, Processing, Ingestion)

Define precisely:

Event time (ET) — when the event actually occurred in the real world. Stamped by the producer (the client, the sensor, the database transaction). This is what users care about. "Watch hours on 2026-04-18" means "hours of watching whose event_time falls on 2026-04-18."
Ingestion time (IT) — when the event was received by the messaging system (e.g., the time Kafka assigned on append).
Processing time (PT) — when the stream processor is evaluating the event.

Relationships:

ET ≤ IT ≤ PT always (events can't be received before they happened; processing can't finish before receipt).
IT - ET is the producer lateness — how long the event took to reach the messaging system.
PT - IT is the processing lag — how behind the consumer is.
PT - ET is the end-to-end latency experienced by consumers.

Distribution of ET → IT lag

Real production distributions are heavily right-skewed:

p50: milliseconds to seconds.
p99: tens of seconds to minutes.
p99.9: hours (mobile reconnects, app-in-background).
p99.99: days (offline clients syncing weeks later).

Lesson: watermarks must be tuned to the distribution, not the average. Setting a 1-minute lateness tolerance drops ~1% of events in most mobile datasets.

When does processing time ever make sense?

Ad-hoc system health ("events/sec arriving right now").
Simple at-least-once transforms with no time semantics.
Systems that explicitly want "wall clock" behavior (alarms, heartbeats).

For any business metric, event time is correct. Always start with event time.

3. Watermarks — Math and Mechanics

A watermark is an assertion: "I assert that no more events with event_time ≤ W will arrive." The stream processor uses this to decide when windows can close.

Perfect vs heuristic watermarks

Perfect watermark: actually correct. Only achievable when you have metadata (e.g., Kafka producer timestamps with bounded clock skew and source-aware committed offsets). Rare.
Heuristic watermark: a guess, usually max(event_time_seen_so_far) - safety_margin. Wrong sometimes; that's what late-data handling is for.

Bounded out-of-orderness watermark

The common practical watermark:

W(t) = max(event_time seen before t) - B

where B is the bounded out-of-orderness (expected max lateness).

Example with B = 5 seconds:

Stream sees events {ET: 10:00:00, ET: 10:00:02, ET: 10:00:01, ET: 10:00:04, ET: 10:00:03, ...}.
After ET: 10:00:04 is observed, watermark = 10:00:04 - 5s = 09:59:59.
Event ET: 10:00:03 arrives next: falls below watermark? 10:00:03 > 09:59:59, so not late. Accepted into windows.
After ET: 10:00:10 is observed, watermark = 10:00:05. Events arriving with ET < 10:00:05 are now late.

Choosing B (bounded out-of-orderness)

Empirical approach:

Instrument production: log processing_time - event_time for every event.
Compute percentiles: p50, p95, p99, p99.9 of lateness.
Pick B as p99 or p99.9 — the tail you're willing to drop vs block.

Math: if B = p99 lateness, ~1% of events are late (dropped or routed to side output).

Higher B:

Fewer late events.
Longer window emission delays (latency ↑).
Larger active state (windows held open longer).

Lower B:

Shorter latency.
More late events (correctness ↓).
Smaller state.

Typical values: web/mobile analytics B = 30s–5m; sensor/IoT with good clocks B = 1–10s; batch-like (daily S3 drops) B = hours.

Watermark propagation in a dataflow graph

In Flink, each operator has its own watermark. Upstream operators propagate watermarks downstream. When an operator has multiple inputs, it takes the minimum watermark of its inputs:

source A: watermark advancing at 10:05
source B: watermark stuck at 10:02 (slow source)
joined operator: watermark = min(10:05, 10:02) = 10:02

This is correct: the downstream operator can't claim "no events before 10:05" because source B might still emit events with event time between 10:02 and 10:05.

Practical implication: a single slow source stalls the entire pipeline. Diagnose by watching per-operator watermarks in Flink UI. Solutions:

Ensure all sources produce events regularly (heartbeat events if idle).
Use WatermarkStrategy.forMonotonousTimestamps() where applicable (no out-of-orderness expected).
Configure idle source detection — a source is marked idle after inactivity and its watermark contribution is ignored.

WatermarkStrategy<Event> strategy = WatermarkStrategy
    .<Event>forBoundedOutOfOrderness(Duration.ofSeconds(5))
    .withIdleness(Duration.ofMinutes(1))
    .withTimestampAssigner((ev, ts) -> ev.eventTime);

Watermarks across Kafka partitions

A single Kafka topic has multiple partitions. Each is an independent log. Event times within a partition are monotonic only if the producer writes in event-time order (usually not).

Flink's Kafka source computes per-partition watermarks, then combines via min. This is correct but introduces subtle issues:

A partition with slow producers drags the watermark.
An empty partition (no events) can cause the overall watermark to freeze unless idleness is configured.

Set watermarks per partition, with idleness detection, and ensure your partitioning key doesn't create empty partitions for prolonged times.

The "punctuated" variant

Instead of heuristics, the producer emits explicit watermark marker records: "I'm at event time T, no more events before T from me." Flink supports this via WatermarkGenerator with onEvent advancing and onPeriodicEmit emitting. Ideal when the source is a database CDC stream with explicit sequencing.

4. Windows — Types, State, and Trade-offs

A window groups events by time (or session). Each window maintains state (the aggregating values) and emits a result when it's ready.

Tumbling window

Fixed size, non-overlapping. Each event belongs to exactly one window.

|----W1----|----W2----|----W3----|
       ^           ^           ^
       events fall in one window

State: O(1) per key per active window. Simplest.

# PySpark Structured Streaming — 1-minute tumbling window
(events
    .withWatermark("event_ts", "5 seconds")
    .groupBy(F.window("event_ts", "1 minute"), "title_id")
    .agg(F.sum("watch_ms").alias("watch_ms")))

// Flink
events
    .keyBy(e -> e.titleId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new SumAggregator());

Sliding (hopping) window

Fixed size, slides forward by a step smaller than the size. Each event belongs to size/slide windows.

size=10m, slide=1m: every event is in 10 overlapping windows.

State: O(size/slide) × keys. Quickly becomes expensive.

Use when: you need rolling metrics ("watch hours in the trailing 10 minutes, updated every minute").

Tip: sliding windows can often be emulated by a tumbling window at the fine granularity + downstream rollup, which is cheaper.

Session window

Dynamic size, closes when there's a gap of inactivity longer than the session timeout.

events:  * * * *      * *     *    (gap >= timeout means new session)
windows: |------|     |---|   |-|

State: O(events per active session) × active sessions. Can be unbounded for users with continuous activity.

events
    .keyBy(e -> e.userId)
    .window(EventTimeSessionWindows.withGap(Time.minutes(30)))
    .aggregate(new SessionBuilder());

Challenge: a single "always-active" user (a bot, a scraper, a stuck client) holds a session open forever, preventing emission. Solutions:

Maximum session length (cap at e.g. 4 hours).
Bot filtering at ingestion.
Late "ejection" if max latency exceeded.

Global window

One window for the entire stream, with custom triggers deciding when to emit. Rarely correct unless you know exactly why you want it (e.g., a running counter that never resets).

Custom windows (e.g., calendar-aligned)

Windows aligned to calendar boundaries ("this day in user's local timezone", "this business week"). Implement with custom window assigners in Flink:

public class CalendarDayWindowAssigner extends WindowAssigner<Event, TimeWindow> {
    @Override
    public Collection<TimeWindow> assignWindows(Event ev, long ts, WindowAssignerContext ctx) {
        ZoneId zone = ZoneId.of(ev.userTimezone);
        ZonedDateTime zdt = Instant.ofEpochMilli(ev.eventTime).atZone(zone);
        ZonedDateTime dayStart = zdt.toLocalDate().atStartOfDay(zone);
        ZonedDateTime dayEnd   = dayStart.plusDays(1);
        return Collections.singleton(new TimeWindow(
            dayStart.toInstant().toEpochMilli(),
            dayEnd.toInstant().toEpochMilli()
        ));
    }
    // ... getDefaultTrigger, getWindowSerializer, etc.
}

Window state size

For N_keys keys and W_active simultaneously active windows:

Tumbling: W_active = 1; state = O(N_keys).
Sliding (size S, slide St): W_active = S/St; state = O(N_keys × S/St).
Session: W_active = active_sessions; state = O(active_sessions × avg_events_per_session).

For high cardinality + long windows + many sliding steps: state explodes. Mitigations:

RocksDB backend (disk-backed state, not memory-bound).
Pre-aggregate (use aggregate not apply — maintains incremental state, not buffered events).
TTL on state (Flink's StateTtlConfig).

5. Triggers and Emission Policy

A trigger decides when a window's current result is emitted. Default: on watermark (once per window, when watermark passes end-of-window). But you can do much more.

When to use custom triggers

Speculative emission: emit partial results early, even before the window closes. Trade accuracy for latency.
Per-row triggering: emit every N events (useful for low-latency monitoring).
Processing-time-based: emit every 10 seconds regardless of watermark (for real-time dashboards).
Data-driven: emit when some condition in the data is met (e.g., "emit when window's count exceeds threshold").

Example: early + on-watermark + late

events
    .keyBy(e -> e.titleId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .trigger(
        EventTimeTrigger.create()
            // speculative: emit every 10s while window is open (processing time)
            // Flink's built-in trigger doesn't combine these directly;
            // implement a custom Trigger extending EventTimeTrigger
    )
    .aggregate(new ViewerCountAgg());

Accumulation mode

When you emit multiple times for the same window, what do the downstream consumers see?

Accumulating: each emission is the latest running total. Downstream must be an upsert sink that replaces the previous value. Example: (title_id=X, window=[10:00,10:01), count=42) replaces the earlier count=37.
Discarding: each emission is only the new events since last emission. Downstream must sum across emissions to get the total. Harder downstream, but smaller messages.
Accumulating + retracting: each emission includes a retraction of the previous value plus the new value. Allows downstream to handle both "live" updates and "final" snapshots. Most expressive, most expensive.

Choose based on downstream semantics. If sink is Kafka with compacted topic keyed by (title_id, window_start): accumulating + upsert. If sink is event log: retracting. If sink is an idempotent append-only store: discarding.

6. Late Data Strategies

An event is "late" if it arrives after its window's watermark has passed.

Strategy 1: Drop (default in Spark)

Late events are silently discarded. Fast, simple, incorrect for audit-sensitive workloads.

Strategy 2: Allowed lateness (extend window life)

Hold the window open past watermark by some duration; late events that arrive within this grace period update the window and trigger re-emission.

events
    .keyBy(e -> e.userId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .allowedLateness(Time.minutes(10))     // 10 extra minutes
    .aggregate(new Counter());

State cost: windows held for window_size + allowed_lateness. For a 1-minute window with 10-minute lateness, state is held 11 minutes per window.

Downstream must handle re-emissions — sink must be upsert-capable.

Strategy 3: Side output (recommended)

Route late events to a separate "late" stream for reconciliation downstream.

OutputTag<Event> lateTag = new OutputTag<>("late"){};

SingleOutputStreamOperator<Result> result = events
    .keyBy(e -> e.userId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .sideOutputLateData(lateTag)
    .aggregate(new Counter());

DataStream<Event> lateEvents = result.getSideOutput(lateTag);
lateEvents.addSink(new LateEventSink());  // write to separate topic/table

Dual-path reconciliation:

Streaming pipeline handles on-time events.
A batch job reads the late-event table daily and re-processes affected windows.
Gold tables are rebuilt from the reconciled silver.

This is the production pattern. It's a Kappa + audit architecture.

Strategy 4: Retractions

Emit a retraction for the previous result and a new result reflecting the late event. Downstream must support retractions (a log of changes, not a snapshot).

Supported natively in Flink Table API with Retract mode. Upsert sinks handle retractions by deletion+insertion.

Measuring lateness in production

Emit a metric per event: lateness_ms = processing_time - event_time. Plot the distribution. Use to tune watermark B.

# Spark Structured Streaming: compute lateness column
events_with_lateness = events.withColumn(
    "lateness_ms",
    F.unix_millis(F.current_timestamp()) - F.unix_millis(F.col("event_ts"))
)

7. Delivery Semantics Proved

"Exactly-once" is often said without precision. Let's nail it down.

Definitions

Let M be a message in a stream. Let effect(M) be the externally-observable effect of processing M (a row written, a counter incremented, a record published).

At-most-once: for each M, effect(M) happens 0 or 1 times.
At-least-once: for each M, effect(M) happens 1 or more times.
Exactly-once: for each M, effect(M) happens exactly 1 time.

The impossibility footnote

In a distributed system with possible crashes, the physical delivery of a message cannot be guaranteed exactly-once: any single network request may or may not have arrived despite no response. But the effect can be, by coordinating retries with deduplication or idempotent sinks.

Hence the industry term "effectively exactly-once": the physical operation may happen multiple times, but the side effect happens exactly once.

Three necessary conditions

For effectively-exactly-once:

Replayable source. On restart after failure, the system can re-consume from a precisely defined offset. (Kafka offsets satisfy this; naive HTTP push sources do not.)
Deterministic or idempotent processing. Re-processing the same input produces the same output, OR the sink absorbs duplicates idempotently.
Atomic commit coordinating source offsets and sink output. Either both the new offsets and the new output are committed, or neither.

Any one of these missing, and "exactly-once" fails.

Spark Structured Streaming's exactly-once

Spark provides effectively-exactly-once for specific combinations:

Mechanism:

Source tracks offsets per micro-batch (stored in offsets/ in checkpoint directory).
Each micro-batch's computation is deterministic given the offsets.
Sink commits atomically:
- File sinks (Parquet, Iceberg, Delta): use a two-phase commit with a commits/ log. After the data files are written, a commit marker is appended atomically. On restart, uncommitted micro-batches are re-executed.
- Kafka sink: uses Kafka transactions (transactional.id). Micro-batch produces within a transaction; transaction commits with the offset commit.
On failure, restart reads offsets/ and commits/ to find the last committed micro-batch, re-executes the next one deterministically.

Where it breaks:

Non-deterministic query (rand(), current_timestamp(), reading from a mutable source).
Non-atomic sinks (HTTP POST, naive JDBC without dedup key).
Kafka transactions disabled or misconfigured.
Checkpoint directory corruption.

(stream
    .writeStream
    .format("iceberg")
    .option("checkpointLocation", "s3://ckpt/events/")
    .trigger(processingTime="30 seconds")
    .start())

Spark guarantees EO when using checkpoints + Iceberg/Delta sinks + deterministic transforms.

Flink's exactly-once via 2PC and checkpoints

Flink uses the Chandy-Lamport algorithm (distributed snapshot) for checkpoints, combined with two-phase commit at sinks.

Barrier-based checkpointing:

JobManager injects a barrier with checkpoint ID C into each source partition.
As barriers flow through the dataflow graph, each operator aligns: when it has received the barrier from all its inputs, it snapshots its state to durable storage.
State snapshot includes: operator state, keyed state, source offsets.
Operator forwards the barrier to all its outputs.
Sinks acknowledge completion. When all sinks ack, checkpoint C is globally complete.

Diagram

[Source] barrier [Map] barrier [Window] barrier [Sink]
- ↓ ↓ ↓ ↓
snapshot snapshot snapshot commit

Aligned vs unaligned barriers:

Aligned: operator waits for barriers from all inputs before snapshotting. Clean semantics but blocks processing of inputs whose barrier arrived first.
Unaligned (Flink 1.11+): operator snapshots immediately when first barrier arrives, including in-flight records. Less backpressure sensitivity; larger snapshots. Enabled via env.getCheckpointConfig().enableUnalignedCheckpoints().

Two-phase commit at sinks:

On barrier: sink does a pre-commit (write to staging, reserve transaction ID).
When JobManager confirms the checkpoint globally complete: sink does commit (make data visible).
On restart: replay uncommitted pre-commits as needed.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60_000);  // every 60s
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30_000);
env.getCheckpointConfig().setCheckpointTimeout(600_000);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);

What EO does NOT guarantee

Ordering across keys: events may still be emitted to the sink in an order different from their arrival.
No retractions: updates to already-committed results are still visible downstream as new records.
Cross-system atomicity: EO within one Flink job + one Kafka cluster. Across two Kafka clusters, or Kafka + a different sink, EO requires explicit 2PC support end-to-end.
Effective-exactly-once for the system, not the business: if the input data is duplicated (upstream producer retried without dedup), EO still processes both copies.

Idempotent sink as an alternative to 2PC

Often simpler than 2PC is to make the sink idempotent by key:

Write each record with a unique idempotency key.
The sink's INSERT IGNORE / ON CONFLICT DO NOTHING / upsert-by-key handles duplicates.
Combined with at-least-once semantics, this gives effective-exactly-once.

-- Postgres sink with idempotency
INSERT INTO sink_table (event_id, user_id, ...)
VALUES (?, ?, ...)
ON CONFLICT (event_id) DO NOTHING;

Works well; doesn't require sink-coordinator awareness. Limit: sinks that can't enforce uniqueness (append-only files) don't support this pattern.

8. Flink Internals — Dataflow, Barriers, State

Architecture

JobManager: coordinator; runs the ExecutionGraph, schedules tasks, coordinates checkpoints.
TaskManagers: workers; each runs N slots, each slot runs a subtask of an operator.
ResourceManager: allocates TaskManagers (typically YARN or Kubernetes).
Dispatcher: receives jobs, spawns JobManagers per job.

Tasks, subtasks, and chains

A Flink job is a DAG of operators. Each operator has a parallelism (how many parallel instances run). Each instance is a subtask. A JobManager assigns subtasks to TaskManager slots.

Operator chaining: if an operator's output goes to another operator 1:1 and no shuffle is needed, Flink chains them into a single thread. Reduces serialization overhead hugely. forward channels are chained; rebalance, keyBy, etc. are not.

DataStream<Event> events = env.addSource(kafkaSource);         // source
DataStream<Event> filtered = events.filter(ev -> ev.valid);    // chained with source
KeyedStream<Event, String> keyed = filtered.keyBy(ev -> ev.userId);  // NOT chained (keyBy shuffles)
DataStream<Result> windowed = keyed.window(...).aggregate(...); // new chain

Task chain = single thread. Processes records with zero network hops and zero ser/de. Fast.

Keyed state vs operator state

Keyed state: one value per key per operator. keyBy partitions the stream; each partition holds state for its keys. The main way Flink scales stateful computation.
- ValueState<T>, ListState<T>, MapState<K,V>, ReducingState<T>, AggregatingState<IN,OUT>.
Operator state: per-subtask state, not per-key. Used by sources (offset tracking) and custom operators. Sub-types: broadcast state, list state (union or split).

public class CountByUserFn extends KeyedProcessFunction<String, Event, Result> {
    private ValueState<Long> count;

    @Override
    public void open(Configuration cfg) {
        ValueStateDescriptor<Long> d = new ValueStateDescriptor<>("count", Long.class, 0L);
        d.enableTimeToLive(StateTtlConfig.newBuilder(Time.hours(24)).build());
        count = getRuntimeContext().getState(d);
    }

    @Override
    public void processElement(Event ev, Context ctx, Collector<Result> out) throws Exception {
        Long c = count.value();
        c += 1;
        count.update(c);
        out.collect(new Result(ctx.getCurrentKey(), c));
    }
}

State backends

HashMapStateBackend: in-JVM-memory. Fastest. State bounded by heap. Synchronous checkpoints (pauses JVM). Use for small state (< ~2GB).
EmbeddedRocksDBStateBackend: on-disk key-value store (RocksDB). State bounded by local disk. Asynchronous and incremental checkpoints. Use for state > a few GB.

RocksDB internals:

SSTables (Sorted String Tables) on disk, immutable.
Memtable in memory; flushed to SSTable periodically.
LSM tree structure: multiple levels of SSTables, compacted periodically.
State read/write: serialize key, look up in memtable + levels (with bloom filters).

Incremental checkpoints

With RocksDB, Flink checkpoints only the SSTables that changed since the last checkpoint. New SSTables uploaded; unchanged ones referenced by manifest. Reduces checkpoint time from "full state" to "delta state."

checkpoint N-1: [sst1, sst2, sst3]
operator writes, compaction happens
checkpoint N:   [sst1 (ref), sst4, sst5]  <-- only sst4 and sst5 uploaded

Savepoints vs checkpoints

Checkpoints: system-owned, periodic, optimized for fast recovery. May be incremental.
Savepoints: user-triggered, versioned, portable across Flink versions (with caveats). Use for explicit restarts (deploys, cluster migrations, schema changes).

flink savepoint <job_id> s3://savepoints/myjob/

flink run -s s3://savepoints/myjob/savepoint-xxx <new-jar>.jar

State migration

When you change the shape of state (add a field, change types), you need a migration. Flink supports:

Schema evolution via Avro-serialized state (auto).
Explicit state migration: write a one-off job that reads the savepoint with old schema, writes with new.
Keyed state type migration: limited; usually requires re-keying from scratch.

9. Kafka Internals — ISR, Controllers, Exactly-Once

Topic, partition, offset

Topic: named append-only log, split into partitions for parallelism.
Partition: an ordered sequence of records; each record has a monotonically increasing offset within the partition.
Producers choose partition (via key hash or explicit selection).
Consumers read partitions sequentially; track their offset.

Parallelism = number of partitions. One consumer per partition per consumer group is the maximum (more consumers than partitions → idle consumers).

Replication

Each partition has N replicas (typically 3):

Leader: handles reads and writes.
Followers: pull from the leader asynchronously.
ISR (In-Sync Replicas): the set of followers that are within replica.lag.time.max.ms of the leader.

Writes:

acks=0: fire-and-forget. Producer doesn't wait for leader ack.
acks=1: wait for leader to persist. Lost on leader failure before followers catch up.
acks=all: wait for all ISRs to persist. No loss within replication factor (assuming min.insync.replicas >= 2).

Controller and leader election

One broker is the controller (elected via ZooKeeper or KRaft). Controller maintains cluster metadata — which broker leads which partition.

On broker failure:

Controller detects failure (ZooKeeper session expiry or KRaft heartbeat timeout).
Controller removes failed broker from ISR for partitions it led.
Controller elects a new leader from surviving ISR (first in the ISR list).
Controller propagates leadership change to all brokers and producers.

Without unclean leader election, the new leader is guaranteed to have every committed record. With unclean leader election enabled (unclean.leader.election.enable=true), a non-ISR replica can be elected — at the cost of data loss.

KRaft (ZooKeeper-free Kafka)

Kafka 3.x+ supports KRaft mode: the controller runs internally via Raft. No ZooKeeper. Simpler operationally; faster metadata propagation; supports larger clusters (millions of partitions).

Default in new deployments from 2024 onward.

Kafka exactly-once

Kafka offers exactly-once semantics (EOS) for specific patterns:

Producer idempotence (enable.idempotence=true):

Producer is assigned a unique producer_id (PID) by the broker.
Each record has a sequence number per partition.
Brokers deduplicate based on (PID, partition, sequence) for the lifetime of the PID's epoch.
Prevents duplicates from producer retries within a single producer session.

Transactional producers (transactional.id=my-tx):

Producer starts a transaction, writes to multiple partitions/topics, commits.
All writes are atomic — consumers with isolation.level=read_committed see either all or none.
Critical for exactly-once stream processing where "read from topic A, write to topics B and C" must be all-or-nothing.

KafkaProducer<String, String> producer = new KafkaProducer<>(props);
producer.initTransactions();
try {
    producer.beginTransaction();
    producer.send(new ProducerRecord<>("out_topic_1", ...));
    producer.send(new ProducerRecord<>("out_topic_2", ...));
    producer.sendOffsetsToTransaction(consumedOffsets, consumerGroup);
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
    throw e;
}

Consumer rebalance protocol

When consumers in a group join or leave:

Eager rebalance (classic):

Coordinator triggers rebalance.
All consumers revoke their partitions (stop processing entirely).
Coordinator assigns partitions via strategy.
Consumers resume.

Stop-the-world can be minutes with many consumers / many partitions. Kills latency SLOs.

Cooperative sticky rebalance (Kafka 2.4+):

Coordinator identifies partitions to move.
Only affected consumers revoke only the moving partitions.
New assignments go to target consumers.
Non-affected consumers continue uninterrupted.

Minimal disruption. Prefer in all modern deployments: partition.assignment.strategy=CooperativeStickyAssignor.

Log compaction

For "changelog" topics keyed by entity ID, Kafka can compact:

Keeps only the latest record per key.
Older records with the same key are garbage-collected during compaction.
Tombstones (null-valued records) delete the key permanently after a grace period.

log before compaction:
  key=user1, val={name:A}
  key=user2, val={name:X}
  key=user1, val={name:A'}
  key=user1, val=null          (tombstone)
  key=user2, val={name:Y}

log after compaction:
  key=user1, val=null          (to be deleted after grace)
  key=user2, val={name:Y}

Compacted topics serve as durable key-value stores readable as streams. Used for:

Materializing database CDC (final state of each row).
State topic backing Kafka Streams applications.
Configuration distribution.

10. Streaming Joins

Joining two unbounded streams is the hardest operation in streaming. There are several flavors.

Stream-stream window join

Join two streams within a time window: match pairs (a, b) where a.timestamp and b.timestamp fall within a common window and the join key matches.

clicks.join(impressions)
    .where(c -> c.userId).equalTo(i -> i.userId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .apply(new JoinFn());

State: O(|clicks| + |impressions|) within the window. Grows linearly with window size and rate.

Stream-stream interval join

Match event pairs within a relative time interval: for each a, find b where a.ts - X ≤ b.ts ≤ a.ts + Y.

clicks.keyBy(c -> c.userId)
    .intervalJoin(impressions.keyBy(i -> i.userId))
    .between(Time.minutes(-1), Time.minutes(5))
    .process(new IntervalJoinFn());

Used for causal analysis (impressions preceding clicks) where window alignment would be wrong.

Stream-table (temporal) join

Join a stream of events against a slowly-changing table (often a dim broadcast or a compacted topic). Each event looks up the version of the table that was current at event time.

Implementation patterns:

Broadcast state: broadcast the table updates to all parallel instances of the join operator. Each instance keeps the table in local state. Great when the table is small (< 1 GB).
Keyed state + async lookup: each event does an async DB lookup, keeping a cache in keyed state.
Versioned table joins (Flink Table API): FOR SYSTEM_TIME AS OF syntax.

-- Flink SQL: temporal join
SELECT o.*, c.country, c.subscription_tier
FROM orders o
JOIN user_dim FOR SYSTEM_TIME AS OF o.event_ts AS c
  ON o.user_id = c.user_id;

The engine maintains the history of user_dim; the join picks the version of each user that was current at event_ts.

Stream enrichment via async I/O

When the dimension is too large for broadcast state but must be looked up per event:

AsyncDataStream.unorderedWait(
    events,
    new AsyncDatabaseLookup(),
    30, TimeUnit.SECONDS,     // timeout
    1000                       // capacity (max concurrent lookups)
).map(enriched -> process(enriched));

Concurrent async lookups keep the pipeline fed while external systems respond. Critical: use unorderedWait if order doesn't matter (higher throughput); orderedWait otherwise.

11. State Management at Scale

Production streaming pipelines accumulate TBs of state. Managing it is its own discipline.

State TTL

Per-state configurable time-to-live. Flink's StateTtlConfig:

StateTtlConfig ttlConfig = StateTtlConfig
    .newBuilder(Time.hours(24))
    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
    .cleanupInRocksdbCompactFilter(1000)
    .build();

Cleanup strategies:

In-access: expired state is removed on read (cheap but delays cleanup).
Full-snapshot: cleanup during checkpoint (exhaustive but costly).
RocksDB compaction filter: cleanup during RocksDB's background compaction. Preferred for large state.

Bounded session windows via custom expiration

Session windows without explicit max length can grow without bound. Implement max session length:

public class BoundedSessionFn extends KeyedProcessFunction<String, Event, Session> {
    private ValueState<Session> current;
    private ValueState<Long> expirationTimer;

    @Override
    public void processElement(Event ev, Context ctx, Collector<Session> out) throws Exception {
        Session s = current.value();
        if (s == null || ev.ts > s.lastEventTs + SESSION_GAP_MS) {
            // new session
            if (s != null) out.collect(s);
            s = new Session(ev);
        } else {
            s.add(ev);
            if (s.durationMs() > MAX_SESSION_MS) {
                out.collect(s);
                s = new Session(ev);   // start a new session from this event
            }
        }
        current.update(s);
        ctx.timerService().registerEventTimeTimer(ev.ts + SESSION_GAP_MS);
    }

    @Override
    public void onTimer(long ts, OnTimerContext ctx, Collector<Session> out) throws Exception {
        Session s = current.value();
        if (s != null && ts >= s.lastEventTs + SESSION_GAP_MS) {
            out.collect(s);
            current.clear();
        }
    }
}

State partitioning evolution

Change the key you're partitioning on? Flink doesn't let you; you must:

Savepoint current state.
Write a migration job that reads savepoint, rewrites state under new keying.
Start new job from migrated savepoint.

Or: start fresh, lose history. Painful; plan keys carefully upfront.

State size budgeting

Estimate:

state_size = num_keys × bytes_per_key_state × num_windows_active_per_key

Example:

100M users
1 KB state per user (a few counters, last-event-ts)
1 window active per user (current session)
→ 100 GB of state.

RocksDB backend + local SSD + incremental checkpoints handle this fine. HashMap backend wouldn't (100 GB heap is infeasible).

For 1B users × 1 KB = 1 TB state: still viable with RocksDB + fast SSDs. Above that, consider sharding across multiple Flink jobs or moving state to an external store with async lookup.

12. Backpressure and Flow Control

Backpressure: when a downstream operator can't keep up, upstream must slow down. If it doesn't, unbounded buffers, OOM, or data loss follow.

How Flink signals backpressure

Flink uses credit-based flow control. Each downstream operator advertises how much buffer space it has (credit); upstream only sends when credit is available.

If downstream is slow, credit is consumed faster than replenished, and upstream naturally blocks. This propagates back to the source, which (if Kafka) simply stops pulling records.

Diagnosing backpressure

In the Flink UI, each operator shows backpressure status (OK / LOW / HIGH). Trace the chain:

Find the operator showing HIGH backpressure.
Its downstream is the bottleneck.

Metrics to check:

outputQueueLength: how full the output buffer is.
inputQueueUsage: same for inputs.
numRecordsInPerSecond vs numRecordsOutPerSecond: disparity indicates buffering.

Fixing backpressure

Increase downstream parallelism: run more instances of the slow operator.
Optimize the slow operator: profile; CPU? I/O? state access? The usual culprits are: Python UDFs (in PyFlink), synchronous external calls, state backend misuse.
Use async I/O for external lookups.
Shard state across more keys: if one key is hot, repartition with a salt (see Spark's skew fix — same principle applies).

Backpressure and checkpoints

Aligned checkpoints require barriers to reach operators; if an operator is backpressured, barriers pile up behind records in the input buffer, delaying the snapshot. Severe backpressure → checkpoint timeouts → job restarts → fallback loop.

Unaligned checkpoints (Flink 1.11+) mitigate: barriers "jump" ahead of in-flight records in the buffer. The snapshot includes those records as part of the channel state. Larger snapshots, but doesn't stall under backpressure.

env.getCheckpointConfig().enableUnalignedCheckpoints();

Enable in production where backpressure is possible.

13. Lambda vs Kappa Revisited

Lambda architecture

Two pipelines:

Batch layer — periodic, fully-accurate, high-latency.
Speed layer — streaming, best-effort, low-latency.
Serving layer — merges batch and speed views; clients see the combined result.

Problems:

Two codebases. Every business rule implemented twice. Drift inevitable.
Two debugging surfaces. Which layer produced this wrong number?
Two SLA sets. When one layer lags, is the combined answer correct?

Kappa architecture

One pipeline, streaming. Batch is "streaming with a bigger window" or "streaming over a replay."

Source: Kafka with long retention (or a lakehouse with time travel).
Processing: streaming engine in both online and replay modes.
Correction: replay the stream from the past to reprocess.

One codebase, one mental model.

The pragmatic middle: streaming + batch audit

The most common production pattern:

Streaming pipeline provides low-latency "hot" answers.
Batch pipeline runs daily/hourly over the full bronze layer to produce "certified" results.
Differences between hot and certified are monitored; large gaps are incidents.
Gold consumption reads whichever is appropriate (or blends by SLO).

This gives streaming's freshness AND batch's correctness guarantees, without Lambda's double-maintenance burden (the streaming version is the "real" code; the batch is a reference/audit path).

14. Streaming Pipeline Example End-to-End

A full example: Netflix-style playback sessionization.

Data flow

Pipeline Flow

Mobile client → Kafka (events) → Flink (sessionize) → Kafka (sessions)

from Kafka (sessions)

► Iceberg bronze (raw append)

► Redis (real-time counters)

Flink sessionization job

DataStream<PlaybackEvent> events = env
    .fromSource(
        KafkaSource.<PlaybackEvent>builder()
            .setBootstrapServers("kafka:9092")
            .setTopics("playback.events")
            .setGroupId("sessionizer")
            .setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.LATEST))
            .setValueOnlyDeserializer(new PlaybackEventDeserializer())
            .build(),
        WatermarkStrategy.<PlaybackEvent>forBoundedOutOfOrderness(Duration.ofSeconds(30))
            .withIdleness(Duration.ofMinutes(1))
            .withTimestampAssigner((ev, ts) -> ev.eventTime),
        "kafka-playback"
    );

// Validate and filter
DataStream<PlaybackEvent> validated = events
    .filter(ev -> ev.isValid())
    .name("validate");

// Sessionize by sessionId with 30-minute event-time gap
OutputTag<PlaybackEvent> lateTag = new OutputTag<>("late-events"){};

SingleOutputStreamOperator<PlaybackSession> sessions = validated
    .keyBy(ev -> ev.sessionId)
    .window(EventTimeSessionWindows.withGap(Time.minutes(30)))
    .allowedLateness(Time.minutes(15))
    .sideOutputLateData(lateTag)
    .trigger(EventTimeTrigger.create())
    .aggregate(
        new SessionAggregator(),      // incremental aggregation
        new SessionEmitter()          // attach window metadata on fire
    )
    .name("sessionize");

// Primary sink: Kafka (for downstream real-time consumers)
sessions.sinkTo(
    KafkaSink.<PlaybackSession>builder()
        .setBootstrapServers("kafka:9092")
        .setRecordSerializer(new SessionSerializer())
        .setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
        .setTransactionalIdPrefix("session-sink-")
        .build()
).name("sink-sessions");

// Secondary sink: Iceberg (for batch analytics)
sessions.sinkTo(
    IcebergSink.forRowData(...)
        .table(catalog.loadTable(tableId("silver", "fact_playback_session")))
        .writeParallelism(8)
        .distributionMode(DistributionMode.HASH)
        .build()
).name("sink-iceberg");

// Late events go to separate Kafka topic for reconciliation
sessions.getSideOutput(lateTag).sinkTo(
    KafkaSink.<PlaybackEvent>builder()
        .setBootstrapServers("kafka:9092")
        .setRecordSerializer(new EventSerializer("playback.events.late"))
        .build()
).name("sink-late");

// Checkpointing: exactly-once, incremental
env.enableCheckpointing(60_000, CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30_000);
env.getCheckpointConfig().enableUnalignedCheckpoints();
env.setStateBackend(new EmbeddedRocksDBStateBackend(true));  // incremental
env.getCheckpointConfig().setCheckpointStorage("s3://checkpoints/sessionizer/");

env.execute("playback-sessionizer");

SessionAggregator: incremental state

public class SessionAggregator
    implements AggregateFunction<PlaybackEvent, SessionAccumulator, PlaybackSession> {

    @Override
    public SessionAccumulator createAccumulator() {
        return new SessionAccumulator();
    }

    @Override
    public SessionAccumulator add(PlaybackEvent ev, SessionAccumulator acc) {
        if (acc.startTs == 0) acc.startTs = ev.eventTime;
        acc.endTs = Math.max(acc.endTs, ev.eventTime);

        switch (ev.type) {
            case HEARTBEAT:   acc.watchMs   += ev.watchDeltaMs;   break;
            case BUFFERING:   acc.bufferMs  += ev.bufferDeltaMs;  break;
            case SEEK:        acc.seeks     += 1;                 break;
            case PAUSE:       acc.pauses    += 1;                 break;
            case ENDED:       acc.endedReason = ev.endedReason;   break;
        }

        acc.maxBitrate = Math.max(acc.maxBitrate, ev.bitrateKbps);
        acc.sumBitrate += ev.bitrateKbps; acc.countBitrate += 1;
        acc.userId = ev.userId;
        acc.titleId = ev.titleId;
        return acc;
    }

    @Override
    public PlaybackSession getResult(SessionAccumulator acc) {
        return new PlaybackSession(
            acc.userId, acc.titleId,
            Instant.ofEpochMilli(acc.startTs), Instant.ofEpochMilli(acc.endTs),
            acc.watchMs, acc.bufferMs, acc.seeks, acc.pauses,
            acc.maxBitrate, acc.sumBitrate / Math.max(acc.countBitrate, 1),
            acc.endedReason
        );
    }

    @Override
    public SessionAccumulator merge(SessionAccumulator a, SessionAccumulator b) {
        // Merge two sessions (used by session windows that get merged)
        a.startTs = Math.min(a.startTs, b.startTs);
        a.endTs   = Math.max(a.endTs, b.endTs);
        a.watchMs += b.watchMs; a.bufferMs += b.bufferMs;
        a.seeks += b.seeks; a.pauses += b.pauses;
        a.maxBitrate = Math.max(a.maxBitrate, b.maxBitrate);
        a.sumBitrate += b.sumBitrate; a.countBitrate += b.countBitrate;
        a.endedReason = a.endedReason != null ? a.endedReason : b.endedReason;
        return a;
    }
}

What makes this production-grade

Event-time with watermarks + idleness. Handles out-of-order and idle producers.
Side-output for late data. Reconciliation path, not silent drops.
Incremental aggregation (AggregateFunction). State per session is O(1), not O(events).
Exactly-once Kafka sink with transactional producer.
Iceberg sink for analytics continuity.
Unaligned checkpoints. Resilient under backpressure.
RocksDB backend. State can scale to TB.
Reasonable parallelism (distribution mode hash). Avoids skew on session IDs.

What to monitor

Watermark per operator (Flink UI): stuck watermark = no emission.
Checkpoint duration: increasing duration = state growing or backpressure.
Records emitted / received per second: discrepancy = backpressure.
Late event rate: tune bounded out-of-orderness or allowed lateness.
End-to-end latency: p50/p95/p99 of now() - eventTime at sink.
Consumer lag on source topics (committed offset behind latest).
Kafka sink transaction rate and abort rate (high aborts = something wrong).

Next: 04-spark-internals.md — Catalyst, AQE, shuffle internals, join algorithms, and skew.

16. Idempotent Producers and Exactly-Once in Kafka

"Exactly-once" in Kafka is three separate guarantees composed: idempotent producer, transactional writes across partitions, and transactional consumer-offset commits. Handwaving past any of them produces bugs. Senior candidates can dissect each layer.

Layer 1 — Idempotent producer

The producer attaches a producer_id + sequence_number to each record. The broker tracks the highest sequence seen per (producer, partition). Duplicates — retries from the producer after a network blip — are detected and dropped at the broker. This alone gives you "no duplicates within a single producer session to a single partition."

Limits: the guarantee breaks across producer restarts (new producer_id) and across partitions (no coordination). For exactly-once across partitions, you need transactions.

Layer 2 — Producer transactions

Enable with transactional.id and bracket writes with beginTransaction() / commitTransaction(). The producer writes a transaction marker to each affected partition's commit-coordinator. Consumers configured with read_committed filter uncommitted records out of their stream. Aborted transactions leave tombstone markers.

Layer 3 — The two-phase commit sink

For true end-to-end exactly-once (consume → process → produce), the processor must commit its source offsets and its output records atomically. Kafka supports this via the sendOffsetsToTransaction() API — the source offset commit is included in the same producer transaction as the output records. Flink's TwoPhaseCommit sink implements this pattern.

What can still go wrong

Zombie producers. A restarted producer with a new transactional.id epoch fences the old one. If you reuse transactional.id naively across pods, you will see fence exceptions in production.
External side effects. If your processor writes to Kafka and calls a REST API, only Kafka is in the transaction. The REST call can happen twice.
Read_uncommitted consumers. A downstream consumer not set to read_committed sees uncommitted records. EOS is a configured guarantee, not a default.

17. Complex Event Processing and MATCH_RECOGNIZE

When the interview question is "find users who did X then Y then Z within 10 minutes," you've hit a Complex Event Processing (CEP) problem. Three implementation paths.

Path 1 — Flink CEP library

Flink ships a native CEP API: Pattern.begin("a").where(...).next("b").where(...).within(Time.minutes(10)). The engine compiles to an NFA (non-deterministic finite automaton) and matches against the event stream with full watermark-aware correctness.

Path 2 — SQL MATCH_RECOGNIZE

Standardized in SQL:2016, supported by Flink SQL, Oracle, and Snowflake. Expresses pattern-match queries declaratively:

SELECT *
FROM events
MATCH_RECOGNIZE (
  PARTITION BY user_id
  ORDER BY event_ts
  MEASURES A.event_ts AS start_ts, C.event_ts AS end_ts
  ONE ROW PER MATCH
  PATTERN (A B* C)
  DEFINE
    A AS A.event_name = 'view_plan',
    B AS B.event_name IN ('hesitate','back','scroll'),
    C AS C.event_name = 'purchase'
         AND C.event_ts <= A.event_ts + INTERVAL '10' MINUTE
);

Path 3 — Hand-rolled stateful operator

For simple patterns (A then B), a stateful Flink operator keyed by user_id is often cleanest. Store a last-seen-A timestamp per key; on each B event, check whether last-seen-A is within the window. Trade-off: simpler code than CEP, but doesn't generalize to longer patterns.

Interview probe

"How would you detect account-takeover patterns in real time?" Strong answer: pattern-match on [login-success, IP-change, password-change, payment-added] within N minutes; use MATCH_RECOGNIZE on Flink SQL with a 30-minute watermark; emit to an alert topic; tune false positives with threshold + allow-list.

18. Stream-Table Duality in Depth

Every table is a compressed history of a stream. Every stream is the log of changes to a table. Systems that expose this duality directly (Kafka Streams, Flink, ksqlDB, Debezium + materialized views) let you reason about processing without the batch/stream dichotomy. Senior candidates internalize this until it becomes the natural lens.

Table → Stream (changelog)

Given a table of current state, emit its change log. Tools: CDC from the database's write-ahead log (Debezium on Postgres MySQL Oracle), or DELETE+INSERT triggers, or a MERGE-with-history pattern. The changelog is itself a topic others can subscribe to.

Stream → Table (materialized view)

Given a changelog stream, maintain a current-state table by replaying with upsert semantics. In Kafka Streams, this is KTable. In Flink, it's the result of SELECT ... GROUP BY key with retractions. The table updates in real time as events arrive.

The two join semantics

Stream-table join. Each event on the stream is enriched with the current value of the table. No watermarks needed on the stream side.
Stream-stream join. Two streams joined within a time window; both need watermarks; late arrivals produce retractions. Substantially more expensive to reason about.

The temporal table join

Sometimes you want "enrich this event with the table state as of the event's timestamp" — not the current value. This is a temporal table join. Flink supports it natively; implementing it by hand requires a versioned table with valid-from / valid-to per row and an as-of subquery.

19. Stateful Functions and Application-Level Patterns

Flink's Stateful Functions and Kafka Streams' Processor API blur the line between stream processing and general-purpose event-driven applications. Three patterns to recognize.

Pattern A — Stateful session accumulation

Keyed by user. State: list of events in current session, last-seen timestamp. Trigger: session window close on 30-minute inactivity. Emit: one summary record per session. Classic streaming sessionization — preferable to SQL sessionization when you need the result within seconds, not the next batch.

Pattern B — The operator-as-state-machine

Keyed by order_id. State: current lifecycle phase (placed, paid, shipped, delivered). Input: milestone events. Transitions: (placed, payment_received) → paid. Invalid transitions emit to a dead-letter side-output. Gives you an accumulating snapshot in streaming form, with explicit invariants.

Pattern C — Distributed cache with TTL

Keyed by feature_id. State: cached value + expiry. Periodic timer refreshes entries before expiry. Serves as an online feature store that can live alongside the pipeline rather than in a separate system. Useful when the feature set is small enough to fit in state and latency matters more than scale.

↑ Back to top

Part 04

Spark Internals

"Spark is a distributed compiler and memory manager that happens to execute SQL." — if you can hold that sentence and defend it, you're at L5.

This chapter goes past the "Spark has an optimizer" talking points. We walk through what Catalyst actually does rule-by-rule, how Adaptive Query Execution (AQE) re-plans at runtime, how the shuffle service works on disk and over the network, how each join algorithm is chosen with the actual math, how skew is detected and split, and how Tungsten manages memory off-heap with its own binary format.

The layered architecture you must carry in your head
Catalyst: logical plan → optimized plan → physical plan
The cost model (such as it is) and why it's wrong
Adaptive Query Execution (AQE): the runtime re-planner
Shuffle: the disk-and-network truth
Join algorithms with real math: BHJ, SMJ, SHJ, BNLJ
Skew: detection, splitting, salting, AQE handling
Tungsten: off-heap memory, UnsafeRow, codegen
Memory model: execution, storage, user, reserved
Partitioning, coalesce, repartition — when each is wrong
Broadcast internals: why 10 MB default and how to push it
Pandas UDFs, Arrow, and the JVM↔︎Python boundary
Writing the plan diff: a real query walkthrough
Configuration cheat sheet — what each knob actually does

1. The layered architecture you must carry in your head

When a user submits spark.sql("SELECT ..."), control flows through five distinct layers. Know the layer that owns the problem and you'll skip 90% of the debugging time:

SQL parser / DataFrame API — produces an unresolved logical plan (tree of LogicalPlan nodes).
Analyzer — resolves identifiers against the catalog, types, UDFs. Output: resolved logical plan.
Optimizer (Catalyst) — rule-based rewrites on the logical plan. Output: optimized logical plan.
Planner — translates logical operators to physical operators (with strategies). Output: physical plan.
Execution (Tungsten + shuffle + RDD) — whole-stage codegen compiles physical operators into bytecode, shuffle ships data between stages, tasks run on executors.

Debugging rule: if EXPLAIN FORMATTED doesn't show what you expect at layer N, don't go looking at layer N+1. A filter that fails to push down is an optimizer problem, not a shuffle problem.

Use:

df.explain(mode="formatted")  # full plan
df.explain(mode="cost")       # plan + cost statistics (post-AQE if enabled)

For even deeper diagnosis:

spark.conf.set("spark.sql.planChangeLog.level", "WARN")
# prints every rule that fires during optimization

2. Catalyst: logical plan → optimized plan → physical plan

Catalyst is a tree-rewriting framework with four passes:

Parser  →  Unresolved Logical Plan
Analyzer →  Resolved Logical Plan   (binds column names, checks types)
Optimizer →  Optimized Logical Plan (rule-based, equivalent rewrites)
Planner  →  Physical Plan          (chooses strategies: join, aggregate, etc.)

2.1 The tree

A logical plan is an immutable tree of nodes (Project, Filter, Join, Aggregate, LeafNode like Relation). Optimizations are rules — functions LogicalPlan → LogicalPlan that pattern-match on subtrees and return a rewritten tree.

// Simplified Catalyst rule idea
object PushDownFilter extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case Filter(cond, Project(fields, child)) if cond.references.subsetOf(child.outputSet) =>
      Project(fields, Filter(cond, child))  // push Filter under Project
  }
}

2.2 The rule batches (what actually fires, in order)

The optimizer runs batches of rules, many to fixed point. The production list is long; memorize the greatest hits:

Batch	Key rules	What it does
Finish Analysis	`ResolveReferences`, `ResolveSubquery`	resolves names, types
Subquery	`RewritePredicateSubquery`	correlated subquery → semi/anti join
Infer Filters	`InferFiltersFromConstraints`	`a = b AND a = 5` infers `b = 5`
Operator Optimizations	`PushDownPredicate`, `ColumnPruning`, `CombineFilters`, `ConstantFolding`, `BooleanSimplification`, `SimplifyCasts`, `NullPropagation`, `EliminateOuterJoin`	the heart of Catalyst
Join Reorder	`ReorderJoin`, `CostBasedJoinReorder`	left-deep tree by row-count estimate
Decimal Optimizations	`DecimalAggregates`	promote decimals for overflow safety
LocalRelation	`ConvertToLocalRelation`	fold static literals

Two rules do 80% of the work:

PushDownPredicate: drives filter push-down through Project, Join, Aggregate, down into the FileSource so Parquet reads the minimum bytes.
ColumnPruning: keeps only needed columns, eliminating entire column chunks from Parquet/ORC reads.

When they don't fire: UDFs (treated as opaque black boxes for filter push-down unless you declare them deterministic = true in a certain shape), non-deterministic expressions, window functions (blocks push-down), and expressions on partition columns that use cast into incompatible types.

2.3 Analyzing the optimized plan

df = (spark.table("bronze.playback_events")
      .filter("event_date = '2026-04-15'")
      .filter("user_id = 1234")
      .select("title_id", "watch_ms"))

df.explain(mode="formatted")

Optimized logical plan (expected):

== Optimized Logical Plan ==
Project [title_id#1, watch_ms#2]
+- Filter (event_date = 2026-04-15 AND user_id = 1234)
   +- Relation bronze.playback_events[...] parquet

Both filters combined, column pruning applied. Now at physical:

== Physical Plan ==
*(1) Project [title_id#1, watch_ms#2]
+- *(1) Filter (user_id = 1234)
   +- *(1) ColumnarToRow
      +- FileScan parquet bronze.playback_events
            PartitionFilters: [event_date = 2026-04-15]
            PushedFilters: [EqualTo(user_id, 1234)]
            ReadSchema: struct<title_id:bigint,watch_ms:bigint>

The *(1) prefix means whole-stage codegen stage 1. PartitionFilters = Hive/directory-level pruning. PushedFilters = Parquet row-group filter using min/max statistics.

2.4 When plans go wrong: the "why didn't it push down" checklist

In order of likelihood:

A UDF touched the column (Catalyst treats most UDFs as opaque).
The column is inside a Window or collect_list boundary.
A COALESCE(col, 0) = 5 — push down works only on simple forms; coalesce blocks it. Rewrite as col = 5 OR (col IS NULL AND 5 = 0).
Parquet file has no statistics (written by an old or broken writer).
Data type mismatch (string column filtered against int) → implicit cast disables pushdown.
You used cache() above the filter; filter no longer pushes past the cache node.

3. The cost model (such as it is) and why it's wrong

Spark's Cost-Based Optimizer (CBO) uses statistics from ANALYZE TABLE ... COMPUTE STATISTICS stored in the catalog. Disabled by default (spark.sql.cbo.enabled=false).

Key statistics collected:

Row count, size in bytes
Per-column: min, max, nullCount, distinctCount (approx, HyperLogLog), avgLen, maxLen
Histograms (ANALYZE TABLE ... FOR COLUMNS ... WITH HISTOGRAM)

3.1 Why the CBO is a paper tiger

Stale statistics: nobody re-runs ANALYZE TABLE after every write. The stats you have are from last week.
Partial statistics: CBO falls back to row-count heuristics when columns lack histograms.
Bias: join selectivity estimation assumes uniform distribution. Real data is Zipfian (a few users do 100× the activity of the median).
AQE made it less necessary: at runtime, AQE has real shuffle statistics, which beats any plan-time estimate.

3.2 The two cases where CBO actually earns its keep

Star schema join reordering: building a left-deep join tree that probes the smallest dimension first. Makes a measurable difference at 10+ joins.
Choosing broadcast for a reasonably-sized dimension: if the stats say the dimension is 50 MB and spark.sql.autoBroadcastJoinThreshold = 100 MB, CBO chooses BHJ pre-AQE.

3.3 What to run

ANALYZE TABLE silver.dim_title COMPUTE STATISTICS;
ANALYZE TABLE silver.dim_title COMPUTE STATISTICS FOR COLUMNS title_id, genre;
ANALYZE TABLE silver.dim_title COMPUTE STATISTICS FOR ALL COLUMNS;

For partitioned tables you can also do FOR COLUMNS ... PARTITION (event_date='2026-04-15').

4. Adaptive Query Execution (AQE): the runtime re-planner

AQE is the single most important Spark performance feature of the last five years. It flips the model from "compile once, execute" to "compile stage-by-stage, with real statistics from the previous stage's shuffle".

Enabled by default in Spark 3.2+:

spark.sql.adaptive.enabled = true

4.1 What AQE actually does

AQE splits execution at materialization barriers (shuffle, broadcast exchange). After each barrier, it:

Reads the actual shuffle map output sizes (per partition).
Re-optimizes the remainder of the plan with those real numbers.
Executes the next stage.

The three main AQE rules:

Rule	What it does	Key config
Coalesce Shuffle Partitions	merges small post-shuffle partitions into fewer, larger ones (reduces task overhead)	`spark.sql.adaptive.coalescePartitions.enabled=true`, `spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB`
Skew Join Handling	detects skewed partitions, splits them across multiple tasks	`spark.sql.adaptive.skewJoin.enabled=true`, `skewedPartitionFactor=5.0`, `skewedPartitionThresholdInBytes=256MB`
Dynamic Join Strategy	converts a planned SMJ to a BHJ at runtime if one side turned out small	`spark.sql.adaptive.localShuffleReader.enabled=true`

4.2 The coalesce algorithm

Before AQE: you set spark.sql.shuffle.partitions=200 globally. After a selective filter, you might have 200 tiny partitions of 5 MB each and 200 tasks of overhead. Waste.

AQE algorithm (simplified):

target_bytes = advisoryPartitionSizeInBytes  # e.g. 64 MB
sorted_partitions = sort_by_size(shuffle_map_output)
groups = []
current = []
current_size = 0
for p in sorted_partitions:
    if current_size + p.size > target_bytes and current:
        groups.append(current)
        current, current_size = [], 0
    current.append(p)
    current_size += p.size
if current: groups.append(current)
# Each group becomes one reduce task reading contiguous map output

Result: ~the right number of tasks at the right size, without user tuning.

4.3 The skew handling algorithm

Detection:

median_size = median(partition_sizes)
is_skewed(p) = p.size > median_size * skewedPartitionFactor
               AND p.size > skewedPartitionThresholdInBytes

Split: AQE reads the skewed partition in ceil(p.size / advisoryPartitionSizeInBytes) sub-tasks, each doing a streaming join against the full (small) other side.

Trade-off: the "other side" is read once per sub-task — so skew handling only helps when the skewed side is much larger than the other. AQE is conservative: it won't split if the cost is worse.

4.4 Dynamic join strategy conversion

The plan said SMJ, but after filter and aggregation one side's shuffle output is 40 MB (under broadcast threshold). AQE converts to Broadcast Hash Join using the shuffle map output as the broadcast table — cheaper than re-reading the source.

4.5 Verifying AQE fired

In Spark UI, the SQL tab shows the Initial Plan and the Final Plan after AQE. Look for AdaptiveSparkPlan wrapper nodes and CustomShuffleReader with coalesced / skewed indicators.

df.explain(mode="formatted")
# post-execution to see AQE's final plan:
spark.sparkContext.uiWebUrl  # browse to SQL tab

4.6 When to turn AQE off (rare)

Very small queries where re-optimization adds measurable latency (streaming, sub-second).
Determinism tests where you need plan stability.
Broken UDFs that crash when the number of partitions changes.

Otherwise: leave it on.

5. Shuffle: the disk-and-network truth

Shuffle is the single biggest cost in any non-trivial Spark job. Understanding exactly what happens buys you a lot of optimization.

5.1 The shuffle lifecycle

Consider a stage boundary like GROUP BY user_id:

Diagram

[Map Stage] [Shuffle] [Reduce Stage]
Task 1 writes 200 files Block manager + ESS Task 1 reads 1 block
Task 2 writes 200 files advertises locations Task 2 reads 1 block
... to driver ...
Task M writes 200 files Task N reads 1 block
Each map task writes N files (one per reducer).
Total files on disk: M × N (for sort-based shuffle: M × 1 file + M index files)

Modern Spark uses sort-based shuffle (SortShuffleManager):

Map task partitions output by partitioner(key) → reducer ID.
Inserts records into an in-memory buffer, sorted by (partition_id, key).
When buffer full: spills to disk as a sorted file.
At end of map task: merges all spill files into ONE data file + ONE index file.
Index file: N entries, each an (offset, length) for reducer R's block.

5.2 The External Shuffle Service (ESS)

By default, map outputs live in the executor's local disk. If the executor dies, reduce tasks can't fetch their blocks — they have to be recomputed. With dynamic allocation, executors die routinely.

The External Shuffle Service (ESS) is a long-running process per NodeManager (or K8s daemonset) that serves shuffle blocks independently of executor lifecycle. Executors write shuffle files locally, but the ESS reads and serves them to reduce tasks.

spark.shuffle.service.enabled = true  # enables ESS
spark.dynamicAllocation.enabled = true  # now safe to dynamically scale executors

5.3 Push-based shuffle (Magnet — Spark 3.2+)

Problem: the fetch side of shuffle is N reducers × M mappers = random-I/O storm on the ESS disk.

Solution: Magnet pushes mapper output to pre-assigned merger nodes as soon as map tasks finish. The reducer fetches ONE pre-merged block per partition instead of M small blocks.

spark.shuffle.push.enabled = true
spark.shuffle.push.server.mergedShuffleFileManagerImpl = ...

Benefit: sequential reads on ESS, typically 2–3× faster shuffle read for large jobs.

5.4 Shuffle size math

For a job with:

M map tasks (input partitions)
N reducer tasks (spark.sql.shuffle.partitions)
D bytes of shuffled data

Network transfer: D bytes (all must cross the network). Disk writes (map side): D bytes. Disk reads (reduce side): D bytes. Number of fetch connections (without push-based): M × N (tiny messages hurt).

Common mistake: setting spark.sql.shuffle.partitions too high. If D = 10 GB and N = 1000, each partition is 10 MB — fine. If N = 10000, each is 1 MB — task overhead dominates. Target: advisoryPartitionSizeInBytes = 64–128 MB.

5.5 Shuffle on object storage — the Netflix / EMR pattern

On-prem: local NVMe disks + ESS. Cloud: executors are ephemeral, local disks are tiny.

Options:

EMR on EBS: local but slower; ESS works fine.
Shuffle plugins that write to S3 (Celeborn, Uniffle, Apache Spark SS on S3): decouples shuffle from executors entirely. Slower per-byte, but enables spot instances / fast scale-down.
Remote shuffle service (RSS): Celeborn (formerly Aliyun/RSS), Uber's Uniffle. Now dominant at large shops.

6. Join algorithms with real math: BHJ, SMJ, SHJ, BNLJ

Spark has four join strategies. Catalyst's planner chooses one based on hints, statistics, and sizes. Know the decision tree and its failure modes.

6.1 Decision tree (simplified)

if user hint "broadcast(small)":  → BroadcastHashJoin
elif one side < spark.sql.autoBroadcastJoinThreshold (default 10 MB):  → BHJ
elif one side fits in memory and has less rows (estimated):  → ShuffledHashJoin (rare)
elif keys sortable:  → SortMergeJoin
else:  → BroadcastNestedLoopJoin  (correctness last resort)

6.2 Broadcast Hash Join (BHJ)

How it works:

Driver collects small side (collect() over RDD) into a HashedRelation.
Broadcast via Spark's torrent-based broadcast.
Each executor joins its slice of the large side by probing the hash table.
No shuffle.

Math:

Driver memory: |small|. Rule of thumb: the broadcast table is ~3× larger in driver memory than the raw Parquet size (decompressed, Java object overhead).
Executor memory: same |small| on every executor.
Network: |small| × #executors.

When it breaks:

Driver OOM at collect().
spark.driver.maxResultSize (default 1 GB) too small.
Skewed small side's hash table too big per executor when bumped beyond the default 10 MB limit.

Tuning:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100MB")  # raise with care
# Or hint:
from pyspark.sql.functions import broadcast
large.join(broadcast(small), "user_id")

6.3 Sort-Merge Join (SMJ)

Cost: O(|L| log|L| + |R| log|R|) for the shuffle sort; merge itself is O(|L| + |R|).

How it works:

Both sides shuffled by join key.
Within each partition, sorted by join key.
Two pointers walk the sorted streams and emit matches.

Math:

Memory: bounded (sort-based merge is streaming after the sort).
Disk: both sides are sorted → spills possible during sort.
Network: shuffle both sides = |L| + |R|.

Why it's the default for large-large joins: predictable memory behavior, handles arbitrary size.

6.4 Shuffled Hash Join (SHJ)

Both sides shuffled, then one side built into an in-memory hash table per partition.

When chosen:

CBO thinks the build side fits in memory.
spark.sql.join.preferSortMergeJoin=false (default true in most Spark versions — SHJ is disabled by default because it can OOM).

Use case: one medium-sized side that fits in executor memory but is too big to broadcast.

6.5 Broadcast Nested-Loop Join (BNLJ)

Fallback for non-equijoin conditions (ON a.x < b.y). Cartesian by default, filtered.

Cost: O(|L| × |R|).

If you see BNLJ in a plan and you didn't intend a cross join, fix the query. Rewrite inequality joins as range joins or pre-filter.

6.6 Join types × strategies — what's supported

Not every strategy supports every join type. INNER, LEFT OUTER, RIGHT OUTER are universal. FULL OUTER needs SMJ (or BHJ with specific conditions). LEFT SEMI / LEFT ANTI are universal. The EXPLAIN tells you what Catalyst picked:

*(3) BroadcastHashJoin [user_id#1], [user_id#10], Inner, BuildRight
*(4) SortMergeJoin [user_id#1], [user_id#10], Inner
*(5) ShuffledHashJoin [user_id#1], [user_id#10], Inner, BuildLeft

7. Skew: detection, splitting, salting, AQE handling

Data skew kills Spark jobs. Here's how to detect and mitigate, in order of sophistication.

7.1 Detecting skew

Symptoms in the Spark UI:

One task in a stage takes 10× the median duration.
GC Time column shows 50%+ on the slow task.
Shuffle Read Size shows a massive outlier partition.

Programmatic check before the job:

from pyspark.sql.functions import col, count, percentile_approx

skew_check = (df.groupBy("join_key")
              .count()
              .selectExpr("percentile_approx(count, 0.50) as p50",
                          "percentile_approx(count, 0.99) as p99",
                          "max(count) as max"))
skew_check.show()

If p99 / p50 > 5 you have skew.

7.2 AQE skew handling (the easy fix)

With spark.sql.adaptive.skewJoin.enabled=true, AQE splits any partition that's:

> skewedPartitionFactor × median (default 5.0), AND
> skewedPartitionThresholdInBytes (default 256 MB)

This handles ~80% of skew cases without code changes.

Where it doesn't help: pre-shuffle skew (one input file is enormous), or non-join skew (GROUP BY on a skewed key).

7.3 Salting

When AQE isn't enough — most commonly, aggregations on a skewed key:

from pyspark.sql.functions import col, concat, lit, floor, rand, expr

# Step 1: add a salt to the skewed key on the large side
salt_factor = 100
salted = df.withColumn("salt", floor(rand() * salt_factor)) \
           .withColumn("key_salted", concat(col("user_id"), lit("_"), col("salt")))

# Step 2: explode the small side (cross join with salts)
small_exploded = small_df.crossJoin(
    spark.range(salt_factor).toDF("salt")
).withColumn("key_salted", concat(col("user_id"), lit("_"), col("salt")))

# Step 3: join on the salted key
joined = salted.join(small_exploded, "key_salted")

Cost: small side grows by salt_factor. Works when small side is truly small.

7.4 Two-stage aggregation (for GROUP BY skew)

# Stage 1: pre-aggregate with salt
stage1 = (df.withColumn("salt", floor(rand() * 100))
          .groupBy("user_id", "salt")
          .agg(sum("amount").alias("sum_amt")))

# Stage 2: final aggregate without salt
stage2 = stage1.groupBy("user_id").agg(sum("sum_amt").alias("total"))

Math: stage 1 shuffles with 100 × N distinct keys (uniformly distributed). Stage 2 shuffles a tiny pre-aggregated dataset.

7.5 Isolating the heavy hitters

For extreme skew (one key = 99% of data), split the query:

heavy = df.filter(col("user_id") == "netflix_test_account")   # process separately
normal = df.filter(col("user_id") != "netflix_test_account")  # normal join
# union at the end

8. Tungsten: off-heap memory, UnsafeRow, codegen

Tungsten is Spark's execution engine rewritten (circa 2015) to eliminate JVM overhead. Three pillars:

Off-heap memory via sun.misc.Unsafe — Spark manages raw byte arrays outside the JVM heap.
UnsafeRow — binary row format with direct memory offsets (no Java object per field).
Whole-stage codegen — generate Java bytecode at runtime that fuses multiple operators into one tight loop.

8.1 UnsafeRow layout

A normal Java Row of 10 fields = 10 boxed objects + header overhead = ~200 bytes. UnsafeRow for the same row = ~80 bytes packed.

+------------------+-----------------+------------------+
| Null bit set     | Fixed-width     | Variable-length  |
| (ceil(N/64) × 8) | (8 bytes each)  | (strings, etc.)  |
+------------------+-----------------+------------------+

Field access = pointer arithmetic (O(1), no indirection). Enables SIMD-friendly loops.

8.2 Whole-stage codegen

For a query like SELECT col1 + col2 FROM t WHERE col3 > 10, Catalyst generates code equivalent to:

while (iter.hasNext()) {
  UnsafeRow row = iter.next();
  int col3 = row.getInt(2);
  if (col3 > 10) {
    int col1 = row.getInt(0);
    int col2 = row.getInt(1);
    int result = col1 + col2;
    outputBuffer.putInt(result);
  }
}

Instead of an operator tree walking row-by-row with virtual calls, it's one JIT-inlineable hot loop. 2–5× throughput on CPU-bound queries.

Recognize codegen stages in EXPLAIN: operators prefixed with * like *(2) Filter.

8.3 When codegen doesn't kick in

Operator not supported (some UDFs, complex window specs).
Too many operators in a stage: Spark falls back at ~8000 lines of generated code.
Disabled explicitly: spark.sql.codegen.wholeStage=false.

9. Memory model: execution, storage, user, reserved

Spark's unified memory model (since 1.6). Each executor's heap is partitioned as:

Structure

Total heap:
┌─────────────────────────────────────────────────────────────┐
- Reserved (300 MB hardcoded) │
- ┤
- Spark Memory = (heap - 300MB) × spark.memory.fraction (0.6) │
- ┌──────────────────────────────┬────────────────────────┐ │
  - Execution │ Storage │ │
  - (shuffle, joins, aggregates) │ (cache, broadcast) │ │
  - ←── can borrow from Storage ──── can borrow from Exec │ │
  - ┴────────────────────────┘ │
- spark.memory.storageFraction = 0.5 (initial boundary) │
- ┤
- User Memory = (heap - 300MB) × (1 - 0.6) = 40% default │
- (user data structures in custom UDFs) │
- ┘

Execution borrows from Storage freely (evicts cached blocks). Storage borrows from Execution only if unused. Execution wins in conflicts.

9.1 Off-heap memory

spark.memory.offHeap.enabled = true
spark.memory.offHeap.size = 4g

Tungsten uses this pool for managed binary data. Adds to executor's container memory request but doesn't contribute to GC pressure. Use when GC is dominant (> 10% of task time).

9.2 The OOM debugging flowchart

When an executor OOMs, check in order:

Container OOM killed by YARN/K8s (exit code 137 or 143): total container > request. Increase spark.executor.memoryOverhead (default max(384 MB, 10% of heap) — often too small for Python/Arrow).
Java heap OOM: dump heap and inspect. Usually one of:
- Large broadcast.
- Skewed partition needing to fit in hash table.
- Cached data too large.
GC overhead OOM: >98% time in GC. Almost always skew or over-cached state. Enable off-heap, increase spark.memory.fraction to 0.7.

9.3 memoryOverhead gotcha

Python Pandas UDFs, Arrow, native code (Parquet writer) all use off-heap container memory. If you see Container killed by YARN for exceeding memory limits, your overhead is too low.

Rule of thumb: overhead = 20–30% of executor memory for PySpark jobs with Pandas UDFs.

10. Partitioning, coalesce, repartition — when each is wrong

repartition(n): full shuffle, round-robin to n partitions. Use when entering a stage and you need more parallelism or a rebalance.
repartition(n, col): shuffle by hash of col to n partitions. Use before a window or GROUP BY if you know the keys are safe.
coalesce(n): narrow transformation, merges existing partitions WITHOUT a shuffle. n must be ≤ current partition count. Use before writing out files.
repartitionByRange(n, col): sampling + range-partitioner for ordered output. Use for sort-merge joins you need to control manually, or for writing out sorted files.

10.1 The classic mistake

df.filter("event_date = '2026-04-15'").coalesce(1).write.parquet(...)

You intended one output file. You got: the entire read + filter runs on one executor because coalesce propagates upward. This turns a 10-node job into a 1-node job.

Correct:

df.filter("event_date = '2026-04-15'").repartition(1).write.parquet(...)
# OR (better): let AQE coalesce, and use `maxRecordsPerFile`
df.filter(...).write.option("maxRecordsPerFile", 1_000_000).parquet(...)

10.2 Writing files — controlling output count

(df.repartition(200, "event_date")
   .sortWithinPartitions("event_date", "user_id")
   .write
   .partitionBy("event_date")
   .option("maxRecordsPerFile", 5_000_000)
   .parquet(path))

repartition(200, "event_date"): co-locates all records for a given date in the same task.
sortWithinPartitions: improves Parquet stats and row-group pruning.
partitionBy at write time: one directory per date.
maxRecordsPerFile: bounds each file's size.

11. Broadcast internals: why 10 MB default and how to push it

11.1 The broadcast lifecycle

Driver collect()s the small side into HashedRelation.
Serializes it (Kryo typically).
Publishes to BlockManager with BroadcastBlockId.
TorrentBroadcast: executors fetch pieces (~4 MB each) from each other and the driver, BitTorrent-style. Reduces driver egress.
Each executor caches the broadcast. Join task probes the hash table locally.

11.2 Why 10 MB default

Driver memory safety: collect() must not OOM the driver.
Network: pushing |small| to N executors costs |small| × N bytes of driver egress before torrenting kicks in.
Executor memory: every executor holds a copy.

11.3 Raising the threshold safely

Know your cluster. If:

Driver has ≥ 8 GB and can comfortably collect the side,
Executors have ≥ 8 GB,
There are ≥ 50 executors so the benefit is large,

then:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "200MB")

Also set spark.driver.maxResultSize appropriately.

Hint instead of global:

from pyspark.sql.functions import broadcast
fact.join(broadcast(dim), "key")

11.4 Broadcast gotchas

Stale broadcast: a dataset that's borderline 10 MB might grow past the threshold at runtime (old stats). AQE handles this; pre-AQE it fails.
Collect-time timeout: spark.sql.broadcastTimeout (default 300s). Bump for slow sources.
Driver OOM on repeated broadcasts: long-running jobs that broadcast inside a loop can retain references.

12. Pandas UDFs, Arrow, and the JVM↔︎Python boundary

A regular Python UDF serializes each row JVM → Python → JVM, one at a time. Throughput: ~10K rows/sec per executor. Terrible.

Pandas UDFs (vectorized UDFs) ship batches of rows via Apache Arrow zero-copy buffers. Throughput: ~1M rows/sec per executor. Essential.

12.1 Types of Pandas UDFs

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
import pandas as pd

# 1. Series → Series (scalar)
@pandas_udf(DoubleType())
def log1p_udf(s: pd.Series) -> pd.Series:
    return (s + 1).map(float).map(np.log)

# 2. Iterator[Series] → Iterator[Series] (for heavy setup once per batch)
@pandas_udf(DoubleType())
def model_score(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    model = load_model_once()  # loaded per executor, not per batch
    for s in iterator:
        yield model.predict(s.values)

# 3. Grouped Map (applyInPandas)
def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
    pdf["z"] = (pdf["x"] - pdf["x"].mean()) / pdf["x"].std()
    return pdf

df.groupBy("user_id").applyInPandas(normalize, schema="user_id long, x double, z double")

12.2 The Arrow boundary

JVM executor (scala)                 Python worker (child process)
+-----------------------+            +------------------------------+
| UnsafeRow batch       |  socket    | Pandas DataFrame             |
| → Arrow RecordBatch   |===pipe===> | Pandas UDF applies           |
|                       |  binary    | → Arrow RecordBatch back     |
| ← Arrow RecordBatch   |<==========  |                              |
+-----------------------+            +------------------------------+

Arrow's columnar representation maps directly to Pandas columns (zero copy for numeric types, small copy for strings).

Config:

spark.sql.execution.arrow.pyspark.enabled = true
spark.sql.execution.arrow.maxRecordsPerBatch = 10000  # tune for memory

12.3 Pandas UDF gotchas

Memory overhead: each Python worker holds a Pandas DataFrame in memory. If maxRecordsPerBatch=10000 and rows are 1 KB, that's 10 MB per worker. Multiply by spark.executor.cores — it adds up. Increase spark.executor.memoryOverhead accordingly.
Python object columns: strings become Python str (object dtype in Pandas 1.x, StringArray in 2.x). map() over them reverts to Python loop → defeats the vectorization.
Schema mismatch: returning a DataFrame with different column order than the declared schema silently produces wrong data. Use .applyInPandas with explicit schema and test.

13. Writing the plan diff: a real query walkthrough

Consider the Netflix-shaped query: "For last 7 days, top 10 titles per country by completions".

completions = (spark.table("silver.fact_playback_completion")
               .filter("event_date >= current_date() - INTERVAL 7 DAYS"))

dim_title = spark.table("silver.dim_title")
dim_country = spark.table("silver.dim_country")

joined = (completions
          .join(dim_title, "title_id")
          .join(dim_country, "country_id"))

from pyspark.sql import Window
from pyspark.sql.functions import count, row_number, col

w = Window.partitionBy("country_name").orderBy(col("completions").desc())

top10 = (joined
         .groupBy("country_name", "title_id", "title_name")
         .agg(count("*").alias("completions"))
         .withColumn("rn", row_number().over(w))
         .filter("rn <= 10"))

top10.explain(mode="formatted")

Expected optimized plan:

== Optimized Logical Plan ==
Filter (rn <= 10)
+- Window [row_number() OVER (PARTITION BY country_name ORDER BY completions DESC)]
   +- Aggregate [country_name, title_id, title_name], [count(1) AS completions]
      +- Project [country_name, title_id, title_name]
         +- Join Inner, country_id = country_id
            :- Join Inner, title_id = title_id
            :  :- Filter (event_date >= date_sub(current_date(), 7))
            :  :  +- Relation silver.fact_playback_completion ...
            :  +- Relation silver.dim_title ...
            +- Relation silver.dim_country ...

Expected physical plan (with AQE):

== Physical Plan ==
AdaptiveSparkPlan (isFinalPlan=true)
+- *(6) Filter (rn#5 <= 10)
   +- Window [row_number() windowspecdefinition(country_name#6, completions#7 DESC ...) AS rn#5]
      +- *(5) Sort [country_name#6 ASC, completions#7 DESC]
         +- ShuffleQueryStage (coalesced from 200 to 50)
            +- Exchange hashpartitioning(country_name#6, 200)
               +- *(4) HashAggregate(keys=[country_name#6, title_id#2, title_name#3], functions=[count(1)])
                  +- Exchange hashpartitioning(country_name#6, title_id#2, title_name#3, 200)
                     +- *(3) HashAggregate(keys=[country_name#6, title_id#2, title_name#3], functions=[partial_count(1)])
                        +- *(3) Project [country_name#6, title_id#2, title_name#3]
                           +- *(3) BroadcastHashJoin [country_id#4], [country_id#8], Inner
                              :- *(3) Project [country_id#4, title_id#2, title_name#3]
                              :  +- *(3) BroadcastHashJoin [title_id#1], [title_id#2], Inner
                              :     :- *(3) Filter isnotnull(title_id#1)
                              :     :  +- *(3) ColumnarToRow
                              :     :     +- FileScan silver.fact_playback_completion
                              :     :          PartitionFilters: [event_date >= 2026-04-12]
                              :     :          PushedFilters: [IsNotNull(title_id)]
                              :     +- BroadcastExchange
                              :        +- *(1) ColumnarToRow
                              :           +- FileScan silver.dim_title
                              +- BroadcastExchange
                                 +- *(2) ColumnarToRow
                                    +- FileScan silver.dim_country

Reading the plan top to bottom:

FileScan fact_playback_completion with PartitionFilters — directory pruning to last 7 days. Catalyst pushed the date filter to the partition key.
BroadcastExchange dim_title and BroadcastExchange dim_country — both dims are small, BHJ chosen.
Partial aggregate (*(3)) per input partition: combines rows locally.
Exchange hashpartitioning on the group keys: 200 partitions (planned).
ShuffleQueryStage (coalesced from 200 to 50) — AQE merged post-shuffle partitions.
Final HashAggregate.
Exchange by country_name for the window.
Sort + Window for row_number.
Filter rn <= 10.

Failure modes to know:

dim_title grew over broadcast threshold → planner switches to SMJ → shuffle cost doubles. Watch it.
If event_date filter uses a UDF like date_trunc(event_date) >= ..., the partition filter doesn't push; you read all 7 × N files.
If the aggregate keys are skewed, the final HashAggregate becomes the long tail.

14. Configuration cheat sheet — what each knob actually does

Only the ones worth knowing.

14.1 Shuffle & AQE

Config	Default	What it does
`spark.sql.shuffle.partitions`	200	Post-shuffle partitions. With AQE enabled, AQE coalesces — set this high (e.g. 1000) and let AQE merge.
`spark.sql.adaptive.enabled`	true (3.2+)	Master AQE switch.
`spark.sql.adaptive.advisoryPartitionSizeInBytes`	64MB	Target size for coalesced partitions. Raise to 128–256 MB on large clusters.
`spark.sql.adaptive.coalescePartitions.enabled`	true	Enable partition coalescing.
`spark.sql.adaptive.skewJoin.enabled`	true	Detect and split skewed join partitions.
`spark.sql.adaptive.skewJoin.skewedPartitionFactor`	5.0	Partition is "skewed" if > factor × median.
`spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes`	256MB	And absolute size > this.

14.2 Join & broadcast

Config	Default	What it does
`spark.sql.autoBroadcastJoinThreshold`	10MB	Side size below which BHJ is auto-selected.
`spark.sql.broadcastTimeout`	300s	`collect()` timeout for broadcast.
`spark.sql.join.preferSortMergeJoin`	true	Prefer SMJ over SHJ when both are valid. Leave true.

14.3 Memory

Config	Default	What it does
`spark.executor.memory`	—	Heap per executor.
`spark.executor.memoryOverhead`	max(384MB, 10% of heap)	Off-heap container memory. Bump for PySpark/Arrow/RocksDB.
`spark.memory.fraction`	0.6	Fraction of (heap - 300 MB) for Spark.
`spark.memory.storageFraction`	0.5	Initial storage:execution ratio.
`spark.memory.offHeap.enabled`	false	Use off-heap for Tungsten.
`spark.memory.offHeap.size`	0	Size of off-heap pool.

14.4 Execution

Config	Default	What it does
`spark.sql.files.maxPartitionBytes`	128MB	Target size of each input split from file sources.
`spark.sql.files.openCostInBytes`	4MB	Overhead cost of opening a file; favors combining small files.
`spark.sql.execution.arrow.pyspark.enabled`	true	Arrow for Pandas UDF and `toPandas()`.
`spark.sql.execution.arrow.maxRecordsPerBatch`	10000	Rows per Arrow batch.
`spark.sql.codegen.wholeStage`	true	Whole-stage codegen. Turn off only to debug plans.

14.5 Dynamic allocation

Config	Default	What it does
`spark.dynamicAllocation.enabled`	false	Scale executors in/out with load.
`spark.dynamicAllocation.minExecutors`	0	Lower bound.
`spark.dynamicAllocation.maxExecutors`	∞	Upper bound.
`spark.dynamicAllocation.executorIdleTimeout`	60s	Release an idle executor after.
`spark.shuffle.service.enabled`	false	Required for dynamic allocation (unless push-based shuffle or decommissioning).

Closing framework

Spark problems almost always sit in one of six buckets:

Plan is wrong: filter didn't push down, join is wrong strategy. → Fix query, add hints, update stats.
Partition count is wrong: too many tiny tasks or too few giant ones. → Let AQE coalesce; set advisory bytes.
Skew: one task is the long tail. → AQE skew handling, salting, split heavy hitters.
Shuffle is too big: try to eliminate it (broadcast), reduce it (pre-aggregate), or move it (RSS / push-based shuffle).
Memory: OOM, spill, GC thrash. → off-heap, overhead, reduce broadcast, check caching.
Driver: driver collecting too much. → maxResultSize, avoid collect(), use toLocalIterator.

That framing survives every interview question and every 3am page.

17. Tungsten and the Off-Heap Memory Model

Before Tungsten (Spark 1.4, matured in 1.6+), Spark ran as a "normal" JVM application: Java objects everywhere, GC pauses ruled tail latency. Tungsten reshapes Spark into something closer to a native database engine — sun.misc.Unsafe memory, whole-stage code generation, cache-conscious binary layouts. Senior candidates know why this matters.

The three pillars

Memory management outside the heap. UnsafeRow format: binary, pointer-arithmetic, no Java object headers. Eliminates object-per-row overhead; a 10-column row in normal Java was ~200 bytes, UnsafeRow is ~60.
Cache-aware computation. Operators produce data in patterns that fit the CPU cache line (64 bytes). Sort-merge joins, aggregations and shuffles are written to minimize cache misses.
Whole-stage code generation. The Catalyst physical plan compiles multiple operators into a single tight Java method at runtime. Virtual dispatch disappears; the JIT can inline everything. 10x–100x speedups on compute-bound stages.

Why interviews probe this

When a candidate says "Spark is slow on a compute-bound job," the interviewer wants to hear: is the whole-stage codegen fallback kicking in? Some operators (complex UDFs, certain UDAFs) disable codegen for their whole stage. You can see this in df.explain(true) — look for * prefixes on operators. Missing stars = missing codegen = 10x slower for no good reason.

The off-heap memory debugging checklist

spark.memory.offHeap.enabled=true and spark.memory.offHeap.size — required for Tungsten to run truly off-heap.
OOMs that say "Container killed by YARN for exceeding memory limits" — Tungsten counts, but the OS container is still bounded. Raise spark.executor.memoryOverhead.
Sort-based aggregation spills to disk when the hash table exceeds the executor's execution memory pool. Spills tank performance. Either raise memory or reduce shuffle partitions / partition key cardinality.

18. Dynamic Allocation and Elastic Clusters

Dynamic allocation adds and removes executors during a job's lifetime based on pending-task pressure. Done right, it's the cheapest compute model Spark offers. Done wrong, it's a source of intermittent timeouts and surprise costs.

The mechanics

Spark tracks pending tasks per stage. When tasks wait in the queue longer than spark.dynamicAllocation.schedulerBacklogTimeout, new executors are requested. When executors sit idle for spark.dynamicAllocation.executorIdleTimeout, they're released. The cluster manager (YARN, Kubernetes, Databricks) actually provisions or releases the underlying nodes.

The shuffle-data-loss trap

Default behavior: when an executor is released, its shuffle output on local disk is also lost. If a later stage needs that shuffle data, it must recompute the upstream stage. For long-running jobs with expensive shuffles, this can double wall-clock time. Fix: enable the external shuffle service (on YARN) or persistent volumes (on Kubernetes). Without one of these, dynamic allocation is not safe to enable.

When it's the wrong choice

Jobs under ~2 minutes total. Executor startup dominates; static allocation is faster.
Streaming jobs. Structured Streaming supports dynamic allocation poorly — back-pressure oscillations trigger thrashing.
Cost-controlled environments with hard per-job budgets. Dynamic scaling makes cost non-deterministic.

19. Photon and Native Accelerators

Photon (Databricks) and similar native engines (Velox under Presto/Trino, RAPIDS for GPU) replace Spark's Java execution with C++ or CUDA kernels. For the right workloads, 2–3x faster at similar hardware cost; for the wrong ones, neutral or slightly worse.

What Photon accelerates well

Columnar scans with predicate pushdown. Native code is 3–5x faster than whole-stage-codegen Java.
Hash aggregations and hash joins on primitive types. 2–3x.
Certain window functions with simple frame clauses.

What it doesn't help

Python UDFs. Can't cross into Photon; falls back to Java path with full data serialization penalty.
Complex type operations (Map, Struct nested access) that aren't yet supported natively fall back.
Shuffle-bound jobs. Photon's gains are per-operator; if 80% of wall-clock is shuffle, Photon moves the needle 5%.

The interview question

"You enabled Photon and cost went down 40% but also one of your dashboards broke. Why?" Possible answer: Photon returns slightly different results for edge cases (e.g., certain cast overflow behaviors, NULL semantics in specific functions). You need regression tests against both engines before flipping the switch on production.

20. PySpark vs Scala — Where the Performance Actually Goes

The lore says "Scala is faster." The truth is more nuanced: PySpark's DataFrame API runs identical JVM code as Scala — the Python client is just a thin wrapper issuing Catalyst plans. Where Python actually loses is in three specific places.

Where PySpark equals Scala

Any pure DataFrame operation: joins, filters, aggregations, window functions. The entire execution happens in JVM; Python is not on the hot path.
SQL strings. spark.sql("SELECT …") from PySpark is indistinguishable from the same call from Scala.

Where Python is slower — and by how much

Python UDFs (non-vectorized). Each row is serialized to Python, function called, result deserialized. 10–30x slower than an equivalent expression in native Spark SQL. Avoid when possible; replace with built-in functions or SQL expressions.
Pandas UDFs (vectorized). Batches of rows ship to Python as Arrow tables; function processes them in one call. 3–10x faster than row-UDFs, within 2x of native Scala. The pragmatic default when you must use Python logic.
Driver-side orchestration with many collects. Python's overhead for orchestrating Spark actions is ~5–10ms per call. A loop of 1,000 small collect()s will waste 5–10 seconds in Python glue. Fix: build the query, execute once.

Decision guidance

DataFrame + SQL only: pick whichever language your team writes better.
Needs Python libraries (ML, NLP, scipy): PySpark with Pandas UDFs.
Mission-critical tight loops with custom logic on >1 TB: Scala wins modestly.
Team owns both: standardize on PySpark unless there's a specific reason. Hiring is easier.

↑ Back to top

Part 05

SQL Deep Dive

"SQL is a query language, an optimization problem, a data model, and a contract — all at once. Interviewers at this level want to see you've thought about all four."

This chapter goes past tricks and into the engine. We derive logical-to-physical translation, walk through each join algorithm with complexity, explain what indexes and zone maps actually do on disk, pull window functions apart by their framing math, and finish with the advanced patterns that show up at L5: gaps-and-islands, sessionization without windows, as-of (point-in-time) joins, and probabilistic sketches.

The mental model: SQL is declarative; engines are procedural
Logical plan processing order (the one the spec promises)
Physical plan processing order (what actually happens)
Join algorithms with complexity math
Indexes, zone maps, Bloom filters: what prunes and when
Subquery types and how they translate to joins
Window functions: frames, partitions, and performance
Gaps and islands — the four canonical variants
Sessionization without windows (Flink-style in SQL)
As-of joins / point-in-time joins
Sketches: HyperLogLog, Theta, Bloom, t-digest
Anti-patterns that kill query plans
Query tuning workflow

1. The mental model: SQL is declarative; engines are procedural

Write what you want. The engine figures out how. Simple to say, but the gap between the two is where all the hard problems live:

Query planner chooses join order, strategy, index usage.
Storage layout (row vs columnar, partitions, zone maps) decides what bytes are read.
Execution engine (row-at-a-time vs vectorized, pipelined vs materialized) decides how fast rows flow.

At L5 you should be able to look at any query and predict: what will the plan look like? What will cost money? What happens when data doubles?

The three questions you always ask:

What's the driving table? The biggest one — everything else is joined to it.
Where do filters apply? At scan time (push-down), or after?
Where do shuffles happen? Between which operators, on which key?

2. Logical plan processing order (the one the spec promises)

The SQL standard defines the logical processing order of a SELECT, which determines name visibility and semantics:

1. FROM (+ JOINs)
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT (expressions + window functions)
6. DISTINCT
7. ORDER BY
8. LIMIT / OFFSET

Two consequences you must carry:

A column alias defined in SELECT is not visible in WHERE, because WHERE is logically processed first. (Some engines like Snowflake relax this; portable code doesn't rely on it.)
Aggregates resolve at step 3, so HAVING is how you filter by aggregate; WHERE can't reference an aggregate directly.

-- Wrong (portable):
SELECT user_id, COUNT(*) AS cnt
FROM events
WHERE cnt > 5                 -- ERROR: cnt doesn't exist yet at WHERE
GROUP BY user_id;

-- Right:
SELECT user_id, COUNT(*) AS cnt
FROM events
GROUP BY user_id
HAVING COUNT(*) > 5;

2.1 The quiet reorderings

The logical order is what the query means; it's not what the engine does. The optimizer is free to reorder any way it wants as long as semantics are preserved. Predicate push-down through JOIN is one such reordering: the logical WHERE runs "after" the FROM+JOIN, but the engine may push the predicate into the scan under the join.

3. Physical plan processing order (what actually happens)

For this query on Postgres-like:

SELECT u.country, COUNT(*) AS plays
FROM fact_plays p
JOIN dim_user u ON p.user_id = u.user_id
WHERE p.event_date = '2026-04-15'
  AND u.signup_country_id = 42
GROUP BY u.country
HAVING COUNT(*) > 100
ORDER BY plays DESC
LIMIT 10;

The physical plan Postgres will likely choose:

1. Index Scan on dim_user where signup_country_id = 42        -- cheap, produces small set
2. Hash Join                                                   -- build hash on dim_user subset
   - Right input: Seq Scan on fact_plays with event_date filter (partitioned index)
3. HashAggregate by country
4. Filter HAVING count > 100
5. Sort plays DESC
6. Limit 10

But note step 1: the small, selective side becomes the build side. That's the first thing most planners do. In Spark/Snowflake the analog is broadcasting the small dimension.

3.1 Reading EXPLAIN from any engine

Postgres: EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT .... Buffers: shared hit=X read=Y tells you the buffer cache vs disk split.
Snowflake: EXPLAIN USING TEXT SELECT .... Use the profile view in the UI for runtime stats (partitions pruned, bytes scanned).
BigQuery: query plan in the UI; dry-run shows bytes billed before execution.
Spark: df.explain(mode="formatted") (see chapter 04).

Common elements across engines:

Scan (Seq / Index / Partitioned): reads the base table.
Filter: applies predicate row-by-row (if not pushed down).
Join (Hash / Merge / Nested Loop): combines inputs.
Aggregate (Hash / Stream / Sort): groups.
Sort: explicit order.
Exchange / Shuffle (distributed engines): network redistribution by key.

4. Join algorithms with complexity math

Algorithm	Preconditions	Build	Probe	Total	Memory
Nested Loop	any condition	—	O(N·M)	O(N·M)	O(1)
Hash Join	equijoin	O(N) build hash	O(M) probe	O(N+M)	O(N)
Sort-Merge Join	equijoin, both sorted	O(N log N + M log M) sort	O(N+M) merge	O((N+M) log (N+M))	O(1) streaming
Index Nested Loop	index on inner	—	O(M · log N)	O(M · log N)	O(1)
Zone Map Join	partitioned / clustered	—	O(M + pruned_N)	depends on pruning	O(1)

4.1 Nested Loop Join

For every row in outer, scan every row in inner. Only reasonable when:

Inner is tiny (< 100 rows).
Condition is non-equijoin (BETWEEN, <, >).
You have an index-supporting predicate that turns it into an Index Nested Loop.

In Spark: BroadcastNestedLoopJoin — broadcast one side, scan the other with filter. Only for inequality/Cartesian conditions.

4.2 Hash Join

Build side is the smaller (by row count or estimated size). Hash table built in memory; probe side streamed through.

Memory requirement: rows × (hash_entry_size) where hash_entry_size ≈ key + pointer + row or row-reference. Typical: 50–200 bytes/row. For 10M rows, 1–2 GB memory.

If the build side doesn't fit, engines fall back to:

Grace Hash Join: partition both sides by hash mod k, spill to disk, then hash-join each partition pair. Memory = O(|partition|).
Hybrid Hash Join: keeps first partition in memory, spills others.

4.3 Sort-Merge Join

When both sides are large and hash tables don't fit. Both sides sorted, then merged by walking two pointers.

Strong suit: supports range joins (if you partition-by-range) and handles arbitrary size.

Weakness: sort cost. Engines avoid it unless at least one side is already sorted (clustered index, clustering key in Snowflake, Z-ORDER in Delta).

4.4 Choosing a strategy (mental shortcut)

if small.rows × row_size < memory_budget:   hash join (build on small)
elif no equijoin:                            nested loop (beware Cartesian)
elif both very large:                        sort-merge
elif index exists on inner.join_key:         index nested loop

5. Indexes, zone maps, Bloom filters: what prunes and when

5.1 B-tree indexes (OLTP default)

A B-tree of (key, row pointer). Lookup: O(log N). Insert/delete: O(log N) with node splits/merges. Each non-leaf page fanout ≈ 200–500 on a 4–8KB page → 4 levels cover ~10^8 rows.

Covering index: includes all columns needed for the query → the query is satisfied from the index alone, no table lookup.

CREATE INDEX idx_plays_user_date ON fact_plays (user_id, event_date)
INCLUDE (watch_ms);
-- SELECT watch_ms FROM fact_plays WHERE user_id = 1234 AND event_date = '2026-04-15'
-- satisfied entirely from index.

5.2 Hash indexes (in-memory dbs, Postgres for exact match)

O(1) exact lookup. Can't support range scans. Rarely first choice.

5.3 Bitmap indexes (low-cardinality)

Column with few distinct values (gender, country, status). One bitmap per value, each bitmap has one bit per row. AND/OR between bitmaps is extremely cheap.

Used in: Oracle, DuckDB (generated on-the-fly), warehouse columnar engines (implicit).

5.4 GIN / GiST / BRIN indexes

GIN: inverted index for JSONB, array, full-text (each distinct term → postings list of row IDs).
GiST: generalized search tree for geospatial, range types, ranges.
BRIN (Block Range INdex): min/max per block range. Tiny, for large time-ordered tables. O(1) space per block. ≈ what a zone map does.

5.5 Zone maps / min-max statistics (columnar warehouses)

Every Parquet row group has min/max per column. Every Snowflake micro-partition, every Delta file, every Iceberg manifest entry — stores them.

For a query WHERE event_date = '2026-04-15', the engine reads min/max per row group and skips any row group whose [min, max] doesn't include '2026-04-15'.

Zone maps work best on clustered/sorted data. Random writes ruin them: every block has the full range of values → no pruning.

Fix: cluster on the column (Snowflake CLUSTER BY, Iceberg sort, Delta Z-ORDER).

5.6 Bloom filters

Probabilistic set membership. Configurable false-positive rate, zero false-negatives. Huge win for point-lookup queries on large tables.

Formula: for n elements and desired FP rate p:

Bits: m = -n × ln(p) / (ln(2))^2
Hashes: k = (m/n) × ln(2)

Example: 10M elements, 1% FP → m ≈ 96 Mbit ≈ 12 MB. Stored per Parquet row group or per Iceberg data file.

Parquet supports Bloom filters since 1.12. Spark writes them with parquet.bloom.filter.enabled=true. Iceberg has them as a column hint:

ALTER TABLE silver.fact_playback_session
  SET TBLPROPERTIES ('write.parquet.bloom-filter-enabled.column.user_id' = 'true');

When the filter says "maybe" you read the block. When it says "definitely not" you skip. On highly selective queries with non-clustered keys, 10–100× speedup.

5.7 The pruning checklist

Your query is slow, but the data is partitioned/clustered. What pruned and what didn't?

Partition pruning (directory level): did the planner push the partition filter? Check PartitionFilters in the plan.
Zone map pruning (file/row-group level): min/max stats match the filter? Requires clustering.
Bloom filter pruning: configured for the column?
Column pruning (columns not selected): only the needed columns are read.

Each one is a multiplicative win.

6. Subquery types and how they translate to joins

6.1 Scalar subquery (one row, one column)

SELECT user_id, (SELECT MAX(event_ts) FROM sessions WHERE user_id = u.user_id) AS last_seen
FROM dim_user u;

Naive execution: for each user, run the inner query. O(N × cost_inner).

Optimizer rewrite (good engines): correlated subquery → left outer join with aggregate.

SELECT u.user_id, s.last_seen
FROM dim_user u
LEFT JOIN (SELECT user_id, MAX(event_ts) AS last_seen FROM sessions GROUP BY user_id) s
  ON u.user_id = s.user_id;

6.2 IN / EXISTS (existence)

SELECT * FROM dim_user WHERE user_id IN (SELECT user_id FROM sessions WHERE event_date >= '2026-04-01');
-- Rewritten to:
SELECT u.*
FROM dim_user u
SEMI JOIN (SELECT DISTINCT user_id FROM sessions WHERE event_date >= '2026-04-01') s
  ON u.user_id = s.user_id;

SEMI JOIN returns rows from the left side that have a match in the right, without duplicating rows or returning right columns.

6.3 NOT IN / NOT EXISTS (anti)

Critical NULL pitfall:

SELECT * FROM dim_user WHERE user_id NOT IN (SELECT user_id FROM blocked_users);

If blocked_users.user_id has ANY NULL, the whole clause returns empty set (3-valued logic: x NOT IN (... NULL ...) → unknown).

Always prefer NOT EXISTS:

SELECT u.* FROM dim_user u
WHERE NOT EXISTS (SELECT 1 FROM blocked_users b WHERE b.user_id = u.user_id);

6.4 Lateral / CROSS APPLY (correlated row-returning)

SELECT u.user_id, top3.title_id
FROM dim_user u,
LATERAL (
  SELECT title_id
  FROM fact_plays p
  WHERE p.user_id = u.user_id
  ORDER BY play_count DESC
  LIMIT 3
) top3;

Per-row subquery that returns multiple rows. Executes per outer row (sometimes optimized to a window function). Keep cardinality in check.

7. Window functions: frames, partitions, and performance

Anatomy:

function(args) OVER (
  [PARTITION BY partition_cols]
  [ORDER BY order_cols]
  [ROWS | RANGE | GROUPS frame_spec]
)

7.1 Frame types

ROWS: physical rows. ROWS BETWEEN 2 PRECEDING AND CURRENT ROW = exactly 3 rows.
RANGE: logical range over ordered values. RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW — uses the ORDER BY value to scope.
GROUPS: peer groups (rows with equal ORDER BY values). GROUPS BETWEEN 1 PRECEDING AND CURRENT ROW = this group + previous group.

7.2 Function families

Ranking: ROW_NUMBER, RANK, DENSE_RANK, NTILE, PERCENT_RANK, CUME_DIST.
Analytic: LAG, LEAD, FIRST_VALUE, LAST_VALUE, NTH_VALUE.
Aggregates over windows: SUM, AVG, MIN, MAX, COUNT, and more.

7.3 The default frame trap

-- Default frame with ORDER BY is:
-- RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
SELECT user_id, event_ts, SUM(amount) OVER (PARTITION BY user_id ORDER BY event_ts) AS running_total
FROM fact_charges;

Without ORDER BY: default frame is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Different answer.

7.4 Window performance model

Per distinct PARTITION BY value, the engine:

Shuffles rows by partition key.
Sorts within partition by ORDER BY.
Walks the partition, applying the frame.

Cost = shuffle + sort + O(partition_size × frame_size).

Optimization tips:

Always PARTITION BY — otherwise one partition = entire dataset = one task.
Match the partitioning to existing clustering to avoid shuffle.
ROW_NUMBER() <= k filter: engines can short-circuit to top-k per partition (avoids a full sort). In Spark this is a specific rule.

7.5 Running totals pattern

SELECT
  user_id, event_ts, amount,
  SUM(amount) OVER (PARTITION BY user_id ORDER BY event_ts
                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
  AVG(amount) OVER (PARTITION BY user_id ORDER BY event_ts
                    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS trailing_7_avg
FROM fact_charges;

7.6 First/last value pattern

SELECT
  user_id, session_id, event_ts,
  FIRST_VALUE(event_type) OVER (PARTITION BY session_id ORDER BY event_ts
                                ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_event,
  LAST_VALUE(event_type) OVER (PARTITION BY session_id ORDER BY event_ts
                               ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_event
FROM session_events;

Explicit frame ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is critical — the default for LAST_VALUE is "current row" and will return the current row's value, not the last.

8. Gaps and islands — the four canonical variants

A classic family of problems. You have a sequence; find stretches where a property is true or identify "runs" of consecutive values.

8.1 Variant 1: consecutive identical values

-- Find streaks of same status per user
WITH marked AS (
  SELECT user_id, event_ts, status,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts) AS rn_all,
    ROW_NUMBER() OVER (PARTITION BY user_id, status ORDER BY event_ts) AS rn_status
  FROM fact_events
)
SELECT user_id, status,
  MIN(event_ts) AS streak_start,
  MAX(event_ts) AS streak_end,
  COUNT(*) AS streak_len
FROM marked
GROUP BY user_id, status, (rn_all - rn_status);

Why it works: within a streak of same status, rn_all - rn_status is constant. It changes when status changes.

8.2 Variant 2: consecutive integers (gap detection)

-- Find missing invoice numbers
SELECT invoice_num + 1 AS gap_start,
       next_num - 1    AS gap_end
FROM (
  SELECT invoice_num,
         LEAD(invoice_num) OVER (ORDER BY invoice_num) AS next_num
  FROM invoices
) t
WHERE next_num - invoice_num > 1;

8.3 Variant 3: date gaps (missing daily data)

WITH calendar AS (
  SELECT generate_series(DATE '2026-01-01', DATE '2026-04-15', INTERVAL '1 day')::date AS d
)
SELECT d FROM calendar
LEFT JOIN daily_metrics m ON m.metric_date = calendar.d
WHERE m.metric_date IS NULL;

8.4 Variant 4: consecutive timestamp "islands" (session gaps)

See Sessionization below — this is the streaming sessionization pattern in SQL.

9. Sessionization without windows (Flink-style in SQL)

Definition: a session is a run of events from the same user with no gap larger than X minutes.

WITH events AS (
  SELECT user_id, event_ts
  FROM fact_events
  WHERE event_date BETWEEN '2026-04-15' AND '2026-04-15'
),
with_prev AS (
  SELECT user_id, event_ts,
    LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) AS prev_ts
  FROM events
),
flagged AS (
  SELECT user_id, event_ts,
    CASE WHEN prev_ts IS NULL OR event_ts - prev_ts > INTERVAL '30 minutes'
         THEN 1 ELSE 0 END AS session_start
  FROM with_prev
),
numbered AS (
  SELECT user_id, event_ts,
    SUM(session_start) OVER (PARTITION BY user_id ORDER BY event_ts) AS session_num
  FROM flagged
)
SELECT user_id, session_num,
  MIN(event_ts) AS session_start_ts,
  MAX(event_ts) AS session_end_ts,
  COUNT(*) AS event_count,
  EXTRACT(EPOCH FROM MAX(event_ts) - MIN(event_ts)) AS duration_seconds
FROM numbered
GROUP BY user_id, session_num;

Why it works:

LAG finds prior event's timestamp within each user's timeline.
Flag a new session whenever gap > 30m.
SUM(session_start) cumulatively counts session starts, giving each event its session number.
GROUP BY (user_id, session_num).

This is the exact pattern Flink's session window implements in its state machine — only here it's a batch SQL translation.

10. As-of joins / point-in-time joins

"What was user X's subscription tier at the moment they made this play?"

The classic feature-store / temporal question. The dimension has history (SCD2). The fact has timestamps. You want the dimension row valid at each fact timestamp.

10.1 Approach 1: BETWEEN join (supported natively in many engines)

SELECT p.*, d.plan_name, d.plan_price
FROM fact_playback p
JOIN dim_subscription_history d
  ON p.user_id = d.user_id
 AND p.event_ts >= d.valid_from
 AND p.event_ts <  d.valid_to;

Works, but the BETWEEN join is effectively a range join — default hash/equi joins don't handle it. Spark uses BNLJ; Postgres uses merge join on (user_id, valid_from) if you have the right index.

10.2 Approach 2: LATERAL correlated subquery

SELECT p.*, d.plan_name
FROM fact_playback p,
LATERAL (
  SELECT plan_name
  FROM dim_subscription_history d
  WHERE d.user_id = p.user_id
    AND d.valid_from <= p.event_ts
  ORDER BY d.valid_from DESC
  LIMIT 1
) d;

Per fact row: lookup the most recent dim row that started before the fact. Works great with a (user_id, valid_from) index.

10.3 Approach 3: Spark-native `AS OF` joins (PySpark)

# Using built-in asof via pandas UDF or explicit:
from pyspark.sql.functions import broadcast
# Sort both sides, zip with forward-fill logic

Native asof support is limited in Spark SQL; most teams express the BETWEEN join and rely on AQE + good clustering. Alternative: kdb+ / ClickHouse / DuckDB have real ASOF JOIN syntax.

-- DuckDB / ClickHouse ASOF JOIN
SELECT p.*, d.plan_name
FROM fact_playback p
ASOF LEFT JOIN dim_subscription_history d
  ON p.user_id = d.user_id AND p.event_ts >= d.valid_from;

10.4 Approach 4: "expand-and-join" (batch feature stores)

Produce a denormalized table with (user_id, valid_from, valid_to, plan_name). Then INNER JOIN on user_id and BETWEEN. Works. Costs storage. Used at scale by Uber's Michelangelo, others.

11. Sketches: HyperLogLog, Theta, Bloom, t-digest

When exact answers cost too much, sketches trade bounded error for massive cost reduction.

11.1 HyperLogLog (HLL) — approximate distinct count

Core idea: hash each element to a bit string; track the maximum number of leading zeros observed per bucket; rare patterns imply large cardinality.

Error: ~1.04 / √m where m = number of buckets. Typical m = 16384 (2^14) → 0.8% error.

-- Snowflake
SELECT HLL(user_id) FROM fact_plays;                        -- approx distinct
SELECT HLL_ACCUMULATE(user_id) FROM fact_plays;             -- serializable state
SELECT HLL_COMBINE(state) FROM (SELECT HLL_ACCUMULATE(user_id) AS state FROM ...);
SELECT HLL_ESTIMATE(HLL_COMBINE(state)) FROM ...;           -- final cardinality

-- BigQuery
SELECT APPROX_COUNT_DISTINCT(user_id) FROM fact_plays;

-- Presto/Trino
SELECT approx_distinct(user_id) FROM fact_plays;

11.2 Theta sketch — distinct with set operations

HLL estimates |A|. Theta estimates |A|, |A ∪ B|, |A ∩ B|, |A \ B|. Critical for "users who watched X AND Y" without recomputing from raw data.

Supported in Snowflake (APPROX_COUNT_DISTINCT uses HLL; Theta available via Java UDF or Datasketches).

11.3 Bloom filters — set membership

Covered under indexes. Also stored as materialized columns for fast "is this key present" tests.

11.4 t-digest — approximate quantiles

For latency percentiles over streams. Compact (~5KB for 1% error at p99). Mergeable.

-- Snowflake
SELECT APPROX_PERCENTILE(latency_ms, 0.99) FROM fact_requests;
-- Trino
SELECT approx_percentile(latency_ms, 0.99) FROM fact_requests;

11.5 When to reach for sketches

Daily/hourly metrics where re-scanning billions of rows is wasteful.
Feature store cardinality features.
Cross-partition unions (streaming + batch).

Rule: cost reduction is typically 100× for 1% error.

12. Anti-patterns that kill query plans

12.1 Function on the indexed column

-- Bad: index on event_ts not used
SELECT * FROM fact_plays WHERE DATE(event_ts) = '2026-04-15';
-- Good:
SELECT * FROM fact_plays
WHERE event_ts >= '2026-04-15' AND event_ts < '2026-04-16';

12.2 Implicit casts

-- Bad: user_id is BIGINT, '1234' is VARCHAR → cast every row
SELECT * FROM fact_plays WHERE user_id = '1234';
-- Good:
SELECT * FROM fact_plays WHERE user_id = 1234;

12.3 SELECT *

In columnar warehouses, reads every column → often 10× the I/O. Name columns.

12.4 DISTINCT as a bug fix

If your query needs DISTINCT, you usually have a join multiplication bug. Find the join producing duplicates and fix the join condition.

12.5 OR across tables in a JOIN

-- Bad: optimizer usually can't split this
SELECT * FROM a JOIN b ON a.x = b.x OR a.y = b.y;
-- Good:
SELECT * FROM a JOIN b ON a.x = b.x
UNION
SELECT * FROM a JOIN b ON a.y = b.y AND a.x <> b.x;

12.6 NOT IN with nullable subquery

Covered above. Use NOT EXISTS.

12.7 Massive CTEs that don't inline

Postgres ≤ 11: CTEs were optimization fences (always materialized). Use subqueries or WITH ... AS (NOT MATERIALIZED) on PG 12+.

12.8 Correlated subquery in SELECT

-- Bad: N × M
SELECT user_id,
  (SELECT COUNT(*) FROM sessions s WHERE s.user_id = u.user_id) AS cnt
FROM dim_user u;

-- Good: one GROUP BY
SELECT u.user_id, COALESCE(s.cnt, 0) AS cnt
FROM dim_user u
LEFT JOIN (SELECT user_id, COUNT(*) AS cnt FROM sessions GROUP BY user_id) s
  USING (user_id);

13. Query tuning workflow

Look at the plan. EXPLAIN ANALYZE (Postgres) or the query profile (Snowflake/BigQuery/Spark UI).
Find the biggest cost node. Usually one scan or one join dominates. 80/20 rule applies.
Is it the scan?
- Check partition pruning, zone maps, bloom filter usage.
- Can you push a filter further? Check for UDFs and implicit casts.
- Are the right columns read? Remove SELECT *.
Is it a join?
- What strategy? Is it what you'd pick?
- Estimated vs actual rows — if wildly wrong, stats are stale. Run ANALYZE.
- Is one side broadcast-sized? Hint it.
- Is it skewed? Salt, two-stage aggregate, or isolate heavy hitters.
Is it a sort / window?
- Can you partition differently to avoid one?
- Is there an unnecessary ORDER BY in the final stage?
Is it a shuffle?
- Can you pre-partition (bucket/cluster)?
- Can it be eliminated by broadcasting a side?
Is it I/O?
- Compression codec: Snappy for speed, ZSTD for ratio.
- File sizes: consolidate small files; split massive ones.

13.1 "When in doubt, measure" checklist

SELECT COUNT(*) on base tables to confirm size.
SELECT COUNT(*), COUNT(DISTINCT key) to confirm cardinality / skew.
SELECT key, COUNT(*) FROM t GROUP BY 1 ORDER BY 2 DESC LIMIT 10 to find hot keys.
SHOW TBLPROPERTIES / SHOW CREATE TABLE — partitioning, clustering, properties.

13.2 A 3-minute triage

For "this query is slow" in an interview:

1. How big are the inputs?               (SELECT COUNT(*)...)
2. Is it scan-bound or compute-bound?    (look at % time in scan vs join)
3. Did partitions prune?                 (check plan)
4. Is one join the culprit?              (find the big one)
5. Is there skew?                        (check task time distribution)
6. What's the smallest change that fixes the biggest cost?

That sequence solves 80% of production SQL problems.

15. CTE Materialization — Inlined vs Cached

Common Table Expressions look syntactically identical across engines but behave very differently under the hood. Senior candidates know which engine inlines CTEs (treating them as view substitutions) and which materializes them (computing once, reading N times).

Engine-by-engine behavior

Engine	Default behavior	Override
Postgres ≤ 11	Always materialized (optimization barrier)	Upgrade to 12+
Postgres ≥ 12	Inlined if referenced once, materialized otherwise	`MATERIALIZED` / `NOT MATERIALIZED` keyword
Snowflake	Inlined always	Use `TEMP TABLE` to force materialization
BigQuery	Inlined always	Use `CREATE TEMP TABLE`
Spark SQL	Inlined	`CACHE TABLE` on a subquery

The performance trap

A CTE referenced twice in an inlined engine is computed twice. If the CTE does a 10-billion-row scan, you're doing 20 billion rows' worth of work. For anything expensive referenced more than once, materialize explicitly.

-- Bad: scans 20B rows
WITH heavy AS (SELECT ... FROM 10B_row_fact WHERE ...)
SELECT ... FROM heavy h1 JOIN heavy h2 ON ...;

-- Good: scans 10B rows once, joins on materialized result
CREATE TEMP TABLE heavy AS SELECT ... FROM 10B_row_fact WHERE ...;
SELECT ... FROM heavy h1 JOIN heavy h2 ON ...;

16. Approximate Aggregations — HLL, t-Digest, CMS

Exact COUNT(DISTINCT) on a billion rows requires shuffling every row. Approximate counterparts (HyperLogLog, t-digest, Count-Min Sketch) give 99% accuracy for 0.01% of the cost. Senior candidates know what each sketch does and when to reach for it.

HyperLogLog — cardinality

APPROX_COUNT_DISTINCT(). Uses ~16 KB of state per distinct-count regardless of input size. Answer within ~2% of exact. Mergeable: HLL sketches from separate shards combine into a single sketch; you can pre-aggregate into daily sketches and then combine across a month for monthly distinct counts without rereading raw data.

T-digest / KLL — quantiles

APPROX_PERCENTILE(). Stores a compressed histogram of value distributions. Supports arbitrary percentiles (p50, p95, p99, p999) from the same sketch. Use for latency dashboards, revenue distributions, any quantile metric on high-cardinality numeric data.

Count-Min Sketch — frequency

Given a stream of events, estimate the frequency of any specific item. Supports "heavy hitters" queries — who are the top-K most-frequent items — cheaply. Less common in SQL engines than HLL but worth naming in interviews for top-K at scale.

When NOT to use sketches

Compliance-sensitive reporting (finance, regulatory). "Approximate" is never the right word on an audit trail.
Small datasets under a few million rows — the exact version runs in seconds; approximations add complexity for no gain.
Critical correctness where a 2% error would change a decision (e.g., fraud thresholds that depend on exact distinct counts).

17. Materialized Views and Incremental Refresh

A materialized view is a cached query result. The interesting question is how it stays fresh. Three refresh strategies, each with its own correctness and cost profile.

Full refresh

Recompute the view from scratch on each refresh. Simple, always correct. Cost scales with base-table size. Use when the view is small, refresh is infrequent, or the base table changes too unpredictably for incremental logic.

Incremental refresh

Apply only the delta since the last refresh. Requires the engine to reason about which base-table rows feed which view rows — a non-trivial analysis that engines support for restricted query shapes only (single table, or simple aggregations over joins with specific key structure).

Snowflake: "Dynamic Tables" support incremental refresh on many join patterns.
BigQuery: "Materialized views" support incremental for aggregations and filters over a single base table.
Postgres: Native MATERIALIZED VIEW only supports full refresh; incremental requires third-party tools (pg_ivm) or manual triggers.

Lakehouse MERGE-based patterns

In Iceberg / Delta, the pattern is: compute the delta in a staging table, MERGE INTO the materialized output. The engine doesn't call it a materialized view, but operationally it is one. Write your own refresh job and schedule it in the orchestrator.

18. Query Hints — Per-Engine Grammar

Query hints override the optimizer. They are a sharp tool; use them only when you've measured that the optimizer is wrong and can't be fixed upstream. Major engines:

Snowflake

Snowflake exposes few hints — the philosophy is "trust the optimizer." What exists: USE_CACHED_RESULT=FALSE, query tags, and clustering hints via CLUSTER BY at table DDL time. The right lever in Snowflake is often reshaping the query or the clustering key, not a hint.

BigQuery

Also minimal: @@optimizer_mode, JOIN HASH hint, GROUP BY ROLLUP. For join-order control you often rewrite the query; inline views or CTEs with explicit ordering work where hints don't.

Spark SQL

Rich hint grammar: /*+ BROADCAST(t1), MERGE(t1, t2), SHUFFLE_HASH(t1), SKEW('t1','skewed_key', (1,2,3)) */. The BROADCAST hint is the most common: force broadcast of a dim table the optimizer refused to broadcast because its statistics were stale.

Postgres

No hints by default — the philosophy is "hints lie over time as data distributions change." The extension pg_hint_plan adds them for those who insist. Preferred tools in Postgres: ANALYZE, VACUUM, index creation, and EXPLAIN (ANALYZE, BUFFERS).

The meta-lesson

Reaching for a hint is a smell. Before hinting, confirm: (a) statistics are up-to-date, (b) the query is written in a shape the optimizer can reason about, (c) you've run EXPLAIN ANALYZE and understand the plan. Interviewers are suspicious of candidates who volunteer hints as a first-resort tuning move.

19. Window Function Choice — Examples by Intent

Pick the wrong window function and the query is subtly wrong in ways the optimizer won't flag. The table below maps the business intent to the correct function, with the most common wrong pick for each.

You want to…	Use	Common wrong pick	Why it matters
Rank with strict ordering, break ties arbitrarily	`ROW_NUMBER()`	`RANK()`	ROW_NUMBER gives exactly N rows for top-N; RANK may return more than N on ties
Rank with ties, preserving count	`DENSE_RANK()`	`RANK()`	RANK leaves gaps (1,1,3); DENSE_RANK doesn't (1,1,2) — pick by whether gaps matter
Top-N with "include all tied at Nth"	`DENSE_RANK()`	`ROW_NUMBER()`	ROW_NUMBER arbitrarily picks one of the ties
Previous row's value per group	`LAG(col)`	self-join	LAG is one pass; self-join is N² in the worst case
Next row's value per group	`LEAD(col)`	self-join	Same reason
Running total	`SUM(col) OVER (ORDER BY …)`	correlated subquery	Correlated subquery is O(N²); windowed sum is O(N)
Moving 7-day sum	`SUM(col) OVER (ORDER BY d ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`	`RANGE` frame	ROWS counts rows; RANGE counts values — with date gaps, RANGE gives different results
Moving 7-calendar-day sum (including empty days)	Generate a dense date grid first, then windowed sum	naive ROWS frame	Sparse data breaks ROW-based windows silently
First/last row per group, ordered	`FIRST_VALUE() / LAST_VALUE()` with explicit frame	no frame	Default frame is `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` — `LAST_VALUE` returns the current row, not the group's last row
Percentile within group	`PERCENT_RANK()` or `NTILE(100)`	`ROW_NUMBER() / COUNT(*)`	Manual division gives wrong semantics for ties
Session number per user (gaps-and-islands)	`SUM(new_session_flag) OVER (PARTITION BY user ORDER BY ts)`	recursive CTE	Recursive is correct but 10–100x slower
"Nth most recent" per group	`ROW_NUMBER() OVER (PARTITION BY grp ORDER BY ts DESC)` + filter `= N`	`ORDER BY ts DESC LIMIT N`	LIMIT applies to the entire result; won't give N-per-group
Cumulative distinct count	HyperLogLog sketch + windowed merge	`COUNT(DISTINCT col) OVER (…)`	Most engines don't support window COUNT(DISTINCT); performance is terrible where they do
Lagged value N rows back	`LAG(col, N)`	`LAG(LAG(LAG(col)))`	LAG takes an offset argument; no need to nest
Percent of total per group	`col / SUM(col) OVER (PARTITION BY grp)`	subquery grouping	Windowed is one pass; subquery-grouping is two scans

The frame-clause trap

The three functions that surprise people most are LAST_VALUE, SUM OVER ORDER BY, and FIRST_VALUE. Their default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which is almost never what the user intended for LAST_VALUE. Always write the frame clause explicitly when using these:

-- Wrong: returns the CURRENT row's value, not the group's last
LAST_VALUE(col) OVER (PARTITION BY g ORDER BY ts)

-- Right: explicitly extends the frame to the group end
LAST_VALUE(col) OVER (
  PARTITION BY g ORDER BY ts
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)

↑ Back to top

Part 06

Python for Data Engineering

"Python is the glue, the prototype, and increasingly the runtime. Knowing where its abstractions leak — GIL, GC, copy-vs-view, the Arrow boundary — is how you stop writing slow data code."

This chapter is what a senior data engineer actually needs from Python: GIL semantics down to bytecode, the real differences between Pandas / Polars / PySpark with the math, the Arrow boundary, generators and async that don't crash production, testing strategies that scale, and packaging for distributed runtimes.

The Python execution model that matters for DE
The GIL — what it actually locks
Memory: refcounting, GC, and the bytes you don't see
Pandas vs Polars vs PySpark — the trade math
The Arrow boundary: zero-copy between worlds
Generators, iterators, and chunked I/O
Async/await for I/O-bound DE work
Multiprocessing patterns that work in production
Type hints, dataclasses, and runtime data validation
Testing strategy for data pipelines
Packaging and deploying to Spark / Lambda / Airflow
Performance debugging toolkit

1. The Python execution model that matters for DE

Three layers, each with its own quirks:

The interpreter (CPython): source → bytecode → executed by the eval loop.
Reference counting + cyclic GC: every object has a refcount; cycles collected by a generational GC.
C extension layer: NumPy, Pandas, PyArrow, etc. — most heavy lifting happens in C and releases the GIL.

You won't write fast Python by writing more Python. You write fast Python by delegating to C/Rust extensions (NumPy, Pandas, Arrow, Polars) and arranging your code so the interpreter loop runs as little as possible.

Quick measurement:

import dis
def add(a, b):
    return a + b
dis.dis(add)
#  2           0 RESUME                   0
#  3           2 LOAD_FAST                0 (a)
#              4 LOAD_FAST                1 (b)
#              6 BINARY_OP                0 (+)
#             10 RETURN_VALUE

Each bytecode op is dispatched by the eval loop. Python 3.11+ added the specializing adaptive interpreter (PEP 659) — same bytecode, but specialized at runtime for observed types (e.g. BINARY_OP_ADD_INT). Real measurable speedup (10–60% for loops). Python 3.12/3.13 push this further.

2. The GIL — what it actually locks

The Global Interpreter Lock is a mutex that guards CPython's interpreter state — primarily refcount manipulation. Only one thread can hold it; only the holder executes Python bytecode at any instant.

What this means in practice:

CPU-bound pure-Python multithreading: no speedup. Threads serialize on the GIL.
I/O-bound multithreading: real speedup. Sockets / file I/O release the GIL while blocking.
C extensions: depends. NumPy, Pandas, scikit-learn frequently release the GIL during heavy compute.
AsyncIO: single-threaded — it never bumps into the GIL because it doesn't use threads for concurrency.

2.1 Switching interval

The GIL is released periodically:

import sys
sys.getswitchinterval()    # 0.005 (5 ms in 3.x)

The currently-running thread is told to release the GIL after this many seconds (cooperatively, at next bytecode boundary). Other threads can then contend.

2.2 What happens during `time.sleep()`

import threading, time
def worker():
    time.sleep(1)  # releases GIL while sleeping
    print("done")
threading.Thread(target=worker).start()

time.sleep is a C function that releases the GIL during the sleep. So you can have 10000 threads sleeping; the GIL doesn't matter.

2.3 What happens during pandas `df.merge()`

The merge is in C. It releases the GIL for the duration. Other Python threads can run. In practice, pandas + threads can scale on multicore for large numerical operations. Caveat: object-dtype columns (strings in Pandas 1.x) need the GIL for hash table building, and won't scale.

2.4 The free-threaded build (PEP 703 — Python 3.13+)

Python 3.13 ships an experimental --disable-gil build. Reference counting becomes biased + atomic; per-object locks replace the GIL. Real concurrency for pure Python. Performance overhead ~10–30% single-threaded.

For DE work in 2026: continue assuming the GIL exists in production. Free-threaded adoption is on the horizon but not yet default.

2.5 The shortcut: `concurrent.futures`

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

# I/O-bound: thread pool fine, GIL not contended
with ThreadPoolExecutor(max_workers=32) as ex:
    results = list(ex.map(fetch_url, urls))

# CPU-bound: process pool to bypass GIL
with ProcessPoolExecutor(max_workers=8) as ex:
    results = list(ex.map(crunch_numbers, datasets))

3. Memory: refcounting, GC, and the bytes you don't see

Every Python object has at minimum a header (~16 bytes on CPython 3.12) + type pointer + refcount. A trivial int is 28 bytes. A 4-character string is 53 bytes. A 1-element list is 88 bytes.

This is why a Pandas DataFrame with 10M rows × 1 string column can use 2 GB of memory in object dtype but 80 MB as Arrow-backed strings.

3.1 Refcounting

import sys
x = [1, 2, 3]
sys.getrefcount(x)  # 2 (variable + getrefcount's local arg)

When refcount hits zero → object freed immediately. Predictable, but cycles aren't collected this way.

3.2 Generational GC

For cycles. Three generations (0, 1, 2). New objects in gen 0; survivors promoted. GC scans gen 0 frequently, gen 2 rarely.

import gc
gc.set_threshold(700, 10, 10)   # collect gen0 every 700 allocations, etc.
gc.disable()                    # turn off cyclic GC entirely

Disabling can speed up batch jobs that allocate massively but never form cycles (rare; verify first).

3.3 Memory profiling

import tracemalloc
tracemalloc.start()
# ... your code
snapshot = tracemalloc.take_snapshot()
top = snapshot.statistics('lineno')
for stat in top[:10]:
    print(stat)

Or for live diagnosis:

pip install py-spy
py-spy top --pid <PID>
py-spy dump --pid <PID>

py-spy is a Rust tool that profiles a running Python process without modifying it (USDT-style). Critical for Spark driver issues.

3.4 The `slots` lever

By default, instance attributes live in a per-instance __dict__ (~~250 bytes). __slots__ declares a fixed attribute set, replacing dict with a struct.

class Point:
    __slots__ = ('x', 'y')
    def __init__(self, x, y):
        self.x, self.y = x, y

Memory drop: ~50%. Useful when you're holding 10M small objects (rare in modern DE; you'd use Arrow). But know it exists.

4. Pandas vs Polars vs PySpark — the trade math

The default question in modern DE: which DataFrame library?

4.1 The honest comparison

Dimension	Pandas	Polars	PySpark
Backing engine	NumPy + (1.x objects, 2.x optionally Arrow)	Rust + Arrow	JVM + Arrow (via wrapper)
Execution	Eager	Lazy + eager	Lazy, distributed
Threading	Mostly single-threaded	Multi-core by default (Rayon)	Distributed across cluster
Memory model	In-memory only	In-memory + streaming engine (1.0+)	Distributed, spillable
Sweet spot	< 10 GB on a laptop, single-threaded analytics	10–500 GB on one beefy machine	TB+, distributed
API surface	The biggest, most documented	Smaller but growing fast	Spark SQL ecosystem
String/object-heavy	Slow (object dtype)	Fast (Arrow strings)	OK
Window functions	OK	Fast	Yes, distributed
Joins	OK	Excellent (parallel)	Distributed
User-defined funcs	Slow (per-row)	Fast (Rust) or slow (Python)	Pandas UDF for batches

4.2 Speed math (hand-wavy benchmarks; orders of magnitude)

For a 10 GB CSV → group by → write Parquet:

Pandas: 5–10 minutes on a laptop, only if it fits in RAM (it won't; you're spilling to swap).
Polars (lazy + streaming): 20–60 seconds on the same laptop. It streams.
PySpark local mode: 1–3 minutes. JVM startup overhead, less efficient single-node.
PySpark on a 10-node cluster: 30 seconds. Now distributed-friendly.

Rule of thumb: if it fits in RAM, Polars wins. If it doesn't, PySpark wins.

4.3 Pandas-specific traps

# View vs copy (the source of "SettingWithCopyWarning")
df2 = df[df['x'] > 0]   # might be a view, might be a copy — depends on memory layout
df2['y'] = ...          # whiplash: changes might or might not propagate to df

# Fix:
df2 = df[df['x'] > 0].copy()
df2['y'] = ...
# Or use .loc[]
df.loc[df['x'] > 0, 'y'] = ...

Pandas 2.0 introduced Copy-on-Write mode (pd.set_option("mode.copy_on_write", True)) which makes this deterministic. Turn it on for new code.

4.4 Polars-specific patterns

import polars as pl

# Lazy: build a plan, optimize, execute once
df = (pl.scan_parquet("s3://bucket/path/*.parquet")
      .filter(pl.col("event_date") == "2026-04-15")
      .group_by("country")
      .agg(pl.col("watch_ms").sum().alias("total_ms"))
      .sort("total_ms", descending=True)
      .head(10))

result = df.collect(streaming=True)  # streaming engine for out-of-core

Polars optimizes the plan (predicate push-down into the Parquet scan, projection push-down, common subexpression elimination). Runs Rust-multithreaded.

4.5 PySpark-specific patterns

See chapter 04. Key principle: never iterate row-by-row in Python over a Spark DataFrame. Use:

Built-in functions (Catalyst native).
Pandas UDFs (Arrow batches).
.toPandas() only for small results.

5. The Arrow boundary: zero-copy between worlds

Apache Arrow is a columnar in-memory format with a stable C ABI. Anything that speaks Arrow can hand a buffer to anything else without copying.

5.1 Why it matters

Pre-Arrow:

Diagram

Spark JVM serialize Java objects → bytes → deserialize Python objects → Pandas DF

Cost: O(rows × columns), with object allocation for every cell. Slow.

Post-Arrow:

Hierarchy

Spark JVM

Arrow IPC buffer (binary)

> Python (pyarrow.Table)

Cost: O(buffer bytes), pointer arithmetic.

5.2 The libraries that talk Arrow natively

PyArrow
Polars
Pandas 2.0+ (with dtype_backend='pyarrow')
DuckDB (in-memory and zero-copy with Pandas)
Spark (for Pandas UDFs and toPandas())
Datafusion, Dask, Vaex
Snowpark
BigQuery storage API
Iceberg, Parquet (file formats are Arrow-compatible)

import polars as pl
import duckdb

pl_df = pl.read_parquet("data.parquet")
duck_result = duckdb.query("SELECT country, SUM(watch_ms) FROM pl_df GROUP BY 1").pl()
# duck_result is a Polars DF — zero copy back

DuckDB sees the Polars DF directly via Arrow. No serialization in either direction.

5.4 The strings caveat

Arrow stores strings in two buffers: an offsets array + a single character buffer. Pandas 1.x stores strings as Python str objects (one per row). Converting Pandas-string ↔︎ Arrow-string requires materializing or building Python objects → not zero-copy.

Pandas 2.0 with dtype_backend='pyarrow' keeps strings as Arrow → all the zero-copy benefits.

import pandas as pd
df = pd.read_parquet("data.parquet", dtype_backend="pyarrow")
df.dtypes  # show pyarrow types

6. Generators, iterators, and chunked I/O

For data that doesn't fit in memory, generators are the idiom.

def chunked_csv(path, chunksize=100_000):
    for chunk in pd.read_csv(path, chunksize=chunksize):
        yield chunk

for chunk in chunked_csv("big.csv"):
    process(chunk)        # one chunk in memory at a time

6.1 Generator pipelines

Compose stages with generators. Each stage is lazy.

def read_lines(path):
    with open(path) as f:
        for line in f:
            yield line.rstrip()

def parse(lines):
    for line in lines:
        yield json.loads(line)

def filter_recent(records, since):
    for r in records:
        if r["ts"] >= since:
            yield r

pipeline = filter_recent(parse(read_lines("events.jsonl")), since=T0)
for record in pipeline:
    sink(record)

Memory use: O(1) per stage, regardless of input size.

6.2 itertools — the standard toolbelt

import itertools

# batch into groups of N
def batched(iterable, n):
    iterator = iter(iterable)
    while batch := tuple(itertools.islice(iterator, n)):
        yield batch

# Cartesian product
itertools.product([1,2], [3,4])    # (1,3), (1,4), (2,3), (2,4)

# Chain iterables
itertools.chain([1,2], [3,4])      # 1, 2, 3, 4

# Group by adjacent equal keys (for pre-sorted data)
for key, group in itertools.groupby(sorted_records, key=lambda r: r["user_id"]):
    process_user(key, list(group))

6.3 Async generators

async def stream_pages(client):
    page_token = None
    while True:
        page = await client.fetch(page_token=page_token)
        for item in page.items:
            yield item
        if not page.next_token:
            break
        page_token = page.next_token

async def main():
    async for item in stream_pages(client):
        await process(item)

7. Async/await for I/O-bound DE work

AsyncIO is single-threaded, single-event-loop concurrency. Best for I/O-heavy: HTTP scraping, S3 list/get, database queries.

7.1 The fundamentals

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as resp:
        return await resp.text()

async def main(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, u) for u in urls]
        return await asyncio.gather(*tasks)

results = asyncio.run(main(urls))

Throughput: 1000s of concurrent requests on a single core, vs sequential at 1 RPS.

7.2 Throttling

asyncio.gather launches everything at once. Throttle:

async def fetch_all(urls, concurrency=20):
    sem = asyncio.Semaphore(concurrency)
    async def bounded(u):
        async with sem:
            return await fetch_one(u)
    return await asyncio.gather(*[bounded(u) for u in urls])

7.3 The mistake: mixing sync and async

async def bad():
    time.sleep(5)   # blocks the entire event loop! No other coroutine runs.

async def good():
    await asyncio.sleep(5)

Same trap with requests.get() (use aiohttp or httpx.AsyncClient), boto3 (use aioboto3), psycopg2 (use asyncpg).

7.4 Running blocking code from async

import asyncio

async def main():
    loop = asyncio.get_running_loop()
    result = await loop.run_in_executor(None, blocking_fn, arg1, arg2)
    # default executor is a ThreadPoolExecutor

7.5 When NOT to use async

CPU-bound work: doesn't help; use processes.
Code with no I/O concurrency need: just use sync. Async adds complexity.
Frameworks that don't natively support it (Spark transformations).

8. Multiprocessing patterns that work in production

When the GIL blocks you on CPU-bound work, spawn processes.

8.1 ProcessPoolExecutor

from concurrent.futures import ProcessPoolExecutor

def heavy(item):
    # CPU-bound transformation
    return compute(item)

with ProcessPoolExecutor(max_workers=8) as ex:
    results = list(ex.map(heavy, items, chunksize=100))

chunksize is critical: too small = IPC overhead per item; too large = stragglers. Rule: total_items / (cores × 4) is a good start.

8.2 Shared memory (Python 3.8+)

from multiprocessing.shared_memory import SharedMemory
import numpy as np

# In producer:
shm = SharedMemory(create=True, size=4 * 10_000_000)
arr = np.ndarray((10_000_000,), dtype=np.float32, buffer=shm.buf)
arr[:] = data
print(shm.name)   # pass name to worker

# In worker:
shm = SharedMemory(name=name)
arr = np.ndarray((10_000_000,), dtype=np.float32, buffer=shm.buf)
# Read/process without copying

Avoids serialization of large NumPy arrays between processes.

8.3 The fork vs spawn gotcha

On Linux, the default start method is fork — the child process is a near-copy of the parent. Cheap but inherits everything (file handles, locks). On macOS 3.8+, the default switched to spawn for safety. Mismatch causes "works locally, broken in production".

import multiprocessing as mp
mp.set_start_method('spawn', force=True)  # be explicit

9. Type hints, dataclasses, and runtime data validation

Modern DE code uses types liberally — for IDE support, docs, and validation.

9.1 Dataclasses

from dataclasses import dataclass, field
from datetime import datetime

@dataclass(frozen=True, slots=True)
class PlaybackEvent:
    user_id: int
    title_id: int
    event_ts: datetime
    watch_ms: int = 0
    metadata: dict = field(default_factory=dict)

ev = PlaybackEvent(user_id=1, title_id=42, event_ts=datetime.now())

frozen=True makes it hashable + immutable. slots=True (3.10+) replaces __dict__ with __slots__, saves memory.

9.2 Pydantic for runtime validation

from pydantic import BaseModel, Field, field_validator

class PlaybackEvent(BaseModel):
    user_id: int = Field(gt=0)
    title_id: int = Field(gt=0)
    event_ts: datetime
    watch_ms: int = Field(ge=0, le=24*3600*1000)

    @field_validator('event_ts')
    @classmethod
    def not_future(cls, v):
        if v > datetime.utcnow():
            raise ValueError("event_ts in future")
        return v

# Validate untrusted input:
ev = PlaybackEvent(**raw_dict)  # raises if invalid

Pydantic v2 is Rust-powered, fast enough to use in-line for streaming validation. Use in:

API gateways accepting events.
DLQ recovery scripts.
dbt-style data tests outside dbt.

9.3 attrs

The pre-Pydantic ergonomics gold standard. Still excellent for non-validation cases.

9.4 Type-checking pipelines

pip install mypy ruff
mypy src/
ruff check src/

Pre-commit hooks enforce. Don't merge code that fails mypy.

10. Testing strategy for data pipelines

This is where DE-grade Python differs from web-app Python. You're testing transformations on data that's typically too big to fixture.

10.1 The pyramid

Structure

UI / E2E (manual or rare)
┌────────────────────────────────┐
- Integration: end-to-end DAG │ (a few, slow, run on PRs)
- on a small fixture │
- ┤
- Pipeline-step unit tests │ (many, fast)
- in-memory PySpark/Polars │
- ┤
- Pure function unit tests │ (most, ms)
- on plain Python │
- ┘

10.2 Pure-function units

# transform.py
def normalize_country(country_raw: str) -> str:
    if country_raw is None:
        return "UNK"
    return country_raw.strip().upper()[:3]

# test_transform.py
def test_normalize_country_basic():
    assert normalize_country("united kingdom") == "UNI"
def test_normalize_country_null():
    assert normalize_country(None) == "UNK"

10.3 PySpark unit tests with chispa

import pytest
from pyspark.sql import SparkSession
from chispa import assert_df_equality
from my_pipeline.silver import dedupe_events

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder.master("local[2]").appName("test").getOrCreate()

def test_dedupe_keeps_latest(spark):
    input_data = [
        ("u1", "e1", "2026-01-01T10:00:00", 100),
        ("u1", "e1", "2026-01-01T10:00:01", 200),  # later wins
        ("u2", "e2", "2026-01-01T10:00:00", 50),
    ]
    cols = ["user_id", "event_id", "event_ts", "value"]
    df = spark.createDataFrame(input_data, cols)
    actual = dedupe_events(df)
    expected = spark.createDataFrame([
        ("u1", "e1", "2026-01-01T10:00:01", 200),
        ("u2", "e2", "2026-01-01T10:00:00", 50),
    ], cols)
    assert_df_equality(actual, expected, ignore_row_order=True)

chispa (or pyspark-test) gives readable diffs.

10.4 dbt tests

# models/silver/silver_playback_session.yml
models:
  - name: silver_playback_session
    columns:
      - name: session_id
        tests:
          - unique
          - not_null
      - name: user_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_user')
              field: user_id

10.5 Property-based tests with Hypothesis

from hypothesis import given, strategies as st

@given(st.lists(st.text(min_size=0, max_size=20)))
def test_normalize_country_idempotent(countries):
    once = [normalize_country(c) for c in countries]
    twice = [normalize_country(c) for c in once]
    assert once == twice

Hypothesis generates inputs, including edge cases you didn't think of (Unicode, empty strings, BOM characters).

10.6 Data tests vs code tests

Two distinct things:

Code tests: does the function logic work? (pytest)
Data tests: does the data in production today satisfy expectations? (Great Expectations, Soda, dbt tests)

You need both. Code tests prevent regressions; data tests catch upstream changes.

11. Packaging and deploying to Spark / Lambda / Airflow

11.1 Project structure

Structure

my-pipeline/
- pyproject.toml
- README.md
- src/
  - my_pipeline/
  - __init__.py
  - bronze/
  - silver/
  - gold/
  - utils/
- tests/
- dags/
- pipeline_dag.py

pyproject.toml (using uv / hatchling):

[project]
name = "my-pipeline"
version = "0.4.2"
requires-python = ">=3.11"
dependencies = [
    "pyspark==3.5.1",
    "pyarrow>=15",
    "pandas>=2.2",
    "pydantic>=2.6",
]

[project.optional-dependencies]
dev = ["pytest", "mypy", "ruff", "chispa", "hypothesis"]

11.2 Distributing to Spark

Spark needs the same Python environment on every executor. Three approaches:

Bake into the cluster image: simplest for stable deps.
--py-files for small custom code: ship a .zip or .egg. Limited; doesn't handle native deps.
PYSPARK_DRIVER_PYTHON + virtualenv archive: ship a packed venv.
Conda-pack / venv-pack + --archives: ship a portable env tarball.

venv-pack -o env.tar.gz
spark-submit \
  --archives env.tar.gz#environment \
  --conf spark.pyspark.driver.python=./environment/bin/python \
  --conf spark.pyspark.python=./environment/bin/python \
  app.py

11.3 Airflow

Use PythonOperator only for orchestration logic (calling out to Spark, dbt, etc.). Don't run heavy compute in the Airflow worker.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def trigger_spark(**ctx):
    # call EMR / Databricks / whatever
    ...

with DAG("playback_pipeline", start_date=datetime(2026,1,1), schedule="@daily") as dag:
    bronze = PythonOperator(task_id="bronze", python_callable=trigger_spark)

For testable DAGs:

Keep the DAG file thin (config + ops).
Move logic into a library (my_pipeline/) that's unit-tested.

11.4 Lambda / serverless

For event-driven enrichment or fan-out work:

Package as a Lambda layer (deps separate from code).
Mind cold-start (Polars/Pandas import time is real).
Use SnapStart on Java; Python equivalents are limited.

12. Performance debugging toolkit

12.1 Profiling pure Python

import cProfile, pstats
profiler = cProfile.Profile()
profiler.enable()
do_work()
profiler.disable()
pstats.Stats(profiler).sort_stats("cumulative").print_stats(30)

For line-level:

pip install line_profiler
kernprof -lv my_script.py     # decorate fns with @profile

12.2 Sampling profiler (no code changes)

pip install py-spy
py-spy record -o profile.svg --pid <PID>     # flamegraph
py-spy top --pid <PID>                       # live top
py-spy dump --pid <PID>                      # what is each thread doing

Works on running processes — perfect for stuck Spark drivers.

12.3 Memory

pip install memray
memray run app.py
memray flamegraph memray-app.bin

Memray hooks malloc and reports allocation flamegraphs.

12.4 Pandas/Polars performance checklist

Are you in object dtype where you should be Arrow? .dtypes to check.
Are you using .apply() row-wise? Replace with vectorized ops or .map() with a dict.
Are you copying when you don't need to? pd.set_option("mode.copy_on_write", True).
Are you reading more columns than you need? Use usecols= or column projection.
Are you reading the whole file when you could chunk? chunksize= or scan with predicate push-down.

12.5 The "where is the time" mental model

For any slow Python data job, time goes to one of:

I/O wait (network, disk read/write)
Serialization/deserialization (Python ↔︎ JVM, JSON parsing, pickle)
Pure-Python loop (the GIL on bytecode)
C extension compute (NumPy/Pandas/Polars internals)
GC (rare; check with gc.get_stats())

Profile, classify, fix. Repeat.

Closing principle

The best DE Python is mostly not Python — it's NumPy/Arrow/Pandas/Polars/Spark calls with Python orchestration. Your job is to make the data move from one fast engine to another with minimum overhead. Get that right and everything else gets easier.

15. AsyncIO for IO-Bound Pipelines

A data engineer writes plenty of ETL that is bottlenecked on HTTP calls, S3 listings, BigQuery API pagination, and Kafka producer acks — not CPU. For that shape of work, asyncio is dramatically faster than threads and infinitely cheaper than processes.

The mental model

A single thread runs an event loop. Each coroutine yields control at await points. While one coroutine waits on a network socket, the loop runs other coroutines. You get concurrency without threads, without GIL concerns, and with far lower per-task overhead (~1 KB per coroutine vs ~1 MB per thread).

Worked example — S3 key inventory

import asyncio, aioboto3

async def list_prefix(sess, bucket, prefix):
    async with sess.client('s3') as s3:
        keys = []
        paginator = s3.get_paginator('list_objects_v2')
        async for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
            keys.extend(o['Key'] for o in page.get('Contents', []))
        return keys

async def inventory(buckets, prefixes):
    sess = aioboto3.Session()
    tasks = [list_prefix(sess, b, p) for b in buckets for p in prefixes]
    return await asyncio.gather(*tasks)

# Sequential: O(N*M) calls x ~80ms each = minutes
# asyncio: ~80ms total for reasonable N*M (bounded by S3 throttle)
results = asyncio.run(inventory(['b1','b2'], ['p1','p2','p3']))

Pitfalls

Don't call synchronous libraries from async code. A blocking requests.get() stalls the entire event loop. Use the async equivalent (aiohttp, httpx).
Bound concurrency with asyncio.Semaphore. Launching 10,000 coroutines at once will get you rate-limited by the target API and crash your memory.
Async is not faster for CPU-bound work. A coroutine doing SHA-256 holds the loop hostage until it finishes. CPU-bound needs multiprocessing or a native library.

16. DuckDB, Polars, and Arrow — The Zero-Copy Trinity

The last three years have reshaped single-node analytics. Three tools dominate, and they share a common substrate: Apache Arrow columnar format. Senior candidates understand how they compose.

Arrow — the interchange format

Columnar, in-memory, language-agnostic. The key property: Python, Rust, Java, C++ can all read the same memory region without serialization. A 10 GB DataFrame handed from Pandas to DuckDB to Spark via Arrow costs zero CPU for the handoff.

DuckDB — SQL on a laptop

Embedded analytical SQL engine. Runs inside your Python process. Reads Parquet, Arrow, Pandas DataFrames, and remote S3 directly. For datasets that fit in memory or can be streamed, DuckDB is often 10x faster than Spark with zero cluster setup. Ideal for data-quality checks, CI tests, ad-hoc analysis.

Polars — Pandas for the modern era

DataFrame library written in Rust, Arrow-native. Multi-threaded by default, lazy evaluation on expression chains, query optimizer. Typically 5–20x faster than Pandas on equivalent operations, with similar-enough API for reasonable migration. When a Pandas job starts being slow on a single machine, Polars is almost always the right next step before reaching for Spark.

The composition pattern

import duckdb, polars as pl

# Read S3 parquet directly with DuckDB, hand to Polars as Arrow
df = duckdb.sql("""
  SELECT region, SUM(amount) AS rev
  FROM read_parquet('s3://bucket/orders/*.parquet')
  WHERE order_ts >= CURRENT_DATE - INTERVAL 30 DAY
  GROUP BY region
""").pl()  # zero-copy to Polars

# Enrich in Polars
df = df.join(regions_df, on='region').sort('rev', descending=True)

17. Pydantic and Dataclass Contracts

Data pipelines fail at boundaries. Python gives you three levels of structure for enforcing contracts at those boundaries — pick the right one for the scale.

Level 1 — `dataclass`

Zero-dependency, type-hint-aware, no runtime validation. Use for internal data structures where static typing (mypy) is enough and runtime validation would be overhead.

from dataclasses import dataclass
@dataclass
class Order:
    order_id: int
    amount: float
    currency: str

Level 2 — `pydantic.BaseModel`

Runtime validation, coercion, JSON schema generation, great error messages. Use at API boundaries, ingestion layers, config files. The one-line cost is well worth it when the data source is outside your control.

from pydantic import BaseModel, Field, field_validator

class Order(BaseModel):
    order_id: int
    amount: float = Field(gt=0)
    currency: str = Field(pattern=r'^[A-Z]{3}$')

    @field_validator('currency')
    @classmethod
    def known_currency(cls, v):
        if v not in {'USD','EUR','GBP','JPY'}: raise ValueError(f'unknown currency {v}')
        return v

# Raises ValidationError with a precise path if the JSON violates the contract
order = Order.model_validate(json_payload)

Level 3 — External schema registry

For cross-service contracts, Python validation is not enough — the producer and consumer may be written in different languages. Use Avro or Protobuf with a schema registry (Confluent Schema Registry, Buf). Pydantic is the consumer-side validator; the source of truth lives outside Python.

18. Packaging — pip, Poetry, uv

Python packaging is legendarily messy. Three modern tools cover most of the ground for data teams. Senior candidates have opinions on which to use and why.

pip + requirements.txt

The baseline. Works everywhere. Dependency resolution is weak; no lockfile; easy to drift. Fine for scripts; wrong for production pipelines shared across a team.

Poetry

Real dependency resolver, lockfile, virtualenv management, publishing workflow. The dominant choice in open-source Python for the last five years. Slow on large dependency graphs (2+ minutes to resolve a medium project isn't unusual). Well-understood, well-documented.

uv

Written in Rust. 10–100x faster than Poetry at install and resolve. Drop-in compatible with most pip / Poetry workflows. Still maturing on edge cases but rapidly becoming the default for new projects. Worth naming in interviews as the direction the ecosystem is moving.

The team-level recommendation

Greenfield Python project: uv.
Existing Poetry project: stay on Poetry until a specific pain justifies migration.
One-off notebooks / scripts: pip is still fine.
Multi-Python-version dev (e.g., testing against 3.10 and 3.12): uv handles this cleanly via its own Python-version manager.

The Docker layering discipline

Regardless of tool, your Dockerfile should install dependencies in a separate layer from your code. Change your code, rebuild is 5 seconds. Change one dependency, rebuild is the full install. Messing up this ordering is a 100x slowdown on every CI run.

FROM python:3.12-slim
WORKDIR /app
# Layer 1: dependencies (changes rarely)
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --frozen
# Layer 2: source code (changes often)
COPY src/ ./src/
CMD ["python", "-m", "src.main"]

↑ Back to top

Part 07

Lakehouse: Iceberg & Delta

"A lakehouse table is a directory of immutable files plus a manifest that says which of those files are 'now'. Everything else — ACID, time travel, schema evolution, hidden partitioning — is a side effect of how that manifest is structured."

This chapter unpacks the table format internals: how Iceberg and Delta organize bytes on disk, what makes a snapshot atomic, how MERGE actually works (copy-on-write vs merge-on-read), how compaction is planned, what hidden partitioning gives you, and how time travel and schema evolution are implemented.

Why open table formats exist
Iceberg on disk: the layered metadata
Delta on disk: the transaction log
Snapshot isolation: how ACID is achieved on object storage
Hidden partitioning: Iceberg's killer feature
MERGE under the hood: COW vs MOR
Compaction: bin-packing, sorting, Z-ORDER
Time travel and snapshot expiration
Schema evolution: column-id semantics
Delete files: position vs equality
Iceberg vs Delta vs Hudi — the trade table
Catalog choices: Hive Metastore, Glue, Nessie, Polaris, Unity
Operating a lakehouse: gotchas and patterns

1. Why open table formats exist

Pre-lakehouse, you had two camps:

Data warehouse (Snowflake, BigQuery, Redshift): proprietary storage; ACID, fast queries; expensive; hard to integrate with non-SQL tools (ML).
Data lake (Parquet on S3 + Hive Metastore): cheap storage, multi-tool, but no ACID, no concurrent writes, no schema evolution beyond Hive's limited model, no transactions, no time travel.

Open table formats — Iceberg (Netflix → Apache 2018), Delta Lake (Databricks 2019), Hudi (Uber 2017) — sit between: a metadata layer over Parquet/ORC files in object storage that gives you:

ACID transactions
Snapshot isolation
Schema evolution
Time travel
Hidden partitioning (Iceberg)
Multi-engine read/write (Spark, Trino, Flink, Snowflake, BigQuery, …)
The data files are still Parquet — query engines read them directly.

2. Iceberg on disk: the layered metadata

The brilliance of Iceberg: a tree of metadata pointers, where every write produces new immutable metadata and a single atomic swap of the "current pointer" commits.

Structure

Catalog (HMS / Glue / Nessie):
table: silver.fact_playback
current_metadata_pointer: s3://.../metadata/v00042.metadata.json
s3://bucket/warehouse/silver/fact_playback/
- data/ # immutable Parquet/ORC files
  - event_date=2026-04-15/
    - 00000-1234-5678.parquet
    - 00001-1234-5678.parquet
  - event_date=2026-04-14/
  - ...
- metadata/
- v00041.metadata.json # previous table state
- v00042.metadata.json # current — what catalog points to
  - schemas, partition specs, sort orders, snapshots[]
  - each snapshot has manifest_list pointer
- snap-7843290023-1-...avro # manifest list for snapshot 7843290023
  - rows: each is (manifest_file_path, partitions_summary)
- 8743-1-...avro # manifest file
  - rows: each is (data_file_path, partition, lower_bounds, upper_bounds, record_count)
- 8744-1-...avro

2.1 The four metadata levels (top to bottom)

Table metadata (vNN.metadata.json):
- table-uuid, format-version (2 in modern usage)
- schemas[] list (for evolution)
- partition-specs[] (for hidden partitioning evolution)
- sort-orders[]
- snapshots[] history
- current-snapshot-id
- properties (compression, format, etc.)
Snapshot (entry in snapshots[]):
- snapshot-id
- parent-snapshot-id
- timestamp-ms
- summary (added/removed records, files)
- manifest-list pointer
Manifest list (snap-NNN-...avro):
- One row per manifest in the snapshot.
- Includes per-partition summaries (lower/upper bounds for each partition column) — used for manifest-level pruning before opening individual manifests.
Manifest (Avro file):
- One row per data file.
- Includes per-column statistics: lower/upper bounds, null count, value count, NaN count.
- File-level filtering happens here.

2.2 What a query does

For SELECT ... WHERE event_date = '2026-04-15' AND user_id = 1234:

Catalog → current metadata file.
Read metadata, get current snapshot's manifest-list.
Read manifest list; filter manifests whose partition summaries don't include 2026-04-15. Most pruned away.
Read remaining manifests; filter data files whose event_date and user_id bounds don't match. Many pruned away.
Open and scan the surviving Parquet files; use Parquet's row-group stats and Bloom filters for further row-group pruning.

Every level is a multiplicative pruning step. This is why Iceberg dominates.

2.3 The atomic commit

A commit:

Write new data files to data/.
Write new manifest files referring to them.
Write a new manifest list.
Write a new metadata.json (vNN+1.metadata.json) with the new snapshot.
Atomic swap of the catalog pointer from vNN to vNN+1.

The atomic swap is the only operation that needs synchronization. On HMS/Glue: single-row update with optimistic concurrency. On Nessie / Polaris / Unity: branch-aware semantics. On filesystems: rename, with conflict detection.

If two writers race:

Both build their metadata in parallel.
One wins the swap.
The loser's commit fails; it must rebase (re-apply on top of the winner's snapshot) and retry.

3. Delta on disk: the transaction log

Delta uses a different organization — a transaction log of JSON files (then periodic Parquet checkpoints) describes the table state as a sequence of additions/removals.

Structure

s3://bucket/warehouse/silver/fact_playback/
- _delta_log/
  - 00000000000000000000.json # initial commit
  - 00000000000000000001.json # next commit, etc.
  - ...
  - 00000000000000000010.checkpoint.parquet # Parquet checkpoint of state
  - 00000000000000000010.json
  - _last_checkpoint # pointer to most recent checkpoint
- (data files, optionally partition-prefixed)
- part-00000-abc.snappy.parquet
- ...

3.1 What's in a JSON commit

{"commitInfo": {...}}
{"protocol": {"minReaderVersion": 1, "minWriterVersion": 4}}
{"metaData": {...schema, partitionColumns, format, properties...}}
{"add": {"path": "part-...parquet", "partitionValues": {...}, "size": 1234567, "stats": "{json}", ...}}
{"add": {"path": "...", ...}}
{"remove": {"path": "old-file.parquet", "deletionTimestamp": ..., ...}}

State at version V = replay all JSON files from 0 to V (or from the last checkpoint to V).

stats JSON contains numRecords, minValues, maxValues, nullCount per column → identical purpose to Iceberg manifest stats.

3.2 Checkpoints

Replaying thousands of JSON files for each query is wasteful. Every 10 commits (configurable), Delta writes a Parquet checkpoint that snapshots the full state. New readers replay from the checkpoint forward.

3.3 Atomic commit

The commit log file (NNNNNN.json) is created with a conditional put (S3 If-None-Match: *). If a file with that version already exists, the writer loses the race and retries.

This requires conditional writes — supported by S3, Azure Blob, GCS, ADLS Gen2 since various dates. (Pre-2024 S3 needed DynamoDB-based locking via delta-storage-s3-dynamodb because S3 didn't have conditional puts; this is no longer required.)

4. Snapshot isolation: how ACID is achieved on object storage

Object storage is the wild west: no rename atomicity, no per-object locks (until conditional writes), eventual consistency historically.

Both Iceberg and Delta achieve serializable isolation through immutability + a single atomic swap point.

4.1 The protocol

Read your start snapshot: S0 = current_snapshot().
Compute the changes (which files to add, which to remove).
Write data + metadata without touching the catalog/log pointer.
Attempt atomic swap: change pointer from S0 to S0+1.
If pointer is no longer S0 (someone committed in between), conflict: rebase.
Rebase: re-evaluate whether your changes still apply against the latest snapshot. If safe, retry the swap.

4.2 Conflict resolution rules

For Iceberg/Delta, two writers conflict if they:

Both modify the same file (one removes a file the other read).
Both INSERT into the same partition with overlapping logic.
One writer's MERGE source overlaps with another writer's INSERT.

Conflict-free patterns:

Two append-only writers to disjoint partitions: never conflict.
One MERGE writer + one append writer to disjoint files: may not conflict.
Two MERGE writers to overlapping data: always conflict; one wins, one retries.

4.3 Read-your-writes semantics

A reader gets a consistent snapshot — they see all files included in the snapshot they pinned, none of the files added after. That's snapshot isolation.

Caveat: Iceberg snapshots can be expired (cleaned up by a maintenance job). A long-running reader can fail with "snapshot not found" if expiration runs aggressively. Mitigations: increase retention; pin snapshot-id explicitly; don't run multi-hour queries.

5. Hidden partitioning: Iceberg's killer feature

In Hive-style partitioning, the partition column is part of the directory path (event_date=2026-04-15/...). Queries must filter on the partition column literally:

-- Old Hive trap
SELECT * FROM events WHERE DATE(event_ts) = '2026-04-15';
-- DATE(event_ts) doesn't match the event_date partition column → full scan

Iceberg separates the logical column from the partition transform:

CREATE TABLE silver.events (
  event_ts TIMESTAMP,
  user_id BIGINT,
  ...
)
PARTITIONED BY (days(event_ts), bucket(16, user_id));

Now the query:

SELECT * FROM events WHERE event_ts >= '2026-04-15' AND event_ts < '2026-04-16';

Iceberg knows that days(event_ts) is the partition; it computes the matching partition values automatically and prunes. No partition column needed in the query.

5.1 Available transforms

identity(col): same as raw value (Hive-style).
bucket(N, col): hash bucket. Distribute writes evenly.
truncate(N, col): prefix string (e.g. first N chars), or value mod N for ints.
year(ts) / month(ts) / day(ts) / hour(ts): time-based.

5.2 Partition spec evolution

You can change partitioning over time. Old data stays under the old spec; new data uses the new spec; queries handle both transparently.

ALTER TABLE silver.events SET PARTITION SPEC (days(event_ts));
-- evolves from hour(event_ts) to day(event_ts)

Files written before keep hour partition; new files use day. Iceberg's manifest tracks which spec each file was written under.

6. MERGE under the hood: COW vs MOR

MERGE = upsert + delete + insert in one atomic operation. Both Iceberg and Delta support it. Two execution strategies, with massive performance differences.

6.1 Copy-on-Write (COW) — the original

For each affected data file:

Read the entire file.
Apply MERGE changes in memory.
Write a NEW file with the changed rows.
Mark the old file as removed in the new snapshot.

Cost: rewriting touched files. If 1% of rows change in a 1 GB file, you rewrite 1 GB. Write amplification: 100×.

Read cost: zero overhead. Each file is a clean Parquet.

Default for most lakehouse tables historically. Best for write-rare, read-heavy workloads.

6.2 Merge-on-Read (MOR) — the streaming-friendly mode

Two strategies for representing the deletes:

Position deletes: per data file, a delete file listing positions (row_index) of deleted rows.
Equality deletes: a small delete file listing values (e.g. user_id = 1234) — at read time, rows matching are filtered out.

Updates = delete + insert: position-delete the old row, write a new row.

Cost on write: write the new rows + the delete file. Cheap. Write amplification: 1×.

Cost on read: read data files + apply delete files. Overhead per file with a delete file. Compaction is required periodically to prevent runaway delete-file growth.

6.3 Choosing COW vs MOR

Workload	COW	MOR
GDPR deletes (rare, scattered)	bad (rewrites lots of data)	good (small delete files)
Daily SCD2 merge (5% changed)	OK	good
Hourly streaming upsert (always changing)	bad	good
Read-heavy, infrequent writes	good	OK
Time-series append-only	both fine; COW simpler	—

Set per-table:

-- Iceberg
ALTER TABLE silver.fact_playback SET TBLPROPERTIES (
  'write.delete.mode' = 'merge-on-read',
  'write.update.mode' = 'merge-on-read',
  'write.merge.mode'  = 'merge-on-read'
);

-- Delta (deletion vectors are roughly equivalent to position deletes)
ALTER TABLE silver.fact_playback SET TBLPROPERTIES (
  'delta.enableDeletionVectors' = 'true'
);

6.4 Delta deletion vectors (the equivalent of MOR)

A bitmap (RoaringBitmap) per data file marking deleted rows. Reads apply the bitmap to skip deleted rows. Writes only touch the bitmap, not the data file.

Trade-offs same as Iceberg position deletes.

6.5 MERGE example

MERGE INTO silver.dim_user t
USING staging.user_updates s
  ON t.user_id = s.user_id
WHEN MATCHED AND s.is_deleted THEN DELETE
WHEN MATCHED AND s.hash_diff <> t.hash_diff THEN UPDATE SET
    plan = s.plan,
    country_id = s.country_id,
    updated_ts = current_timestamp()
WHEN NOT MATCHED THEN INSERT (user_id, plan, country_id, updated_ts)
                     VALUES (s.user_id, s.plan, s.country_id, current_timestamp());

Multi-match ambiguity: by spec, if a single target row matches multiple source rows, the operation is undefined and Iceberg/Delta will throw. Ensure your source has a unique key.

7. Compaction: bin-packing, sorting, Z-ORDER

Streaming writers produce many small files. Updates with MOR produce many delete files. Both kill query performance. Compaction rewrites them periodically.

7.1 Bin-packing (size compaction)

-- Iceberg
CALL system.rewrite_data_files(
  table => 'silver.fact_playback',
  options => map('target-file-size-bytes', '536870912')  -- 512 MB
);

-- Delta
OPTIMIZE silver.fact_playback;

Algorithm: pick a partition; pick groups of small files whose total size < target; rewrite each group as one file. New file appears, old ones marked removed. Atomic snapshot.

7.2 Sort compaction

-- Iceberg
CALL system.rewrite_data_files(
  table => 'silver.fact_playback',
  strategy => 'sort',
  sort_order => 'event_ts ASC, user_id ASC'
);

Sorts within each output file. Improves Parquet's row-group min/max stats → better zone-map pruning at query time.

7.3 Z-ORDER (Delta + others)

Multi-column locality. The Z-order curve interleaves the bits of two coordinates so that nearby points in N-dimensional space are nearby in linear order.

-- Delta
OPTIMIZE silver.fact_playback ZORDER BY (user_id, title_id);

Result: within a file, rows are clustered in 2D space of (user_id, title_id). A predicate on either column prunes a large fraction of files.

Math: on N columns, Z-order gives ~N^(1/N) selectivity per column, vs sort-by-first-only which gives 1 for the first column and 0 for others. For N=2 it's a real win; for N=4 the benefit shrinks.

7.4 Manifest compaction

Iceberg maintains many small manifest files when many commits add a few files each. Manifest compaction merges them.

CALL system.rewrite_manifests('silver.fact_playback');

Cheap and runs fast; do it daily for high-write tables.

7.5 Equality / position delete compaction

-- Iceberg
CALL system.rewrite_position_delete_files('silver.fact_playback');

-- For more aggressive compaction that absorbs deletes into data files:
CALL system.rewrite_data_files(
  table => 'silver.fact_playback',
  options => map('delete-file-threshold', '1')   -- rewrite any data file with >=1 delete
);

7.6 Maintenance pattern

Daily / hourly cron:

-- Compact small files, absorb deletes
CALL system.rewrite_data_files('silver.fact_playback');

-- Compact manifests
CALL system.rewrite_manifests('silver.fact_playback');

-- Expire old snapshots (default keeps 5 days)
CALL system.expire_snapshots(
  table => 'silver.fact_playback',
  older_than => TIMESTAMP '2026-04-09 00:00:00'
);

-- Remove orphan files (data files no longer referenced by any snapshot)
CALL system.remove_orphan_files('silver.fact_playback');

8. Time travel and snapshot expiration

Both formats preserve old snapshots until you expire them.

-- Iceberg: query by snapshot ID or timestamp
SELECT * FROM silver.fact_playback FOR VERSION AS OF 7843290023;
SELECT * FROM silver.fact_playback FOR TIMESTAMP AS OF '2026-04-15 12:00:00';

-- Delta
SELECT * FROM silver.fact_playback VERSION AS OF 42;
SELECT * FROM silver.fact_playback TIMESTAMP AS OF '2026-04-15 12:00:00';

8.1 Practical uses

Audit: "show me what the table looked like before yesterday's bad merge".
Reproducibility: ML training reading a frozen snapshot ID.
Rollback: undo a bad write with a single CALL.
CDC: Iceberg snapshot diff or Delta CDF gives per-snapshot row changes.

8.2 Expiration trade-off

Long retention = huge storage cost (you keep every old version of every file). Short retention = no recovery from incidents.

Common default: 5–7 days. Audit-required tables: longer (90+ days), with cost to match.

ALTER TABLE silver.fact_playback SET TBLPROPERTIES (
  'history.expire.min-snapshots-to-keep' = '20',
  'history.expire.max-snapshot-age-ms'   = '604800000'  -- 7 days
);

8.3 Rollback

-- Iceberg
CALL system.rollback_to_snapshot('silver.fact_playback', 7843290023);

-- Delta
RESTORE TABLE silver.fact_playback TO VERSION AS OF 42;

Atomic: just swap the current snapshot pointer. The "rolled back" data files reappear (assuming not yet expired).

9. Schema evolution: column-id semantics

Hive's schema evolution is positional and brittle. Iceberg tracks every column by an immutable column ID, not by name or position.

9.1 What's safe

Add column: new ID, default value if not in older files.
Drop column: ID retired; old files keep their data but it's not exposed.
Rename column: same ID, new name (no file rewrite!).
Reorder columns: doesn't matter; ID-based reads.
Promote types: int → long, float → double, decimal precision up. Done in metadata.

9.2 What's not safe (requires rewrite or fails)

Rename when other engines map by name (some catalogs).
Change type incompatibly (string → int).
Change a NOT NULL constraint on existing data with NULLs.

9.3 Why Parquet "just works"

Parquet stores column names, not IDs by default. Iceberg requires writing files with field IDs in Parquet metadata so that reads can map by ID even if names changed. Older readers without ID support fall back to name-based mapping.

ALTER TABLE silver.events ADD COLUMN device_class STRING;
ALTER TABLE silver.events RENAME COLUMN device_class TO device_type;
ALTER TABLE silver.events DROP COLUMN device_type;
ALTER TABLE silver.events ALTER COLUMN price TYPE decimal(18, 4);  -- promotion

All metadata-only operations. No data file rewrite.

9.4 Delta schema evolution

Similar capabilities; column IDs introduced in Delta's protocol versions ≥ 2. Older Delta tables (without column mapping) need ALTER TABLE ... SET TBLPROPERTIES('delta.columnMapping.mode' = 'name') to enable rename/drop.

10. Delete files: position vs equality

In Iceberg's MOR mode, two flavors of delete file:

10.1 Position deletes

Per data file, lists row positions (offsets in row order) to skip.

delete-file-1.parquet:
  file_path                        | pos
  --------------------------------+-----
  data/event_date=.../00001.parquet | 42
  data/event_date=.../00001.parquet | 1003
  data/event_date=.../00002.parquet | 7

Read-time cost: low. Per-file applied as a row-position bitmap.

10.2 Equality deletes

Lists values that should be deleted.

delete-file-1.parquet:
  user_id | event_id
  --------+----------
  1234    | abc
  5678    | xyz

At read time, every row is checked against the equality predicate.

Read-time cost: higher (requires per-row evaluation). But useful when you don't know exact positions (e.g. CDC stream just says "delete user_id=1234").

10.3 When each is used

Position deletes: produced by MERGE/UPDATE/DELETE that knows the affected file+row.
Equality deletes: produced by streaming sinks (Flink) that don't want to read the data file to find positions.

Iceberg can mix both within a snapshot.

11. Iceberg vs Delta vs Hudi — the trade table

Feature	Iceberg	Delta	Hudi
Origin	Netflix	Databricks	Uber
Governance	Apache	Linux Foundation (Delta Lake)	Apache
Engine support	Spark, Flink, Trino, Snowflake, BigQuery, Athena, ClickHouse	Spark (best), Trino, Flink, Synapse	Spark (best), Trino, Flink
Hidden partitioning	yes	no	partial
Schema evolution	column IDs	column mapping (opt-in)	yes
MERGE strategies	COW + MOR (position + equality deletes)	COW + deletion vectors (position equivalent)	COW + MOR
Time travel	yes	yes	yes
Branches / tags	yes (with Nessie/Polaris/Iceberg V3)	no (clone is similar)	savepoints
Concurrent writers	optimistic; conflict on file overlap	optimistic; conflict on file overlap	optimistic
Catalog options	HMS, Glue, REST (Polaris), Nessie, JDBC	Hive, Unity (Databricks), Glue (limited)	HMS
Streaming ingestion	Flink, Spark Streaming	Spark Streaming, Flink	Streamer (own tool)
Industry traction (2026)	dominant for new builds, Snowflake/BigQuery/Databricks all support	strong on Databricks, Unity Catalog	declining outside Hudi-native shops

The honest take: Iceberg has won the open standard race. Delta is excellent on Databricks. Hudi is fine if you're already on Hudi.

12. Catalog choices: Hive Metastore, Glue, Nessie, Polaris, Unity

The catalog stores table → current metadata pointer. The catalog choice determines:

Atomic commit semantics
Multi-engine interop
Branch/tag support
Access control
Lineage

12.1 Hive Metastore (HMS)

The original. Thrift API, MySQL/Postgres backend. Universally supported. No branches. Limited ACL. Single-region typically.

12.2 AWS Glue Data Catalog

HMS-compatible API, managed, multi-region. Good with EMR / Athena / Redshift Spectrum / Snowflake (via Iceberg integration). No branches.

12.3 Project Nessie

Git-like semantics for catalogs. Branches, tags, commit history per table change. Great for ML reproducibility ("this experiment trained on the prod-2026-04-15 tag").

12.4 Apache Polaris (Snowflake's contributed REST catalog)

Open-source REST catalog implementing Iceberg's REST spec. Multi-engine, RBAC, support for branches / Nessie-style semantics over time. The strongest contender for "the universal Iceberg catalog".

12.5 Unity Catalog (Databricks)

Databricks's metastore with fine-grained ACL, lineage, audit. Native to Databricks; expanding interop via Iceberg REST.

12.6 The pragmatic choice

For Iceberg in 2026: Polaris (REST) or Glue if you're cloud-native; Nessie if you want git-like branching; Unity if you're on Databricks. HMS for legacy.

13. Operating a lakehouse: gotchas and patterns

13.1 The small-files killer

Streaming writers (Flink/Spark Structured Streaming) commit every checkpoint interval. With a 30-second checkpoint and 100 partitions, that's 200 small files per minute → 288K per day. Run compaction.

Mitigations:

Increase checkpoint interval (1–5 minutes is usually fine).
Use distribution mode hash so each writer covers a single partition.
Enable Iceberg's auto-compaction (write-side) if available.
Schedule daily compaction job.

13.2 The expiration trap

Aggressively expiring snapshots breaks long-running readers. Aggressively expiring orphan files can DELETE files an in-flight reader needed.

Pattern:

Set expire_snapshots older_than to "older than the longest possible query" (e.g. 24h).
Run remove_orphan_files with older_than even longer (e.g. 72h).
Never run remove_orphan_files with a low threshold — you risk deleting files referenced by a snapshot you're about to expire but haven't yet.

13.3 Concurrent writers conflict spiral

Pattern: stream + batch both writing same table → conflict → batch retries → still conflicts → fails.

Fix:

Partition by time so stream and batch write to different partitions.
Or have one writer; use staging tables and merge in a single writer.
Or use REST catalog with retry-with-backoff.

13.4 Cross-region reads

Iceberg metadata is small; data files are large. If reading from another region, the metadata cost is negligible but data egress is real. Replicate data files with S3 cross-region replication; the catalog can point at the regional bucket; consider per-region copies of the table.

13.5 Auditing / data contracts

Both formats store commit metadata: who, when, what changed. Surface this in your data catalog.

SELECT *
FROM silver.fact_playback.snapshots
ORDER BY committed_at DESC
LIMIT 10;

Iceberg metadata tables (.snapshots, .history, .files, .partitions, .manifests) are first-class queryable views on the table's state. Use them for monitoring, alerting on anomalous commits, and observability.

13.6 Cost monitoring

S3 LIST and small-file PUT operations are real cost drivers at scale. Monitor:

Number of files per table per day.
Average file size.
Number of manifests.
Snapshot accumulation.

A healthy table has files in the 128 MB – 1 GB range, manifests merged daily, snapshots expired weekly.

Closing principle

Iceberg/Delta turned object storage into a real database. The cost: you have to operate it like one — compaction, expiration, schema discipline, conflict handling. Get the maintenance jobs right and the lakehouse just works. Skip them and you'll be on a 3am page within 90 days.

16. Iceberg vs Delta vs Hudi — The Feature Matrix

In every lakehouse interview at least one question probes your table-format opinions. The informed answer goes beyond "Iceberg has better engine support" — here's the matrix of real differences.

Capability	Iceberg	Delta Lake	Hudi
Snapshot isolation	Manifest-list based	JSON transaction log	Commit timeline + file groups
MERGE semantics	Copy-on-write (v1) or merge-on-read (v2)	Copy-on-write default, deletion vectors for MoR	MoR default (merge-on-read with log files)
Hidden partitioning	Yes — partition spec evolution	No (Liquid clustering is the answer)	Partial (partition fields declared)
Schema evolution	Add, drop, rename, reorder, type-widen — all safe	Add, type-widen only (drop/rename require explicit config)	Add, rename (field IDs)
Time travel	Snapshot ID, timestamp	Version, timestamp	Commit timestamp
Row-level deletes	Position or equality delete files	Deletion vectors (since 3.0)	Native
Engine support	Spark, Trino, Flink, DuckDB, many	Spark (native), Trino (read), others growing	Spark (native), Trino (read)
Cross-engine writes	Excellent via REST catalog	Historically Spark-first; improving	Spark-first
Streaming ingest optimization	Good (compact-on-write)	Good	Best (MoR designed for upsert streams)

How to pick — the decision heuristic

Multiple engines writing the same table? → Iceberg. Cross-engine write semantics are its strongest suit.
Single-engine Databricks shop? → Delta. Best native integration, newer features land there first.
High-volume upsert workload (CDC ingest, slowly-changing)? → Hudi. MoR was designed for this.
Analytics-heavy with occasional updates? → Iceberg or Delta; both fit well.

17. Catalog Architectures — Hive, Glue, REST, Unity, Polaris

The catalog layer decides which engine can write the same table and how metadata is shared across boundaries. It's also the single biggest lock-in risk in a lakehouse.

Hive Metastore (HMS)

The original. Stores table metadata in a relational DB (MySQL/Postgres). Thrift API. Good: every query engine in the world speaks it. Bad: single-master, schema-less for table formats, no native ACLs, no multi-tenancy.

AWS Glue

Managed HMS-equivalent. Same API, fewer operational headaches. Lock-in to AWS for metadata. Supports Iceberg and Delta as catalog entries. Fine for AWS-centric shops; painful if you plan a multi-cloud future.

Iceberg REST Catalog

Open spec. HTTP API that any engine can call. Backend-agnostic (can be backed by HMS, Glue, Nessie, Polaris). The direction the industry is moving for vendor-neutral catalogs.

Unity Catalog (Databricks)

Centralized governance across Delta tables, files, ML models, and views. Strong on ACLs, lineage, attribute-based access control. Works best inside Databricks; open-source Unity is catching up on ecosystem support.

Polaris (Snowflake)

Open-source Iceberg REST catalog from Snowflake. Aimed at the same "centralized governance, open engine access" thesis as Unity. Still early but strategically significant.

The migration trap

Switching catalogs is not a metadata-only operation even though it sounds like one. Every engine's configuration must point at the new catalog, every table must be re-registered with matching paths, and any engine-specific extensions (Delta vacuum retention, Iceberg snapshot expiration) must be preserved. A catalog migration is a project, not a weekend. Plan for it the same way you'd plan a table-format migration.

18. Write Amplification and Compaction Strategy

Copy-on-write tables pay a cost each time a partition is updated: the entire partition gets rewritten. That cost is called write amplification — the ratio of bytes rewritten to bytes actually changed.

Quantifying write amplification

Suppose a 1 TB daily partition is written once then a MERGE statement touches 0.1% of rows the next day. Copy-on-write rewrites the whole 1 TB partition to land 1 GB of changes. Write amplification = 1000x. At 30 such MERGEs per day, you're writing 30 TB to land 30 GB of logical changes.

Merge-on-read as the fix

Delta's deletion vectors and Iceberg's v2 delete-files let you write a small side-file containing "these rows are deleted / updated to X." Readers apply the overlay at query time. Write cost collapses. Read cost rises — by how much depends on how many overlays have accumulated. That's where compaction comes in.

Compaction strategies

Bin-pack. Merge many small files into fewer larger ones. Cheapest form of compaction; doesn't touch overlays.
Sort. Rewrite with a clustering key (ZORDER in Delta, sort-order in Iceberg). Improves read performance by improving min/max pruning and compression. More expensive to run.
Delete-vector application. Materialize deletions into main data files. Reclaims the read-time overlay cost. Run periodically (weekly / monthly) based on overlay ratio.

Operational rule of thumb

If your read queries on a table are getting slower over time without the data growing, it's probably compaction debt. Monitor the ratio of delete-file bytes to data-file bytes; when it crosses ~5–10%, schedule compaction.

19. Multi-Table Transactions and the Two-Phase Commit

Lakehouse formats provide atomic commits per table. They don't provide atomic commits across tables by default. If your pipeline updates a fact and a dimension that must move together, you need to solve this yourself.

The problem, stated precisely

Pipeline writes fct_order and dim_customer. If fct_order commits but dim_customer fails, consumer queries see new orders referencing missing customer keys. Data corruption.

Three mitigation strategies

Staging table pattern. Write to staging_* tables. Once both staging writes succeed, do a "rename" commit — point the final-table catalog entries at the staging paths atomically. Works; requires careful metadata choreography; error-recovery is tricky.
Two-phase commit via external coordinator. Use a transaction coordinator (Zookeeper, a custom service, or an orchestrator checkpoint). Prepare writes to both tables; only both commits succeed or both roll back. Heavy; rarely worth the complexity.
Ordering + idempotent consumers. Pick an ordering where the "most dangerous" side is committed last. If dim is committed before fact, consumers can handle missing dim keys gracefully (inferred members). Accept at-least-once committing; downstream consumers tolerate it. This is the most common production pattern.

Iceberg's multi-table transaction feature

Iceberg (as of v1.4+) supports a Transactions API that commits against multiple tables atomically if the catalog supports it. REST Catalog does; Glue doesn't. When supported, it eliminates the strategies above for its scope. Still worth understanding the fallbacks because catalog support is uneven.

↑ Back to top

Part 08

Interview Q&A — Real Scenarios

"The point of these questions isn't 'do you know the answer.' It's 'can you reason out loud, ask the right clarifying questions, and stop when you've answered enough.'"

These are scenarios drawn from real Senior / L5 Data Engineering loops — Netflix, Stripe, Airbnb, Pinterest, Uber, Meta, DoorDash. No leetcode. No "reverse this binary tree." Every scenario is something a real engineer faced at 3am or in a design review.

The format for each:

Scenario — what they ask.
What they're testing — the underlying skill.
Answer skeleton — how a strong candidate structures the answer.
What weak candidates miss.
Bonus / follow-ups.

Incident: The pipeline missed SLA — diagnose

What they're testing: triage methodology under uncertainty.

Answer skeleton:

What's the SLA, what's the actual completion time, by how much did it slip? "Missed SLA" might be 2 minutes or 6 hours — the answer is different.
What's the symptom: late, failed, or partial?
Look at the orchestrator's view first: which task in the DAG is slow/failing? That narrows the surface area immediately.
For the slow task: drill into the platform's UI (Spark UI, Flink UI, dbt logs, query history). Look at:
- Stage times (which stage spent the most time?)
- Task time distribution (skew?)
- Data volume read/written (data spike?)
- GC time (memory pressure?)
Compare to baseline: was yesterday fine? If yes, what changed? Recent code deploy, infra change, upstream data volume change?
Hypothesis → fix → verify. Don't fix without a hypothesis.
Communicate: post in #data-incident, set ETA, update if it slips.

Common mistakes: jumping to "it's the cluster size, scale up" before understanding what changed. Fixing in production without a backout plan.

Bonus: prevention — what observability would have caught this earlier? (Volume monitoring on the upstream Kafka topic, partition skew alerts, anomaly detection on per-stage runtime.)

Incident: The dashboard shows half the revenue it did yesterday

What they're testing: data correctness debugging.

Answer skeleton:

Confirm the symptom: is it the dashboard's filter, time zone, query? Check the dashboard's underlying SQL.
Check the source-of-truth tables: query directly with the same logic, see if numbers match. If yes → BI/dashboard issue. If no → data issue.
For a data issue, walk upstream:
- Did the gold model run? (orchestrator)
- Did silver run? Did it produce expected row counts?
- Did bronze ingestion complete?
- Did the source produce normal volume? (Kafka topic lag, source DB row counts)
Most likely culprits in order: (a) duplicate-suppression bug now over-suppressing; (b) join condition newly produces no match for a category; (c) upstream schema change dropped a column to NULL; (d) timezone shift; (e) partial data due to a missed late arrival.
Reproduce in a notebook: bisect by date and category to isolate. "It dropped at 6pm UTC and only for SUBSCRIPTION revenue" tells you a lot.
Hot-fix the metric (if possible), file an incident, do a postmortem.

Common mistakes: guessing at causes; fixing the dashboard query without finding the root cause; not checking row counts at every layer.

Incident: Streaming consumer lag is climbing and won't drain

What they're testing: streaming systems debugging.

Answer skeleton:

Quantify: which topic, which consumer group, how much lag, growing how fast (msg/sec)? Per-partition or uniform?
Check upstream: is the producer rate higher than usual? Spike, sustained, or normal?
Check downstream: if the consumer writes to a sink (DB, S3), is the sink the bottleneck? Sink latency, throttling errors?
Check the consumer process: CPU? Memory? GC? Look at Flink/Spark UI: backpressure indicators, busy operators, checkpoint durations.
Check skew: one partition's lag is 10× others? → key skew. Need to rebalance, salt, or rebalance with a different partitioner.
Check rebalances: storms of group rebalances cause processing pauses. Look at consumer group state changes.
Triage actions in order:
- Add more consumers / parallelism (if uniform load).
- Bypass DLQ for known-bad messages (if poison pill).
- Pause non-critical sinks to give CPU back.
- As a last resort, increase Kafka partitions (rebalances, irreversible).
Don't reset offsets unless you know what you're doing — risk of data loss or duplication.

Common mistakes: blindly scaling up consumers (won't help if downstream is the bottleneck); resetting offsets without consequences understood.

Bonus: how does Flink's credit-based backpressure manifest? (Upstream operator slows down because downstream stops issuing credits — backPressuredTimeMsPerSecond metric.)

Incident: Iceberg table reads are 10× slower this week

What they're testing: lakehouse operations.

Answer skeleton:

Check the file count and average file size: query the metadata table.

SELECT COUNT(*) AS file_count,
       AVG(file_size_in_bytes) AS avg_size,
       SUM(file_size_in_bytes) AS total_size
FROM silver.fact_playback.files
WHERE partition_event_date = DATE '2026-04-15';

Check the manifest count: lots of small manifests = lots of LIST/GET on metadata.
Most likely: streaming writer started running every minute, producing 1440 small files per partition per day. Reads must open all of them.

Fix: schedule compaction.

CALL system.rewrite_data_files('silver.fact_playback');
CALL system.rewrite_manifests('silver.fact_playback');

Long-term: enable async compaction at write time, or reduce streaming commit frequency.

Common mistakes: blaming the query engine; not checking metadata.

Incident: Spark job OOMs only on Mondays

What they're testing: data-volume-aware debugging.

Answer skeleton:

What's special about Mondays: typically aggregating a weekend's worth of data → 3× the volume.
Where does it OOM: container OOM (exit 137) or JVM heap OOM? They have different fixes.
Container OOM: Pandas UDF or Arrow buffer overflow → bump spark.executor.memoryOverhead.
JVM OOM in shuffle/join: a partition that fits Tue-Sun no longer fits Mon → AQE skew handling, salt, raise broadcast threshold, or split heavy hitters.
JVM OOM in Pandas UDF aggregation: Python worker exploding on a single mega-group → check group sizes, rewrite using SQL aggregations or applyInPandas with smaller groups.
Long-term fix: don't let runtime scale linearly with data; use incremental / windowed aggregation.

Common mistakes: increasing executor memory without diagnosing whether it's heap or overhead.

Incident: Late-arriving data corrupted yesterday's report

What they're testing: event-time vs processing-time understanding.

Answer skeleton:

Confirm: was yesterday's report finalized at "watermark close" or "processing-time end-of-day"? Different bugs each.
If watermark-based: late events past the allowed lateness were dropped. Configure higher allowed lateness, or accept that some events will appear in tomorrow's report instead.
If processing-time: yesterday included only events that arrived yesterday, not events whose event_time was yesterday. Switch the report to event-time semantics.
For lakehouse: re-process the affected day with a backfill that overwrites only the affected partition. Use INSERT OVERWRITE or replace-where or MERGE.
Communicate: data was corrected at T+24h; explain the trade-off (you can't have low-latency AND complete data without retraction).

Common mistakes: ignoring the event-time/processing-time distinction; reporting on stream output without watermark discipline.

Design: a clickstream pipeline at 1M events/sec

What they're testing: end-to-end systems thinking, capacity planning.

Answer skeleton:

Clarify: 1M events/sec average or peak? Event size (bytes)? Latency requirement (sec, min, hour)? Use cases (real-time dashboard, ML feature, batch analytics)? Retention (days, years)?
Capacity math: 1M × 1KB = 1 GB/sec ingress = 86 TB/day. Storage: 30 days = 2.6 PB raw; with Snappy compression / Parquet → ~600 TB.
Architecture (sketch):
Structure
- Edge SDK → API Gateway → Kafka (topic, ~100 partitions) →
  - Real-time: Flink → Druid/Pinot for sub-second dashboard
  - Near-real-time: Flink → Iceberg (1-min commit) → Trino for ad-hoc
  - Batch enrichment: Spark daily → silver/gold Iceberg tables
Kafka design: 100 partitions × 3 replicas × 7-day retention; partition by user_id (most queries are per-user) or random (avoid skew). Consider tiered storage.
Schemas: Avro/Protobuf with Schema Registry. Backward + forward compat policy.
Failure modes:
- Edge SDK can't reach gateway → local buffer + retry with backoff.
- Kafka unavailable → producer queue → DLQ.
- Flink job dies → exactly-once via checkpoints + transactional Kafka writes.
- Iceberg commit conflict → REST catalog with retry.
Cost: Kafka (compute + storage) > Flink (CPU) > S3 (storage); aim for 70/20/10 distribution.
Monitoring: lag per consumer group, SLO per stage, p99 latency, schema validation failures.

What weak candidates miss: capacity math; Kafka partition design; the difference between "real-time dashboard" and "real-time data lake".

Bonus: how do you handle hot keys? (Salting the partition key but joining back later, or pre-aggregating before partitioning.)

Design: a feature store for ML serving

What they're testing: balance of online + offline systems.

Answer skeleton:

Clarify: how many features, how many models, online QPS, offline volume, freshness SLA per feature.
Two-tier architecture:
- Offline store (Iceberg/Delta on S3): feature values over time, used for training.
- Online store (DynamoDB / Redis / ScyllaDB): latest values per entity, low-latency lookup.
Materialization:
- Batch features (daily aggregations) computed in Spark, written to both stores.
- Streaming features (last 5 minutes) computed in Flink, written to both.
- Same definition, two paths — risk of skew. Use a single feature definition (Tecton-like, or your own DSL).
Point-in-time correctness: training data must use the feature's value as it was at the prediction time, not as it is now. Implement an as-of join (see chapter 05).
Lineage and discovery: catalog of feature definitions, owners, freshness, schema, tests.
Failure modes: online store stale (alert), offline/online skew (compare hourly), feature drift (statistical tests).

What weak candidates miss: point-in-time joins for training; the offline/online consistency problem.

Design: point-in-time correct training data

See as-of joins and the feature-store design above. Key principle: every feature value has a valid_from/valid_to; every label has an event_ts; the join takes the latest feature value where valid_from <= event_ts. Implement with BETWEEN join + LATERAL or DuckDB ASOF JOIN.

Design: a daily metric that must be 100% accurate

What they're testing: when to give up on streaming.

Answer skeleton:

Clarify "100%": matched to a system of record (payments DB)? Eventually consistent (T+1)?
If matched to system of record: don't stream. Run a batch ETL after the source system has closed the day (T+24h). Reconcile against the SOR with a tolerance check (e.g., difference < $1).
Architecture:
- Source DB → daily Snowflake/BigQuery export at end of day.
- Reconcile total revenue from source vs total in warehouse.
- If diff > tolerance, halt downstream pipelines and alert.
- Only after reconciliation succeeds, publish the metric.
Why not stream? Streaming has bounded out-of-orderness; late events change the answer. For audit-grade metrics, accept latency.
Hybrid: real-time approximate + daily authoritative. Make sure consumers know which they're reading.

What weak candidates miss: the reconciliation step; understanding that streaming is fundamentally probabilistic for metrics.

Design: SCD Type 2 ingestion from Kafka CDC

What they're testing: streaming + dimensional modeling joined.

Answer skeleton:

Source: Debezium → Kafka topic with one envelope per change {op: c|u|d, before, after, source: {ts_ms, snapshot}}.
Sink: Iceberg dim table with valid_from, valid_to, is_current.
Pattern A: streaming MERGE per micro-batch (Spark Structured Streaming):
- Each batch: deduplicate by user_id keeping latest LSN/ts_ms.
- MERGE INTO dim:
  - WHEN MATCHED AND hash differs: close current row (valid_to = batch_ts, is_current = false), insert new row.
  - WHEN NOT MATCHED: insert new row.
  - WHEN MATCHED AND op = 'd': close current row, optionally insert tombstone.
Pattern B: Flink streaming with state:
- Keyed by user_id, store last-seen hash + open-row pointer.
- Emit two records per change: a "close" + an "open" — written via Iceberg sink with MERGE/upsert semantics.
Idempotency: at-least-once Kafka delivery means dupes possible. The MERGE must be idempotent: deduping by (user_id, source_ts_ms) before MERGE.
Late events: out-of-order CDC is rare per partition (Debezium preserves order per source row), but can happen across partitions. Reject events with source_ts_ms older than the dim's current row.

What weak candidates miss: dedup before merge; handling deletes; out-of-order tolerance.

Design: a multi-region data warehouse

What they're testing: distributed system trade-offs at the storage layer.

Answer skeleton:

What's the goal: latency for queries near users, DR, regulatory data residency?
Pattern A: single global warehouse, regional caches — easy if latency is OK; one source of truth.
Pattern B: per-region warehouses, eventual sync — low latency reads, complex consistency. Each region writes locally; replication daily or streamed via CDC.
Pattern C: per-region storage of regional data, federated query — complies with data residency; cross-region queries are slow.
Conflict handling for writes: avoid by partitioning ownership (US users write US warehouse). Last-writer-wins with timestamps. Or use Iceberg with REST catalog and replicate snapshots forward.
Cross-region S3 replication for backup; understand cost (egress is expensive).
Catalog: one global catalog (e.g., Glue replicated, or REST catalog with HA) or per-region catalogs synced. The global catalog is simpler to reason about.

What weak candidates miss: data residency/regulatory; cost of cross-region egress; per-region writability vs read-only replicas.

Design: exactly-once for a payments-counting pipeline

What they're testing: depth on exactly-once semantics.

Answer skeleton:

See chapter 03, sections 7 and 14. The five questions:

Source: replayable with deterministic offsets? Kafka, yes. HTTP webhook, no — need a buffer.
State: durable across failures? Use Flink's keyed state on RocksDB with checkpoints; or Spark structured streaming with checkpoint location.
Sink: idempotent or transactional?
- Transactional: Kafka EOS (transactional producer), JDBC with XA.
- Idempotent: writing to a sink keyed by (payment_id, event_id) so dupes upsert harmlessly.
Effective once at consumer: consumers must read transactionally (isolation.level=read_committed).
End-to-end: even with EOS, downstream consumers can dedupe on a unique business key as belt-and-suspenders.

Architecture:

Payment events → Kafka (transactional producer) → Flink (checkpoints, EOS sink) → Iceberg
                                                  ↓
                                           Side output → Audit log

What weak candidates miss: thinking exactly-once means "no dupes ever". It really means "exactly-once effect on the durable state". The right framing is idempotent sink + transactional or replayable source.

Internals: How does Spark decide BHJ vs SMJ at runtime?

See chapter 04, sections 4 and 6. Key points:

Plan time: catalyst's planner chooses BHJ if (a) user hint, (b) one side estimated < autoBroadcastJoinThreshold. Otherwise SMJ.
Runtime (AQE): after the shuffle, AQE re-evaluates. If the post-shuffle map output of one side fits the threshold, AQE converts to BroadcastHashJoin using the shuffled output as broadcast — this is Dynamic Join Selection.
Skew handling: AQE further detects skewed partitions in SMJ and splits them.

A senior should mention: estimates can be wildly wrong → why AQE matters; broadcast can fail at runtime if the side is bigger than the driver can hold.

Internals: Why is your shuffle slow and what can you do?

See chapter 04, sections 5 and 7.

Reasons in order of likelihood:

Too many partitions → small files, fetch overhead. AQE coalesces.
Skew → one task is the long pole. AQE skew handling, salting.
Disk → ESS disk full or slow. Push-based shuffle (Magnet) helps.
Network → cross-AZ traffic, slow NICs. Co-locate where possible.
Wrong join strategy → SMJ when BHJ would have eliminated the shuffle entirely.

Fix in order: enable AQE, increase advisory partition size, enable push-based shuffle, broadcast where possible, remove the shuffle entirely (pre-bucket the table).

Internals: Why doesn't this filter push down?

See chapter 04, section 2.4. Top reasons:

UDF on the column (Catalyst is conservative).
Function on indexed column: WHERE date(ts) = ... — doesn't push.
Cast type mismatch (string column compared against int).
Window/aggregate sits between Filter and Scan.
Cache (.cache()) above the filter.

How to verify: df.explain(mode="formatted") and look for PushedFilters in the FileScan.

Internals: Walk me through what happens during a Flink checkpoint

See chapter 03, section 9. Crisp version:

JobManager triggers a checkpoint and broadcasts a checkpoint barrier to every source.
Each source emits the barrier into its output stream (between data records).
Operators receive barriers on each input. With aligned checkpoints, they wait for barriers on all inputs (buffering); with unaligned, they snapshot in-flight buffers immediately.
When all barriers are received, the operator snapshots its state to RocksDB → uploads to S3 (incremental: only new SSTables since last checkpoint).
JobManager collects acks; once all operators ack, checkpoint metadata is written and the checkpoint is "complete".
On failure, all operators restart from the last completed checkpoint and replay from the offsets stored in source state.

Bonus mention: incremental checkpoints (RocksDB SSTable diffs), savepoints (manual + format-stable), exactly-once requires sink commits to be tied to checkpoint completion (two-phase commit).

Internals: Walk me through an Iceberg commit, end-to-end

See chapter 07, sections 2 and 4. Crisp version:

Writer reads current snapshot (S0) from catalog.
Writes new Parquet data files to data/.
Writes new manifest file(s) referring to the data files.
Writes a new manifest list combining new manifests with existing ones from S0.
Writes a new metadata.json (v00043.metadata.json) that records the new snapshot S1 with its manifest list, parent = S0.
Atomic compare-and-swap: ask the catalog to update the table pointer from v00042 to v00043. If catalog says "still at v00042" → success. If "now at v00043" (someone else won) → conflict.
On conflict: discard the new metadata.json (the data files are orphan but that's fine for now), refresh to the latest snapshot, re-evaluate the changes, retry from step 2 (or just step 5 if changes still apply).
Periodically, an orphan-file cleanup removes data files not referenced by any reachable snapshot.

Internals: How does a watermark form across a Kafka topic with 12 partitions?

See chapter 03, section 3. Key points:

Each Kafka partition has its own watermark (computed from the timestamps of events in that partition with WatermarkStrategy.forBoundedOutOfOrderness(...)).
The Flink source operator's watermark is the minimum of its partition watermarks.
Downstream operators receive watermarks from each upstream channel; their effective watermark is the min of all incoming watermarks.
An idle partition can stall the global watermark forever. Use withIdleness(Duration.ofMinutes(2)) so an idle source partition's watermark is excluded from the min calculation after the timeout.

Bonus: with parallelism > partitions, multiple source subtasks share partitions but the math holds — each subtask emits one watermark; downstream takes min.

Internals: Why does my EXACTLY_ONCE Kafka producer still produce duplicates downstream?

What they're testing: layered understanding.

Answer:

Are downstream consumers reading committed only? They need isolation.level=read_committed. Default is read_uncommitted → reads aborted transactions too.
Is the sink idempotent on its own key? EOS on the producer side prevents one Kafka message becoming two. But if the sink's key is (user_id, ts) and your processing produces non-deterministic timestamps on retry, you get two distinct rows downstream.
Is your transformation deterministic? Replay from a checkpoint must produce identical outputs to the prior run, otherwise the sink sees both versions.
Are there multiple producers writing same key? Two upstream pipelines independently writing the same business event to the topic → two messages → two downstream rows. EOS doesn't dedupe across producers.

Modeling: Star schema vs OBT — when each?

See chapter 01, section 11.

Star schema: governance, multi-consumer, evolving dimensions, conformed dimensions across many facts. Snowflake/BigQuery execution prefers it. Most warehouses.
OBT (One Big Table): feature engineering for ML, ad-hoc exploration where re-joins would dominate cost, low-cardinality dimensions stable enough to denormalize. Modern columnar warehouses with run-length encoding store denormalized cheaply.

Heuristic: keep dimensions normalized (star) for the warehouse; materialize OBTs as gold-layer marts for specific high-traffic use cases.

Modeling: SCD2 or SCD1 for this dimension?

What they're testing: requirement-driven design.

Answer pattern: ask the question back. SCD2 if any downstream use case needs to know the value at a past point in time (audits, point-in-time joins for ML, retroactive cohort analysis). SCD1 if "current value only" is fine.

Common SCD2 cases: subscription plan, account country, billing address, organization ownership. Common SCD1 cases: name spelling fixes, internal IDs that don't change semantically, fields where history is held in another system.

Cost trade-off: SCD2 multiplies row count by churn rate. For low-churn dims it's free; for high-churn (5% per day) it gets large.

Modeling: Persist this Silver model or rebuild from Bronze?

What they're testing: cost-vs-correctness reasoning.

Persist if:

Downstream pipelines depend on it daily (recompute cost > storage cost).
Source data is volatile and you need stable history.
The transformation is expensive (joins, windowing, dedup at scale).

Rebuild if:

The transformation is cheap and the source is small.
You need to evolve the logic frequently (each silver state is "as of last run" — re-running is the only way to get the latest logic).

In practice: persist Silver, but also persist Bronze (raw ingestion, immutable). When logic changes, backfill Silver from Bronze.

Modeling: Partition by user_id or by date?

What they're testing: partition design intuition.

By date if queries filter by date (most analytical queries do). By user_id if queries are point-lookups by user (operational). Both with primary partition by date and clustering / Z-ORDER by user_id — the modern lakehouse default.

Avoid high-cardinality columns as partition: 10M users × 365 days = 3.65B partitions = catastrophic. Partition is for ~100s to ~10000s of values; clustering/sorting/bucketing for higher.

Modeling: Soft deletes in a fact table?

What they're testing: temporal modeling.

Facts are typically immutable — what happened, happened. "Soft delete" of a fact often signals a model issue: the user cancelled something, that's a new fact (the cancellation event), not a deletion of the original.

If you must mark a fact as voided, add is_voided BOOLEAN and voided_at TIMESTAMP. Downstream queries WHERE NOT is_voided. Don't actually DELETE — keep audit trail.

If GDPR-driven hard delete: lakehouse MERGE-ON-READ with deletion vectors, applied on a per-user basis, with retention controls.

Modeling: PM asked for "real-time" — what do you ask back?

What they're testing: requirement gathering.

Ask:

What latency is acceptable: 1 second, 1 minute, 5 minutes, 1 hour? "Real-time" varies by user.
What's the use case: dashboard refresh, ML feature, alerting, operational decision?
How frequently is the data viewed? (Refreshing a dashboard every 30s for a metric viewed once a day is wasteful.)
What's the cost budget? Sub-second is 10–100× more expensive than 5-minute.
What's the quality bar? Approximate or exact? Does late data need retraction?

Most "real-time" requests turn out to be "5–15 minutes is fine" once you ask, and that's a vastly different system.

Quality: How do you test a Spark transformation?

See chapter 06, section 10. Three layers:

Pure-function unit tests on transformations decomposed from the DataFrame logic.
End-to-end PySpark tests with chispa on small fixture DataFrames.
Integration tests against a temp Iceberg table on local filesystem.

Plus dbt-style data tests on the actual table in production (uniqueness, nulls, referential integrity).

Quality: How do you backfill safely?

See chapter 02, section 6. Crisp principles:

Idempotent overwrite per partition: deterministic so reruns are safe.
Bounded parallelism: don't reprocess 365 days at once on a 10-node cluster.
Staging table first: write backfilled data to a staging table; validate row counts, key uniqueness, sums; then atomic swap into prod.
Audit the diff: compare backfill output vs current prod for each partition; require sign-off if delta > threshold.
Notify downstream: backfill changes data; downstream caches must invalidate.

Quality: How do you design a data contract?

See chapter 01, section 13. Components:

Schema (columns, types, nullability, constraints).
Semantics (what each column means, units, valid range).
Freshness SLA (data is < N hours old).
Volume range (rows/day expected; alert outside ±X%).
Owner & on-call.
Change process (versioned; consumers notified before breaking changes).
Tests (run on each ingestion; halt downstream if broken).

Stored as YAML in the producer's repo; checked in CI; surfaces in the data catalog.

Quality: How do you measure pipeline quality?

Six dimensions:

Freshness: time since last successful run. SLO measurable.
Completeness: row count vs expected; missing partitions.
Accuracy: spot-checks against a system of record; statistical anomaly detection.
Consistency: same metric computed two ways agrees within tolerance.
Validity: passes schema and referential integrity tests.
Uniqueness: primary keys are unique; no duplicate facts.

Surface as an observability layer (Monte Carlo, Datafold, custom). Alert on regressions.

Quality: Schema evolution without breaking consumers?

Strategies:

Add only: new columns are non-breaking; consumers without them ignore.
Deprecate, don't drop: keep old columns with NULL or last-known value during deprecation window; remove only after consumers migrated.
Renames via copy: add new column populated from old; deprecate old.
Type changes: only widening (int → bigint, decimal precision up); other changes need explicit consumer support.
Versioned tables / views: silver.fact_playback_v2 as a new table, redirect consumers gradually.

For Kafka topics: Schema Registry compatibility checks (BACKWARD, FORWARD, FULL); reject incompatible producer changes.

Quality: CI/CD for a data warehouse?

Pattern:

Source control (git) for SQL models, transformations, DAG definitions, schemas.
PR pipeline:
- Lint (sqlfluff, ruff).
- Type check (mypy on Python; dbt parse on SQL).
- Unit tests.
- dbt build on a sample / dev database.
- dbt test on dev.
Deploy to staging environment on merge to main.
Smoke tests on staging.
Promote to prod via versioned release.
Observability post-deploy: row count check, freshness, anomaly detection.

For Spark / Flink jobs: containerize, deploy via terraform/k8s/argo; canary via shadow runs comparing outputs against current prod.

Trade-offs: Latency vs cost vs correctness — pick two

What they're testing: judgement.

The triangle:

Latency + correctness: real-time + exact = expensive (high replication, transactional sinks, big clusters).
Latency + cost: real-time + cheap = approximate (sketches, sampling, eventual consistency).
Cost + correctness: cheap + exact = batch with reconciliation, T+24h.

You pick based on use case. Operational alerting → latency + correctness. Marketing dashboard → cost + correctness. Personalization features → latency + cost (ish).

Trade-offs: Row-store for analytics?

What they're testing: knowing the rule and its exceptions.

Rule: columnar for analytics. Exceptions:

Point-lookup heavy workloads (single-row, single-column reads).
Operational analytics on small tables (< 1M rows) where columnar overhead doesn't pay off.
HTAP (hybrid transactional/analytical) systems where the same row is updated AND analyzed (Singlestore, Postgres with extensions).

For 99% of pure analytics: columnar.

Trade-offs: When NOT to use Iceberg?

Single-engine, single-region, no concurrent writers, no schema evolution → plain Parquet on S3 + Hive Metastore is fine.
Tables < 1 GB, few writes → not worth metadata overhead.
Streaming-heavy with sub-minute commits and no compaction strategy → Iceberg metadata explodes.
Compute fully on a managed warehouse (Snowflake native tables) where you don't need open format.

For most other cases: use Iceberg.

Trade-offs: Lambda architecture in 2026?

What they're testing: knowing why Kappa won mostly, but Lambda still has corners.

Kappa (single streaming pipeline) is the default now: simpler, less code, single source of truth. Use Kappa when:

Streaming engine handles batch via the same code (Flink, Spark Streaming).
You're OK with "real-time + late corrections via reprocessing".

Lambda (separate batch + streaming) still justified when:

Batch is the audit-grade source; streaming is a real-time approximation explicitly labeled.
Different teams own batch vs streaming with different SLAs.
The streaming engine can't express the batch-quality computation (rare today).

Trade-offs: Polars or Spark — when each?

See chapter 06, section 4.

Polars: fits in memory of one machine (incl. its streaming engine for spill), local notebook, ad-hoc work, ETL where Spark cluster startup is overkill.
Spark: > one machine of data, distributed shuffle/join, integration with Iceberg/Delta at scale, multi-tenant clusters.

Heuristic: if your data fits on a r6i.16xlarge (512 GB RAM), Polars wins on speed and cost. Above that, Spark.

Trade-offs: Build vs buy?

Layer	Build	Buy
Orchestration	Airflow (open)	Astronomer, Prefect Cloud, Dagster Cloud
Lineage	OpenLineage + custom	Datafold, Atlan, Alation
Quality	Great Expectations / Soda + custom alerts	Monte Carlo, Bigeye, Anomalo
Catalog	DataHub (open)	Atlan, Alation, Collibra
Lakehouse compute	EMR / OSS Spark	Databricks, Snowflake
Streaming	Flink on K8s	Confluent Cloud, Decodable, Aiven
BI	Superset (open), Lightdash	Looker, Tableau, Power BI, Mode

Heuristic: buy what's commodity (BI, catalog, lineage), build what's differentiated (custom quality rules, domain transformations). Don't build orchestration.

Behavioural: A time you said no to a stakeholder

What they're testing: judgment, communication, ownership.

Structure (STAR):

Situation: PM wanted real-time engagement metrics in 1 minute latency for a feature launch.
Task: I was the DE owner; engineering had to deliver within 4 weeks.
Action: I quantified the cost (3× current pipeline budget) and complexity (Flink job, new infra, on-call burden), and presented an alternative — 5-minute latency at 1.2× cost. Worked with the PM to confirm that 5-minute was actually acceptable for the use case.
Result: Shipped at 5-minute latency in 2 weeks, freed the team for higher-priority work, set a precedent for trading latency for cost transparently.

Avoid: cynical "no", or "yes" without surfacing trade-offs.

Behavioural: A pipeline you owned caused a bad metric

What they're testing: ownership, blamelessness, root-cause discipline.

Structure:

The bug (be specific and technical).
How it was found (you, monitoring, downstream).
Immediate response (timeline of mitigation, communication to stakeholders).
Root cause analysis (5-whys; structural, not personal).
The fix (immediate + systemic).
Prevention (test, monitor, contract).

Don't blame teammates, vendors, or upstream. Own it.

Behavioural: How do you decide what NOT to build?

Will it deliver measurable user value?
Is the team's time better spent elsewhere?
Can we buy/use OSS instead?
Is the demand stable or speculative?
What's the maintenance cost over 3 years?

Decide explicitly. Document the decision and the alternatives. Revisit when context changes.

Behavioural: How do you onboard the next engineer?

Day 1: their dev env runs the smallest end-to-end pipeline.
Week 1: they own a small change behind a flag, reviewed thoroughly.
Month 1: they own a non-critical pipeline end-to-end, including on-call.
Month 3: they're contributing to architecture discussions.

Pair on incidents. Have them write/update the runbook. Make them speak in design reviews early.

Closing meta-advice

For the actual interview:

Slow down before answering. Restate the question to confirm you got it right.
Ask 1–3 clarifying questions before diving in. Don't interrogate, but don't assume.
Talk through trade-offs, not just "the right answer". L5 is judgment.
Use real numbers (data volume, latency, cost) — concrete is more impressive than abstract.
Go deep when asked, broad when asked. Read the cue.
It's OK to say "I don't know" — and immediately follow with "but here's how I'd find out".
End with a sentence about what you'd do next — what you'd validate, monitor, document.

That's how senior engineers think on their feet. Practice it out loud. Good luck.

Observability & Data Quality Scenarios

Design a data-quality monitoring system

Prompt: our data team has 400 tables across bronze/silver/gold. Design a monitoring system that catches data quality issues before stakeholders do.

Strong answer structure:

Four dimensions per table: freshness (last successful write), volume (rows in last load vs expected), schema (column count, type drift), values (NULL rate, uniqueness on business key, range sanity on numeric columns).
Thresholds are learned, not hand-set. Compute a 30-day rolling baseline per metric per table; alert on deviations beyond 3 sigma or engineered thresholds. Static thresholds create noise.
Alert routing follows ownership. Each table has a designated owner; alerts go to that owner's channel, not a global firehose. A global firehose gets muted within a week.
Severity tiers. Tier 1 (pager): gold tables backing exec dashboards, revenue reports. Tier 2 (on-call): silver tables with downstream consumers. Tier 3 (daily digest): bronze + low-traffic tables.
Prevention, not just detection. Data contracts at the bronze→silver boundary with enforced schemas. Column-level lineage so a schema change in bronze surfaces to impacted silver/gold owners via PR comment or Slack bot.

Senior signal: name the tension. Strict monitoring creates alert fatigue and slows deploys. Lax monitoring lets issues reach stakeholders. The cheap middle path is: strict on the top 20 business-critical tables, advisory on the rest. Revisit the ratio quarterly.

How do you design freshness SLAs?

Freshness SLA ≠ pipeline SLA. Freshness is end-to-end: event emission → table queryable. Pipeline SLA is internal: job start → job success. Conflating them misses the retry tail, the watermark, and the downstream materialization lag.

A proper freshness SLA has four parameters: target (95% of data available within 2 hours of event time), measurement window (rolling 7 days), error budget (5% = 8.4 hours/week can miss target), breach consequence (page on-call when error budget hits 50%). Without all four, an SLA is a wish.

A dashboard shows zero for a KPI that should be 10 K

Investigation sequence: (1) is the source pipeline green? (2) is the underlying table populated? (3) is the query filter correctly matching timezone, date bucket, and filter predicates? (4) has the downstream aggregation job run? Most often the cause is at (3) — an upstream deploy changed a column's casing, enum value, or timezone, and the query now matches zero rows silently. The fix at the system level is: schema tests for enum values and a row-count floor check on every aggregation output.

Cloud-Specific Scenarios

Your Snowflake bill tripled last month. What do you check first?

Query history by warehouse. Which warehouse's credit consumption grew? Sort queries by credit spent; find the top 10.
Warehouse utilization. Is a warehouse set too large for its workload? Oversized warehouses consume credits on idle time if AUTO_SUSPEND is too long.
Storage growth. Did a table grow 5x because of a bad MERGE that didn't vacuum? Run TABLE_STORAGE_METRICS.
Serverless features. Snowpipe, Automatic Clustering, Materialized Views — each has its own credit line that doesn't appear in warehouse consumption. Check AUTOMATIC_CLUSTERING_HISTORY.
New users or external sharing. A new BI tool integration pulling hourly extracts can double compute silently.

Senior answer: after finding the driver, propose a tag-based cost allocation so the next month's surprise lives inside a named team's budget. "Free compute" disguises real cost.

Your BigQuery slots are exhausted during business hours. What's the playbook?

Three levers. (1) Reservations + assignments: carve slots across workloads so production-critical jobs are insulated from ad-hoc analyst queries. (2) Query tuning: convert full-table scans on wide tables to partitioned/clustered scans; the top three wide-table scans usually account for most of the slot time. (3) Scheduling: move nightly batches out of the 9am–5pm window; use flex slots with off-peak pricing. The mistake is to buy more slots before tuning the top 20 offenders — expensive and doesn't fix the underlying sprawl.

S3 list-objects is slow on a 10 M object prefix

The prefix has become a scan hotspot because listing is O(objects under prefix). Three mitigations: (a) partition by date as a prefix structure (/2026/04/20/), so listing only touches the day; (b) use Iceberg or Delta — they replace list-based discovery with metadata files and scan O(snapshot size), not O(files); (c) S3 Inventory exports a daily manifest of all objects; use it for bulk reconciliation instead of live listing.

ML & Feature Store Scenarios

Design a feature store with online/offline parity

The parity requirement: every feature value used to train a model must be identically computable at serving time. Two architecture patterns.

Pattern A — dual-write from a single computation. A feature pipeline writes to both the offline store (Iceberg / BigQuery) and the online store (Redis / DynamoDB) from the same code path. Parity by construction. Cost: the feature pipeline must run in both batch and stream modes correctly.

Pattern B — online store derived from offline. Compute features in batch into the offline store; periodically load into online. Simpler; online store lags offline by the refresh interval. Acceptable when model serving tolerates stale-by-N-minutes features.

Anti-pattern: two separate pipelines (one batch for training, one streaming for serving) computing "the same" feature with different code. This produces silent drift and model degradation that is nearly impossible to debug.

Your model's offline accuracy is 92% but online performance is 84%. What happened?

Systematic checklist: (1) Feature drift: distribution of input features at serving differs from training. Measure with a statistical test on production vs. training features per day. (2) Training-serving skew: features computed differently in the two environments. Test with a canary — run the same row through both pipelines, compare bit-for-bit. (3) Label leakage in training: the training label was accidentally derivable from a feature that isn't available at serving time. Audit feature timestamps against label timestamps. (4) Target distribution shift: the thing you're predicting has changed since training. Retrain. (5) Infrastructure bugs: online feature lookup returning stale or default values for a subset of users.

The order matters. Start with (5) — fastest to rule in/out, and a surprising fraction of "model issues" are actually serving infra bugs.

How would you retrain a production model without breaking serving?

Four stages. (a) Shadow training: new model trains in parallel, scores the same production traffic as the current champion, logs go to an evaluation table. (b) Offline comparison: compute business metrics from the logged shadow decisions vs. champion decisions. (c) Canary rollout: 1% of traffic to the new model; watch key metrics for 24–48 hours. (d) Ramp: 10%, 50%, 100%, with automated rollback if metrics regress. Each stage must have a rollback trigger and a named decision-maker. The biggest failure mode is a candidate who describes this at a high level but can't name the metrics or thresholds that govern each stage gate.

↑ Back to top

Part 09

The Prep Program

The previous parts taught you the territory. This one teaches you how to move through it under fire. Interviews are not knowledge tests — they are a performance discipline, and this is the practice plan for that performance.

1. The Four-Week Roadmap

Four weeks is the minimum a working engineer needs to credibly prepare for a senior loop at a top-tier company. Less than that and either the loop itself is easy or you are relying on accumulated muscle memory. The roadmap below assumes ~10 focused hours per week (two weekday evenings + one weekend block). Adjust duration, not phasing.

The single most common preparation mistake is studying more topics instead of fewer topics deeper, with reps. You cannot talk your way through system design if you've never said the words out loud.

Week 1 — Foundations: SQL and Modeling

Goal at end of week: you can write window-function-heavy SQL under time pressure without pausing to look up syntax, and you can sketch a star schema for any business domain in under 10 minutes on a whiteboard.

Day 1 — SQL joins and set operations

Re-derive when LEFT JOIN and NOT EXISTS give different answers (NULL semantics, duplicate rows). Write both for the same problem.
Practice: five problems at medium difficulty from a public SQL practice set. Time yourself to 12 minutes each.
End-of-day drill: explain out loud why COUNT(*) vs COUNT(col) can differ and under what exact conditions.

Day 2 — Window functions

Rebuild the mental model: frame clauses, partitioning, ordering, RANGE vs ROWS. Know the default frame for each window function by heart.
Practice: month-over-month revenue with sparse months, top-N per group with ties, running totals with resets, gaps and islands for sessionization.
End-of-day drill: write percentile_cont by hand using NTILE and interpolation.

Day 3 — Advanced SQL patterns

Cohort retention (signup-week × activity-week matrix). Sessionization with timeout. Last-touch attribution. Point-in-time joins (as-of).
Practice the canonical cohort retention query end-to-end, starting from raw events.
End-of-day drill: explain the difference between point-in-time correctness and snapshot correctness in 90 seconds.

Day 4 — Dimensional modeling

Draw a star schema for three domains from memory: subscription streaming service, two-sided marketplace, advertising network. For each, state the grain of each fact.
Write DDL and SCD Type-2 MERGE for one customer dimension. Include effective-from / effective-to / current-flag columns.
End-of-day drill: state in one sentence what conformed dimensions are and why they matter.

Day 5 — Contracts and NULL semantics

Re-read Part 01 §12–13. Write a schema contract in the format your target company uses (YAML or JSON Schema).
Draft a "contract violation" runbook: what the producer does, what the consumer does, where it breaks the pipeline.
End-of-day drill: pick one table you know well and enumerate every column that could be NULL and what NULL actually means for each (unknown, not-applicable, pending, sentinel).

Weekend block — SQL mock round

Two hours. Six problems in a notebook. No internet, no autocomplete. Time each to 15 minutes max. Record a screencast of yourself speaking through problem 4 or 5.
Review the tape: where did you pause? Where did you talk in circles? Where did you jump to a query before stating assumptions? Those are the gaps.

Week 2 — Processing Engines and Distributed Compute

Goal at end of week: you can explain how a Spark job is physically executed from df.join().groupBy().agg().write() down to tasks and files, and you can talk about streaming guarantees without handwaving "exactly once."

Day 6 — Spark from the top down

Catalyst rule passes. Logical → optimized → physical → RDD. AQE and what it rewrites at runtime.
Drill: take a 5-line PySpark script and narrate every stage boundary, every shuffle, every file the executor will write.

Day 7 — Shuffle, joins, and skew

Broadcast vs sort-merge vs shuffle-hash. The exact threshold math. When AQE converts SMJ to BHJ mid-job.
Skew detection and the salt trick. What "skew join" in AQE actually does under the hood.
Drill: you have a 10 TB fact joined to a 50 GB dim and jobs time out. Walk through your debugging sequence — what do you look at first and why?

Day 8 — Batch patterns and idempotency

MERGE semantics on lakehouse tables. The "idempotent over the same interval" proof. Backfill strategies.
Drill: describe, in order, what happens when you re-run yesterday's job at noon today. Which rows are touched, which partitions are rewritten, how downstream caches invalidate.

Day 9 — Streaming fundamentals

Event-time vs processing-time. Watermarks. Windows. Late data. Trigger semantics.
Exactly-once in Kafka + Flink: the two-phase commit sink and what can still go wrong.
Drill: explain the difference between "no duplicates in output" and "exactly-once processing" in one minute.

Day 10 — Streaming patterns

Stream-stream joins with watermarks. Stream-table joins. Temporal joins. Enrichment patterns.
Backpressure, state backends, checkpointing, recovery time objectives.
Drill: the job restarts and reads 30 minutes of state — what guarantees does the consumer downstream still have?

Weekend block — Engine mock round

Pick one prompt from Part 08 that involves Spark or streaming. Stand at a whiteboard or a blank page. Talk for 45 minutes out loud. Record it.
Re-watch. Count filler words. Count how many times you said "it depends" without stating the axis.

Week 3 — System Design, Lakehouse, and Tradeoffs

Goal at end of week: you can scope, decompose, and defend a 45-minute data system design for any prompt, including volumes, latency, failure modes, and cost.

Day 11 — The system-design frame

The five-step opener: restate, scope, volumes and SLAs, decompose, defend one pick.
Draft a "volumes cheat sheet" for every scale you might claim — 1M/10M/100M/1B events per day, what each means for Kafka partitions, Spark cluster size, storage cost per month.

Day 12 — Lakehouse table formats in depth

Iceberg vs Delta vs Hudi: snapshot layout, metadata tree, MERGE internals, compaction behavior, schema evolution rules.
Catalog architectures: Hive, Glue, REST, Unity, Polaris. Which ones allow cross-engine writes and which don't.
Drill: why does Iceberg's hidden partitioning avoid the "partition column drift" problem?

Day 13 — System design prompt 1

"Design a pipeline that ingests 10 M events/day from a mobile app and serves both a near-real-time operational dashboard and a nightly revenue report."
Solve on paper in 45 minutes. Use the transcript in §5 below as the grading rubric.

Day 14 — System design prompt 2

"Migrate a 50 TB partitioned Hive table to Iceberg with zero downtime for consumers."
Solve on paper in 45 minutes. Focus on the migration sequence, consumer cutover, and rollback.

Day 15 — Observability and data quality

Data-SLA vs system-SLA. Freshness, completeness, validity, uniqueness. Where to emit metrics. When to fail the pipeline vs quarantine vs warn.
Drill: design the alerting strategy for a pipeline where the input can legitimately drop to zero on some days (e.g., marketing campaigns). How do you avoid noise?

Weekend block — System-design mock round

Pair with a peer if possible. They ask the prompt, you whiteboard for 45 minutes, they probe. No book, no laptop.
If solo: pick a prompt, run a 45-minute self-tape, re-watch, grade yourself on the rubric in §8.

Week 4 — Mocks, Behavioral, and Polish

Goal at end of week: three full mock loops completed, eight STAR frames rehearsed, and a one-page "signal deck" of the projects you'll reference.

Day 16 — Behavioral foundations

The eight STAR frames in §6. Write one story per frame, not more. Aim for 3–4 minutes spoken each.
Record yourself telling two of them. Watch for mealy-mouthed passive voice. Rewrite.

Day 17 — Mock loop #1 (full)

45 min coding/SQL → 45 min Spark/streaming deep dive → 45 min system design → 30 min behavioral. Back to back. Lunch break OK.
Post-mortem the gaps that same evening. Don't wait a day — decay is fast.

Day 18 — Weakness patch

Whatever failed in Mock #1, spend today on. Don't study anything new.

Day 19 — Mock loop #2 (full)

Same structure. Ideally different interviewer if you have one.

Day 20 — Resume drill-down prep

For every project on your resume, prepare: scale (numbers), your contribution (verbs), the tradeoff you made, the failure mode you avoided, the thing you'd do differently now.
Four bullets max per project. Practice delivering each in under 90 seconds.

Weekend block — Mock loop #3 + final polish

Third and last full mock. Fix the one or two tics that have survived this long.
Day before the real loop: no new study. Light review. Sleep.

Anti-Goals — What Not to Do

Do not grind leetcode for a DE senior loop. You will be asked at most one lightly algorithmic problem in SQL or Python; topic breadth matters more than graph-traversal fluency.
Do not read five blog posts per topic. Pick the one canonical source (a section of this guide, a chapter of a book, a primary doc page) and read it three times.
Do not try to memorize command flags. Memorize the decisions the flags control.
Do not skip the out-loud practice. Reading a solution and speaking a solution are different skills; the interview only tests the second.

2. How Candidates Lose Offers — The Failure Catalog

This section is uncomfortable on purpose. These are the specific failure modes that show up in post-debrief scorecards, not the abstract "communication issues" that everybody writes about. Most of them are behavioural, not technical — which is precisely why they're survivable if named.

A. Pre-flight and resume failures

A1 — The "resume says senior, answers say mid"

You claim on your resume that you "owned the streaming platform" but when the interviewer asks what Kafka version you ran, what your broker count was, what the retention policy was, and what happened the last time you ran out of disk — you freeze. Scorecard verdict: "Scope inflation, no depth behind the title."

Fix: for every line on the resume, pre-build a three-layer drill-down. Top line, two supporting paragraphs of numbers and verbs, one anecdote of a production incident you debugged.

A2 — The unexplained gap between title and output

Your last role was "Staff Data Engineer" but your project list reads like three-month feature work. Interviewers silently score you down for weak scope signal. You never recover.

Fix: lead with scope numbers — number of downstream consumers, SLAs owned, size of the footprint, org reach. If the numbers are modest, claim the title you can defend, not the title on the offer letter.

B. Coding and SQL round failures

B1 — Query-before-clarify

The prompt is "Find users whose average order value grew MoM." You start typing immediately. You don't ask whether "user" means account or visitor, whether "month" is calendar or rolling, whether deleted users count, how to handle months with zero orders. The interviewer silently writes down "jumped to implementation" and moves the bar up.

Fix: 90 seconds of clarification is never too much. Ask at least three questions before your cursor touches the query. Restate the problem back in your own words first.

B2 — The window-function shortcut

You know that ROW_NUMBER() OVER (PARTITION BY ...) is the trick for "top per group." You reach for it without thinking through whether ties matter, whether NULLs should be included, whether you want dense-rank instead. Your answer is close but subtly wrong on edge cases. Scorecard: "Knows the tools, hasn't internalized the semantics."

Fix: for any window function, say out loud: "the frame is X, ties resolve Y, NULLs sort Z." Every time. Even when obvious.

B3 — Silent scratch-paper math

You stop talking. You do mental math for 45 seconds. The interviewer has no signal on how you think. By the time you speak again they've already decided.

Fix: narrate. "I'm going to try ROW_NUMBER here — wait, ties matter for revenue buckets so RANK would over-count. Let me check…" Even wrong narration beats silent correctness in the first 20 minutes of the round.

B4 — The one-liner trap

You compress a three-CTE problem into a single deeply nested query because "it's more elegant." The interviewer cannot read it in real time and neither can you. You can't debug it when they ask a follow-up. Verdict: "Writes for themselves, not for review."

Fix: default to named CTEs with short, descriptive names. Elegance is for Twitter, not interview rounds.

C. Engine / systems deep-dive failures

C1 — "AQE handles it"

Asked how Spark handles skew, you say "AQE handles it automatically." Asked to go deeper — how does AQE detect skew, what config controls the threshold, what's the trade-off vs broadcast — you have nothing. Verdict: "Knows the keyword, not the mechanism."

Fix: for every feature you name, know the three layers beneath it. Configuration, algorithm, trade-off. If you can't describe all three, don't volunteer the feature.

C2 — Exactly-once handwaving

"We used Kafka exactly-once so there are no duplicates." The interviewer probes: what about the sink? What about application-level retries? What happens during a rebalance mid-commit? You handwave. Verdict: "Treats EOS as a buzzword."

Fix: memorize the two-phase commit sink diagram and the exact failure modes it does and does not cover. Also memorize the phrase "exactly-once effectively" and when it differs from "exactly-once processing."

C3 — Partitioning by hope

Your system design partitions a fact table by "user_id" because partitioning is good. You don't state the cardinality, don't state the access pattern, don't state the partition file count at six-month retention. The interviewer asks "how many files is that" and you guess wildly.

Fix: before you name a partition key, compute the file count out loud. Cardinality × retention / compaction target = files. If the answer is 10 million, change the key.

D. System design failures

D1 — Scope sprawl

The prompt is "design a pipeline for 10 M events/day." You spend 15 minutes on Kafka producer ack semantics and never get to storage, serving, or cost. Time runs out. Verdict: "Deep in one corner, blank elsewhere."

Fix: budget your 45 minutes. Roughly: 5 min scope, 5 min volumes/SLAs, 15 min happy path, 10 min failure modes, 5 min cost, 5 min trade-offs. Put it on your scratch paper before you start.

D2 — Solutioning before scoping

Interviewer: "Design a pipeline for 10 M events/day." You: "I'd use Kafka, then Spark Streaming, then Iceberg, then Looker." One minute in. You never asked what the events are, what the SLA is, who the consumer is, what the business question is.

Fix: five scoping questions before any box appears on the board. "What's the event schema? What are the downstream consumers? What's the freshness SLA? What's the volume distribution — steady or bursty? What's the retention and access pattern?"

D3 — The "golden path" trap

Your design works perfectly on the happy path. You never mention: what happens when Kafka is down, what happens when the transform fails on one record, what happens when the sink table is being vacuumed, what happens when a consumer is 6 hours behind. Verdict: "Designs for the demo."

Fix: reserve the second half of the round for failure modes. Treat "what happens when X breaks" as first-class content, not a footnote.

D4 — No cost awareness

Your design streams everything with millisecond latency when the business question is a daily dashboard. You triple-replicate data across warehouses "for safety." You never state the monthly cloud bill. Verdict: "Optimizes for the whiteboard, not the business."

Fix: always close the design with a one-line cost estimate — storage per TB, compute per hour, egress if relevant. Even order-of-magnitude is fine. Silence is not.

E. Behavioral failures

E1 — Hero mode

Every story is "I single-handedly…". The interviewer, who works on teams, silently down-scores collaboration. By round four, the hiring committee has "limited team signal" in bold.

Fix: 60/40. In every story, spend 60% on your specific actions and 40% on how you enabled others. Name them.

E2 — STAR drift

You start with Situation → Task, but by minute three you've slid into "and then generally we do X at our company." You lost the interviewer 90 seconds ago. They're waiting for the "R" — what happened, what did you learn.

Fix: write every story on a 4×1 index card. Situation (2 sentences). Task (1 sentence). Action (4 sentences). Result (2 sentences with numbers). Under 4 minutes spoken, always.

E3 — "I would have…"

Asked for a time you failed, you describe what you would have done differently. You never described the actual failure. Verdict: "Can't own an outcome."

Fix: the failure story starts with what actually broke, in one blunt sentence. "We shipped a backfill that doubled the revenue numbers on the executive dashboard for six hours." Then the rest.

E4 — Over-prepared monotone

You've rehearsed the story so many times it sounds canned. The interviewer senses it. Authenticity tanks. Verdict: "Felt like a script."

Fix: rehearse the structure, not the words. Pick three different phrasings of the opening line. Use whichever fits the conversational flow that day.

F. Closing-round and hiring-committee failures

F1 — The "why this company" gap

You've answered everything technically well. The bar raiser asks "why us specifically?" and you recite a line from the website. Verdict: "Strong IC, weak mission fit." That's enough for a committee to pass.

Fix: have three specific reasons ready, at least one of which references something from an engineering blog, an open-source commit, or a recent talk by someone there. Show you actually looked.

F2 — Underselling in the wrap

Asked "is there anything we should know that we haven't asked?" you say "no, I think we covered it." You just gave up your last 90 seconds of scoring time.

Fix: always have one prepared "closing signal" — a specific project they didn't ask about, or a failure recovery story that shows judgment. Rehearsed, short, positive.

F3 — Weak reverse questions

"Do you have questions for me?" You ask about work-from-home policy and team size. Those are HR questions. Engineering interviewers want to see what you'd actually want to know on day one.

Fix: three tiers of questions ready: (a) one about their specific team's current technical challenge, (b) one about how decisions get made when the team disagrees, (c) one about what the interviewer personally wishes was better.

G. The "almost-hires" pattern — why committees pass on strong candidates

The most painful losses are not the obvious fails. They're the loops where every round is "lean hire" but the committee still passes. Three patterns drive this:

Narrow specialist signal. Every round showed the same skill. Committee can't tell if you'd be useful on a different problem next quarter. Fix: use the four rounds to show four different skills. If round 1 was SQL depth, round 2 should be operations judgment, round 3 should be scope negotiation, round 4 should be mentoring. You choose the stories.
No "hell yes" round. Four "lean hires" is a statistical hire probability below 50% at most committees. You need one round where someone writes "strong hire, would fight for." Fix: pick the round you expect to do best in and prepare a signature answer — a story or a design move that is unusual enough to be memorable in debrief.
Level ambiguity. Scorecards split between L4 and L5. The committee defaults down. Fix: in every round, demonstrate at least one behaviour from the level above — ambiguity tolerance in coding, org-level reasoning in design, mentoring framing in behavioural.

3. What Changes at Staff+ and Principal

The interview loop for senior (L5), staff (L6), and principal (L7) uses the same rounds — but the scorecard rubric shifts. Miscalibrating the level you're targeting is one of the most common reasons for "down-leveling" at offer time. What follows is the signal map.

The Level Matrix

Axis	L4 — Mid	L5 — Senior	L6 — Staff	L7 — Principal
Scope	A task inside a feature	A feature end-to-end	A multi-team system	An org-wide technical direction
Ambiguity	Told what to do	Told what outcome to deliver	Defines the outcome	Defines which problems are worth solving
Tech judgment	Uses existing patterns	Picks the right pattern	Decides when to invent a pattern	Sets the patterns others will use for years
Influence	Influences own code reviews	Influences a team	Influences peer teams	Influences the org and sometimes the industry
Cost awareness	Aware of query cost	Owns pipeline cost	Owns team budget	Shapes the cost curve of the platform
Failure recovery	Fixes their bug	Owns the incident	Changes the system so this class of incident can't recur	Changes the org's practices around failure
Mentoring	Receives mentoring	Mentors juniors	Mentors seniors and raises the team's ceiling	Mentors staff engineers and managers
Writing	PR descriptions	Design docs for a feature	Strategy docs that reshape decisions	Technical vision papers that set roadmaps

Staff (L6) Signals

At the staff level, interviews are no longer asking "can you build it" — they're asking "would you be the person others run a hard decision by?" Three signals carry most of the weight.

S1 — Scope negotiation

When given a prompt, staff candidates compress or expand the scope deliberately. Example: the prompt is "design a recommendation pipeline." A senior answers the prompt. A staff engineer says: "The interesting question here isn't the pipeline — it's how we version the feature store, because that's where the ML team and the infra team disagree. Let me scope the answer around that disagreement and mention the pipeline details as they come up."

That move — naming the real problem inside the prompt — is staff-level. It requires judgment about what is worth engineering time in the real world, not just technical skill.

S2 — Cross-team trade-off reasoning

Senior engineers answer "what's the right design." Staff engineers answer "what's the right design given what the adjacent team is doing, what the org's cost posture is, and what we're willing to deprecate in twelve months." Every technical answer has an org-level sentence.

S3 — Mentorship embedded in the story

In behaviourals, staff candidates naturally mention other engineers by role — "the senior on the team," "the junior who joined last quarter," "the staff eng on the adjacent team who had the opposite opinion." Their stories are never about themselves alone. A candidate whose every story uses "I" without ever naming who they worked with reads as individual-contributor strong, team-lead weak.

Principal (L7) Signals

At principal, the interview bar shifts again. The scorecard looks for evidence that you set technical direction across an organization, and that your framing changes how others think — not just what they build.

P1 — Problem selection

Given a system-design prompt, principal candidates often reframe the question. "The prompt asks how to serve 10 M events per day at 200 ms p99 latency. In my experience, at that scale the real question is whether we even need 200 ms — because every order of magnitude cheaper we can make the infrastructure unlocks a different set of products. Let me walk through three designs at three latency points and show the cost curve."

Principal engineers select which problems to solve, not just how to solve them.

P2 — Org-level influence without authority

Principal behaviourals feature stories where the candidate changed the behaviour of people who did not report to them and were not asked to. A principal engineer rarely describes winning an argument — they describe reframing the argument so the outcome was inevitable.

P3 — Narrative ownership

Principal candidates can answer "what's your three-year technical bet?" with specificity. They have a thesis — about where the stack is going, which abstractions are about to break, what the team should be investing in now to be ready. That thesis is testable and defensible.

Common Down-Leveling Traps

L5 claiming L6. You work on a big team, so you claim staff. But every story is scoped to a feature, not a system. Fix: either claim L5 cleanly and ace it, or find two stories where you owned a cross-team decision, and rehearse them cold.
L6 claiming L7. You've delivered multi-team systems, but you can't articulate a three-year bet. Fix: prepare a written one-page "technical narrative" before the loop. Even if it never comes up, writing it will sharpen every answer.
L7 under-demonstrating. You're so deep in the problem that you forget to name the org-level framing. Fix: every answer opens with the framing sentence. "The interesting tension here is between X team's need for Y and Z team's need for W." Then dive in.

4. The SQL Question Bank — Senior Tier

These are the SQL problems that actually separate seniors from mid-level engineers in interview rounds. Each comes with a prompt, the schema, the expected output shape, a walkthrough of the thinking, and a reference solution. The difficulty target is the "hard tier" on public SQL practice platforms — multi-CTE, window-function-heavy, business-grounded, and with at least one non-obvious edge case.

Every solution is written in ANSI-style SQL that runs on Snowflake, BigQuery, Redshift, and Postgres with minor dialect tweaks. Spark SQL equivalents are noted where they diverge.

The fastest way to fail these problems in an interview is to type before you've restated the requirements out loud. Every solution below opens with "Let me restate what I'm solving."

Q1 — Cohort Retention Matrix

Prompt

For a subscription product, build the signup-month × activity-month retention matrix. Each cell should contain the percentage of users who signed up in the row month and were active in the column month. Include month 0 (100% by definition). Go up to month 12. The output should be sorted by signup month descending.

Schema

users(user_id BIGINT, signup_ts TIMESTAMP)
events(user_id BIGINT, event_ts TIMESTAMP, event_name STRING)
-- "active" = at least one event in that calendar month

Why this problem is hard

Three traps: (1) the month-diff must be computed carefully across year boundaries; (2) a naive join will double-count users who had multiple events in a month; (3) cohorts with zero retained users at month N must still appear — you can't filter them out with an inner join.

Walkthrough

Build the cohort dimension: one row per user, with their signup month truncated to the first of the month.
Build the activity dimension: distinct (user_id, activity_month) pairs from events.
Join cohort → activity on user_id, compute month_number = month_diff(activity_month, signup_month).
Aggregate: for each (signup_month, month_number), count distinct users. Divide by the cohort size to get the retention percentage.
Pivot into the matrix shape — either with conditional aggregation or with your engine's PIVOT clause.

Reference solution

WITH cohort AS (
  SELECT
    user_id,
    DATE_TRUNC('month', signup_ts) AS signup_month
  FROM users
),
cohort_size AS (
  SELECT signup_month, COUNT(*) AS cohort_n
  FROM cohort
  GROUP BY signup_month
),
activity AS (
  SELECT DISTINCT
    user_id,
    DATE_TRUNC('month', event_ts) AS activity_month
  FROM events
),
retention AS (
  SELECT
    c.signup_month,
    DATEDIFF('month', c.signup_month, a.activity_month) AS month_n,
    COUNT(DISTINCT a.user_id) AS retained
  FROM cohort c
  JOIN activity a USING (user_id)
  WHERE a.activity_month >= c.signup_month
    AND DATEDIFF('month', c.signup_month, a.activity_month) <= 12
  GROUP BY 1, 2
)
SELECT
  r.signup_month,
  s.cohort_n,
  MAX(CASE WHEN month_n = 0  THEN retained END) AS m0,
  MAX(CASE WHEN month_n = 1  THEN retained END) AS m1,
  MAX(CASE WHEN month_n = 2  THEN retained END) AS m2,
  MAX(CASE WHEN month_n = 3  THEN retained END) AS m3,
  MAX(CASE WHEN month_n = 6  THEN retained END) AS m6,
  MAX(CASE WHEN month_n = 12 THEN retained END) AS m12,
  ROUND(100.0 * MAX(CASE WHEN month_n = 1  THEN retained END) / s.cohort_n, 1) AS pct_m1,
  ROUND(100.0 * MAX(CASE WHEN month_n = 3  THEN retained END) / s.cohort_n, 1) AS pct_m3,
  ROUND(100.0 * MAX(CASE WHEN month_n = 12 THEN retained END) / s.cohort_n, 1) AS pct_m12
FROM retention r
JOIN cohort_size s USING (signup_month)
GROUP BY r.signup_month, s.cohort_n
ORDER BY r.signup_month DESC;

Follow-up probes the interviewer will ask

"What if we want rolling 30-day retention instead of calendar-month retention?" → replace DATE_TRUNC('month', …) with a bucket derived from FLOOR(DATEDIFF('day', signup_ts, event_ts) / 30).
"What if some users have backdated signup timestamps because they were imported?" → add a validity filter on signup_ts and a separate "import" cohort.
"How would you compute the confidence interval on each retention percentage?" → Wilson interval, or bootstrap, depending on cohort size.

Q2 — Sessionization with Idle Timeout

Prompt

You have raw event logs with user_id and event timestamps. Define a "session" as a contiguous sequence of events from the same user where no gap between consecutive events exceeds 30 minutes. For each session, output: user_id, session_id, session_start, session_end, duration_seconds, event_count. The session_id should be stable across re-runs for the same input.

Schema

events(user_id BIGINT, event_ts TIMESTAMP, event_name STRING)

Why this problem is hard

This is a "gaps and islands" problem in disguise. Two traps: (1) you need to compute the gap against the previous event per user, not globally; (2) the session_id must be deterministic — a naive ROW_NUMBER() changes across runs if ties exist. A typical junior solution uses a self-join which scans O(n²); the senior solution uses a single pass with LAG.

Walkthrough

Partition by user, order by timestamp, compute LAG(event_ts) to get the previous event's timestamp.
Flag a new session when the gap > 30 minutes or when there is no previous event (first row per user).
Take a running SUM of the flag over the partition to assign monotonically increasing session numbers per user.
Aggregate by (user_id, session_number) for the final output. Build a stable session_id by hashing user_id + session_start.

Reference solution

WITH e AS (
  SELECT
    user_id,
    event_ts,
    event_name,
    LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) AS prev_ts
  FROM events
),
flagged AS (
  SELECT
    e.*,
    CASE
      WHEN prev_ts IS NULL THEN 1
      WHEN DATEDIFF('second', prev_ts, event_ts) > 1800 THEN 1
      ELSE 0
    END AS new_session_flag
  FROM e
),
numbered AS (
  SELECT
    f.*,
    SUM(new_session_flag) OVER (
      PARTITION BY user_id ORDER BY event_ts
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS session_number
  FROM flagged f
)
SELECT
  user_id,
  MD5(CAST(user_id AS STRING) || '_' || CAST(MIN(event_ts) AS STRING)) AS session_id,
  MIN(event_ts)                                     AS session_start,
  MAX(event_ts)                                     AS session_end,
  DATEDIFF('second', MIN(event_ts), MAX(event_ts))  AS duration_seconds,
  COUNT(*)                                          AS event_count
FROM numbered
GROUP BY user_id, session_number
ORDER BY user_id, session_start;

Follow-up probes

"How would you make this incremental?" → state the watermark: "I'd only reprocess sessions that extend into the new batch window, which means replaying the tail of each user's most recent session from a checkpoint."
"What if two events share a timestamp?" → the ORDER BY must include a tiebreaker — typically event_id — or sessions become non-deterministic across runs.
"Can this be done in Spark with Structured Streaming?" → yes, using session_window directly; walk through the watermark and state timeout semantics.

Q3 — Last-Touch Attribution With Lookback Window

Prompt

For every purchase, find the last marketing touch (channel + campaign) that occurred within the 7 days preceding the purchase, for the same user. If no touch exists in that window, attribute the purchase to "organic." Output: purchase_id, user_id, purchase_ts, purchase_amount, attribution_channel, attribution_campaign, days_between_touch_and_purchase.

Schema

purchases(purchase_id BIGINT, user_id BIGINT, purchase_ts TIMESTAMP, amount DECIMAL)
touches(user_id BIGINT, touch_ts TIMESTAMP, channel STRING, campaign STRING)

Why this problem is hard

Two traps: (1) this is a range join — every purchase potentially joins to many touches, and a careless query will multiply rows before the rank, producing wrong totals; (2) ties at the same timestamp must be broken deterministically. The correct approach ranks candidates inside a bounded sub-query rather than joining the full cross-product.

Walkthrough

Join purchases to touches on user_id and the 7-day window. Compute the lag in seconds.
Rank touches per purchase_id by touch_ts DESC (last-touch wins), with channel alphabetical as a deterministic tiebreaker.
Keep only rank = 1. Left-join back to the full purchases table so unattributed rows show "organic."

Reference solution

WITH candidates AS (
  SELECT
    p.purchase_id,
    p.user_id,
    p.purchase_ts,
    p.amount,
    t.touch_ts,
    t.channel,
    t.campaign,
    ROW_NUMBER() OVER (
      PARTITION BY p.purchase_id
      ORDER BY t.touch_ts DESC, t.channel ASC
    ) AS rn
  FROM purchases p
  LEFT JOIN touches t
    ON t.user_id = p.user_id
   AND t.touch_ts <= p.purchase_ts
   AND t.touch_ts >= p.purchase_ts - INTERVAL '7 days'
)
SELECT
  purchase_id,
  user_id,
  purchase_ts,
  amount AS purchase_amount,
  COALESCE(channel, 'organic')         AS attribution_channel,
  COALESCE(campaign, 'none')           AS attribution_campaign,
  CASE WHEN touch_ts IS NULL THEN NULL
       ELSE DATEDIFF('day', touch_ts, purchase_ts)
  END AS days_between_touch_and_purchase
FROM candidates
WHERE rn = 1
ORDER BY purchase_ts;

Follow-up probes

"How would the query change for first-touch attribution?" → swap ORDER BY t.touch_ts DESC to ASC, and consider expanding the window to the user's lifetime.
"How would you handle cross-device users?" → introduce an identity table and replace user_id with a resolved identity_id upstream.
"What does the query cost at 1 B purchases and 10 B touches?" → range joins are expensive; discuss pre-filtering touches by date, using broadcast hash joins if touches fits, and partitioning by user_id on both sides.

Q4 — Month-over-Month Growth With Sparse Months

Prompt

Compute month-over-month revenue growth for every product, for every calendar month in the past 24 months. If a product had no sales in a given month, treat its revenue as zero (not NULL). Output: product_id, month, revenue, prev_month_revenue, mom_growth_pct.

Why this problem is hard

The sparse-months trap. A naive LAG() over the sales table will compare April to the previous row, which might be from February if March had no sales. The answer is wrong and doesn't flag itself. The fix is to generate a dense calendar (product × month) and left-join actual revenue onto it.

Reference solution

WITH months AS (
  SELECT DATE_TRUNC('month', d) AS month
  FROM GENERATE_SERIES(
    DATE_TRUNC('month', CURRENT_DATE - INTERVAL '24 months'),
    DATE_TRUNC('month', CURRENT_DATE),
    INTERVAL '1 month'
  ) AS d
),
products AS (SELECT DISTINCT product_id FROM sales),
grid AS (
  SELECT p.product_id, m.month
  FROM products p CROSS JOIN months m
),
monthly AS (
  SELECT
    product_id,
    DATE_TRUNC('month', sale_ts) AS month,
    SUM(amount) AS revenue
  FROM sales
  GROUP BY 1, 2
),
filled AS (
  SELECT
    g.product_id,
    g.month,
    COALESCE(m.revenue, 0) AS revenue
  FROM grid g
  LEFT JOIN monthly m USING (product_id, month)
)
SELECT
  product_id,
  month,
  revenue,
  LAG(revenue) OVER (PARTITION BY product_id ORDER BY month) AS prev_month_revenue,
  CASE
    WHEN LAG(revenue) OVER (PARTITION BY product_id ORDER BY month) IS NULL THEN NULL
    WHEN LAG(revenue) OVER (PARTITION BY product_id ORDER BY month) = 0 AND revenue > 0 THEN 9999.0
    WHEN LAG(revenue) OVER (PARTITION BY product_id ORDER BY month) = 0 THEN 0
    ELSE ROUND(
      100.0 * (revenue - LAG(revenue) OVER (PARTITION BY product_id ORDER BY month))
      / LAG(revenue) OVER (PARTITION BY product_id ORDER BY month), 2
    )
  END AS mom_growth_pct
FROM filled
ORDER BY product_id, month;

Follow-up probes

"How do you handle products that were launched mid-window?" → emit NULL rather than 0 for months before the product's first sale; use a first_sale_month CTE.
"The query is slow at 50 M rows. What do you change?" → pre-aggregate sales to daily first; partition-prune on sale_ts; materialize the monthly roll-up as a daily batch job.

Q5 — Top-N Per Group With Ties and Pagination

Prompt

For each region, return the top 3 products by revenue in the current quarter. If two products tie on revenue, break the tie by product name alphabetically. The query must be correct even when a region has fewer than 3 products.

Why it's hard

The window function choice matters: ROW_NUMBER arbitrarily picks a winner on ties, RANK leaves gaps, DENSE_RANK doesn't. For "top 3 with ties resolved deterministically" you almost always want ROW_NUMBER with a deterministic tiebreaker inside the ORDER BY. That's what this question is really testing.

Reference solution

WITH q AS (
  SELECT
    region,
    product_id,
    product_name,
    SUM(amount) AS revenue
  FROM sales
  WHERE sale_ts >= DATE_TRUNC('quarter', CURRENT_DATE)
    AND sale_ts <  DATE_TRUNC('quarter', CURRENT_DATE + INTERVAL '3 months')
  GROUP BY region, product_id, product_name
),
ranked AS (
  SELECT
    q.*,
    ROW_NUMBER() OVER (
      PARTITION BY region
      ORDER BY revenue DESC, product_name ASC
    ) AS rn
  FROM q
)
SELECT region, product_id, product_name, revenue, rn AS rank_in_region
FROM ranked
WHERE rn <= 3
ORDER BY region, rn;

Follow-up probes

"What if the business wants all ties included in the top 3?" → switch to DENSE_RANK() and filter on dense_rank <= 3.
"How would you page this to return rank 4–6 next?" → parametrize the filter; note the performance hit if the engine doesn't push the filter below the window.

Q6 — As-Of / Point-in-Time Join

Prompt

For each order, find the customer's subscription tier as of the order timestamp. A customer's tier can change over time and we have a history table with valid-from / valid-to per tier. Output: order_id, order_ts, customer_id, tier_at_order_time.

Why it's hard

This is a point-in-time join — the most common mistake is to join on "tier where valid_from <= order_ts and valid_to >= order_ts." That works only if the history is gap-free and exactly one row is valid at any moment. Real-world histories have gaps, overlaps from bad backfills, and open-ended current rows (valid_to = NULL). The senior answer handles all three.

Reference solution (gap-tolerant, picks most-recent valid)

WITH tier_history AS (
  SELECT
    customer_id,
    tier,
    valid_from,
    COALESCE(valid_to, TIMESTAMP '9999-12-31 00:00:00') AS valid_to
  FROM customer_tier_history
),
candidates AS (
  SELECT
    o.order_id,
    o.order_ts,
    o.customer_id,
    h.tier,
    h.valid_from,
    ROW_NUMBER() OVER (
      PARTITION BY o.order_id
      ORDER BY h.valid_from DESC
    ) AS rn
  FROM orders o
  LEFT JOIN tier_history h
    ON h.customer_id = o.customer_id
   AND h.valid_from <= o.order_ts
   AND h.valid_to   >  o.order_ts
)
SELECT
  order_id,
  order_ts,
  customer_id,
  COALESCE(tier, 'free') AS tier_at_order_time
FROM candidates
WHERE rn = 1
ORDER BY order_ts;

Follow-up probes

"What do you do if the history has overlaps?" → rank by valid_from DESC, then by loaded_at DESC, and take the most recently-loaded record. Flag the overlap to a data-quality dashboard.
"Spark has a native as-of join — when do you use it?" → when the fact side is much larger than the history side and you'd otherwise shuffle both; Spark's asOfJoin can keep the history broadcast.

Q7 — Funnel Analysis With Ordered Steps

Prompt

A signup funnel has four ordered steps: visit → view_plan → start_checkout → complete_signup. For users in the past 30 days, compute (a) how many users reached each step, and (b) the step-to-step conversion rate. A user only "reaches" a step if they also reached every earlier step, and the events must appear in order within a single 24-hour window.

Why it's hard

Ordered funnels with a time constraint cannot be solved with simple COUNT(DISTINCT CASE WHEN …) over the events table — that would count users who did step 3 before step 1 or across weeks. The senior answer uses window functions to enforce ordering within a window, or array_agg with pattern-matching on the sequence.

Reference solution

WITH ordered AS (
  SELECT
    user_id,
    event_name,
    event_ts,
    MIN(CASE WHEN event_name = 'visit'          THEN event_ts END) OVER w AS t_visit,
    MIN(CASE WHEN event_name = 'view_plan'      THEN event_ts END) OVER w AS t_view,
    MIN(CASE WHEN event_name = 'start_checkout' THEN event_ts END) OVER w AS t_start,
    MIN(CASE WHEN event_name = 'complete_signup' THEN event_ts END) OVER w AS t_done
  FROM events
  WHERE event_ts >= CURRENT_DATE - INTERVAL '30 days'
  WINDOW w AS (PARTITION BY user_id)
),
per_user AS (
  SELECT
    user_id,
    t_visit,
    CASE WHEN t_view  > t_visit AND t_view  <= t_visit + INTERVAL '24 hours' THEN t_view END AS t_view_ok,
    CASE WHEN t_start > t_view  AND t_start <= t_visit + INTERVAL '24 hours' THEN t_start END AS t_start_ok,
    CASE WHEN t_done  > t_start AND t_done  <= t_visit + INTERVAL '24 hours' THEN t_done  END AS t_done_ok
  FROM ordered
  WHERE t_visit IS NOT NULL
  GROUP BY user_id, t_visit, t_view, t_start, t_done
)
SELECT
  COUNT(DISTINCT CASE WHEN t_visit    IS NOT NULL THEN user_id END) AS n_visit,
  COUNT(DISTINCT CASE WHEN t_view_ok  IS NOT NULL THEN user_id END) AS n_view,
  COUNT(DISTINCT CASE WHEN t_start_ok IS NOT NULL THEN user_id END) AS n_start,
  COUNT(DISTINCT CASE WHEN t_done_ok  IS NOT NULL THEN user_id END) AS n_done,
  ROUND(100.0 * COUNT(DISTINCT CASE WHEN t_view_ok  IS NOT NULL THEN user_id END)
             / NULLIF(COUNT(DISTINCT CASE WHEN t_visit    IS NOT NULL THEN user_id END), 0), 2) AS visit_to_view_pct,
  ROUND(100.0 * COUNT(DISTINCT CASE WHEN t_start_ok IS NOT NULL THEN user_id END)
             / NULLIF(COUNT(DISTINCT CASE WHEN t_view_ok  IS NOT NULL THEN user_id END), 0), 2) AS view_to_start_pct,
  ROUND(100.0 * COUNT(DISTINCT CASE WHEN t_done_ok  IS NOT NULL THEN user_id END)
             / NULLIF(COUNT(DISTINCT CASE WHEN t_start_ok IS NOT NULL THEN user_id END), 0), 2) AS start_to_done_pct
FROM per_user;

Follow-up probes

"What if a user can enter the funnel more than once?" → redefine "reaching a step" per funnel-instance, not per user. Add a funnel_id via sessionization (reuse Q2's trick).
"Engines vary on ordering guarantees in nested windows. Which engine are you targeting?" → call out Snowflake / BigQuery vs. Spark differences.

Q8 — Anomaly Detection by Z-Score Per Group

Prompt

For each merchant, flag daily transaction counts in the past 90 days that are more than 3 standard deviations from the merchant's own 90-day mean. Output: merchant_id, txn_date, txn_count, rolling_mean, rolling_stddev, z_score, is_anomaly.

Why it's hard

Three traps: (1) you need a rolling mean and stddev over the last 90 days per merchant, which is a moving window — not just a group aggregate; (2) merchants with fewer than, say, 14 days of history should be excluded or flagged separately because z-scores on small N are meaningless; (3) if you include the current row in the mean you bias the anomaly test.

Reference solution

WITH daily AS (
  SELECT merchant_id, DATE(txn_ts) AS txn_date, COUNT(*) AS txn_count
  FROM transactions
  WHERE txn_ts >= CURRENT_DATE - INTERVAL '90 days'
  GROUP BY 1, 2
),
stats AS (
  SELECT
    merchant_id,
    txn_date,
    txn_count,
    AVG(txn_count) OVER w AS rolling_mean,
    STDDEV(txn_count) OVER w AS rolling_stddev,
    COUNT(*) OVER w AS n_days
  FROM daily
  WINDOW w AS (
    PARTITION BY merchant_id
    ORDER BY txn_date
    ROWS BETWEEN 89 PRECEDING AND 1 PRECEDING
  )
)
SELECT
  merchant_id,
  txn_date,
  txn_count,
  ROUND(rolling_mean, 2)   AS rolling_mean,
  ROUND(rolling_stddev, 2) AS rolling_stddev,
  CASE
    WHEN n_days < 14 THEN NULL
    WHEN rolling_stddev = 0 THEN NULL
    ELSE ROUND((txn_count - rolling_mean) / rolling_stddev, 2)
  END AS z_score,
  CASE
    WHEN n_days < 14 OR rolling_stddev IS NULL OR rolling_stddev = 0 THEN FALSE
    WHEN ABS((txn_count - rolling_mean) / rolling_stddev) > 3 THEN TRUE
    ELSE FALSE
  END AS is_anomaly
FROM stats
ORDER BY merchant_id, txn_date;

Follow-up probes

"Z-score assumes normality — transaction counts are often Poisson. What would you use instead?" → MAD (median absolute deviation), or a Poisson-based confidence interval, or a seasonally-decomposed residual.
"How would you productionize this?" → as a materialized view refreshed hourly; alerts go into a low-cardinality table, not straight to PagerDuty; include a 15-minute debounce.

Q9 — Median Per Group Without a MEDIAN Function

Prompt

Compute the median order value per country for the past year. The target engine does not have a MEDIAN function (assume older Postgres). Use window functions and produce deterministic output for both odd and even counts per group.

Reference solution

WITH ordered AS (
  SELECT
    country,
    amount,
    ROW_NUMBER() OVER (PARTITION BY country ORDER BY amount ASC, order_id ASC) AS rn_asc,
    COUNT(*)    OVER (PARTITION BY country) AS n
  FROM orders
  WHERE order_ts >= CURRENT_DATE - INTERVAL '365 days'
),
medians AS (
  SELECT country, amount
  FROM ordered
  WHERE
    (n % 2 = 1 AND rn_asc = (n + 1) / 2)
    OR
    (n % 2 = 0 AND rn_asc IN (n / 2, n / 2 + 1))
)
SELECT country, AVG(amount) AS median_amount
FROM medians
GROUP BY country
ORDER BY country;

Follow-up probes

"How would you compute the 95th percentile the same way?" → the general form is rn_asc = CEIL(n * 0.95) for the lower index, with interpolation if you need continuous percentiles.
"What about approximate percentiles at scale?" → introduce APPROX_PERCENTILE / t-digest and discuss the accuracy / cost trade-off.

Q10 — Mutual / Reciprocal Relationships

Prompt

Given a directed follows table (follower_id, followee_id), find all mutual pairs — where A follows B and B follows A — with each pair listed only once, ordered by the smaller user_id first.

Reference solution

SELECT
  LEAST(follower_id, followee_id)    AS user_a,
  GREATEST(follower_id, followee_id) AS user_b
FROM follows f1
WHERE EXISTS (
  SELECT 1 FROM follows f2
  WHERE f2.follower_id = f1.followee_id
    AND f2.followee_id = f1.follower_id
)
AND f1.follower_id < f1.followee_id;

Why the `<` filter matters

Without f1.follower_id < f1.followee_id, each mutual pair appears twice (once as A→B, once as B→A). The combination of EXISTS and the asymmetric filter keeps each pair exactly once.

Follow-up probes

"How do you extend this to find friend-of-a-friend recommendations?" → self-join the mutual-pairs CTE on either user, exclude existing follows, rank by count of common connections.
"This query is slow at 500 M rows. What do you change?" → cluster follows by follower_id, consider a pre-computed mutuals table refreshed nightly.

5. System Design Transcripts — Bad, Average, Strong

Reading a system-design answer is not the same as hearing one. What follows are four prompts rendered as transcripts at three skill levels each, with a scorecard on why each lands where it does. Read them out loud. The rhythm is part of what you're practicing.

Scenario A — Design an Event Pipeline for 10 M Events/Day

Prompt (verbatim)

Our mobile app sends ~10 M events per day. Product wants both a near-real-time operational dashboard (freshness under 5 minutes) and a nightly business report. Design the pipeline.

Bad answer (what not to do)

"I'd use Kafka, then Spark Structured Streaming, then write to Snowflake. The dashboard reads from Snowflake. The nightly report also reads from Snowflake."

Why it fails: No clarifying questions. No volumes (10 M/day is only the headline — what's the peak multiple of the mean? what's the payload size?). No SLA decomposition (operational dashboards and nightly reports have different correctness tolerances, not just freshness). No failure modes. No cost. The interviewer will push and push, hoping to find ground.

Average answer

"10 M events per day is ~115 events per second average, probably 1,000 per second peak. Events are small, maybe 2 KB each, so ~20 GB per day. I'd use Kafka for ingest with three brokers, retention of 7 days for replay. A Spark Structured Streaming job writes to a bronze Iceberg table every 2 minutes with micro-batch trigger. A second job transforms bronze to silver hourly. The dashboard queries silver via a serverless warehouse. The nightly report runs on gold tables materialized at 2am.

For failure handling, I'd use idempotent writes keyed on event_id. If Spark dies mid-batch, the checkpoint lets it resume. If Kafka is backed up, the dashboard shows stale data but doesn't break."

Why it's only average: correct, complete enough to pass a senior bar, but doesn't surface tension. No mention of schema evolution, late data, or the cost gap between streaming bronze and batching silver. No mention of who owns which SLA when something breaks.

Strong answer (Staff+)

"Three clarifying questions before I design. One: the 5-minute freshness — is that end-to-end from app-emit to dashboard, or from ingestion to dashboard? The difference is a mobile SDK retry tail that can add 30 minutes. Two: is the dashboard aggregated or row-level? Aggregations can tolerate more latency because they're less sensitive to individual drop-outs. Three: who is the 3am owner of this pipeline, and what breakage wakes them up versus opens a ticket?

Assuming end-to-end 5 min, aggregated dashboard, and on-call is a SRE team that wants noise-free alerts — here's the design.

Mobile SDK batches events and posts to a collector behind a CDN. The collector validates schema and writes to Kafka partitioned by user_id for ordering. Retention is 7 days so we can replay a full week. Spark Structured Streaming consumes with a 1-minute trigger, writes to a bronze Iceberg table with hidden partitioning on event_date. That gives us cheap compaction and schema evolution without consumer breakage.

A second streaming job projects bronze into silver with typed columns and business-key resolution. The dashboard queries silver, filtered to the past 24 hours, via a warehouse that auto-scales on idle. For the nightly report, a batch job at 2am reads silver, joins with reference data, and writes a gold table consumed by the BI tool.

Three failure modes worth naming. First: late data. SDK retries can arrive 12 hours after the event. I'd set a watermark of 2 hours on silver and emit late arrivals to a "late" side-table that the nightly batch re-incorporates. The dashboard never sees late data — it's already moved on. Second: schema drift. I'd require the bronze writer to fail-open with a quarantine record whenever an unknown field appears, and alert the producer team, not the consumer team. Third: cost runaway. If the mobile SDK ships a bug that 10x's event volume, Kafka's retention will still hold 7 days, Spark will throttle at its configured max, and the dashboard will stale rather than crash. That's the correct failure mode because a stale dashboard is a pageable event and a crashed cluster is a catastrophe.

One trade-off to flag: we're duplicating storage (bronze + silver + gold). The alternative is materialized views on raw, which saves storage but gives us less control over cost and less visibility into pipeline failures. At 20 GB/day raw, that's only ~7 TB/year; keeping all three tiers costs ~$2,000/year in storage. Cheap. I'd keep the three-tier design. If we were at 10x the scale, the trade-off would reverse."

Why it's strong: opens with scoping, states assumptions before designing, names who pages at 3am (org awareness), discusses failure modes as first-class content, quantifies cost, closes with a trade-off the interviewer didn't ask about. That last move is the staff-level signature — volunteering the tension you chose to live with.

Scenario B — "The Dashboard Numbers Don't Match" (Live Debug)

Prompt (verbatim)

It's Tuesday morning. Two dashboards show different values for the same KPI — yesterday's revenue. The finance team reports $4.2M; the executive dashboard reports $3.9M. Both were "right" last week. Walk me through how you investigate.

Bad answer

"I'd check the queries behind each dashboard and compare them."

Why it fails: this is what the junior tells the senior who then has to do the actual work. The interviewer is testing investigation methodology, not tooling. A one-sentence answer signals "I've never owned an incident like this."

Average answer

"First I'd reproduce both numbers myself against the warehouse — confirm finance's $4.2M and exec's $3.9M are the right numbers from each dashboard's source query. Then I'd inspect the SQL of both queries. Usually the difference is timezone — one uses UTC and one uses local time — or a join filter that includes/excludes refunds.

If SQL is identical, I'd check the source tables — are they reading the same table? If one reads gold.revenue and the other reads silver.orders, those pipelines might have different lag. I'd look at pipeline run logs for yesterday.

Once I find the root cause, I'd document it, fix it, and post a message in the data slack channel with the explanation."

Why it's only average: correct debugging sequence, but no prioritization, no mention of stopping the bleeding while investigating, no consideration of downstream consumers, no ownership of communications beyond "post in slack." This is senior IC thinking, not staff thinking.

Strong answer

"Before I touch anything technical — three moves in parallel. One: I IM both dashboard owners to say 'we see a discrepancy, investigating, will post to #data-incidents within 30 min, please don't publish commentary on these numbers externally until we post.' That buys me a window and avoids the CFO telling the CEO a wrong number at 10am. Two: I note the current values from both dashboards with screenshots — the numbers can drift during investigation. Three: I check if there's an active incident, since discrepancies often correlate with upstream failures.

Now the investigation. I treat this as a funnel: at which layer does the $300K disagreement first appear? I run both queries myself against the warehouse and confirm. Then I go up the pipeline — do the upstream silver tables agree? Do the bronze raw tables agree? Somewhere in that chain the numbers diverge, and that's the layer where the bug lives.

The four most likely root causes in order: timezone boundary (revenue before/after midnight attributes to different days), filter drift (one query includes cancellations, the other excludes), late-arriving data (finance pulled later and saw more rows), and join duplication (a new dim row caused some facts to be double-counted after a recent deploy).

Let's say I find the cause: a dim_store refresh last night added a duplicate mapping for two store_ids, which inflated finance's number. Now the fix. Short-term: I roll back the dim_store snapshot. Medium-term: I fix the dim loader to surface uniqueness violations as a pipeline failure, not a warning. Long-term — and this is what most engineers skip — I want to understand why this wasn't caught by the data-quality monitor. Was the monitor not configured on this table? Was the threshold too wide? That's the conversation that prevents recurrence.

For communications, I'd post to #data-incidents with a four-part summary: impact, root cause, fix, prevention. I'd notify finance and the exec team directly about the corrected number, with the delta explained — because the trust repair is as important as the fix.

One meta-point I'd flag: we shouldn't have two dashboards consuming the same KPI from two different sources. That's the real bug. I'd follow up with a proposal to consolidate onto a single certified source, owned by the data team, with a deprecation path for whichever dashboard loses."

Why it's strong: parallel moves (comms + triage + investigation), explicit stakeholder management, structured root-cause hypothesis, short/medium/long-term fix framing, and the closing move — naming the organizational root cause — is unmistakably staff-level.

Scenario C — Migrate 50 TB Hive Table to Iceberg With Zero Consumer Downtime

Prompt (verbatim)

You have a 50 TB Hive-external table on S3, partitioned by day, with 40+ downstream consumers. Product wants to migrate it to Iceberg so we get time travel, safe schema evolution, and compaction. Design the migration. Zero consumer downtime is a hard requirement.

Bad answer

"I'd use Iceberg's migrate-table procedure, run it overnight, and notify consumers in the morning."

Why it fails: doesn't engage the zero-downtime constraint, doesn't state which tools support both the old and new formats during cutover, doesn't plan rollback, doesn't acknowledge 40+ consumers means 40+ migration coordination problems.

Average answer

"Iceberg has two migration paths — migrate which replaces the Hive table in place, and snapshot which creates a parallel Iceberg table pointing at the same files. I'd use snapshot first so the Hive table stays live.

Then I'd point one pilot consumer at the Iceberg table and validate it for a week. If good, I'd ask each consumer team to update their reader, one per week, until everyone is on the new table. Last step: decommission the Hive table.

For rollback, since the snapshot doesn't modify the underlying files, I can always delete the Iceberg metadata and fall back to Hive."

Why it's only average: correct at a high level, but treats "40 consumers update their reader" as a simple line item. In reality that's the bulk of the migration work and the bulk of the risk. Also doesn't address what happens to writes during the cutover window, when both tables must agree.

Strong answer

"Scoping first. 'Zero consumer downtime' needs unpacking — do we mean zero dashboard gaps, or zero read failures? Those are different SLAs. And 'migrate' — are we snapshotting to keep the same files, or ingesting fresh? If we need time travel from before today, we need snapshot. If not, we can do a parallel rebuild and save ourselves a class of bugs.

Assuming snapshot-based migration with zero read failures as the hard SLA, here's the plan in five phases.

Phase 1 — parallel registry. Run Iceberg's snapshot procedure to create a parallel Iceberg table pointing at the same underlying files. The Hive table is unchanged. No consumer sees anything. This phase is safe to test, revert, and retry.

Phase 2 — dual-write. Every job that writes to the Hive table now also writes to the Iceberg table. I'd rather not do this for 50 TB historically, so dual-write applies to new partitions only. Historical partitions are snapshotted once and read-only. I add a data-quality check that sums both tables nightly and pages on divergence over 0.01%.

Phase 3 — pilot consumers. I identify three low-risk consumers: a BI dashboard with a weekly refresh, a data-science notebook used by one team, and a back-office report. I migrate them first. I give each one two weeks to raise issues. I publish a migration guide with the exact reader config change required — engine, version, one-liner diff.

Phase 4 — tiered rollout. I rank the remaining 37 consumers by three axes: business criticality, reader engine, and how many consumers share that engine. I migrate them in cohorts, one cohort per sprint, with a "flag day" where all readers for a given engine cut over on the same Monday. I require every cohort to explicitly sign off — no passive migration.

Phase 5 — deprecation. Only after 100% of the 40 consumers are on Iceberg do I shut off the Hive metastore entry for the old table. I keep the underlying S3 files for 90 days for forensic rollback, then delete.

Rollback plan, which I want to name explicitly because it's the part that determines whether I'll actually ship this: at every phase, the previous state is still reachable. If Phase 2 dual-write introduces a bug, I disable the Iceberg write and the Hive table is still correct. If Phase 3 pilot fails, consumers swap their reader back. If Phase 4 surfaces a platform-wide bug, we pause and fix before the next cohort. I would not start without these rollback contracts written down.

Three risks worth naming. First: engine support. Not every consumer's reader speaks Iceberg at the version we'd target — the Presto team might be on an older release that doesn't support Iceberg's v2 spec. I need a reader-engine inventory before Phase 1. Second: catalog. If the company uses Glue as the Hive metastore but we want Iceberg on a REST catalog, we're also migrating the catalog. That's a separate project and I'd treat it as such. Third: compaction. Iceberg's advantage is smaller files via compaction, but until we run compaction the new table looks identical to the old. Consumers will not feel the benefit for weeks, which is a narrative risk — leadership might ask "why are we doing this" mid-migration. I'd schedule the first compaction for right after Phase 3 so there's something to show."

Why it's strong: phases are named and sequenced with explicit entry/exit criteria; rollback is treated as first-class, not a footnote; risks are named with mitigations; and the narrative risk in the last paragraph is a staff-level move — the candidate is thinking about the political economy of the migration, not just the technical one.

Scenario D — Real-Time Fraud Detection for Card Transactions

Prompt (verbatim)

Design a real-time fraud detection system for card transactions at a payments processor. Volume: 5,000 TPS steady, 20,000 TPS peak during holidays. Decision budget: sub-300 ms p99 from transaction arrival to approve/decline. The model team will own the model; you own everything else.

Bad answer

"I'd stream transactions through Kafka, call a model service, return the result. If the model is slow I'd add a cache."

Why it fails: no scoping, no SLAs decomposed, no feature-store plan, no fallback, no model team interface, no regulation mention, no audit log. Payments is a compliance-heavy domain and a good answer must signal awareness of that.

Average answer

"Transactions arrive via an API gateway, land in Kafka, and are consumed by a Flink job that enriches them with features — card history, merchant history, geo-velocity — and calls the fraud model. The model returns a score; if it's above a threshold, the transaction is declined. The result is posted back to the gateway within the latency budget.

For features, I'd use a feature store with online and offline tiers. The online tier is Redis or DynamoDB for sub-10ms lookups. The offline tier is a lakehouse table for training.

For the model, I'd support A/B by shadow-scoring with a second model and logging both outputs.

For audit, every decision is logged with the input features, the score, and the threshold applied."

Why it's only average: the shape is right, but the answer is generic — it would describe any ML serving system. Missing: the 300 ms budget decomposed into how many ms for each hop; the failure behavior when the model is unavailable; the regulatory constraint that declines cannot be retried on a different model version; and the interface contract with the model team that prevents them from breaking the latency budget unilaterally.

Strong answer

"A few scoping questions. One: is 300 ms p99 end-to-end from gateway-in to gateway-out, or is it just the decision engine? The answer changes the budget for feature lookup. Two: what's the fallback when the model is unavailable? 'Approve everything' is fraud-friendly; 'decline everything' is customer-friendly. Someone — probably not us — has a business answer. Three: is this the only decision system, or is there a rules engine ahead of or behind us? Interaction with rules is often where the latency budget gets eaten.

Assuming 300 ms end-to-end, deny-on-failure at the rules engine, and we're the single ML decision point — here's the design.

Budget decomposition: 20 ms network gateway→us, 20 ms us→gateway, 50 ms feature lookup, 100 ms model inference, 50 ms auxiliary logging and control plane, 60 ms buffer. That's the shape of the SLA. Anyone who eats into it — a model change that adds 50 ms, a feature that requires a fresh join — has to justify it against that budget.

Transactions land at a regional API gateway that validates, authenticates, and writes to a Kafka topic partitioned by card_hash. A stateless Flink job consumes with per-key ordering guarantees. The job reads online features from Redis (cached card- and merchant-level aggregates refreshed every 5 minutes) and from DynamoDB (longer-tail features). The model is deployed as a gRPC service with per-request timeouts of 150 ms; if it times out, we emit a fallback decision from the rules engine and flag the transaction for review.

The feature store has two sides — online (Redis + DynamoDB) for real-time lookups, offline (Iceberg) for training. The online-offline parity guarantee is critical: every feature must be written through the same code path to both stores so training and serving can't drift. I'd enforce this with a feature-registration contract — model team cannot deploy a feature in prod that isn't registered in the offline store and backfilled.

For model deployment, shadow-mode first: new models score alongside the current champion for a week without affecting decisions. If the shadow passes drift and calibration checks, a canary cohort (say, 1% of traffic) gets the new decisions. Only after a clean 7 days do we ramp to 100%. Every step is automatically rolled back on a pager page for latency regression or fraud-recall regression.

Three regulatory and operational points. First: every decline must be explainable. I'd log the top three feature contributors per decision, sourced from the model's SHAP output, retained for 7 years. That's a compliance minimum in this industry. Second: decisions cannot be retried on a different model version. If we decline, a retry hitting a newer model version that approves would constitute a split decision — a disaster for audit. I'd pin the model version per decision_id and never serve a retry with a different version. Third: the model team owns model quality, I own the system. I'd publish a two-line contract: they guarantee p99 inference under 100 ms at any given version, I guarantee feature freshness within SLA and fallback-on-timeout. Anything outside those contracts goes to a monthly joint review.

Cost: at 5,000 TPS steady with 20,000 peak, you're looking at roughly $50-100K/month just for the online feature store at that request rate, plus model GPU costs. I'd budget accordingly and flag early that if we push sub-100 ms we're in a different cost regime.

Finally, one risk I'd name loudly: holiday peak is 4x steady. If we autoscale on traffic and the model inference is stateful (GPU warm-up takes 2 min), we'll see cold-start latency violations during the ramp. The cleanest mitigation is capacity-plan to peak, accept the cost, and run scheduled scale-up the week before known peaks. Autoscaling alone is the wrong tool here."

Why it's strong: latency budget decomposed, online-offline parity named, shadow → canary → ramp rollout, compliance signals (explainability, model-version pinning, retention period), explicit contract with the model team, cost estimate, and the final "holidays + autoscaling = cold starts" callout is the kind of specific failure-mode-naming that distinguishes a senior answer from a staff one.

6. Behavioral STAR Frames

Every senior loop includes at least one dedicated behavioral round, and behavioral signal gets probed inside every technical round too. The STAR framework (Situation-Task-Action-Result) is the industry default — but most candidates apply it badly. This section is the correction.

STAR Is Not A Script. It's A Skeleton.

The frame has four parts, each with a strict time budget:

Situation (2 sentences, ~20 seconds): where and when, role of others, why it mattered. No backstory dumps.
Task (1 sentence, ~10 seconds): what you specifically were responsible for.
Action (4–6 sentences, ~90 seconds): what you did, with verbs and decisions. Not what the team did — what you did.
Result (2 sentences with numbers, ~30 seconds): the outcome with a metric, and what you learned or changed because of it.

Total spoken time: under 4 minutes. If you can't tell the story in 4 minutes, either it's two stories or you're hiding something in the actions.

The Eight Canonical Frames

You need one prepared story for each of these frames. Pre-prepared. Written on an index card. Rehearsed out loud with a timer.

Disagreement resolved. A time you disagreed with a senior person and worked through it. Tests influence without authority, emotional regulation.
Failure owned. A significant technical failure that was your responsibility. Tests ownership and growth mindset.
Trade-off under pressure. A time you had to ship something imperfect because of a deadline or cost constraint. Tests judgment and pragmatism.
Influence across teams. A time you got people outside your team to do something they weren't planning to do. Tests scope and communication.
Mentored someone. A time you materially raised the skill of another engineer. Tests teaching ability — critical at staff+.
Said no to a stakeholder. A time you declined a reasonable-sounding ask because the engineering cost wasn't worth the value. Tests principled pushback.
Led through ambiguity. A time the problem was undefined and you had to define it. Tests senior-level scoping.
Drove alignment. A time multiple teams were working at cross-purposes and you moved them to a shared plan. Tests staff-level scope.

Worked Example — "Disagreement Resolved"

Bad version: "The lead architect wanted to use Kafka, I wanted to use Pulsar. We argued about it and eventually settled on Kafka."

What's wrong: no stakes, no actions, no outcome. Reads like losing the argument and still submitting it as an answer.

Average version: "Our lead architect wanted to standardize on Kafka. I'd used Pulsar at my previous company and thought the geo-replication story was better for our multi-region rollout. I wrote a short doc comparing the two and we discussed it in a design review. We ended up going with Kafka because of team familiarity. I learned that tool familiarity beats feature-set at most decisions."

What's right: structured, correctly humble ending, actual lesson. What's missing: no specific actions, no tension beyond the abstract. Forgettable.

Strong version: "[Situation] At a payments company in 2024, we were about to stand up a new event platform for multi-region fraud detection. Our principal engineer made an early decision on Kafka because the team had just finished a six-month upgrade and felt battle-tested. [Task] I was the tech lead for one of three consumer teams and had four weeks of experience running Pulsar at my previous job. I thought geo-replication would matter more than we were weighting it.

[Action] I didn't push back in the meeting — I wanted to do the work first. I spent two evenings writing a six-page doc comparing the two on five axes: throughput ceiling, ops burden, geo-replication correctness under partition, cost at our projected scale, and team ramp-up. I deliberately included our fraud team's exact SLAs and projected a twelve-month cost envelope. I shared it with the principal engineer first, in a private 1:1, with the explicit opener: 'I'm probably wrong but I want you to check my math before I share this.' He caught one real error in my cost model and reshaped the throughput section. Then we shared the revised doc with the broader group.

[Result] We went with Kafka for the core platform but adopted Pulsar for the cross-region mirror because it met a specific geo-replication need that Kafka didn't cover cleanly. I would have lost a pure-principle argument. The written doc, and asking him to review it first, turned the disagreement from 'who wins' into 'what's right.' Three months later the principal engineer asked me to co-author the next major platform doc. [Lesson] A thing I do now as a default: when I disagree with a senior decision, I write the case as if I'll be proven wrong, and I give the person I disagree with the first review."

Why this version works: specific numbers (six pages, two evenings, five axes, twelve-month envelope), a specific sentence the candidate said out loud ("I'm probably wrong but…"), a nuanced outcome (not a clean win, a compromise that served both concerns), and a closing "what I do now" that signals durable behavioral change.

Worked Example — "Failure Owned"

Weak opening: "I made a mistake once where…"

Strong opening: "In 2023 I pushed a backfill at 3pm on a Thursday that doubled the revenue numbers on the executive dashboard for six hours. The CFO saw it before we caught it."

The strong opening names the impact in the first sentence. No warm-up, no hedging. The rest of the story then earns the right to be heard.

Three rules for failure stories:

Own it in the first sentence. "I pushed," not "we had an incident where."
Name the specific decision that failed, not a vague "we didn't have enough tests." What decision — yours — would you make differently?
The fix is systemic, not just personal. "I added a test" is junior. "I proposed, and we shipped, a pre-deploy data-diff gate that catches this class of error at the CI layer" is senior.

Anti-Patterns — The STAR Failures

The we-drift. You start with "I" and by minute two you're saying "we" for everything. The interviewer loses track of your specific contribution. Fix: every paragraph of the Action section must contain the word "I" with a verb.
The hero arc. Every story ends with you being thanked, promoted, or proven right. It rings false. Real senior stories include at least one "what I'd do differently" in the Result.
The silent co-star. You never name the other humans in the story. Interviewers reading between lines assume you can't work with people. Fix: name roles (not real names) — "the staff engineer on the adjacent team," "the product manager new to our group."
The cliffhanger result. Your Result is "and then I changed jobs" or "and then the project got reprioritized." Pick a different story. The Result section must actually resolve.
The date-less incident. "At a previous company, some time back…" is unanchored and hard to assess. Always anchor in a year and a team size.

The Cross-Round Story Portfolio

Across a 4-round loop, you'll have maybe 3–4 chances to tell a prepared story, plus 2–3 smaller "tell me about a time" asks woven into technical rounds. You want to show different sides — not tell the same story in three different frames.

Map your eight prepared stories to at least four distinct skills: technical depth, cross-team influence, mentoring, and operational judgment. If you look at your portfolio and three of the eight are "I debugged a hard incident," rotate. Variety is what hiring committees reward.

7. Decision Frameworks — The Cheat Sheet

Many interview rounds test whether you can make a reasoned decision between two legitimate options, not whether you know the options exist. These are the most common forks and the axis each hinges on. Memorize the frames, not the conclusions — every real decision depends on context, and the reasoning is what gets scored.

Batch vs Streaming

The decision axis is value of freshness vs cost of continuous compute. If the downstream consumer acts on the data within minutes, streaming. If a human reads it daily or a model retrains weekly, batch. The trap: candidates default to streaming because it sounds modern. Real answer: streaming is roughly 3–10x the ops overhead of batch and should be justified by a concrete business lever, not a feeling.

A clarifying question that works in any round: "Who acts on this data, and how soon after arrival?" If the answer is "a dashboard refreshed at 9am" — batch. If it's "a fraud decision in 200 ms" — streaming. If it's "a data scientist who runs queries ad hoc" — batch with a small streaming tier for the last hour.

Normalize vs Denormalize

The decision axis is update frequency vs read pattern fan-out. Normalize when the source of truth changes often and the consumers are narrow. Denormalize when the source is stable and the consumers are many and varied. In a lakehouse, denormalized wide tables are almost always the right answer for analytics layers, because columnar storage makes the read cost of unused columns near-zero, and eliminating joins dominates the access-pattern tradeoffs.

Ask: "Is this table read a thousand times for every write?" If yes, denormalize aggressively. "Is the write source a system of record that will audit every change?" If yes, normalize the source and denormalize only the consumer layer.

Build vs Buy

Three questions in order. One: is this a core competency of our business? If yes, build — the control and optionality matter more than the cost. Two: does an off-the-shelf option meet 80% of our needs at 20% of the cost? If yes, buy, and put the 20% engineering savings toward the core. Three: what's the exit cost if we buy? If vendor lock-in on this layer would require a year-long migration, discount the "buy" option heavily.

Avoid the false economy of "we'll build a simple version." Every simple version becomes a complex version in three years with no maintainer. If you wouldn't staff a team of three to own it, don't build it.

Schema-on-Read vs Schema-on-Write

Schema-on-read is attractive in bronze/raw layers where the cost of rejecting a malformed record is higher than the cost of a downstream error. Schema-on-write is right in silver/gold where consumers depend on predictable typing. The senior pattern is schema-on-read at ingest with typed contracts at the bronze→silver transform — reject bad records to a quarantine table, alert the producer, keep the main pipeline flowing.

Warehouse vs Lakehouse vs Operational Store

Warehouses (Snowflake, BigQuery) optimize for analytics read patterns with tight governance, fast answer times, and high cost-per-GB at scale. Lakehouses (Iceberg/Delta on S3) win when data volume crosses ~100 TB or when the same data serves both analytics and ML training. Operational stores (Postgres, DynamoDB) are for sub-10ms online lookups, not analytics. The three should coexist; the anti-pattern is forcing one to do another's job because "we already have it."

Spark vs SQL Engine

Spark is the right tool when you're writing non-trivial logic (Python UDFs, iterative algorithms, complex joins over 100 GB+ data). A warehouse SQL engine is the right tool for analytics queries, dimensional joins, and anything a BI tool will render. The overlap zone — moderate-size ETL — goes to whichever platform your team already operates well. Ops maturity beats theoretical fit.

8. The Mock Interview Loop

Mock interviews are the single highest-leverage activity in prep. Most candidates do too few and do them wrong. This section fixes both.

Weekly Cadence

Target three full mocks in weeks 3–4 of the roadmap. Each mock should simulate a real round: 45 minutes, interviewer plays the role faithfully, no pausing to look things up. The post-mock review is as important as the mock itself — budget another 45 minutes to review the recording the same day.

Self-Tape Method (Solo Practice)

When you don't have a partner, record yourself solving a prompt out loud. Set a timer. No pausing. At the end, watch it back with a rubric. Count filler words ("um," "so," "basically"), measure time-to-first-assumption, note every moment you paused mid-sentence. The tape is uncomfortable to watch — that's the point.

Three specific drills:

The 45-minute system design tape. Pick a prompt from §5. Solve on a whiteboard or blank page. Record. Review for: time budgeting, scoping questions, failure-mode coverage, and the closing trade-off.
The 20-minute SQL tape. Pick a question from §4. Solve on paper with narration. Review for: clarifying questions asked, narration cadence, and silent math.
The 4-minute behavioral tape. Pick a frame from §6. Tell the story. Review for: time budget, "I" vs "we" count, date/team-size anchoring.

Pair-Up Protocol

If you're mocking with another candidate, swap roles. You interview first (learning what good questions look like), then you're interviewed. The act of playing interviewer sharpens your own answers more than any study session.

Set rules before you start: interviewer won't help unless the candidate is truly stuck for 3+ minutes, candidate must narrate, no looking up anything. Afterwards, 10 minutes of specific feedback using the rubric below. Vague feedback like "you did great" is worthless.

Feedback Rubric — Five Dimensions, 1-4 Scale

Dimension	1 (weak)	2 (mid)	3 (senior)	4 (staff+)
Scoping	Jumped to solution	One clarifying question	Restated + 2–3 questions	Named the real problem under the prompt
Narration	Silent stretches >30s	Narrated half the time	Narrated continuously	Narrated and invited collaboration
Technical depth	Buzzword-level	Knew the mechanism	Knew the mechanism and the trade-off	Invented a framing to compare mechanisms
Failure thinking	Happy path only	Named one failure mode	Systematic failure analysis	Named the class of failures + mitigation
Close	Trailed off / time ran out	Summarized the answer	Summarized + one trade-off	Surfaced a tension the interviewer didn't ask about

Score yourself honestly. A 3 across all five dimensions is a clean senior pass. A 4 anywhere is Staff+ signal. A 1 anywhere is a gap that must be closed before the real loop — not after.

9. The Pre-Interview Checklist

This is the list to run through the week before the loop.

Resume

For every bullet: scale numbers, your verbs, the trade-off, the failure mode, the "would do differently."
One sentence — just one — that captures your career arc. Interviewers ask this as an opener in 80% of rounds.
Remove anything you can't defend under drill-down. If a bullet mentions Kafka, you must be ready for 20 minutes on Kafka.

Project Deep-Dives

Pick your three strongest projects. Prepare a 2-minute pitch and a 10-minute deep-dive for each.
The deep-dive must include: architecture diagram (drawn from memory, on paper), the decision you'd revisit, the failure you recovered from, the metric you moved.
Rehearse out loud. Time yourself.

Company-Specific

Read at least one engineering blog post from the target team. Pick one specific technical claim to reference in the loop.
Know the team's stack. If they use a specific cloud, brush up on its nuances the night before.
Have three reverse questions prepared (see §F3 above).

Logistics

Test your video, mic, screen-share 48 hours out. Do not test them 10 minutes before the round.
Second monitor — one for the prompt, one for your scratch notes. Do not screen-share the scratch monitor.
Water, snack, silent phone. The loop is 4+ hours and most candidates crash at hour 3 from caffeine and dehydration.

The Day Before

No new study. Re-read your own notes.
Sleep 8 hours. Interview performance drops measurably below 7 hours.
Plan your morning routine to the minute: wake, shower, coffee, walk, setup, first round. Remove decisions from the morning.

10. Day-Of Playbook

Morning

Exercise for 20 minutes, even lightly. It measurably improves cognitive performance for the next 4–6 hours.
Breakfast with protein. Skip the heavy carbs — the crash will hit mid-round.
Review your one-page "stories portfolio" — the eight STAR frames, one sentence each. Not the full stories; just the index.

15 Minutes Before Each Round

Stand up, move, breathe. Do not re-read notes. You already know what you know.
Look at your three reverse questions. Pick which one fits this round.
Water. Bathroom. Phone on silent.

During The Round

Open with a warm "thanks for making time, excited to dig in." Not "let's get started" — this is a conversation, not a meeting.
In the first 2 minutes, say one thing that isn't an answer to a question. It humanizes you. "I noticed you mentioned [X] in your bio — I've been curious about that."
Narrate your thinking. Silence is a scoring vacuum.
If you don't know something: "I haven't worked with that specifically, but based on [related thing] I'd guess it works like…" Buy time to think by demonstrating how you'd think.
Watch the interviewer's body language / tone. If they're restless, compress. If they're leaning in, go deeper.
Always leave 3 minutes at the end for questions.

Between Rounds

Stand up, stretch, breathe. Do not ruminate on the previous round. The next interviewer has not talked to the last interviewer yet — it's a clean slate.
Write one sentence of notes on the previous round for the post-loop debrief, then close the notebook.
Walk for 3 minutes if time allows. Oxygen and blood flow matter.

Post-Loop

Write a 30-minute debrief: one page per round, covering what you were asked, what you answered, what you wish you'd answered.
Send a brief thank-you within 24 hours to each interviewer, referencing one specific thing they said. Generic thank-yous are neutral; specific ones are memorable.
Don't re-litigate the loop mentally. You're done. Wait.

11. Tricky Behavioral Scenarios — Real Data-Engineering Situations

The questions in this section are drawn from actual DE interview loops at senior and staff level. Each one begins with a concrete technical scenario — adding a feature, optimizing a pipeline, picking a tool — and the behavioral challenge is the human layer on top of that technical decision. The answers are written as first-person transcripts, with the moves annotated so you can see where the senior-level judgment lives.

Abstract STAR frames are easy to fake. What interviewers actually probe for is whether your story contains the texture of engineering life — real table names, real team members, real trade-offs you lived with. Technical specificity is what makes a behavioral story credible.

11.1 — "Adding a new feature requires a schema change. The data platform team pushes back hard. How do you handle it?"

The real scenario

Product wants to launch subscription pausing — a customer can pause for 30/60/90 days without losing their plan. Engineering needs a new column paused_until_date on dim_subscription and a new event type subscription_paused flowing through the fact pipeline. The data platform team — who owns dim_subscription and every downstream ETL — rejects the change. Their concerns: (1) they already maintain 14 consumers that read this dimension and every schema change ripples through all of them; (2) the proposed column breaks the current MRR calculation logic; (3) they're mid-migration to Iceberg and don't want schema churn during the migration.

All three of their concerns are legitimate. The product launch date is six weeks away.

The answer — with annotated moves

"[Naming the tension clearly] The conflict wasn't 'they don't want to add a column.' It was that I was asking them to absorb a schema change during a migration they'd scoped for quiet. They had real stakes and I had a product deadline, and both were valid.

[The first investment — do their work before asking for yours] Before I proposed anything, I went and read the migration plan. I counted the 14 consumers, and for each one I classified whether the new column would affect their read path. Turns out only 3 of the 14 actually touched MRR math; the other 11 would see the new column and safely ignore it. That framing — 'this is a 3-consumer problem, not a 14-consumer problem' — was the unlock.

[The counter-proposal, specific and small] I went back with three options written down. Option A: add the column in the current Hive table, accept the re-test on 3 consumers, let the Iceberg migration absorb it later. Option B: build the feature against a new fact table fct_subscription_state_change and leave dim_subscription untouched; joins on read instead of denormalized column. Option C: slip the product launch two weeks and add the column post-migration. I recommended B because it decoupled my timeline from their migration entirely.

[The MRR fix, explicit] The MRR concern was the real technical issue. The current MRR query assumed every active subscription billed every month. A paused subscription is active-but-not-billing. I wrote the MRR fix myself — the query became SUM(price) WHERE status='active' AND (paused_until IS NULL OR paused_until < CURRENT_DATE). I posted that as a PR against their repo, not as a request for them to change their code.

[The outcome] We went with Option B. The data platform team kept their migration on track; I got a cleaner model out of it (event-sourced state changes is strictly better than a nullable column on a dim). Their tech lead was in my corner on the next two cross-team proposals. [What I'd do differently] I'd have read the migration plan before the first conversation, not after the first disagreement. I walked in asking for something without knowing what I was actually asking for."

What the interviewer hears

The candidate did the other team's homework before expecting them to do ours. That's the scarce behavior.
The three-option write-up is a staff-level move — respects the other team's authority while giving them a path to say yes.
Writing the MRR query as a PR in their repo, not a request, transfers the work cost to the asker. This is the single most underused technique for cross-team proposals.
The closing "what I'd do differently" is specific and operational, not performative.

11.2 — "You and a senior engineer disagree on how to optimize a slow pipeline. He's cocky and won't back down. Walk me through it."

The real scenario

A nightly aggregation job on a 2.3 TB fact table is taking 4 hours and slipping the 6am SLA. The senior engineer on your team — let's call him the staff eng from the other pod — has strong opinions. His proposal: cache the dimension tables in Redis so the job doesn't re-scan them every night. Your read: the dimension tables aren't the bottleneck. The Spark UI shows 85% of wall-clock is a single shuffle on a skewed join key (merchant_id) where the top 0.1% of merchants account for 60% of rows. You believe the answer is salted joins or broadcast with a skew hint, not Redis.

The staff eng has been at the company eight years. He's built most of the platform. He's confident in his diagnosis and dismissive of yours in standup. He has genuine reasons — he's seen this pattern before on a different job where Redis did help. He hasn't looked at the Spark UI for this job yet.

The answer

"[The technical move first — evidence before argument] I didn't argue with him in standup. I pulled the Spark UI for the last five nightly runs and took screenshots of the stage timings. Three stages, 12 minutes combined — the dimension reads. One stage, 3 hours 14 minutes — the skewed shuffle. That was the actual data.

[Separating the person from the position] He's been right about many things over many years. Dismissing him in the room wasn't the play. I asked him for a 20-minute 1:1 — 'I want your eyes on something I'm seeing in the Spark UI; I might be wrong.' That opening — 'I might be wrong, look at this with me' — changed the dynamic. He was curious, not defensive.

[The shared diagnosis] We looked at the UI together. He caught something I'd missed: the shuffle wasn't just skewed, it was spilling to disk because the executors were undersized for that stage. So the fix wasn't purely skew mitigation — it was salt-the-skew and bump executor memory for that stage. Better answer than either of us had alone.

[The cocky-peer specific move] Here's the thing: if I'd walked in saying 'you're wrong, here's why' — he'd have spent the meeting defending the Redis idea, and we wouldn't have reached the real answer. When I walked in with 'I might be wrong, look at this with me,' his defensiveness disappeared. He knew he hadn't looked at the UI; he knew I had. Inviting him to co-diagnose instead of correcting him let him update without losing face.

[The outcome] Fix shipped that week. Job went from 4 hours to 38 minutes. He dropped a note in the team channel calling it a 'great catch,' which repaired the standup friction publicly. [What I'd do differently] I would not have pushed back in standup at all, even gently. Public disagreement with a senior peer in a multi-person meeting burns capital for no benefit. The 1:1 path was always the right first move and I lost 24 hours getting to it."

What the interviewer hears

The candidate resisted the ego trap of wanting to be proven right in public.
The phrase "I might be wrong, look at this with me" is a senior-level move — it gives the other person an onramp to change their mind without admitting defeat.
The co-diagnosis produced a better answer than either starting position. That's the hallmark of healthy disagreement.
The "what I'd do differently" acknowledges a real miscalibration, not a humblebrag.

11.3 — "You proposed adopting dbt. One engineer is hell-bent on rejecting it, and his reasons are real. How do you move it forward?"

The real scenario

Your team owns ~180 SQL-based data models written as bash scripts that call snowflake CLI. You propose moving everything to dbt. The most senior data engineer on the team — let's call him the tech lead — is firmly against. His reasons: (1) we'd be importing Jinja templating into SQL that is otherwise clean and auditable; (2) dbt's test framework duplicates what we already have in our airflow dag; (3) he's seen dbt monorepos become unmaintainable at >500 models; (4) he worries about vendor dependency.

These are not bad-faith objections. Points 1, 3, and 4 are genuinely debatable. Point 2 is mostly wrong but understandable.

The answer

"[Steelman first — this is the actual unlock] Before I made my case, I wrote down his four objections in his words and sent them back to him: 'I want to check I'm hearing you. Your concerns are Jinja-in-SQL auditability, redundant tests, monorepo scaling at 500+, and vendor lock. Did I miss anything?' He added a fifth — local dev environment complexity — which was fair and I hadn't thought about. Just the act of restating accurately took the temperature down.

[Separating the decidable from the philosophical] Two of his concerns were falsifiable — tests redundancy and local dev — so I addressed those with actual data. I wrote up a side-by-side: our current airflow test framework covers 3 of the 6 categories dbt generic tests cover; the overlap is narrower than he thought. On local dev, I set up the dbt project myself, ran it end-to-end on my laptop, and shipped a one-page setup guide. Three objections remained philosophical: Jinja auditability, 500+ model scaling, vendor lock. Those I did not try to win.

[The ask-for-criteria move] Instead of arguing, I asked him: 'What would you need to see in 90 days to be willing to expand this beyond a pilot?' He named two things: (a) a way to render the compiled SQL cleanly for PR review so the Jinja doesn't obscure intent, and (b) a dbt-to-airflow fallback path if we ever wanted to walk away. I put both in the pilot plan as explicit exit criteria.

[The small pilot] We didn't adopt dbt across the board. We migrated 12 of the simplest models as a pilot. He co-designed the pilot success criteria with me. I asked him to be the pilot's reviewer, not its cheerleader — his job was to find the seams. He found three real ones, which we fixed. At the 90-day review he said, 'I'd expand it.' Not because I convinced him but because he'd co-built a version that addressed his objections.

[Outcome and the meta-lesson] We hit ~120 models on dbt before pausing to reassess monorepo ergonomics — which was exactly what he'd warned about. He was right about that concern. His objection improved the rollout plan; it didn't kill the idea. [What I'd do differently] I would have invited him to co-author the original proposal instead of presenting a finished proposal to the team and defending it. When the opponent is senior and has real reasons, co-authoring is faster than convincing."

What the interviewer hears

Steelmanning the objection is a rare senior skill — most candidates flinch toward defending their idea.
Separating falsifiable concerns from philosophical ones is a core DE interview signal — it shows the candidate can tell "an opinion" from "a testable claim."
Asking the skeptic to design the kill criteria flips the dynamic — now they have skin in the pilot's success.
Admitting the skeptic was right about monorepo ergonomics later is unusual and credible.

11.4 — "Mid-project you realize the pipeline architecture won't meet SLA. What do you do?"

The real scenario

Week 4 of a 7-week project to build a near-real-time customer-360 table feeding the call-center app. The agreed SLA is 5-minute freshness. You designed a Kafka → Spark Structured Streaming → Iceberg pipeline with a 2-minute trigger. Two weeks of integration testing later, you realize the read side — a serving layer on top of Iceberg — is the actual bottleneck. Queries are hitting metadata overhead that isn't improving with tuning; p99 read latency is 3.5 seconds and the app needs under 200 ms. The write path works fine; the read path won't. No amount of additional tuning changes the order of magnitude.

You have three weeks left. You are already two weeks into a plan that no longer works.

The answer

"[Quantifying before escalating] First thing I did was separate 'tuning can fix this' from 'architecture cannot.' I spent two days profiling with different cluster sizes, table layouts, and compaction settings. Latency moved from 3.5s p99 to 2.8s. Order of magnitude unchanged. That gave me the data: this wasn't tunable, it was an architectural mismatch. Iceberg's read path is fine for analytics; it's not designed for a 200 ms serving path.

[Escalating with options, not panic] I went to my manager the same day with a one-pager. Three sentences at the top: 'The serving-layer latency is an order of magnitude off target. Tuning will not close the gap. I need a decision by Friday to preserve options.' Then three options, each with cost, timeline, and trade-off:

Option A — keep Iceberg, pre-materialize into Redis. Write the 360 table to Iceberg as planned, run a background job that maintains a Redis key-value mirror. Serving reads Redis; Iceberg holds the durable store. Adds 1 week; meets SLA; adds Redis as an operational surface.
Option B — swap the store entirely for DynamoDB. Kafka → Flink → DynamoDB, cut Iceberg out. Meets SLA natively. Adds 2 weeks because the write path is effectively new; loses the analytics-ready properties of the Iceberg table.
Option C — relax the SLA from 200 ms to 2 seconds. Product-facing change; ships on time with architecture as-is. Requires product buy-in.

[Making a recommendation and owning it] I recommended Option A. The background materializer is one of the cleanest patterns in DE, Redis is well-understood on our team, and we keep the analytics-ready Iceberg table for downstream uses. Option B would have worked but thrown away work. Option C might have been viable if product was flexible, but I wasn't willing to surprise them with it.

[The conversation with product] My manager took the pager to the product lead. We surfaced the options honestly — 'the plan needs to change, here's why, here's the fix.' Product asked intelligent questions about Option C (relaxing SLA) because it would have saved engineering work. I had the data ready: a 2-second response in the call-center app would measurably hurt the rep's handle time. We went with A.

[The outcome] Shipped one week late. The Redis materializer pattern is now on two other pipelines on the team. [What I'd do differently] I should have benchmarked read latency in week 1, not week 4. The write path is what everyone tests first and I followed the pattern; I should have built the riskiest unknown first. 'Attack the unknowns first' is a project-management maxim I now live by."

What the interviewer hears

The candidate distinguished "needs more tuning" from "needs different architecture" with actual data before escalating.
Three named options with honest trade-offs is the single best escalation pattern — the decision-maker can decide, not guess.
The "why I didn't pick Option B or C" shows judgment, not just the recommendation.
The lesson — "attack the unknowns first" — is operational and transferable, which is what staff+ candidates do with mistakes.

11.5 — "You wanted data-quality checks to block deploys. Another engineer wanted them to warn only. Walk me through resolving it."

The real scenario

Your team ships a daily feature table that a downstream ML training job depends on. You propose adding blocking DQ checks: if row count drops more than 30% from the 7-day median, the pipeline fails hard and pages on-call. A peer — the engineer who owns the training pipeline on the other side of the handoff — argues for warnings only, not blocks. His reasoning: "I'd rather get a slightly stale or slightly weird table than no table; blocking means my training doesn't run and I miss my weekly model refresh."

Both of you have real stakes. Both of you are right in a narrow way.

The answer

"[Naming that both stakes are real] The real conflict wasn't 'block vs warn' — it was that we were treating 'a bad table arrives' and 'no table arrives' as the same failure mode. They aren't. He was optimizing for model freshness; I was optimizing for training on clean data. A single 'block or warn' policy can't serve both.

[The decomposition] I proposed splitting the DQ checks into tiers. Tier 1 — hard blocks — for things that make the training run garbage: schema drift, duplicate primary keys, NULL rates exceeding 5x baseline. Tier 2 — soft warnings — for things that are suspicious but tolerable: row count drift between 10–30%, category distribution shifts, outlier counts. Tier 3 — silent metrics — for everything else, logged to Datadog but not alerting.

[The specific list] I wrote the tier assignments column-by-column for the 47-column table. I sent him the list before any meeting. He had comments on eight of them — half of which I'd gotten wrong. Specifically, he was right that 'campaign_id NULL rate spike' should be a warning not a block because campaigns legitimately go dark on holidays. That kind of domain knowledge was exactly why pre-sending the list worked.

[The recovery path] For Tier 1 hard failures, I added a manual override — an on-call eng can explicitly approve the run with a one-line justification that gets logged. That addressed his 'I'd rather have something than nothing' concern: if the check misfires on a Friday night, he's not blocked for the weekend, just required to sign off that he accepts the risk.

[Outcome] We shipped tiered DQ six weeks later. In the first six months the Tier 1 blocks fired 11 times. Nine were real issues that would have corrupted models. Two were false positives — both of which led us to refine the check thresholds. The manual override was used 3 times, always with justification in the log. His model refresh was never blocked by a false positive for longer than the override round-trip.

[The meta-lesson] The right answer wasn't 'block' or 'warn.' It was 'different categories of failure warrant different responses, and here's the taxonomy.' If I'd insisted on blanket blocking, I'd have been right in principle and wrong in practice. [What I'd do differently] I'd have asked him to co-design the tiers from the start. Half of the assignments I got wrong were because I didn't know the downstream training behavior. The domain knowledge for DQ assignments lives on both sides of the handoff."

What the interviewer hears

The candidate reframed a binary argument into a taxonomy. That's the staff-level move — "you're both asking the wrong question."
Pre-sending the per-column list invited specific, decidable disagreement instead of philosophical posturing.
The manual override is a senior design instinct — systems need escape hatches for their safety mechanisms.
Quantifying outcomes (11 fires, 9 true, 2 false, 3 overrides) shows the candidate tracks their own decisions in production.

11.6 — "Your manager wants to ship without the data contract you think is essential. How do you handle it?"

The real scenario

You're standing up a new event stream from the mobile team into your pipeline — roughly 4 M events/day, 27 fields, landing in Kafka and then Iceberg. You've drafted a formal data contract: schema, semantic definitions, ownership, breaking-change process, SLA. Your manager pushes back: "We don't need a contract for this. It's one team on each side, we talk every week, the schema is in the code. Let's ship." Her reasoning isn't bad — for the current state of two teams, a contract is overhead. But you've seen this movie before: six months in, other teams will subscribe, the mobile team will ship a "small" schema change, and somebody's dashboard will break silently.

The answer

"[Understanding her reasoning first] I took it seriously that she was being pragmatic, not dismissive. Contracts have real costs: they slow down early iteration when nobody's downstream, and 'formal' often becomes bureaucratic rather than useful. She wasn't wrong about that tension.

[The reframe] I didn't argue for a contract in the abstract. I asked one question: 'What's the single thing we'd regret most six months from now if we didn't have it?' Her answer — and mine — was the same: undocumented semantic meanings of the 'event_type' field. If three teams subscribe and each interprets event_type differently, we'll be debugging for a quarter. The contract's entire value was in one column's semantic documentation.

[The scoped counter-proposal] I proposed a minimal 'contract lite' — one page, four sections: (a) schema with types, (b) event_type semantic enum (the thing we'd regret not having), (c) breaking-change process (30-day notice to subscribers), (d) ownership and oncall. That was it. No JSON Schema, no registry, no tooling. A one-page markdown file in the producer's repo. 30 minutes to write, reviewed quarterly.

[The disagreement and what she said yes to] She still pushed back on the 30-day breaking-change notice — said it was overkill with only one consumer. Fair point. We compromised: 7-day notice as the default, 30 days only after the third consumer joined. I disagreed mildly but not meaningfully; the 7-day default was fine for the current state and the scaling rule was the right shape.

[The execute-as-if-it-were-my-idea move] Once we agreed on the shape, I didn't re-argue. I wrote the one-pager, got it reviewed, and ran the first quarterly review myself to prove the maintenance burden was low. Six months later we had four subscribers, the schema had evolved three times cleanly, and the event_type semantic doc had been updated twice. My manager told me later she'd been wrong about how much overhead it was. I didn't bring that up at the time and I still don't.

[What I'd do differently] I was going to push for the full contract with registry and tooling. Scoping it down to 'the one thing we'd regret not having' was a better argument. I learned to do that reframe earlier in future proposals — start by asking 'what's the minimum here?' not 'what's the right shape?'"

What the interviewer hears

Understanding the manager's objection as pragmatic rather than wrong is the mature frame.
The "single thing we'd regret most" question is a generally powerful frame for reducing scope without giving up value.
Executing fully after a disagreement went against you — without re-litigating — is the senior behavior.
Refusing to say "I told you so" even when proven right is trust-banking. Interviewers note this one hard.

11.7 — "A feature you built caused a production incident. Walk me through it."

The real scenario

You shipped a new incremental loader for fct_order_line. It processed only the last 24 hours of orders each run, merging into the target table. Friday afternoon deploy. Saturday morning the CFO's weekend revenue dashboard showed revenue down 40% week-over-week. It wasn't. The bug: your incremental filter used order_updated_at >= now() - interval '24 hours', but the source system updates the updated_at field on status changes (cancellations, refunds). The loader was picking up cancellations and treating them as the whole truth for those orders, overwriting the full-truth rows from the previous day's full-scan load. The effect: order totals got replaced by their cancellation deltas, which are negative or zero.

The answer

"[Owning it in the first sentence] I shipped a loader that overwrote correct revenue rows with cancellation-only updates because I used the wrong incremental column. The CFO saw a 40% revenue drop on a Saturday morning before anyone was awake to catch it. That's the top line.

[The pager-to-fix timeline] Saturday 8:47am our on-call got paged by the dashboard owner, who got a Slack from the CFO. On-call pinged me, I was at my laptop by 9:10. Diagnosis by 9:35 — the merge query was trivially wrong once I read it with fresh eyes. Rollback by 9:50 via the previous day's snapshot. Full table refresh by 11:20. Incident resolved before noon.

[The specific decision that failed] My mistake wasn't 'we needed more tests.' The specific decision was: I trusted the column name order_updated_at without asking what event types trigger an update. I assumed 'updated' meant 'updated once, at creation.' It didn't. If I'd asked the ops team what writes to that column, I'd have known. I didn't ask.

[The prevention, systemic not personal] Two fixes shipped the next week. First, the loader now uses a dedicated ingested_at timestamp I added to the source contract — producer-controlled, no risk of being rewritten by downstream updates. Second, I added a pre-commit data-diff that compares the target table's revenue sum before and after the merge, with a 5% guardrail. If it would drop revenue by more than 5%, the merge aborts and pages on-call. Those two changes shipped as a PR, with tests, in the next sprint. The diff guardrail has since caught two unrelated issues on other tables.

[What I'd do differently] Beyond the specific fix: I now treat 'the column name is clear' as a yellow flag, not a green one. Every column I pull into a critical transform, I ask the producer about write semantics before I ship. It's a 10-minute conversation that would have saved a Saturday morning."

What the interviewer hears

Owning it in one blunt sentence — no warm-up, no hedging.
Timeline specificity (8:47, 9:10, 9:35, 9:50, 11:20) signals actual ownership of a real incident.
The specific decision that failed is named: "I trusted the column name." That's the kind of self-criticism that reads as honest rather than performative.
The systemic fix (contract column + diff guardrail) extends beyond "the fix for this bug" to "the class of bugs."
The meta-lesson ("column name is a yellow flag") is a durable behavioral change, not a situational patch.

11.8 — "A junior engineer is blocked and frustrated. How do you mentor them out of it?"

The real scenario

A junior data engineer on your team is building their first streaming pipeline — a Flink job consuming Kafka into Iceberg. They've been stuck for four days on a watermark issue: records are being dropped as "late" even though they look on-time. They're working late, increasingly defensive in standup, and two days ago said "I don't think streaming is for me." You know exactly what the bug is within 30 seconds of reading their code. Do you just tell them?

The answer

"[The question to ask yourself first] 'Do I just tell them' is the wrong frame. The right frame is: 'what's the most valuable thing they can learn in the next hour?' If I tell them the answer, they save four days on this bug and learn one fact. If I walk them to the answer, they save four days on this bug and learn a debugging pattern they'll use for every streaming bug for the next decade. The second payoff is worth the extra 45 minutes.

[The move — Socratic debug] I booked 90 minutes with them, not a drive-by. I asked them to walk me through their mental model of watermarks before we looked at any code. Within two minutes it was clear they thought watermarks were a global clock, not a per-source boundary. That was the actual misconception under the bug. The code issue was just the symptom.

[Teaching the model, not the fix] We spent 20 minutes on a whiteboard sketching how watermarks propagate through operators, including the specific case where a slow partition on one side of a union stalls the whole watermark. Then I asked them to re-read their code with that model. They found the bug themselves — they'd set allowedLateness to zero while reading from a multi-partition topic where one partition lagged by 8 seconds. The fix was one line.

[The frustration piece] The 'streaming is not for me' comment was the one I addressed directly, but carefully. Not 'don't say that' — that lands as dismissive. Instead: 'You've done four days of work that you're about to learn didn't need to take four days. That's frustrating. What made the difference just now — do you think?' They said 'having someone to walk through it with.' That was the teach moment — asking for help earlier isn't weakness, it's pipeline engineering at senior scale because the cost of a day of stuck-ness is higher than the cost of a 30-minute pair session.

[The durable change] I set up a weekly 30-minute standing 1:1 with them specifically for 'what are you stuck on this week.' Not status, not progress — just stuck-ness. Three months later they were running streaming incident reviews for the whole team. Not because of any heroics on my part — because they now knew when to ask for help and when to work through something alone.

[What I'd do differently] I'd have caught their frustration earlier. Four days of stuckness is four days too many. The signal was in standup two days before — they stopped volunteering details. I should have pulled them aside then, not waited until they were already saying 'this isn't for me.'"

What the interviewer hears

The candidate understands that teaching the model pays off more than teaching the fix — classic senior-to-staff transition thinking.
Addressing the "streaming isn't for me" comment directly but gently shows emotional intelligence under pressure.
The standing 1:1 specifically for stuck-ness, not status, is a durable mentoring pattern worth bringing up for itself.
Noticing the earlier signal (standup silence) is the kind of pattern-recognition that separates staff+ engineers from senior ones.

11.9 — The Meta-Pattern Across These Scenarios

Five behaviors show up in every senior-level answer above. Memorize them as the questions to ask yourself when preparing your own DE stories:

Did you do the other person's homework before expecting them to do yours? Reading the migration plan, pulling the Spark UI, sketching the per-column DQ tiers. Work-before-ask is the scarcest behavior in senior rounds.
Did you separate the decidable from the philosophical? Tests-redundancy is falsifiable; vendor-lock is judgment. Addressing each with the right tool (data vs. explicit acceptance) is what senior engineers do that mid engineers don't.
Did you invite the skeptic or opponent to co-own the next step? Asking them to design the pilot kill criteria, write the counter-case, or audit your DQ tier assignments. Co-ownership converts opposition into investment.
Did your answer include at least one specific thing you got wrong? Not as performance — as operational learning. "I was going to push for the full contract; the reframe was better." "I should have pulled them aside two days earlier."
Did your fix extend beyond the specific problem to the class? The data-diff guardrail that now catches bugs on other tables. The standing 1:1 for stuck-ness. The "attack unknowns first" principle. Systemic fixes are the Staff+ signal that transactional fixes cannot replace.

Every DE behavioral story you prepare should be audited against these five. If any three are missing, the story reads as mid-level regardless of how technically impressive the situation was.

Closing Note

Interviews reward a specific kind of preparation — the kind that produces fluent speech under pressure, not encyclopedic knowledge in silence. Everything in this part is aimed at the first, not the second.

If you do the four-week roadmap, run three mocks, prepare eight STAR frames, and memorize the first 30 seconds of your answer to each scenario in §5 — you will walk into the real loop with more readiness than most candidates ever have. That's the bar to aim for.

Good luck.

↑ Back to top

Data Engineering Interview Prep

How to use this

Files

Reading paths

What "deep" means here

1. Bounded vs Unbounded Data at Rest

2. OLTP vs OLAP — The Physical Reality

Row-oriented storage (Postgres, MySQL, Oracle)

Column-oriented storage (Parquet, ORC, Snowflake FDN, Redshift, BigQuery Capacitor)

OLTP modeling implications

OLAP modeling implications

HTAP blur

3. Normalization Derived From Functional Dependencies

1NF — atomic values

2NF — full functional dependency on the key

3NF — no transitive dependencies

BCNF (3.5NF) — every determinant is a candidate key

4NF & 5NF — multi-valued & join dependencies

Practical heuristic

4. Dimensional Modeling — Kimball in Full

The four-step process (and what each step actually means)

Full star schema — playback analytics

The query this shape makes easy

Why surrogate keys

5. Fact Table Grain (The Most Important Decision)

Three transactional fact patterns

Additivity

Domain-specific additivity cheat sheet

Measures that look additive but aren't

Factless fact tables

Multi-grain fact: how to avoid it

6. Dimensions — Conformed, Junk, Degenerate, Mini, Role-Playing

Conformed dimensions

Junk dimensions

Degenerate dimensions

Mini-dimensions

Role-playing dimensions

Swappable / outrigger

7. Slowly Changing Dimensions — All Seven Types, with Code

Type 0 — Retain original

Type 1 — Overwrite

Type 2 — Add new row, track history

The MERGE pattern (dialect-independent)

PySpark / Delta Lake version

dbt snapshot (the declarative way)

Type 3 — Add new column (previous + current)

Type 4 — History in a side table

Type 6 — 1 + 2 + 3

Type 7 — Dual keys

Picking a type per attribute

The red-flag shapes

8. Date & Time Dimensions Done Right

Generating dim_date (Postgres)

Time-of-day dimension (separate)

Timezone handling — the ironclad rule

9. Data Vault 2.0 — When and Why

Core structures

Loading pattern

Pros

Cons

When to use it

10. One Big Table & the Columnar Revolution

What OBT looks like

OBT trade-offs

Hybrid: Kimball silver + OBT gold

11. Medallion Architecture (Bronze/Silver/Gold)

Bronze — raw, append-only, preserved

Silver — cleaned, conformed, typed, deduped

Gold — consumption-ready, OBT per use case

The implicit contract

Medallion + streaming

12. NULL Semantics — The Silent Source of Bugs

NULL in joins

NULL in aggregations

NULL in filters

NULL ordering

NULL in GROUP BY

NULLs in window functions

Rule: every NULL should be documented

13. Data Contracts & Schema Evolution

Anti-pattern: `INSERT ... SELECT` without dedup

Anti-pattern: `now()` in transform