Overview
A multi-file knowledge base for deep, production-grade data engineering. Each file goes far beyond surface treatment — internals, math, code samples, war stories, and the things that get asked in senior interview loops.
How to use this
Read in order if you're studying. Jump to a file if you're cramming for a specific area. The final file is interview Q&A built from real scenarios — page-at-3am incidents, design rounds, debugging walkthroughs — not leetcode.
Each file is self-contained. Code samples are runnable (Spark/PySpark, Flink, SQL on Snowflake/BigQuery/Postgres, Python). All examples assume Python 3.11+, Spark 3.5+, Flink 1.18+ unless stated otherwise.
Files
| # | File | What's in it |
|---|---|---|
| 01 | 01-data-modeling.md |
Bounded vs unbounded data shapes, Kimball/Inmon/Vault, SCD 0–7 with code, fact grain, OBT vs star, lakehouse modeling, NULL semantics, contract design |
| 02 | 02-batch-processing.md |
Bounded data theory, idempotency proofs, incremental + backfill patterns, MERGE patterns, file-format internals, partition design math |
| 03 | 03-streaming-processing.md |
Unbounded data, event time vs processing time, watermark math, window mechanics, Flink internals, Kafka protocol, exactly-once proofs |
| 04 | 04-spark-internals.md |
Catalyst rules, AQE behavior, shuffle service, broadcast vs SMJ vs SHJ math, skew detection algorithms, memory model, Tungsten |
| 05 | 05-sql-deep-dive.md |
Logical → physical plan translation, join algorithms with complexity, indexes & zone maps, window function internals, advanced patterns (gaps, sessionize, PIT, sketches) |
| 06 | 06-python-de.md |
GIL bytecode-level, Pandas vs Polars vs PySpark trade math, pandas/Arrow boundary, generators, async patterns, testing strategy, packaging |
| 07 | 07-lakehouse-iceberg.md |
Iceberg & Delta on-disk layout, snapshot isolation, MERGE internals, compaction, hidden partitioning, time travel, schema evolution semantics |
| 08 | 08-interview-qa-scenarios.md |
40+ real interview scenarios with full answer skeletons — incident response, system design, deep-internals, judgment calls. Not leetcode. |
Reading paths
Cramming for an interview in a week: 08 first to know what you're aiming for, then 02/03 (processing), then 05 (SQL), then 01 (modeling), then 04 (Spark) and 07 (lakehouse) for depth on stack-specific questions.
Building a real system: 01 → 02 → 03 → 07 → 04. SQL and Python depth as needed.
Refresher / lookup: jump to whichever file has the topic. Each file's TOC is browsable.
What "deep" means here
Where the previous deep dive said "watermarks are a promise that no events with event_time ≤ W will arrive," this version derives the watermark formula, walks through how Flink computes it across operators with different latencies, shows the actual math for tuning bounded-out-of-orderness, and gives the code.
Where the previous version said "AQE handles skew automatically," this version shows the algorithm — how Spark detects a skewed partition, how it splits it, what the trade-offs are with broadcast vs shuffle, and the configs that govern each step.
That's the bar.
Data Modeling
The shape of your data at rest outlives every other choice. Code gets rewritten; schemas get migrated but never quite leave. This file goes deeper than "use a star schema" — it derives the decisions from first principles, shows the math that justifies them, and gives full DDL and MERGE patterns you can actually run.
1. Bounded vs Unbounded Data at Rest
Before storage models, the underlying question: is the dataset bounded (finite, the end is known) or unbounded (arriving forever)?
- Bounded at rest: a daily snapshot, a historical archive, a reference table. You can scan it, sort it, compute global aggregates, and re-derive it from source on demand.
- Unbounded at rest: an append-only log of events where you're modeling "everything that ever happened." Storage grows forever. Queries must always constrain time.
This distinction drives physical choices:
| Aspect | Bounded tables | Unbounded tables |
|---|---|---|
| Partitioning | Optional, by category | Required, by time |
| Sort key | By query predicate | By time, always |
| Vacuum policy | Rare, small savings | Aggressive — old partitions dropped or cold-tiered |
| Query pattern | Full scans tolerable | Must always include time bound |
| Compaction | Manual, when files accumulate | Continuous or scheduled |
The engineering mistake is treating an unbounded stream-origin table as bounded — e.g., SELECT COUNT(DISTINCT user_id) FROM events without a dt filter. Over time this becomes unrunnable. Build guards into your table design (required partitions, partition-prune-or-fail settings like Snowflake's REQUIRE_PARTITION_FILTER, BigQuery's require_partition_filter=TRUE).
-- BigQuery: reject queries that don't filter on the partition column
CREATE TABLE playback.events_daily (
event_ts TIMESTAMP,
user_id STRING,
title_id STRING,
watch_ms INT64,
dt DATE
)
PARTITION BY dt
OPTIONS (
require_partition_filter = TRUE,
partition_expiration_days = 730 -- auto-drop after 2 years
);
-- Snowflake equivalent: rely on clustering + session param
ALTER SESSION SET QUERY_TAG = 'enforce_partition_filter';
-- Enforcement is done via resource monitors + query profiling, not DDL.The unbounded nature also affects backfills. For a bounded dimension, "regenerate from source" is a few GB; for an unbounded fact, it's years of events.
2. OLTP vs OLAP — The Physical Reality
The choice between row-oriented and column-oriented storage isn't philosophical — it's dictated by what the hardware does with your queries.
Row-oriented storage (Postgres, MySQL, Oracle)
A row is laid out contiguously on disk. Reading row 42 = one random I/O + parse the row. Fast for "give me all columns of one row by primary key." Slow for "give me the average of column X across 100M rows" because every row must be loaded even though only one column is needed.
The storage page (typically 8 KB in Postgres, 16 KB in MySQL) is the unit of I/O. Postgres reads whole pages into shared_buffers. TID is (page_number, slot_number). B-tree indexes point at TIDs; looking up row 42 = B-tree walk (log N pages) + heap fetch.
Column-oriented storage (Parquet, ORC, Snowflake FDN, Redshift, BigQuery Capacitor)
Each column stored separately, with its own compression. Reading column X = read just X's bytes, skipping everything else. Compression ratios of 5–20× are routine because a single column has low cardinality locally.
A Parquet file is organized as:
- File footer: schema, row-group locations, column stats (min/max/null_count/distinct_count).
- Row groups (typically 128 MB): a horizontal slice of rows.
- Column chunks: within a row group, one per column.
- Pages (typically 1 MB): within a chunk, the unit of decoding.
- Column chunks: within a row group, one per column.
The footer stats are the basis of row-group skipping — a WHERE filter can exclude row groups whose min/max don't match. This is why WHERE dt = '2026-04-16' AND country = 'US' is fast on a Parquet table partitioned by dt and sorted by country: both conditions prune.
Compression schemes within a column:
- RLE (Run-Length Encoding):
AAAA BBBB CCCC→(A,4)(B,4)(C,4). Great on sorted columns with runs. - Dictionary encoding: map values to ints; store ints + dictionary. Great on low-cardinality strings.
- Bit-packing: encode small integers in <8 bits each. Often combined with RLE.
- Delta encoding: store differences from previous value. Great on timestamps, sorted sequences.
The column-oriented nature has a direct modeling implication: adding a column is nearly free at read time (you only pay for it if you select it). Wide tables (OBT) aren't the disaster they'd be in a row store.
OLTP modeling implications
- Normalize: each update touches one row, so redundancy is expensive.
- B-tree indexes on everything queried. Covering indexes for hot queries.
- FK constraints enforced at write.
- Partitioning is a tool for manageability (vacuum, detach old partitions) more than performance.
OLAP modeling implications
- Denormalize within reason. Joins are expensive in distributed systems; repeated string values cost almost nothing with dictionary encoding.
- Sort / cluster by the most common filter key (usually date).
- No FK constraints; integrity is pipeline-enforced.
- Partitioning is the #1 physical layout decision. Get it wrong and queries scan entire years.
HTAP blur
New engines (TiDB, CockroachDB, SingleStore, Snowflake Hybrid Tables) offer row + column storage of the same logical table — writes go to the row store, reads to the column store with asynchronous replication. Nice for mid-size workloads; still limited in scale or consistency compared to pure OLTP/OLAP.
3. Normalization Derived From Functional Dependencies
Normalization is taught as a set of rules. It's actually a consequence of functional dependencies — statements of the form "column set X determines column set Y" (written X → Y).
Given dependencies:
order_id → customer_id, order_date, statuscustomer_id → customer_name, country
The relation Orders(order_id, customer_id, order_date, status, customer_name, country) violates 3NF because customer_id (non-key) determines customer_name, country — a transitive dependency. Decomposition:
Orders(order_id, customer_id, order_date, status)Customers(customer_id, customer_name, country)
1NF — atomic values
No arrays, no comma-separated strings, no nested structures that the DB can't query. tags = "red,blue,green" forces LIKE '%blue%' queries that can't use indexes. Decompose into a child table order_tags(order_id, tag).
Modern engines allow arrays (Postgres text[], BigQuery ARRAY<STRING>) with GIN/ARRAY-function support. That's a departure from strict 1NF but acceptable when the array operations are well-supported.
2NF — full functional dependency on the key
Only relevant for composite keys. If PK is (order_id, line_number) and order_date depends on order_id alone, order_date doesn't belong in the line-item table. Move it.
3NF — no transitive dependencies
Shown above. In practice, 3NF is the ceiling for OLTP — further decomposition rarely helps.
BCNF (3.5NF) — every determinant is a candidate key
The textbook failure case: ClassRoom(course_id, instructor, room) where (course_id, instructor) → room and instructor → room. BCNF requires instructor to be a candidate key; it isn't, so decompose. These situations are rare in real data.
4NF & 5NF — multi-valued & join dependencies
Multi-valued dependencies arise when two independent multi-valued facts coexist: a teacher teaches multiple courses AND speaks multiple languages, and all combinations appear. 4NF says split these. 5NF handles even more arcane join dependencies. In 20 years of modeling work I've applied these consciously twice.
Practical heuristic
Normalize to 3NF. Denormalize deliberately, with a comment explaining why. A typical OLTP schema has 90% 3NF + 10% intentional denormalization (caching displayed values, storing denormalized totals for performance).
4. Dimensional Modeling — Kimball in Full
Kimball's dimensional model is a theory of analytic queries. It says: most analytics look like "a measure, broken down by attributes, filtered by other attributes." If your storage matches that shape, queries are trivial and fast.
The four-step process (and what each step actually means)
Step 1 — Business process. Not "the whole business" — a single process: orders, shipments, returns, playback sessions. Business processes are where events get generated; each is a candidate fact table. Get them from business users, not from source-system tables.
Step 2 — Grain. Declare exactly what one fact row represents. "One row per order line item per order" is a grain. "One row per playback session per user" is another. Grain must be atomic — prefer finer grain; you can always aggregate up.
The test: pick any two rows; they must represent genuinely distinct events. If two rows could be the same event counted twice, your grain is wrong.
Step 3 — Dimensions. The descriptors: who, what, where, when, why, how. For each dimension, identify the grain and attributes. Dimensions are denormalized (all attributes in one table) — you want category, subcategory, brand, department all present in dim_product even though they form a hierarchy.
Step 4 — Facts. The numeric measures at the declared grain. Prefer additive measures (quantity, extended_price). Document semi-additive (balance — additive across accounts but not time) and non-additive (ratio — don't sum ever).
Full star schema — playback analytics
-- -----------------------------------------------------
-- Date dimension (discussed more in §8)
-- -----------------------------------------------------
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- 20260419
full_date DATE NOT NULL,
day_of_week SMALLINT NOT NULL, -- 1=Monday
day_name VARCHAR(10) NOT NULL,
day_of_month SMALLINT NOT NULL,
day_of_year SMALLINT NOT NULL,
week_of_year SMALLINT NOT NULL, -- ISO week
month_num SMALLINT NOT NULL,
month_name VARCHAR(10) NOT NULL,
quarter SMALLINT NOT NULL,
year SMALLINT NOT NULL,
fiscal_year SMALLINT NOT NULL,
fiscal_quarter SMALLINT NOT NULL,
fiscal_period SMALLINT NOT NULL,
is_weekend BOOLEAN NOT NULL,
is_holiday BOOLEAN NOT NULL,
holiday_name VARCHAR(50),
is_business_day BOOLEAN NOT NULL,
days_from_today INT -- maintained via view
);
-- -----------------------------------------------------
-- User dimension — SCD Type 2
-- -----------------------------------------------------
CREATE TABLE dim_user (
user_key BIGINT PRIMARY KEY, -- surrogate
user_id VARCHAR(36) NOT NULL, -- natural/durable
email VARCHAR(320),
country_code CHAR(2),
subscription_tier VARCHAR(20), -- basic/standard/premium
household_size SMALLINT,
profile_language VARCHAR(10),
signup_date DATE,
churn_date DATE,
valid_from TIMESTAMPTZ NOT NULL,
valid_to TIMESTAMPTZ,
is_current BOOLEAN NOT NULL,
hash_diff CHAR(64) NOT NULL -- SHA-256 of tracked attrs
);
CREATE UNIQUE INDEX dim_user_natural ON dim_user(user_id, valid_from);
CREATE INDEX dim_user_current ON dim_user(user_id) WHERE is_current;
-- -----------------------------------------------------
-- Title dimension — mostly Type 1, some Type 2 (rating)
-- -----------------------------------------------------
CREATE TABLE dim_title (
title_key BIGINT PRIMARY KEY,
title_id VARCHAR(36) NOT NULL,
title VARCHAR(500),
title_type VARCHAR(20), -- movie/series/documentary
genre_primary VARCHAR(50),
genre_secondary VARCHAR(50),
runtime_minutes INT,
release_year SMALLINT,
language_primary VARCHAR(10),
content_rating VARCHAR(20), -- MA, PG-13, etc.
current_rating NUMERIC(3,2), -- aggregated, Type 1
season_number SMALLINT,
episode_number SMALLINT,
is_original BOOLEAN,
valid_from TIMESTAMPTZ NOT NULL,
valid_to TIMESTAMPTZ,
is_current BOOLEAN NOT NULL
);
-- -----------------------------------------------------
-- Device dimension — Type 1 (device facts don't "change" per user)
-- -----------------------------------------------------
CREATE TABLE dim_device (
device_key BIGINT PRIMARY KEY,
device_id VARCHAR(64) NOT NULL,
device_type VARCHAR(20), -- tv/mobile/tablet/web
manufacturer VARCHAR(50),
model VARCHAR(100),
os VARCHAR(50),
os_version VARCHAR(20),
app_version VARCHAR(20)
);
-- -----------------------------------------------------
-- Playback session fact — transaction grain
-- One row per playback session (session starts and ends)
-- -----------------------------------------------------
CREATE TABLE fact_playback_session (
session_key BIGINT PRIMARY KEY, -- surrogate per session
session_id VARCHAR(64) NOT NULL, -- natural, from client
user_key BIGINT NOT NULL REFERENCES dim_user,
title_key BIGINT NOT NULL REFERENCES dim_title,
device_key BIGINT NOT NULL REFERENCES dim_device,
start_date_key INT NOT NULL REFERENCES dim_date,
end_date_key INT NOT NULL REFERENCES dim_date,
start_ts TIMESTAMPTZ NOT NULL,
end_ts TIMESTAMPTZ NOT NULL,
-- Measures (all numeric, additive where possible)
watch_ms BIGINT NOT NULL, -- additive
buffer_ms BIGINT NOT NULL, -- additive
seeks INT NOT NULL, -- additive
pauses INT NOT NULL, -- additive
max_bitrate_kbps INT, -- non-additive (use avg)
avg_bitrate_kbps INT, -- non-additive (use avg)
completion_ratio NUMERIC(4,3), -- non-additive ratio
-- Degenerate dimensions
ended_reason VARCHAR(20), -- user/credits/error
qoe_score NUMERIC(3,2),
-- Partition
dt DATE NOT NULL
)
PARTITION BY RANGE (dt);Key details people miss:
- Every FK is a surrogate
_key, not a natural_id. The natural IDs are attributes, preserved for lookup. start_date_keyandend_date_keyboth referencedim_date— same table, role-playing.session_idis stored as a degenerate dimension (an attribute on the fact with no corresponding dim table) because there's no other data to put in adim_session.max_bitrate_kbpsandavg_bitrate_kbpsare non-additive measures. You need both if you want to compute weighted averages at higher grains.
The query this shape makes easy
-- Total watch hours by country and content rating, last 7 days, weekdays only
SELECT
u.country_code,
t.content_rating,
SUM(f.watch_ms) / 3600.0 / 1000 AS watch_hours
FROM fact_playback_session f
JOIN dim_user u ON u.user_key = f.user_key
JOIN dim_title t ON t.title_key = f.title_key
JOIN dim_date d ON d.date_key = f.start_date_key
WHERE d.full_date >= CURRENT_DATE - 7
AND d.is_weekend = FALSE
GROUP BY u.country_code, t.content_rating
ORDER BY watch_hours DESC;No CASE statements, no date arithmetic, no EXTRACT — just filter and group. This is why Kimball wins.
Why surrogate keys
Four reasons, in increasing order of importance:
- Natural keys change. Source systems renumber, reassign IDs during migrations. Your warehouse shouldn't break.
- SCD2 requires multiple rows per natural ID — the surrogate disambiguates.
- Integer surrogates are small and fast in joins. A BIGINT is 8 bytes; a VARCHAR(36) UUID is 36 bytes + overhead and comparison is slower.
- Surrogates decouple the warehouse's keyspace from source systems. You can integrate two sources whose
customer_idnamespaces collide.
Generate surrogates deterministically with SHA-256 of the natural key (for idempotent re-runs) or from a sequence (for insertion-order keys). The deterministic approach is safer in modern pipelines.
-- Deterministic surrogate from natural key
SELECT
HASH(user_id, valid_from) AS user_key, -- Snowflake HASH(x1,...) -> BIGINT
user_id, valid_from, ...
FROM staging_users;
-- Spark / PySpark
from pyspark.sql.functions import sha2, concat_ws, col
df = df.withColumn(
"user_key",
sha2(concat_ws("||", col("user_id"), col("valid_from")), 256)
)Use a hashing function that's stable across restarts; Spark's built-in hash() isn't guaranteed stable across versions, but sha2 and md5 are.
5. Fact Table Grain (The Most Important Decision)
Grain is where every modeling interview should start. The failure mode is a fact table where some rows are at one grain and others at another. The bug is invisible until a sum is 2× too large.
Three transactional fact patterns
Transaction fact — one row per event. Most common. Fully additive. Example: fact_playback_session above.
Periodic snapshot fact — one row per entity per period. Measures describe the state at period end. Semi-additive.
CREATE TABLE fact_subscription_daily (
snapshot_date_key INT,
user_key BIGINT,
-- Semi-additive: summable across users, NOT across days
mrr_cents BIGINT, -- monthly recurring revenue
status VARCHAR(20),
days_until_renewal SMALLINT,
-- Additive within the day
payments_made INT,
refunds_issued INT,
PRIMARY KEY (snapshot_date_key, user_key)
);The MRR column is the classic semi-additive trap. Users total mrr across two days and get 2× the real revenue. Solutions: label the column in the metadata catalog; always aggregate via AVG across time and SUM across entity; make the column name end in _point_in_time.
Accumulating snapshot fact — one row per process instance, updated as it progresses through milestones.
CREATE TABLE fact_order_fulfillment (
order_key BIGINT PRIMARY KEY,
customer_key BIGINT,
-- Milestones (date keys; NULL until reached)
order_date_key INT NOT NULL,
paid_date_key INT,
packed_date_key INT,
shipped_date_key INT,
delivered_date_key INT,
returned_date_key INT,
-- Durations, derived at each update
hours_to_pay INT,
hours_to_pack INT,
hours_to_ship INT,
hours_to_deliver INT,
-- Measures
order_total_cents BIGINT,
items_count INT
);Pros: a single row per order gives "lead time analysis" trivially. Cons: heavy on updates (updates are the enemy of columnar engines). Use when the process has bounded, well-defined milestones.
Additivity
Additivity is the single most-often-wrong attribute in a fact table. Get it wrong and every dashboard off that table is silently lying. The taxonomy is three-way, but the real skill is recognizing which bucket any given measure falls into across domains. The table below expands the usual starter examples into a cross-industry reference you can pattern-match against.
| Additivity | Sum across any dim? | Representative measures | Warehouse pattern |
|---|---|---|---|
| Additive | Yes | revenue, quantity_sold, watch_ms, clicks, impressions, seeks, bytes_transferred, kWh_consumed, calls_handled, miles_driven, tickets_sold, pageviews, messages_sent, COGS, gross_margin_dollars | SUM() everywhere |
| Semi-additive | Some dims only | account_balance, loan_principal_outstanding, inventory_on_hand, headcount, open_positions, active_users_point_in_time, queue_depth, subscriber_count, MRR/ARR snapshot, portfolio_market_value, occupancy_count, beds_available, in-flight_aircraft, open_tickets, WIP_units, cache_entry_count | SUM() across non-time dims; AVG(), LAST_VALUE(), or MAX(snapshot_date) across time |
| Non-additive | Never | unit_price, margin_pct, conversion_rate, bounce_rate, CTR, CPC, CPM, ROAS, churn_rate, NPS, CSAT, avg_session_duration, LTV, CAC_ratio, z_score, cohort_retention_pct, exchange_rate, utilization_pct, SLA_pct, fill_rate_pct, stock_price, cap_rate, interest_rate | Store numerator + denominator; compute ratio in SELECT. Never pre-aggregate the ratio itself. |
Domain-specific additivity cheat sheet
Use this to pattern-match new domains against known ones. The same measure shape shows up again and again.
| Domain | Additive | Semi-additive (snapshot) | Non-additive (ratios) |
|---|---|---|---|
| SaaS / subscription | signups, cancellations, payments_received, refunds, feature_events | MRR, ARR, active_subscribers, seats_in_use, storage_GB_used | churn_rate, gross_retention_pct, NRR, ARPU, LTV/CAC |
| E-commerce / retail | orders_placed, units_sold, revenue, returns, shipping_cost, gift_cards_issued | inventory_on_hand, cart_value_total, wishlist_size, open_orders | conversion_rate, AOV, return_rate, margin_pct, sell_through_rate |
| Fintech / banking | deposits, withdrawals, transactions_count, interest_accrued, fees_charged | account_balance, loan_outstanding, credit_exposure, AUM | APR, NPL_ratio, NIM, capital_ratio, Sharpe_ratio |
| Payments / fraud | transactions, approved_count, declined_count, chargebacks, refunds_dollar | active_card_count, in_review_queue_depth, fraud_blocklist_size | approval_rate, fraud_rate_bps, false_positive_rate, chargeback_ratio |
| Advertising / adtech | impressions, clicks, conversions, spend_dollars, video_starts, video_completes | active_campaigns, remaining_budget, inventory_available | CTR, CPC, CPM, CPA, ROAS, viewability_pct, VTR |
| Streaming media | watch_ms, starts, completes, seeks, downloads, new_titles_added | concurrent_streams, active_accounts, catalog_size | completion_rate, stream_start_success_rate, rebuffer_ratio |
| Gaming | sessions, purchases, XP_earned, deaths, matches_played, currency_spent | active_players, open_lobbies, inventory_items_held, mmr_snapshot | D1/D7/D30_retention, ARPDAU, win_rate, K/D_ratio, match_fill_rate |
| Ride-share / logistics | rides_completed, miles_driven, fares_collected, driver_hours_online, tips | drivers_online, riders_in_queue, open_requests, fleet_size | acceptance_rate, cancellation_rate, utilization_pct, ETA_accuracy, $/mile |
| IoT / industrial | readings_count, alerts_raised, kWh, bytes_telemetered, maintenance_events | devices_online, queue_depth, battery_level_snapshot, firmware_version_count | uptime_pct, packet_loss_rate, alert_false_positive_rate, MTBF |
| Healthcare / payer | claims_submitted, visits, prescriptions_filled, procedures, payments | members_enrolled, authorizations_open, beds_occupied, PMPM_exposure | denial_rate, readmission_rate, MLR, PMPM_cost_trend, claim_cycle_time |
| Ops / on-call | incidents_opened, pages_sent, deploys, rollbacks, PRs_merged | open_incidents, queue_depth, on_call_count | MTTR, change_failure_rate, deploy_freq_norm, SLO_pct, error_budget_remaining |
| Marketplace (two-sided) | listings_created, bookings, transactions, disputes_opened | active_supply_nodes, active_demand_nodes, open_listings | match_rate, take_rate, supply/demand_ratio, repeat_use_rate |
Measures that look additive but aren't
The following look like counts — they are not safely sum-able without context. Each has a specific remediation.
| Misleading measure | Why it breaks | Fix |
|---|---|---|
distinct_users per day | Summing two days double-counts users active in both | Store users_active_hll (HyperLogLog sketch); merge sketches, then cardinality |
unique_sessions | Sessions can span day boundaries; naive sum over-counts or under-counts | Sessionize with a fixed boundary (e.g., always assign to session-start day) |
avg_session_duration per day | Average of averages weights each day equally regardless of traffic | Store sum_duration_ms + session_count; divide at query time |
median_order_value | Medians don't aggregate linearly | Store a t-digest or KLL sketch per day; merge sketches, then query percentiles |
max_concurrent_streams | Summing daily maxes ≠ global max; daily max can coincide or not | Store interval-level counts; compute max over intersection in SELECT |
returning_visitor_count | "Returning" is relative to a baseline that shifts per query window | Reclassify per query; store first_seen_ts per visitor, derive returning flag on demand |
churn_rate (pre-computed) | Churn % cannot be averaged across cohorts of different sizes | Store churned_count + cohort_size; compute ratio in SELECT |
weighted_avg_price | Re-weighted averages don't compose | Store sum_price_times_qty + sum_qty; divide in SELECT |
rank_position from LIMIT queries | Rank is relative to the surrounding rowset, not intrinsic | Never persist rank as a column; compute via window function on read |
latency_p99 per minute | Percentiles don't average — p99 of (p99s) is not the global p99 | Store histogram buckets or a sketch; query-time merge |
The canonical non-additive mistake: you have fact_product_daily(product_key, date_key, avg_price). A BI user computes "average price this week = AVG(avg_price) across 7 days." That's an average of averages — not the weighted average they want. Fix: store sum_of_prices and price_count, compute the ratio in the SELECT. Same rule for everything in the third table above — the shape is always "store the components, derive the ratio on read."
Factless fact tables
A fact table with only FKs — no measures. Models events or eligibility:
-- "Which users had access to which titles on which day"
CREATE TABLE fact_title_availability_daily (
date_key INT,
user_key BIGINT,
title_key BIGINT,
-- no measures
PRIMARY KEY (date_key, user_key, title_key)
);The "measure" is COUNT(*) — "how many users had access to Stranger Things in France on 2026-04-01."
These get huge fast (N_users × N_titles × N_days). Sparse-encode where possible (don't emit rows for absent combinations).
Multi-grain fact: how to avoid it
The canonical multi-grain bug: an orders fact table where some rows are "one per order" and some are "one per line item." Summing order_total sums header rows plus line rows — each order's total appears many times.
Rules:
- One fact table per declared grain.
- If you need both grains, build two fact tables:
fact_order_headerandfact_order_line. They roll up via shared dimensions. - Never introduce an "aggregation type" column to disambiguate rows of different grains. That's a band-aid on a severed arm.
6. Dimensions — Conformed, Junk, Degenerate, Mini, Role-Playing
Conformed dimensions
A dimension is conformed when multiple fact tables share it. This is how you answer "return rate by customer segment" — fact_returns.customer_key and fact_orders.customer_key must reference the same dim_customer.
Conformed dimensions are an organizational commitment as much as a technical one. A dim_customer owned by marketing with fields marketing cares about, and another owned by billing with billing fields, is a failure. You need a single dim_customer with all stakeholders as consumers.
The enterprise bus matrix (Kimball's term) is a table of business processes × dimensions, with an X where they intersect. Use it to plan which dimensions must be conformed:
dim_user dim_title dim_device dim_date dim_geo dim_promo
playback_session X X X X X -
billing_charge X - - X X X
search_query X - X X X -
signup X - X X X X
Junk dimensions
When you have a handful of low-cardinality flags and indicators that don't belong together but don't warrant their own dimensions: combine them into a junk dimension.
-- Instead of: is_trial BOOLEAN, is_promo BOOLEAN, device_is_new BOOLEAN on every fact row...
CREATE TABLE dim_session_flags (
flags_key INT PRIMARY KEY,
is_trial BOOLEAN,
is_promo BOOLEAN,
is_new_device BOOLEAN,
signal_source VARCHAR(20)
);
-- Only 2x2x2x (distinct sources) = ~16 rows.Then the fact has one flags_key FK and you get clean reporting.
Degenerate dimensions
An attribute on the fact with no corresponding dim table — like session_id above. Reason: there's no other data about a session beyond what's already in the fact row. Creating dim_session would just be a table with one column.
Mini-dimensions
When a dimension has a subset of frequently-changing attributes that would inflate SCD2 storage: split them off.
-- dim_user (stable attrs)
user_key, user_id, email, country_code, signup_date, ...
-- dim_user_demographics (mini-dim; snapshots of volatile attrs)
demographics_key, age_bracket, income_bracket, household_size, interest_segment, snapshot_tsThe fact has FKs to both: user_key (as-current) and demographics_key (as-at-event-time). Saves SCD2 storage when the volatile attributes change often and the stable ones don't.
Role-playing dimensions
One physical dimension, multiple roles in the fact. dim_date plays order_date, ship_date, delivered_date. Implement as views aliasing the base dim so query joins read naturally:
CREATE VIEW dim_order_date AS SELECT * FROM dim_date;
CREATE VIEW dim_ship_date AS SELECT * FROM dim_date;
CREATE VIEW dim_delivered_date AS SELECT * FROM dim_date;
SELECT
od.year, od.month_num,
sd.day_name AS ship_day,
dd.week_of_year AS delivered_week,
SUM(f.order_total_cents)
FROM fact_order_fulfillment f
JOIN dim_order_date od ON od.date_key = f.order_date_key
JOIN dim_ship_date sd ON sd.date_key = f.shipped_date_key
JOIN dim_delivered_date dd ON dd.date_key = f.delivered_date_key
GROUP BY od.year, od.month_num, sd.day_name, dd.week_of_year;The physical storage is one table; the logical model has three.
Swappable / outrigger
When a dimension attribute has its own attributes worth rolling up: a small normalized "outrigger" joined to the main dim. dim_product.category_id → dim_category.name, parent_category. Violates pure star; sometimes necessary.
7. Slowly Changing Dimensions — All Seven Types, with Code
Type 0 — Retain original
Never update. Fields like date_of_birth, signup_date, birth_country. No code needed beyond rejecting updates.
Type 1 — Overwrite
Just update. No history. Use for typo corrections, GDPR erasure of PII.
-- Stage has latest values; merge overwrites
MERGE INTO dim_user AS tgt
USING stg_user AS src
ON tgt.user_id = src.user_id
WHEN MATCHED THEN UPDATE SET
email = src.email,
country_code = src.country_code,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);Type 2 — Add new row, track history
The standard. Full history. Every change creates a new row; old row closed with valid_to.
The MERGE pattern (dialect-independent)
Implemented as two statements because a single MERGE can't both update and insert the same logical record.
-- Step 1: close out rows whose tracked attributes have changed
UPDATE dim_user
SET valid_to = CURRENT_TIMESTAMP,
is_current = FALSE
WHERE is_current = TRUE
AND user_id IN (SELECT user_id FROM stg_user)
AND (
-- Use a hash_diff for concise comparison
hash_diff <> (
SELECT SHA2(CONCAT_WS('||',
COALESCE(s.email, ''),
COALESCE(s.country_code, ''),
COALESCE(s.subscription_tier, ''),
COALESCE(s.household_size::VARCHAR, ''),
COALESCE(s.profile_language, '')
), 256)
FROM stg_user s WHERE s.user_id = dim_user.user_id
)
);
-- Step 2: insert a row for every natural key with no current row
INSERT INTO dim_user (
user_key, user_id, email, country_code, subscription_tier,
household_size, profile_language, signup_date, churn_date,
valid_from, valid_to, is_current, hash_diff
)
SELECT
nextval('dim_user_seq'), -- surrogate
s.user_id,
s.email, s.country_code, s.subscription_tier,
s.household_size, s.profile_language,
s.signup_date, s.churn_date,
CURRENT_TIMESTAMP, NULL, TRUE,
SHA2(CONCAT_WS('||',
COALESCE(s.email, ''),
COALESCE(s.country_code, ''),
COALESCE(s.subscription_tier, ''),
COALESCE(s.household_size::VARCHAR, ''),
COALESCE(s.profile_language, '')
), 256)
FROM stg_user s
LEFT JOIN dim_user d
ON d.user_id = s.user_id AND d.is_current = TRUE
WHERE d.user_id IS NULL;Why hash_diff: comparing every column individually is verbose, error-prone, and slow. Hashing the concatenated tracked columns lets you compare two states with a single =. Use SHA-256 (not MD5; collision-free enough for dim-change detection).
Why two statements: MERGE's WHEN MATCHED runs once per matched row. You can't both close the old row and insert a new one in the same MERGE without duplication. Separate steps are clearer anyway.
PySpark / Delta Lake version
from delta.tables import DeltaTable
from pyspark.sql.functions import sha2, concat_ws, coalesce, lit, col, current_timestamp, monotonically_increasing_id
dim = DeltaTable.forName(spark, "warehouse.dim_user")
stg = spark.read.table("stg_user")
# Add hash_diff to the stage
stg_h = stg.withColumn(
"hash_diff",
sha2(concat_ws("||",
coalesce(col("email"), lit("")),
coalesce(col("country_code"), lit("")),
coalesce(col("subscription_tier"), lit("")),
coalesce(col("household_size").cast("string"), lit("")),
coalesce(col("profile_language"), lit(""))
), 256)
)
# Step 1: close rows whose hash changed
(dim.alias("t")
.merge(stg_h.alias("s"),
"t.user_id = s.user_id AND t.is_current = TRUE AND t.hash_diff <> s.hash_diff")
.whenMatchedUpdate(set={
"valid_to": "current_timestamp()",
"is_current": "FALSE"
})
.execute())
# Step 2: insert new versions
changed = (stg_h.alias("s")
.join(dim.toDF().alias("t"),
(col("s.user_id") == col("t.user_id")) & (col("t.is_current") == True),
"left_anti")) # keep s rows not present as current
changed_with_keys = changed.withColumn("user_key", monotonically_increasing_id() + <max_key>)
changed_with_keys.write.format("delta").mode("append").saveAsTable("warehouse.dim_user")monotonically_increasing_id() per partition + offset is one pattern for surrogate generation. Deterministic hashing is safer across re-runs.
dbt snapshot (the declarative way)
# snapshots/dim_user.sql
{% snapshot dim_user_snapshot %}
{{ config(
target_schema='snapshots',
unique_key='user_id',
strategy='check',
check_cols=['email', 'country_code', 'subscription_tier',
'household_size', 'profile_language']
) }}
SELECT user_id, email, country_code, subscription_tier,
household_size, profile_language, signup_date, churn_date
FROM {{ source('raw', 'users') }}
{% endsnapshot %}dbt handles the entire Type 2 lifecycle — adds dbt_valid_from, dbt_valid_to, compares columns, inserts new versions. Production-grade with one file.
Type 3 — Add new column (previous + current)
Tracks exactly one prior value. Niche — useful only when business cares about "previous region" specifically.
ALTER TABLE dim_user
ADD COLUMN previous_country_code CHAR(2),
ADD COLUMN country_changed_at TIMESTAMPTZ;On update: UPDATE ... SET previous_country_code = country_code, country_code = new, country_changed_at = now().
Type 4 — History in a side table
Current row in the main dim (Type 1 semantics for ease of query); history in a separate table.
-- Main dim (current-state only)
CREATE TABLE dim_user_current (
user_key BIGINT PRIMARY KEY,
user_id VARCHAR(36) UNIQUE,
email, country_code, ...
);
-- History (append-only, every change)
CREATE TABLE dim_user_history (
user_id VARCHAR(36),
email, country_code, ...,
valid_from TIMESTAMPTZ,
valid_to TIMESTAMPTZ,
change_reason VARCHAR(50)
);Good when the main dim is hit constantly by dashboards and you want a narrow table. Point-in-time joins use the history table.
Type 6 — 1 + 2 + 3
A Type 2 dim where every historical row also carries the current value of select attributes. Lets you report "sales by the customer's current segment" even on old facts.
CREATE TABLE dim_user (
user_key BIGINT PRIMARY KEY,
user_id VARCHAR(36) NOT NULL,
-- Historical values (Type 2 — different per version)
email, country_code, subscription_tier,
-- Current values (Type 1 — same across all versions of same user_id)
current_email, current_country_code, current_subscription_tier,
-- History metadata
valid_from, valid_to, is_current,
hash_diff
);The Type 1 columns are updated across all rows with the same user_id whenever the current state changes — an expensive broadcast update, but query-time simplicity worth it.
Query patterns:
- "Watch hours by historical segment": join on
user_key, group bysubscription_tier. - "Watch hours by current segment": join on
user_key, group bycurrent_subscription_tier.
Type 7 — Dual keys
The fact carries both the surrogate (user_key — version at event time) and the durable natural key (user_id). Consumers pick which to join on. Rare but powerful for complex reporting platforms.
ALTER TABLE fact_playback_session ADD COLUMN user_id VARCHAR(36); -- durable
-- Historical view
JOIN dim_user d ON d.user_key = f.user_key
-- Current view (resolve through natural key)
JOIN dim_user_current d ON d.user_id = f.user_idPicking a type per attribute
Attributes within one dimension almost always use different types. The skill is choosing correctly per attribute. The expanded table below covers the attributes you'll see in most real dimensions, with the reasoning.
| Attribute | Type | Why |
|---|---|---|
| Customer / user dimension | ||
date_of_birth | 0 | Immutable; a correction is rare and reported as data-entry fix |
signup_date | 0 | Facts of origin; never changes |
birth_country | 0 | Cannot change retroactively |
email (typo-fix) | 1 | Corrections should replace silently, not create a history row |
gdpr_erased_flag | 1 | Privacy operations overwrite by design |
country_code_current | 2 | Relocations must not retro-attribute old orders |
subscription_tier | 2 | Revenue attribution by tier depends on tier at event time |
loyalty_level | 2 | Historical level drove the discount historically given |
referred_by_user_id | 3 (prev + curr) | Short history matters; full history rarely queried |
display_name | 6 | UI needs current name everywhere; analytics may need point-in-time |
full_name (post-marriage) | 6 | Same — current in app, historical for compliance |
last_churn_date | 1 + audit log | Most-recent value is the useful one; audit trail elsewhere |
| Product / SKU dimension | ||
product_id (natural) | 0 | Primary identity, must never mutate |
product_name_display | 1 | Rebrand should update cleanly; history of names is noise |
category_hierarchy | 2 | Reorganizations break historical category rollups if overwritten |
list_price | 2 | Revenue by price-tier depends on price-at-order-time |
cost_of_goods | 2 | Margin math requires the historical COGS at order time |
is_active flag | 2 | Window of availability matters for supply analyses |
current_inventory | — (not a dim attr) | Move to a periodic-snapshot fact, not a dimension |
| Store / location dimension | ||
store_open_date | 0 | Facts of origin |
manager_name | 2 | Attribution by manager-of-record at event time |
store_format (big-box vs mall) | 2 | Format reshapes affect historical comps |
physical_address | 2 | Tax jurisdiction is historical |
branding_name | 1 | Marketing rename; no analytical impact |
square_footage | 2 | Renovations affect per-sqft metrics historically |
| Employee / HR dimension | ||
hire_date | 0 | Origin fact |
legal_name | 6 | Legal name history retained for compliance; display uses current |
employment_status | 2 | Employed-at-event-time drives most HR analytics |
reporting_manager | 2 | Org-chart snapshots matter historically |
level_band / grade | 2 | Promotion-aware analyses need the level at event time |
salary_band | 2 | Compensation history drives budget rollups |
personal_email | 1 | Pure contact info update |
| Marketing campaign dimension | ||
campaign_id | 0 | Identity |
launch_date | 0 | Origin |
budget_usd | 2 | Budget revisions attribute to the window they applied |
internal_name | 1 | Naming hygiene only |
channel | 2 | Channel reclassification affects attribution history |
owning_team | 2 | Ownership transfers need historical accuracy for credit |
| Account / subscription dimension (B2B) | ||
account_id | 0 | Identity |
contract_start_date | 0 | Origin |
plan_name | 2 | Historical plan drives historical entitlement |
seats_purchased | 2 | Contract-size history is revenue math |
renewal_date | 1 | Current renewal date is the useful one |
owner_account_exec | 2 | Historical ownership for commission attribution |
industry_classification | 2 | Reclassification should not rewrite historical industry rollups |
Document every per-attribute decision in the data catalog. A newcomer should be able to read the catalog and predict whether an update to country_code produces a new row in dim_user. If the answer is ambiguous, the catalog is incomplete.
The red-flag shapes
Three attribute categories warrant extra suspicion when choosing an SCD type:
- Anything regulatory. Legal name, address, tax ID, consent flags — treat as Type 2 minimum, plus a separate audit trail, regardless of analytical need. Compliance trumps modeling elegance.
- Anything that drives financial attribution. Tier, plan, price, territory assignment, account executive ownership — always Type 2. A future auditor will ask "who owned this account when the deal closed." Type 1 cannot answer that.
- Anything computed. Lifetime value, segment classification, engagement score — don't SCD these; move them to a periodic snapshot fact. Dimensions should hold identity and attributes; computed aggregates belong with the facts.
8. Date & Time Dimensions Done Right
A first-class date dimension makes fiscal/holiday/weekend/quarter queries trivial. Generate it once, populate 20 years of rows, never touch again.
Generating dim_date (Postgres)
INSERT INTO dim_date (
date_key, full_date, day_of_week, day_name,
day_of_month, day_of_year, week_of_year,
month_num, month_name, quarter, year,
fiscal_year, fiscal_quarter, fiscal_period,
is_weekend, is_holiday, holiday_name, is_business_day
)
SELECT
TO_CHAR(d, 'YYYYMMDD')::INT AS date_key,
d AS full_date,
EXTRACT(ISODOW FROM d) AS day_of_week,
TO_CHAR(d, 'FMDay') AS day_name,
EXTRACT(DAY FROM d) AS day_of_month,
EXTRACT(DOY FROM d) AS day_of_year,
EXTRACT(WEEK FROM d) AS week_of_year,
EXTRACT(MONTH FROM d) AS month_num,
TO_CHAR(d, 'FMMonth') AS month_name,
EXTRACT(QUARTER FROM d) AS quarter,
EXTRACT(YEAR FROM d) AS year,
-- Netflix fiscal year happens to be calendar; substitute your rules
EXTRACT(YEAR FROM d) AS fiscal_year,
EXTRACT(QUARTER FROM d) AS fiscal_quarter,
EXTRACT(MONTH FROM d) AS fiscal_period,
(EXTRACT(ISODOW FROM d) IN (6,7)) AS is_weekend,
FALSE AS is_holiday,
NULL AS holiday_name,
(EXTRACT(ISODOW FROM d) BETWEEN 1 AND 5) AS is_business_day
FROM generate_series(DATE '2010-01-01', DATE '2040-12-31', INTERVAL '1 day') d;
-- Holidays (join a separate table, mark is_holiday)
UPDATE dim_date d SET is_holiday = TRUE, holiday_name = h.name, is_business_day = FALSE
FROM holidays h WHERE h.observed_date = d.full_date;Time-of-day dimension (separate)
Never combine date + time into one dimension — cardinality explodes (30 years × 86400 seconds = ~1 billion rows).
CREATE TABLE dim_time (
time_key SMALLINT PRIMARY KEY, -- 0..1439 (minute-of-day)
hour SMALLINT,
minute SMALLINT,
period VARCHAR(2), -- AM/PM
hour_12 SMALLINT,
time_of_day_segment VARCHAR(20), -- 'morning', 'evening'
is_peak_hours BOOLEAN
);
-- 1440 rows total.Facts reference both: start_date_key + start_time_key from start_ts.
Timezone handling — the ironclad rule
Store everything in UTC. Convert for display at presentation.
- Source systems emit timestamps; normalize to UTC in the first transform (bronze → silver).
- Use
TIMESTAMP WITH TIME ZONEcolumns (TIMESTAMPTZ) in Postgres/Redshift;TIMESTAMPin Snowflake/BigQuery is naive — wrap consistently. - For local-time analytics ("plays during 8pm local per user"), precompute the local date/time on ingest using the user's known timezone. This avoids runtime
AT TIME ZONEfor millions of rows.
-- Compute user-local ts at ingest
SELECT
event_ts AT TIME ZONE 'UTC' AT TIME ZONE u.timezone AS local_ts,
(event_ts AT TIME ZONE 'UTC' AT TIME ZONE u.timezone)::DATE AS local_date
FROM raw_events e JOIN dim_user u ON u.user_id = e.user_id;DST is the silent bug. Whenever you convert, be prepared for 2026-03-08 02:30 US/Eastern — which doesn't exist. Use library functions (pytz, Java ZonedDateTime) that raise errors rather than silently shift.
9. Data Vault 2.0 — When and Why
Data Vault is a modeling methodology for enterprises with many source systems, strong auditability requirements, and the need to evolve quickly without reshaping downstream.
Core structures
Hub — unique list of business keys.
CREATE TABLE hub_customer (
customer_hk CHAR(64) PRIMARY KEY, -- hash of business key
customer_bk VARCHAR(100) NOT NULL, -- natural / business key
load_dts TIMESTAMPTZ NOT NULL,
record_source VARCHAR(100) NOT NULL
);Link — relationships between hubs.
CREATE TABLE link_order (
order_hk CHAR(64) PRIMARY KEY, -- hash(customer_bk, product_bk, order_bk)
customer_hk CHAR(64) NOT NULL REFERENCES hub_customer,
product_hk CHAR(64) NOT NULL REFERENCES hub_product,
order_bk VARCHAR(100) NOT NULL,
load_dts TIMESTAMPTZ NOT NULL,
record_source VARCHAR(100) NOT NULL
);Satellite — descriptive attributes, with full history built in.
CREATE TABLE sat_customer_details (
customer_hk CHAR(64) NOT NULL,
load_dts TIMESTAMPTZ NOT NULL,
load_end_dts TIMESTAMPTZ,
hash_diff CHAR(64) NOT NULL,
-- Attributes
email VARCHAR(320),
country_code CHAR(2),
subscription_tier VARCHAR(20),
record_source VARCHAR(100) NOT NULL,
PRIMARY KEY (customer_hk, load_dts)
);Loading pattern
Highly parallel — each hub, link, and sat can be loaded independently as long as hubs are loaded before their sats. No dependencies between sources.
-- Load pattern (per source, per hub)
INSERT INTO hub_customer (customer_hk, customer_bk, load_dts, record_source)
SELECT DISTINCT
SHA2(s.customer_id, 256) AS customer_hk,
s.customer_id,
CURRENT_TIMESTAMP,
'billing_system'
FROM stage_billing s
LEFT JOIN hub_customer h ON h.customer_hk = SHA2(s.customer_id, 256)
WHERE h.customer_hk IS NULL;Pros
- Source-system-agnostic: the vault survives upstream changes with zero refactor.
- Auditable: every fact has load_dts + record_source; lineage is built in.
- Parallel loading: no cross-source dependencies.
- Adding new sources is an additive operation.
Cons
- Unqueryable directly. You build a Kimball-style "information mart" on top for BI. Two layers = more moving parts.
- Many joins: hubs + sats + links to get a single business view.
- Overkill for small/single-source warehouses.
- Team needs training; the discipline is easy to violate.
When to use it
- 50+ source systems to integrate.
- Regulatory compliance (insurance, banking, healthcare) requires full audit trail.
- Large data engineering team that can maintain both vault + marts.
When NOT to use it: small warehouses (<10 sources), teams without DV expertise, or when business moves too fast for a two-layer architecture.
10. One Big Table & the Columnar Revolution
Columnar compression changes the math. With dictionary encoding, 100M rows of "premium" cost roughly 100M × 1-byte-dictionary-code + one entry "premium" in the dictionary. ~100MB uncompressed. With RLE (runs of "premium"): a few bytes per run.
This means denormalizing a dimension into a fact table adds almost no storage. The join-elimination benefit is real. Hence OBT.
What OBT looks like
-- A wide, denormalized table: 200 columns, billions of rows, no joins required
CREATE TABLE gold.playback_sessions_enriched (
-- Primary identifiers
session_id VARCHAR(64),
session_key BIGINT,
-- User attributes (denormalized from dim_user)
user_id VARCHAR(36),
user_country_code CHAR(2),
user_subscription_tier VARCHAR(20),
user_household_size SMALLINT,
user_profile_language VARCHAR(10),
user_signup_date DATE,
user_segment VARCHAR(50),
-- Title attributes
title_id VARCHAR(36),
title VARCHAR(500),
title_type VARCHAR(20),
title_genre_primary VARCHAR(50),
title_runtime_min INT,
title_is_original BOOLEAN,
-- Device attributes
device_type VARCHAR(20),
device_os VARCHAR(50),
-- Session measures
watch_ms BIGINT,
buffer_ms BIGINT,
seeks INT,
pauses INT,
max_bitrate_kbps INT,
completion_ratio NUMERIC(4,3),
-- Derived (not in star)
watched_in_peak_hours BOOLEAN,
was_binge_session BOOLEAN,
qoe_score NUMERIC(3,2),
-- Time
start_ts TIMESTAMPTZ,
end_ts TIMESTAMPTZ,
dt DATE
)
PARTITIONED BY (dt)
CLUSTER BY (user_country_code, title_genre_primary);Consumer queries have zero joins:
SELECT
user_country_code, title_genre_primary,
SUM(watch_ms) / 3600e3 AS watch_hours
FROM gold.playback_sessions_enriched
WHERE dt BETWEEN '2026-04-12' AND '2026-04-18'
GROUP BY user_country_code, title_genre_primary;Zone maps + clustering make this scan a tiny fraction of the table.
OBT trade-offs
Pros:
- One join-less table.
- Dashboards fly.
- Simple mental model for analysts.
- Compressed storage cost is often lower than the original star due to better dictionary encoding across a wider table.
Cons:
- SCD2 becomes painful. Do you carry every historical attribute on every fact row? Or pin-time user attributes as-of event time and skip "current" reporting? Either is a choice.
- Backfilling dimension changes is expensive. If
user_segmentlogic changes, every historical row must be rewritten. - Data contracts grow. 200 columns × many downstream consumers = a lot of coordination on changes.
- Governance nightmare. Sensitive fields proliferate — you now have
emailon billions of fact rows; GDPR deletes touch all of them.
Hybrid: Kimball silver + OBT gold
The pragmatic pattern most serious teams end up with:
- Silver = conformed star schema (dims + facts), Kimball-style.
- Gold = one OBT per consumption pattern, materialized from silver.
Regenerate gold cheaply when dimensions change. Silver provides the single source of truth. BI reads gold for speed.
11. Medallion Architecture (Bronze/Silver/Gold)
The lakehouse convention. Each layer has a distinct purpose:
Bronze — raw, append-only, preserved
Never business-logic-applied. Schema-on-read preserved (or schema promoted but all fields retained). Idempotent ingest by source event ID.
# Ingest Kafka events to Bronze Iceberg
(stream
.writeStream
.format("iceberg")
.outputMode("append")
.option("path", "s3://lake/bronze/events/")
.option("checkpointLocation", "s3://lake/checkpoints/bronze_events/")
.partitionBy("ingest_date")
.trigger(processingTime="5 minutes")
.start())Bronze is the "replayability" layer. If silver/gold ever get corrupted, you can always rebuild from bronze.
Silver — cleaned, conformed, typed, deduped
Business entities emerge. Dimensions and facts in Kimball style. SCD handling lives here.
Gold — consumption-ready, OBT per use case
One gold table per dashboard domain. Materialized from silver. Rebuilt cheaply.
The implicit contract
- Bronze is owned by the ingestion team. Contract: "we will deliver every source event exactly once, preserve schema, and never backfill by overwrite."
- Silver is owned by the platform team. Contract: "conformed dimensions, typed, documented, stable schema."
- Gold is owned by the consumer team. Contract: "you build it, you own it; break it quietly and it's on you."
Medallion + streaming
Works the same way. Bronze = append stream into Iceberg / Delta. Silver = streaming sessionization / dedup / enrichment, written as upsert into Iceberg. Gold = periodic or streaming rollup. Iceberg's row-level DELETE/MERGE (v2) makes this feasible at real latencies.
12. NULL Semantics — The Silent Source of Bugs
NULL in SQL is three-valued logic. NULL = NULL is NULL (not TRUE). x = NULL always yields NULL. This is the source of countless quiet bugs.
NULL in joins
a.x = b.x never matches when either side is NULL. If you want NULL to equal NULL, use:
a.x IS NOT DISTINCT FROM b.x(Postgres, Spark)COALESCE(a.x, '') = COALESCE(b.x, '')(universal but ugly; only works if '' isn't a valid value)equal_null(a.x, b.x)(Snowflake)
NULL in aggregations
COUNT(*)counts all rows.COUNT(x)counts non-NULL x.SUM,AVG,MIN,MAXignore NULLs.AVG(x)=SUM(x) / COUNT(x)— divides by non-NULL count. Surprising when you expect "nulls treated as zero."COUNT(DISTINCT x)counts distinct non-NULL values — so a column with values{1, NULL, NULL}has COUNT DISTINCT 1.
NULL in filters
WHERE x <> 5 excludes rows where x is 5 AND rows where x is NULL. If you want nulls too: WHERE x IS NULL OR x <> 5. Forgetting this is a dominant source of silent data loss.
NULL ordering
- Postgres: NULLs sort LAST by default in ASC, FIRST in DESC. Override with
NULLS FIRST/NULLS LAST. - MySQL: NULLs sort FIRST in ASC.
- Be explicit. Always.
NULL in GROUP BY
NULLs form their own group. GROUP BY country returns one row with country = NULL for all rows missing country. Surface this explicitly in reports or filter before aggregation.
NULLs in window functions
LAG(x) returns NULL before the first row. Use LAG(x, 1, default_value) or COALESCE. LAG(x) IGNORE NULLS (Snowflake/BigQuery) skips NULLs and returns the last non-NULL value — powerful for "last known state" patterns.
-- Last non-null country (for users with intermittent country updates)
SELECT user_id, event_ts,
LAST_VALUE(country IGNORE NULLS) OVER (
PARTITION BY user_id ORDER BY event_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS country_known
FROM events;Rule: every NULL should be documented
Either NULL means "not yet known" (e.g., churn_date) or "not applicable" (e.g., return_reason on non-returned orders). Document which. Consider sentinel values (unknown codes instead of NULL) when the latter — they force explicit handling downstream.
-- Instead of NULL for "unknown country":
country_code CHAR(2) NOT NULL DEFAULT 'ZZ', -- ZZ = unknown
-- Reports naturally include "ZZ" as a group rather than silently dropping.13. Data Contracts & Schema Evolution
A data contract is a guarantee between a producer and consumers about a dataset's schema, semantics, freshness, and quality. Without it, consumers build on sand.
What a contract specifies
# data_contracts/playback_sessions.yaml
id: bronze.playback_sessions
version: 3.1.0
owner: playback-platform@co
schema:
event_id: { type: string, required: true, unique: true }
session_id: { type: string, required: true }
user_id: { type: string, required: true, pii: true }
title_id: { type: string, required: true }
event_ts: { type: timestamp, required: true, timezone: UTC }
event_type: { type: string, required: true, enum: [start, heartbeat, end, error] }
watch_ms: { type: long, required: false, constraints: "value >= 0" }
sla:
freshness: "p99 < 15 minutes after event_ts"
completeness: ">= 99.5% of upstream client emissions"
uniqueness: "100% on event_id within 7 days"
tests:
- name: no_negative_watch_ms
sql: "SELECT COUNT(*) FROM {{ this }} WHERE watch_ms < 0"
threshold: 0
breaking_changes_require:
- major_version_bump
- 30_day_deprecation_notice
- consumer_acknowledgment_from: [analytics, ml-platform, product-insights]Schema evolution rules
| Change | Compatibility | Action |
|---|---|---|
| Add optional column | Backward | Minor version |
| Add required column | Breaking | Major version, default-fill historical rows |
| Drop column | Breaking | Deprecate (write-null for N versions, then drop) |
| Rename column | Breaking | Add new, deprecate old, drop after transition |
| Widen type (INT → BIGINT) | Backward | Minor version |
| Narrow type | Breaking | Major version, data validation |
| Change semantics without renaming | Catastrophic | Never. Rename + deprecate. |
Enforcing contracts
Automatic checks in CI:
# contract_check.py
import yaml, pyarrow.parquet as pq
spec = yaml.safe_load(open("data_contracts/playback_sessions.yaml"))
schema = pq.ParquetFile(sample_file).schema_arrow
observed = {f.name: str(f.type) for f in schema}
for field_name, field_spec in spec["schema"].items():
if field_spec["required"] and field_name not in observed:
raise ValueError(f"Required field {field_name} missing")
# Type, enum, constraint checks...Pair with runtime data quality tests (Great Expectations, Soda, dbt tests) that run on each pipeline execution and block promotion on failure.
Lakehouse-native schema evolution
Iceberg and Delta both support safe evolution:
- Iceberg tracks columns by unique ID, not name. Rename is metadata-only, no rewrite.
- Add column: metadata-only; existing rows treated as NULL.
- Drop column: metadata-only (writes stop emitting it; old files untouched).
- Reorder: metadata-only.
- Type promotion (INT → LONG, FLOAT → DOUBLE): metadata-only, allowed.
- Type narrowing: not allowed.
-- Iceberg
ALTER TABLE silver.playback_sessions ADD COLUMN engagement_score DOUBLE;
ALTER TABLE silver.playback_sessions RENAME COLUMN qoe_score TO quality_score;14. Modeling Checklist & Anti-Patterns
Before shipping a model:
- Grain declared and documented? If two engineers could disagree about what a row represents, the model is broken.
- Are all keys surrogate? Natural keys are attributes, not PKs on dims.
- Is every measure's additivity documented? Labels in the data catalog; column naming conventions (
_point_in_time,_ratio). - Is history handled deliberately? SCD type per attribute, not per dim.
- Are dimensions conformed across facts? Single
dim_customer, multiple FKs. - Is the date dimension a real table? Not a function call.
- Are slowly-changing fact attributes modeled as mini-dimensions? Not VARCHAR columns.
- Are all NULLs meaningful and documented? Sentinel values where applicable.
- Are timestamps UTC with explicit timezone storage? Local times precomputed for user-local analytics.
- Does the table have a partition/cluster strategy that prunes typical queries? Default: partition by date, cluster by most-filtered high-cardinality column.
- Are contracts published and enforced? Schema + SLA + tests in version control, blocked by CI.
- Can the table be rebuilt from bronze in a reasonable window? If not, you have single-source fragility.
Anti-patterns
- Mixed-grain fact tables. Invisible bug, impossible to audit.
- Natural keys as PKs. Breaks the first time source-system renumbering happens.
- FLOAT for money. Accumulation errors. Use BIGINT cents or DECIMAL(18,2).
- No date dimension. Every report ends up with
CASE WHEN EXTRACT(MONTH ...)scattered everywhere. - TIMESTAMP without timezone. Analytics broken after the next DST change.
- "Latest" table that's actually mutable. A view over SCD2 with
WHERE is_currentis safe; an actual table that gets UPDATE'd is a mutation-ordering bug farm. - Soft-delete without a partial index.
WHERE deleted = FALSEbecomes a full scan on billions of rows. - EAV (entity-attribute-value) inside a warehouse.
(entity_id, attr_name, attr_value)— looks flexible, makes every query a pivot, kills indexability. - One column per language (
name_en, name_fr, name_de, ...). Unbounded schema drift. Normalize to a translations child table. - Copying source tables verbatim into the warehouse as "the model." Source schema serves source transactions; warehouse schema should serve analytics.
- Logic embedded in views with no materialization. When the dashboard query gets slow, caching becomes a retrofit.
- Dimensions that aren't dimensions. If a column has 8 distinct values, it's a mini-dim candidate, not a VARCHAR fact column.
Next: 02-batch-processing.md — the theory and mechanics of bounded-data processing, idempotency, MERGE patterns, file formats, and partition design.
14. Surrogate Key Strategies in Depth
A surrogate key is an engineered stand-in for a natural key. Choosing the wrong surrogate strategy is a recurring source of production bugs, late-arriving data hell, and expensive re-keying projects two years in.
The three strategies
Sequence (identity column). Monotonic integer issued by the database. Tiny, fast, great for indexing. Fatal weakness: you cannot generate it independently in two systems. If you load facts in Spark and dimensions in Snowflake, you need a round-trip to assign keys. That round-trip is slow, fragile, and forbids truly distributed pipelines.
Hash (deterministic on business key). Apply a collision-resistant hash (SHA-256, xxhash) to the natural key tuple and store the hash. Any system can produce the same key without coordination. Works beautifully for Data Vault and lakehouse pipelines. Trade-off: keys are larger (16–32 bytes) and not human-readable. Birthday-paradox collisions are theoretically possible but practically never occur at data-warehouse scale with 128+ bit hashes.
UUID (random). Nice for unordered distributed generation but terrible for range scans and clustering. Never use random UUIDs as a primary key in a columnar warehouse — they destroy compression and min/max pruning. UUIDv7 (time-sortable) is acceptable if you really want UUID semantics.
Decision table
| Requirement | Sequence | Hash | UUID |
|---|---|---|---|
| Cross-system generation | No | Yes | Yes |
| Clustered storage efficiency | Best | Good | Worst |
| Human-readable | Yes | No | No |
| Idempotent re-run | No (re-assigns) | Yes | No |
| Size | 4–8 bytes | 16–32 bytes | 16 bytes |
Hashed surrogate — worked example
-- Snowflake / BigQuery / Spark all agree on this pattern
SELECT
MD5(CONCAT_WS('|',
LOWER(TRIM(source_system)),
CAST(source_id AS STRING),
CAST(effective_date AS STRING)
)) AS customer_sk,
source_system,
source_id,
effective_date,
...
FROM staging_customer;
Key properties: lowercase + trim on text columns (defensive against inconsistent casing), an explicit delimiter that cannot appear in the payload, and inclusion of the temporal component for SCD Type 2. The same row re-ingested produces the same key — idempotency for free.
15. Bridge Tables and Many-to-Many Dimensions
Not every dimension attaches to a fact with a clean foreign key. A policy can have multiple beneficiaries. An insurance claim can have multiple diagnosis codes. A marketing touch can belong to multiple campaigns. These require bridge tables — and bridge tables are where naive data modelers get caught.
The problem with denormalizing
Tempting shortcut: flatten the many-to-many into a wide fact with diagnosis_code_1, diagnosis_code_2, diagnosis_code_3, …. This works until the day you get a claim with 12 codes. Now you either drop data, create a 20-column wasteland, or reshape the fact — which breaks every downstream consumer. Avoid this pattern.
The bridge table pattern
Three tables: the fact (fct_claim), the dimension (dim_diagnosis), and the bridge (bridge_claim_diagnosis). The bridge stores one row per (claim_sk, diagnosis_sk) with an optional weight column.
CREATE TABLE bridge_claim_diagnosis (
claim_sk BIGINT NOT NULL,
diagnosis_sk BIGINT NOT NULL,
is_primary BOOLEAN NOT NULL,
weight DECIMAL(6,4), -- optional: allocate claim value across codes
PRIMARY KEY (claim_sk, diagnosis_sk)
);
The weight column is not optional. Without it, you cannot aggregate claim dollars by diagnosis without double-counting. Ten codes on a claim would cause the same dollars to be attributed ten times when joined naively. Interviewers probe for this.
Aggregation safely across a bridge
-- Correct: weighted allocation
SELECT
d.diagnosis_category,
SUM(f.claim_amount * b.weight) AS allocated_amount
FROM fct_claim f
JOIN bridge_claim_diagnosis b ON b.claim_sk = f.claim_sk
JOIN dim_diagnosis d ON d.diagnosis_sk = b.diagnosis_sk
GROUP BY d.diagnosis_category;
-- Wrong: naive join double-counts
SELECT d.diagnosis_category, SUM(f.claim_amount) FROM ... -- DO NOT
16. Late-Arriving Facts and Dimensions
"Late-arriving" data is the term of art for rows whose business timestamp is in the past but whose arrival timestamp is now. Two distinct cases, each with its own playbook.
Late-arriving facts
A mobile app emits an event on Tuesday but the SDK holds it in local cache due to offline mode, and uploads it Thursday. The fact has a business date of Tuesday. You must decide: does Tuesday's daily aggregate get corrected retroactively, or is the late record attributed to Thursday?
The senior answer is: both tables exist. A "by business date" partition table recomputes Tuesday's totals when the late row arrives. A "by arrival date" partition table leaves Tuesday alone and attributes the row to Thursday. Each serves different consumers — finance cares about business date, operations cares about arrival date — and conflating them is a classic bug.
Late-arriving dimensions
A fact arrives referencing a new customer_id that doesn't exist in dim_customer yet. Three strategies:
- Inferred member. Insert a placeholder row into dim_customer with the known natural key and NULL attributes. The fact joins successfully. When the full dim row arrives later, update in place (SCD Type 1) or issue a new version (SCD Type 2). Best default.
- Quarantine. Hold the fact in a quarantine table until the dim row appears. Replay when it does. Use when joining with missing context would be meaningless.
- Orphan facts. Let the fact land with a NULL dim key. Never recommended — downstream SQL silently drops rows.
17. Accumulating Snapshot Fact Tables
Most facts are transactional (one row per event) or periodic (one row per entity per period). The third, less-known kind is the accumulating snapshot — one row per business process instance, updated in place as the process progresses. The canonical example is an order lifecycle.
Structure
One row per order. Columns for every milestone: placed_ts, paid_ts, shipped_ts, delivered_ts, returned_ts. Plus computed lag columns: days_to_ship, days_to_deliver. The row is created at order placement and UPDATEd at every milestone. NULL columns mean "hasn't happened yet."
CREATE TABLE fct_order_lifecycle (
order_sk BIGINT NOT NULL,
customer_sk BIGINT NOT NULL,
placed_ts TIMESTAMP NOT NULL,
paid_ts TIMESTAMP,
shipped_ts TIMESTAMP,
delivered_ts TIMESTAMP,
returned_ts TIMESTAMP,
days_to_pay INT,
days_to_ship INT,
days_to_deliver INT,
order_amount DECIMAL(10,2) NOT NULL,
current_status VARCHAR(20) NOT NULL
);
Why this fact type exists
Without it, "how long does it take to ship a paid order" requires a self-join on a transactional fact across millions of rows. With accumulating snapshot, it's AVG(days_to_ship). The cost is maintenance complexity: every milestone update has to find and update the right row, ideally via a MERGE on the natural key.
Lakehouse implementation
On Iceberg or Delta, accumulating snapshots are implemented via MERGE INTO. The row is inserted on the first event and updated on every subsequent event. Use a current_status column to track where in the lifecycle the row currently sits — this makes recovery after a failed merge straightforward.
Batch Processing
Batch processing is the workhorse of the data warehouse. It's also where the subtle bugs live — the ones that take three weeks to notice and two days to find. This file goes into the theory of bounded-data processing, the mathematics of idempotency, the anatomy of MERGE, the internals of Parquet/ORC, partition design with real math, and how to build pipelines that survive backfills, reruns, and bad data.
1. Mental Model: Bounded vs Unbounded Data (Deep)
A batch job processes bounded data — a finite set of rows with a defined beginning and end. You can scan it, sort it globally, compute the total, and wait for everything before you emit. These are luxuries a streaming job doesn't have.
But "bounded" is a simplification. Most production batch jobs operate on windows of an unbounded stream: "process yesterday's events" is a bounded view of the event firehose. The sealed-ness of that window depends entirely on how late data can arrive.
The three dimensions of boundedness
- Temporal boundedness — is there a cutoff time after which no more data for the window can arrive?
- Cardinality boundedness — is the count of rows bounded? (It always is in retrospect, but the job needs to know when it's seen them all.)
- Completeness boundedness — can you prove all upstream producers have emitted?
Real-world jobs rarely have all three cleanly:
- A daily job on mobile app events: temporally bounded (sealing at midnight + safety window), cardinality uncertain (user activity varies), completeness probabilistic (some clients reconnect days later).
- An S3 daily drop: temporally bounded + cardinality bounded once the drop completes, completeness signal from a
_SUCCESSfile or a manifest. - A CDC stream dump for prior day: cardinality bounded by rows in source table; temporal boundedness depends on the CDC tool's lag.
Implications for pipeline design
A truly bounded source (a sealed S3 drop) admits:
- Strict correctness checks: row counts, checksums, column distributions against expected ranges.
- Full refresh fallback: if anything looks off, reprocess from source.
A temporally-bounded-but-late-arriving source admits:
- Safety windows: process
event_time ≥ T - 7 dayseven on the "today" job, so late arrivals are included. - Periodic reconciliation: a separate weekly job re-processes the last N days to catch late data.
- Grace periods: "seal" day D only after 48h of additional quiet.
Watermark-as-promise, applied to batch
Even batch has watermarks — implicit ones. "Process yesterday" with a 1-hour safety buffer means: I trust that by now() - 1 hour, no events with event_time in [yesterday_start, yesterday_end] will still be arriving. Make this explicit:
def run_daily(dt: date, safety_hours: int = 2):
"""Process dt's events; assume all events for dt have arrived by dt+1day+safety_hours."""
assert datetime.utcnow() >= datetime.combine(dt + timedelta(days=1), time.min) \
+ timedelta(hours=safety_hours), \
f"Not safe to process {dt} yet"
...The assert prevents silent correctness loss when backfilling to "today."
Bounded storage, unbounded log
Even a bounded table has an unbounded "log" of changes — the sequence of insertions, updates, deletions over time. Modern lakehouse formats (Iceberg, Delta) expose this log: you can read the table as-of any past snapshot (time travel) or stream the log of changes (Change Data Feed in Delta, incremental reads in Iceberg).
This lets batch jobs consume batch sources as if they were streams:
# Iceberg incremental read — only snapshots since the last run
df = (spark.read
.format("iceberg")
.option("start-snapshot-id", last_processed_snapshot)
.option("end-snapshot-id", current_snapshot)
.load("warehouse.bronze.events"))The line between batch and streaming blurs. Modern advice: unify the code path; only the trigger differs.
2. Anatomy of a Batch Job
A production batch job has these stages, in order. Each can fail; each must be recoverable.
- Parameter binding — job receives
logical_date(the date being processed), notcurrent_date. This is the single most important contract; without it, the job is not idempotent. - Source existence check — upstream data exists and is complete. Fail fast if missing (don't produce an empty output that looks successful).
- Source read — pull from bronze / staging / raw.
- Deduplication — producers retry, sources double-emit.
- Conforming — type casting, UTC normalization, null handling, enum validation.
- Enrichment — joins to dimensions to hydrate descriptors.
- Transformation — the actual business logic.
- Validation — row counts, distribution checks, uniqueness assertions. This is the airbag.
- Write — atomic commit to the target.
- Publish — signal to downstream consumers (dataset events, XComs, manifest writes).
- Audit log — row counts in/out, source snapshot IDs, emission latency, job version.
Minimal Python/Spark skeleton
from dataclasses import dataclass
from datetime import date, datetime, timedelta
import logging
log = logging.getLogger(__name__)
@dataclass
class RunContext:
logical_date: date
job_version: str
run_id: str
spark: "SparkSession"
def run(ctx: RunContext) -> None:
log.info("job_start", extra={"extra": ctx.__dict__})
source_exists(ctx) # fail fast if upstream missing
raw = read_source(ctx) # (1)
deduped = dedupe(raw) # (2)
conformed = conform_types(deduped) # (3)
enriched = enrich_dimensions(ctx, conformed) # (4)
output = transform(enriched) # (5)
audit = validate(output, ctx) # (6) — raises on hard-fail
if audit.soft_warnings:
notify(audit.soft_warnings)
write_atomic(output, ctx) # (7)
publish_dataset_event(ctx, audit) # (8)
log.info("job_done", extra={"extra": {"rows": audit.row_count}})Each helper should be independently unit-testable; the orchestrator is thin.
The validation gate
This is the most important step and the most often skipped.
@dataclass
class Audit:
row_count: int
unique_rate: float
null_rates: dict[str, float]
sum_checks: dict[str, int]
soft_warnings: list[str]
def validate(df, ctx) -> Audit:
# Hard assertions (raise)
row_count = df.count()
if row_count == 0:
raise AssertionError("Empty output — upstream issue or bug")
if row_count < ctx.expected_min_rows:
raise AssertionError(f"Row count {row_count} below min {ctx.expected_min_rows}")
dup_rate = 1 - df.select("event_id").distinct().count() / row_count
if dup_rate > 0.001:
raise AssertionError(f"Uniqueness violated: {dup_rate:.4%} duplicates")
# Soft warnings (log, notify)
soft = []
null_rates = {}
for col_name in ["user_id", "title_id", "watch_ms"]:
nr = df.filter(F.col(col_name).isNull()).count() / row_count
null_rates[col_name] = nr
if nr > 0.005: # 0.5% threshold
soft.append(f"High null rate in {col_name}: {nr:.2%}")
return Audit(row_count, 1 - dup_rate, null_rates, {}, soft)Store audit records in a permanent table — historical audit is the tool for diagnosing "what changed overnight."
3. Idempotency — Proofs and Patterns
Definition: running the job twice with the same parameters produces the same output as running it once.
Formally: if f(x) is the job and S_0 is the starting state, then f(f(x))(S_0) = f(x)(S_0). The second run is a no-op relative to the first.
Why it matters
- Reruns on failure: if your job crashes mid-way, you need to retry without producing duplicates or partial writes.
- Backfills: reprocessing a historical date must overwrite cleanly, not accumulate.
- Human error: "I accidentally ran the job twice" shouldn't cause an incident.
- Distributed retries: orchestrators retry automatically; idempotency prevents them causing damage.
Pattern 1 — Partition overwrite
For append-only outputs partitioned by the job parameter (usually date):
(df.write
.format("iceberg")
.mode("overwrite")
.option("replace-where", f"dt = '{ctx.logical_date}'")
.saveAsTable("warehouse.silver.playback_sessions"))The replace-where clause scopes the overwrite to that one partition. Rerun = same partition overwritten with same data = no-op. Iceberg's atomic commit guarantees no reader sees a half-written state.
Safety check: verify df only contains rows for that partition.
assert df.filter(F.col("dt") != ctx.logical_date).count() == 0, \
"Output contains rows outside target partition — partitioning bug"Pattern 2 — MERGE by key
For outputs with compound keys and potential updates:
MERGE INTO silver.dim_user AS t
USING stg_user AS s
ON t.user_id = s.user_id
WHEN MATCHED AND t.updated_at < s.updated_at THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);Idempotent because a second run against the same stg_user finds no rows where t.updated_at < s.updated_at (they're equal). Requires a proper version column (updated_at); otherwise you might silently overwrite a newer value with an older one.
Pattern 3 — Hash + upsert
For CDC-like feeds:
MERGE INTO silver.events AS t
USING (
SELECT *, SHA2(CONCAT_WS('||', event_id, event_ts, payload), 256) AS row_hash
FROM stg_events
) AS s
ON t.event_id = s.event_id
WHEN MATCHED AND t.row_hash <> s.row_hash THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);Second run: same hashes match, no updates. Idempotent.
Anti-pattern: INSERT ... SELECT without dedup
-- BAD: second run doubles the data
INSERT INTO silver.events SELECT * FROM stg_events;This is the single most common idempotency violation. Fix:
- Replace with
MERGEor partition-overwrite. - Or: truncate target partition first (but now two statements must be atomic — use Iceberg/Delta semantics).
Anti-pattern: now() in transform
-- BAD: second run produces different output because now() changes
SELECT *, NOW() AS processed_at FROM stg_events;Fix: use ctx.logical_date or a per-run constant. If you need a "processed_at" column, set it once at the start of the run and pass it in.
Anti-pattern: monotonically increasing keys that differ across runs
# BAD: Spark's monotonically_increasing_id() isn't stable across runs
df.withColumn("key", F.monotonically_increasing_id())Fix: deterministic hash of natural keys.
df.withColumn("key", F.sha2(F.concat_ws("||", *natural_key_cols), 256))Testing idempotency
Write a test that runs the job twice and diffs the output:
def test_idempotent_run(tmp_table, test_data):
run(RunContext(logical_date=date(2026,4,10), ...))
snapshot1 = spark.table(tmp_table).collect()
run(RunContext(logical_date=date(2026,4,10), ...))
snapshot2 = spark.table(tmp_table).collect()
assert snapshot1 == snapshot2, "Output differs between runs — not idempotent"This catches now() leaks, non-deterministic ordering, and accidental row duplication.
4. MERGE Under the Hood
MERGE (aka UPSERT) looks simple; it's anything but. Understanding its execution is essential to reasoning about performance and correctness.
The logical model
MERGE INTO target t
USING source s
ON t.key = s.key
WHEN MATCHED AND <cond> THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
WHEN MATCHED AND <cond2> THEN DELETE
WHEN NOT MATCHED BY SOURCE THEN DELETE; -- supported by some enginesConceptually: for every (t, s) pair where t.key = s.key, apply the first matching WHEN MATCHED clause. For every s with no match in t, apply WHEN NOT MATCHED. For every t with no match in s (when WHEN NOT MATCHED BY SOURCE), apply that clause.
The multi-match rule
If a single target row matches multiple source rows, behavior is undefined and most engines error:
ERROR: Cannot perform MERGE as multiple source rows matched and attempted to modify
the same target row.Fix: deduplicate the source to a single row per join key before the MERGE.
MERGE INTO target t
USING (
SELECT DISTINCT ON (user_id) * -- Postgres
FROM stg_user
ORDER BY user_id, updated_at DESC
) s
ON t.user_id = s.user_id
...Or using window functions (portable):
USING (
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) rn
FROM stg_user
) WHERE rn = 1
) sExecution on lakehouse (Iceberg/Delta)
MERGE on a lakehouse is not an in-place update. Files are immutable. The steps:
- Scan target to find files touched by matched rows.
- Scan source and join against target on key.
- For each target file touched, read it fully, apply the merge, write a new file with updated/inserted rows.
- Commit: swap the manifest to reference new files instead of old.
- Old files remain (for time travel) until vacuum.
Implications:
- Write amplification: updating 1% of rows in a 10 GB file still rewrites the whole 10 GB file. This is why small, well-sized files matter — bigger files = more write amplification per update.
- Merge-on-read vs copy-on-write: Iceberg v2 and Delta support "merge-on-read" mode where deletes/updates are stored as separate delete files (positional or equality deletes), merged at read time. Cheaper writes, more expensive reads until compaction.
- Compaction is essential. Without periodic
OPTIMIZE/rewrite_data_files, merge-on-read tables slow down over time.
Copy-on-write vs merge-on-read
Copy-on-write (default in Iceberg v1, Delta default):
Update → rewrite touched files entirely.
Pros: fast reads, no merge at query time.
Cons: slow writes, high write amplification.
Use: update-light workloads (SCD2, daily overwrites).
Merge-on-read (Iceberg v2, Delta with deletion vectors):
Update → write delete file + new rows file.
Pros: fast writes.
Cons: reads must apply delete vectors; slower until compacted.
Use: update-heavy workloads (CDC, streaming upserts).
Configure per table based on expected update pattern.
Partition-aware MERGE
If the target is partitioned and the merge key lets the planner prune partitions, only those are touched. Make sure the partition column appears in the ON clause:
-- Good: planner prunes partitions where dt doesn't appear in source
MERGE INTO target t
USING source s
ON t.user_id = s.user_id AND t.dt = s.dt -- includes partition col
...Without the partition predicate in the ON clause, the engine scans all partitions.
Spark / Delta MERGE performance
from delta.tables import DeltaTable
# Enable deletion vectors (merge-on-read) for this table
spark.sql("""
ALTER TABLE silver.events SET TBLPROPERTIES (
'delta.enableDeletionVectors' = 'true'
)
""")
t = DeltaTable.forName(spark, "silver.events")
(t.alias("t")
.merge(
source=stg.alias("s"),
condition="t.event_id = s.event_id AND t.dt = s.dt"
)
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())Tips:
- Narrow the MERGE with a partition predicate if possible.
- Use
spark.databricks.delta.merge.repartitionBeforeWrite.enabled = trueto redistribute writes for better parallelism. - Monitor
numOutputRowsvsnumTargetRowsUpdated— if you're rewriting 100% for a 1% update, reconsider file sizing.
5. Incremental Processing Patterns
Full refresh is simple but expensive above ~100 GB. Incremental processing is cheap but introduces correctness traps.
Pattern 1 — High watermark incremental
Track the max processed timestamp per source; next run pulls everything since.
def get_high_watermark() -> datetime:
return spark.sql("""
SELECT MAX(event_ts) FROM silver.events
""").collect()[0][0] or datetime.min
def run_incremental():
hwm = get_high_watermark()
new_rows = spark.read.table("bronze.events") \
.filter(F.col("event_ts") > hwm)
# process...
new_rows.write.mode("append").saveAsTable("silver.events")Correctness traps:
- Late-arriving data with
event_ts < hwmnever gets processed. Fix: widen the window (processevent_ts > hwm - safety_window, then dedupe). - Clock skew across producers can push HWM forward too fast. Fix: use event_ts not processing_ts.
- A failed run that half-updated the target leaves the HWM in a wrong state. Fix: watermark is derived from the committed target, not stored separately.
Pattern 2 — Partition-based incremental
The target has daily partitions; each run processes one partition's worth of source data.
def run_incremental(dt: date):
source_for_day = (spark.read.table("bronze.events")
.filter(F.col("dt") == dt)) # safety: always one partition
processed = transform(source_for_day)
(processed.write
.format("iceberg")
.mode("overwrite")
.option("replace-where", f"dt = '{dt}'")
.saveAsTable("silver.events"))Cleaner than HWM — each partition is an atomic unit. Reruns are safe. Backfills loop over dates.
Pattern 3 — Snapshot-diff incremental (Iceberg / Delta)
Read only the changes to a source table since the last processed snapshot.
# Iceberg
df_changes = (spark.read
.format("iceberg")
.option("start-snapshot-id", last_snapshot)
.option("end-snapshot-id", current_snapshot)
.load("warehouse.bronze.events"))
# Delta: Change Data Feed (CDF)
df_changes = (spark.read
.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", last_version)
.option("endingVersion", current_version)
.load("/path/to/bronze/events"))
# Rows include _change_type = 'insert' | 'update_preimage' | 'update_postimage' | 'delete'Much cheaper than scanning the whole table when most data is unchanged. Used for streaming-from-batch and silver → gold refresh.
Pattern 4 — Merge-on-arrival (streaming batch)
For sources that arrive as a stream but are processed in micro-batches (Spark Structured Streaming with Trigger.AvailableNow):
(spark.readStream
.format("iceberg")
.load("warehouse.bronze.events")
.writeStream
.format("iceberg")
.option("path", "warehouse.silver.events")
.option("checkpointLocation", "s3://.../ckpt/silver_events/")
.trigger(availableNow=True) # process whatever's available, then stop
.start()
.awaitTermination())Best of both: streaming semantics (state, checkpoints, exactly-once), batch cadence (scheduled run, then stop).
6. Backfills — Design, Safety, and Throttling
A backfill is how pipelines prove themselves. The healthy pattern:
- Parameterized job (takes
logical_date). - Idempotent writes (partition overwrite or MERGE).
- Separate orchestrator (dedicated DAG or job) that doesn't block the daily schedule.
- Rate limiting to avoid flooding upstream or spiking cost.
- Monitoring: backfill progress, failure rate, error logs.
Backfill driver script
def backfill(start: date, end: date, parallelism: int = 4, throttle_s: int = 30):
dates = [start + timedelta(days=i) for i in range((end - start).days + 1)]
with ThreadPoolExecutor(max_workers=parallelism) as pool:
futures = {}
for i, d in enumerate(dates):
futures[pool.submit(run_daily, d)] = d
time.sleep(throttle_s / parallelism) # stagger starts
for f in as_completed(futures):
d = futures[f]
try:
f.result()
log.info(f"backfilled {d}")
except Exception as e:
log.error(f"backfill failed for {d}: {e}")
# Decide: halt entire backfill, or continue and report?Key knobs:
- Parallelism: run N days simultaneously. Too high = overwhelms source / cluster.
- Throttle: pause between submissions. Prevents thundering herd.
- Error policy: fail fast vs continue. Depends on whether days are independent.
Airflow dynamic task mapping for backfills
from airflow.decorators import dag, task
from datetime import datetime, timedelta
@dag(
schedule=None, # triggered manually
start_date=datetime(2026, 1, 1),
catchup=False,
max_active_tasks=8, # concurrency limit
)
def backfill_playback():
@task
def list_dates(start: str, end: str) -> list[str]:
s = datetime.fromisoformat(start).date()
e = datetime.fromisoformat(end).date()
return [(s + timedelta(days=i)).isoformat() for i in range((e-s).days+1)]
@task(retries=2)
def run_one(dt: str):
run_daily_for(dt)
run_one.expand(dt=list_dates("{{ params.start }}", "{{ params.end }}"))
backfill_playback()Trigger with airflow dags trigger backfill_playback -p '{"start":"2026-01-01","end":"2026-02-01"}'.
Backfill correctness
- If the current pipeline version differs from the one that produced the historical data, backfill produces different results. Document this; version-tag the output.
- Upstream sources may have had different schemas historically; backfill code must handle all historical shapes OR start from a normalized bronze.
- Dimensions are SCD2: a fact backfilled today joins to dim versions that were current on the backfill date, not today.
Cost-aware backfills
A naive "backfill one year of events" can cost $50k on a warehouse. Tactics:
- Sampling: backfill 10% first; verify output shape; extrapolate cost; decide.
- Coarse-grained backfill: backfill weekly rollups first; fine-grained only where needed.
- Use a dedicated cluster/warehouse: don't let the backfill contend with production.
- Cold-path data: older data goes to cheaper storage tiers (S3 Glacier, Snowflake external tables).
7. File Format Internals — Parquet, ORC, Avro
Parquet (columnar, analytic)
Layout:
File
├── Magic bytes "PAR1"
├── Row Group 1
│ ├── Column Chunk: col_a
│ │ ├── Page 1 (data + RLE/dict encoding)
│ │ ├── Page 2
│ │ └── ...
│ ├── Column Chunk: col_b
│ └── ...
├── Row Group 2
├── ...
├── File Metadata (schema, row group locations, column stats)
└── Magic bytes "PAR1"
- Row group size: 128 MB default. Smaller = finer pruning but more overhead per file. Larger = more efficient scans but coarser filtering.
- Page size: 1 MB default. The unit of decompression.
- Dictionary page: encode low-cardinality columns as ints pointing to a dictionary. Spark writes dictionary pages up to 1 MB; then falls back to plain encoding.
Statistics available:
- Per row group, per column: min, max, null_count, distinct_count (optional).
- Bloom filters (Parquet 2.5+, optional): probabilistic set membership per column chunk.
Encodings:
- PLAIN — raw values, no compression.
- RLE_DICTIONARY — RLE of dictionary IDs. Default for most columns.
- DELTA_BINARY_PACKED — delta encoding. Good for sorted integers / timestamps.
- DELTA_BYTE_ARRAY — delta encoding for strings with common prefixes.
Compression:
- SNAPPY — fast, moderate compression. Default.
- ZSTD — slower, better compression (10-20% smaller than SNAPPY). Preferred in 2026.
- GZIP — very slow, best compression. Rarely used.
- LZ4 — fastest decompression.
Reading pattern (what the engine does):
- Open file, seek to footer, read metadata.
- Apply filters against per-row-group stats; skip non-matching row groups.
- For each surviving row group, for each needed column, seek to column chunk.
- Decode pages; apply predicates again at page level (if dictionary-filter eligible).
- Assemble rows.
This is why selecting only needed columns and filtering on stat-having columns is so fast in Parquet.
ORC (columnar, Hive-era)
Similar to Parquet but:
- Stripe = ORC's row group (64 MB default).
- Richer lightweight indexes (per-10k-row bloom filters, not just per-stripe).
- Better complex-type handling historically (structs, maps).
- Compresses slightly better (ZSTD/ZLIB) due to more aggressive type-aware encoding.
Most modern lakehouses use Parquet for ecosystem reasons. ORC dominant in traditional Hadoop.
Avro (row-oriented, streaming)
Row-oriented format. Bad for analytics (must read full rows); good for:
- Streaming source data: Kafka topics often serialize records as Avro with a schema registry.
- Backup / archival: faster to produce, less CPU to write.
- Schema evolution: Avro has a rich schema resolution model (reader schema vs writer schema, nullable union types).
# Writing Avro
df.write.format("avro").save("s3://backup/events/")
# Reading with explicit schema
import avro.schema
schema = avro.schema.parse(open("events.avsc").read())Format choice
- Analytic tables: Parquet (lakehouse default) or ORC (Hive legacy).
- Streaming bronze (Kafka → lake): Avro in-flight (schema registry), Parquet on landing.
- Archival / backup: Avro or gzipped JSON.
- Row-level state: neither — use a row store (Postgres, RocksDB).
8. Partition Design Math
Partition choice is the #1 physical layout decision in a lakehouse. Bad partitioning wastes more compute than any other mistake.
The cardinality sweet spot
Rule: each partition should hold hundreds of MB to a few GB of uncompressed data, with at least a few hundred partitions total for a large table and not more than ~tens of thousands.
Math:
- Row group size target: 128 MB.
- Files per partition: 1–10 for good parallelism without small-file overhead.
- → Partition size target: 128 MB – 1.3 GB.
For a table with 10 TB of data:
- 10 TB / 500 MB per partition ≈ 20,000 partitions. Upper edge.
- 10 TB / 5 GB per partition ≈ 2,000 partitions. Nicer.
If the natural partition column (e.g., dt) gives you 10,000 days — that's 10,000 partitions, ~1 GB each.
If the natural partition column gives you 10M users — that's 10M partitions of tiny data each. Disaster. Repartition by a coarser key (day + country, or user_id % 1000).
Partition evolution
Iceberg supports partition evolution: change the partition spec without rewriting data. Old data keeps its partitioning; new writes use the new spec.
-- Started with daily partitions
ALTER TABLE silver.events SET PARTITION SPEC (days(event_ts), bucket(16, user_id));
-- Old daily partitions survive; new writes use day+user_bucketThis is huge. In traditional Hive, changing partitioning meant rewriting the entire table.
Partition pruning — what actually happens
Query: SELECT * FROM silver.events WHERE dt = '2026-04-19' AND country = 'US'.
- Planner reads table metadata → sees partition spec
days(event_ts)→ partition columndt. - Planner matches filter
dt = '2026-04-19'→ determines relevant partition(s). - Planner reads manifest files for matching partitions only.
- Manifests list data files with per-column stats (min/max of
country). - Planner prunes files where
countrystats exclude 'US'. - Spark launches tasks only against surviving files.
What breaks pruning:
- Function on partition column:
WHERE DATE(event_ts) = '2026-04-19'— planner can't prove this matches partitiondays(event_ts)=2026-04-19. Rewrite:WHERE event_ts >= '2026-04-19' AND event_ts < '2026-04-20'. - Iceberg's hidden partitioning fixes this —
days(event_ts)is an implicit partition derivation, and filters onevent_tsdirectly are pruned. - Implicit type casts:
WHERE dt = '20260419'(string) vs partition column is DATE — some engines won't prune. Match types. - UDFs in predicates: opaque; never pruned.
Hidden partitioning (Iceberg's killer feature)
Hive-style: the partition column is user-managed. You must always write WHERE dt = '...' matching the partition value exactly.
Iceberg: partition derived from source column. User writes WHERE event_ts BETWEEN ...; Iceberg transparently computes days(event_ts) and prunes.
-- Iceberg
CREATE TABLE silver.events (
event_id STRING,
event_ts TIMESTAMP,
user_id STRING,
...
)
USING iceberg
PARTITIONED BY (days(event_ts), bucket(16, user_id));
-- Query naturally uses event_ts (no dt column visible)
SELECT COUNT(*) FROM silver.events
WHERE event_ts BETWEEN '2026-04-19' AND '2026-04-20'
AND user_id = 'abc';
-- Prunes: days(event_ts) matches 2026-04-19 partition
-- Prunes: bucket(16, 'abc') matches one of 16 bucketsClustering (sort within partition)
Even with good partitioning, within a partition the data order matters. Clustering (Snowflake), z-ordering (Delta), or sort-order (Iceberg) physically sorts rows so filters prune at the file / row-group level.
-- Iceberg: write new data sorted by user_id within each partition
ALTER TABLE silver.events WRITE ORDERED BY user_id;
-- Delta: z-order after writes (multi-column space-filling curve)
OPTIMIZE silver.events ZORDER BY (user_id, country);
-- Snowflake: clustering key (automatic maintenance)
ALTER TABLE silver.events CLUSTER BY (user_id);When to cluster:
- You frequently filter on a specific high-cardinality column (e.g.,
user_id). - That column is too high-cardinality to partition on directly.
- Query patterns are read-heavy; the sort amortizes.
Z-order packs rows with similar values across multiple columns. Best when queries filter on a pair of columns unpredictably — e.g., sometimes user_id, sometimes country, sometimes both. Single-column sort is better when there's one dominant filter.
Bucket partitioning
Divide rows into N buckets by hash(key) % N. Useful when:
- Key is high-cardinality (can't partition directly).
- You want predictable file sizes (each bucket has roughly 1/N of data).
- Joins on that key can use bucketed joins (skip shuffle if both sides are bucketed identically).
CREATE TABLE silver.events ( ... )
USING iceberg
PARTITIONED BY (days(event_ts), bucket(32, user_id));32 buckets × 365 daily partitions = 11,680 partitions for a year. Each partition covers 1_day × (1/32 of users). A query filtering on user_id = 'abc' prunes to 1 day × 1 bucket = small.
9. Small Files and Compaction
The small files problem: tiny files kill read performance because each file incurs fixed overhead (open, footer read, metadata parse). A table with millions of tiny files is slower than one with thousands of right-sized files, despite holding the same data.
Causes of small files
- High write parallelism + low data volume. 200 Spark partitions × 200 hourly runs = 40,000 files even if each run is small.
- Streaming micro-batches with short triggers. 1-minute triggers for a low-volume stream: ~525,000 files per year per topic.
- Skewed partitioning. A partition with lots of data gets sub-partitioned; one with little gets one tiny file.
Prevention at write time
df.coalesce(N)before write: reduces file count without shuffle. Use N = target_data_size / target_file_size.df.repartition(N, ...)before write: shuffles to exactly N partitions. Costs a shuffle but guarantees even file sizes.- Spark AQE:
spark.sql.adaptive.advisoryPartitionSizeInBytes = 128MB— AQE coalesces output partitions to hit this target. - Iceberg
write.target-file-size-bytestable property.
df.write \
.format("iceberg") \
.option("target-file-size-bytes", 134217728) \ # 128 MB
.mode("append") \
.saveAsTable("silver.events")Compaction (post-write)
Running OPTIMIZE or rewrite_data_files periodically rewrites many small files into fewer large ones.
-- Iceberg: compact files smaller than 100 MB, combining to ~512 MB
CALL system.rewrite_data_files(
table => 'silver.events',
options => map(
'target-file-size-bytes', '536870912',
'min-file-size-bytes', '104857600'
)
);
-- Delta
OPTIMIZE silver.events;Schedule: daily for streaming-ingested tables, weekly for batch-ingested.
Garbage collection (expiring snapshots)
Iceberg/Delta keep old snapshots for time travel. Expire them to reclaim storage:
-- Iceberg: expire snapshots older than 7 days, keep at least 5 most recent
CALL system.expire_snapshots(
table => 'silver.events',
older_than => TIMESTAMP '2026-04-12 00:00:00',
retain_last => 5
);
-- Delta
VACUUM silver.events RETAIN 168 HOURS; -- 7 daysCaution: vacuuming kills time-travel for data older than the retention window. If compliance requires 7-year retention, configure accordingly.
10. CDC (Change Data Capture) Patterns
Source OLTP database → analytics warehouse. The canonical patterns:
Full snapshot (baseline)
Daily full dump of the source table. Simple, correct, expensive. Viable up to ~tens of GB.
# JDBC to Iceberg (full reload)
(spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://...")
.option("dbtable", "public.orders")
.option("partitionColumn", "order_id")
.option("numPartitions", 20)
.option("lowerBound", "1")
.option("upperBound", "100000000")
.load()
.write
.format("iceberg")
.mode("overwrite")
.saveAsTable("bronze.orders"))Pros: guaranteed correctness, no state. Cons: expensive at scale, lag = 1 day.
Incremental snapshot (changed-rows-only)
Source table has updated_at; pull rows where updated_at > last_pull.
hwm = get_high_watermark("bronze.orders", "updated_at")
(spark.read
.format("jdbc")
.option("dbtable", f"(SELECT * FROM orders WHERE updated_at > '{hwm}') AS t")
...
.write
.format("delta")
.mode("append")
.saveAsTable("bronze.orders_changes"))
# Upsert into bronze.orders using MERGEPros: cheaper than full reload. Cons: misses hard deletes (row gone from source with no marker); relies on accurate updated_at.
Log-based CDC (Debezium, Fivetran, AWS DMS)
Read from the database's write-ahead log (Postgres logical replication, MySQL binlog, SQL Server CDC). Emit Kafka messages per row change, with operation type:
{
"before": {"order_id": 42, "status": "paid", "total": 100},
"after": {"order_id": 42, "status": "shipped", "total": 100},
"op": "u",
"ts_ms": 1713567890000,
"source": {"schema": "public", "table": "orders", "lsn": "0/30A3B8"}
}Ops: c (create/insert), u (update), d (delete), r (snapshot read).
Pros:
- Sub-second lag.
- Hard deletes captured.
- Exact change history, not just "current state."
- Works without touching the source.
Cons:
- Operationally complex: Kafka, Debezium, schema registry.
- Requires database-side logging (WAL) configured correctly.
- Schema drift at source breaks pipelines.
Upserting CDC into a lakehouse
cdc_stream = (spark.readStream
.format("kafka")
.option("subscribe", "cdc.public.orders")
...
.load())
# Parse Debezium envelope
parsed = cdc_stream.select(
F.from_json(F.col("value").cast("string"), debezium_schema).alias("env")
).select(
"env.op", "env.after", "env.before", "env.ts_ms"
)
# Apply to target with forEachBatch + MERGE
def merge_batch(batch_df, batch_id):
from delta.tables import DeltaTable
t = DeltaTable.forName(spark, "silver.orders")
(t.alias("t")
.merge(
batch_df.alias("s"),
"t.order_id = s.after.order_id OR (s.op = 'd' AND t.order_id = s.before.order_id)"
)
.whenMatchedDelete(condition="s.op = 'd'")
.whenMatchedUpdateAll(condition="s.op = 'u' AND s.ts_ms > t.cdc_ts_ms")
.whenNotMatchedInsert(
condition="s.op IN ('c','r')",
values={
"order_id": "s.after.order_id",
"status": "s.after.status",
...
"cdc_ts_ms": "s.ts_ms"
}
)
.execute())
(parsed.writeStream
.foreachBatch(merge_batch)
.option("checkpointLocation", "s3://.../ckpt/orders_cdc/")
.trigger(processingTime="30 seconds")
.start())The cdc_ts_ms > t.cdc_ts_ms guard prevents out-of-order CDC events from overwriting a newer state with an older one.
CDC + SCD2
For downstream dim tables: the stream of CDC events drives SCD2 insertion of new rows. Every update event closes the current row and inserts a new one; every delete marks the current row as invalidated.
11. Data Quality in Batch Pipelines
Quality checks live alongside logic — not "if we have time later."
Taxonomy of checks
| Check type | Example | When to run |
|---|---|---|
| Schema | columns exist with expected types | Every run, fail fast |
| Row count | count > 0, count within ±20% of 7-day avg | Every run, warn on drift |
| Uniqueness | PK is unique | Every run, hard fail on violation |
| Referential | all FKs resolve | Every run, warn or fail |
| Null rate | non-key columns < 1% null | Every run, warn |
| Distribution | numeric columns within expected p99 | Periodic, drift detection |
| Freshness | max(event_ts) within SLO | Every run, SLA alert |
| Volume | same volume as upstream, ±5% | Every run, warn |
dbt tests (declarative)
# models/silver/playback_sessions.yml
version: 2
models:
- name: playback_sessions
description: "Sessionized playback events"
columns:
- name: session_id
tests:
- unique
- not_null
- name: user_id
tests:
- not_null
- relationships:
to: ref('dim_user')
field: user_id
- name: watch_ms
tests:
- dbt_utils.accepted_range:
min_value: 0
max_value: 86400000 # 24h in ms
tests:
- dbt_utils.recency:
datepart: hour
field: max_end_ts
interval: 2Great Expectations (imperative + metadata)
import great_expectations as gx
context = gx.get_context()
ds = context.sources.add_pandas("pd")
asset = ds.add_parquet_asset(
"playback_sessions",
filepath_or_buffer="s3://lake/silver/playback_sessions/dt=2026-04-19/*.parquet"
)
batch = asset.build_batch_request()
validator = context.get_validator(batch_request=batch)
validator.expect_column_values_to_not_be_null("session_id")
validator.expect_column_values_to_be_unique("session_id")
validator.expect_column_values_to_be_between("watch_ms", min_value=0, max_value=86_400_000)
validator.expect_column_mean_to_be_between("watch_ms", min_value=60_000, max_value=3_600_000)
validator.save_expectation_suite("playback_sessions.suite")
result = validator.validate()
if not result.success:
raise ValueError(f"DQ failed: {result}")Soda Core (YAML-first)
checks for silver.playback_sessions:
- row_count > 1000000
- missing_count(session_id) = 0
- duplicate_count(session_id) = 0
- values in (device_type) must be in ['tv','mobile','tablet','web','other']
- freshness(end_ts) < 1h
- anomaly score for row_count < 3Designing "what to check"
- Every PK column: not null, unique.
- Every FK: relationships to parent (or tolerate N% unresolved with a warning).
- Every measure: physically plausible range (no negatives on watch time, etc.).
- Every timestamp: within expected date range (catch 1970-01-01 defaults).
- Every enum column: values in allowed set.
- Daily volumes: within ±20% of 7-day rolling average; alert on anomalies.
Over-checking is cheap; under-checking is expensive.
12. Orchestration Patterns for Batch
The simple scheduled DAG
One task per stage, linear dependencies, cron schedule.
extract → dedupe → transform → validate → write → publish
Best for: small, well-understood pipelines.
Asset-based orchestration (Dagster)
Instead of tasks-with-dependencies, declare the datasets (assets) and their producers. Dagster figures out the dependency graph.
from dagster import asset
@asset
def bronze_events(context):
return pull_from_kafka(context)
@asset
def silver_events(bronze_events):
return transform(bronze_events)
@asset
def gold_daily_plays(silver_events):
return aggregate(silver_events)Materializing gold_daily_plays rebuilds upstream as needed.
Sensor-based triggering
Instead of cron, trigger when upstream data arrives.
# Airflow
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
wait = S3KeySensor(
task_id="wait_for_upstream",
bucket_key="s3://upstream/events/{{ ds }}/_SUCCESS",
timeout=60 * 60 * 4,
poke_interval=300,
)For lakehouse tables: dataset-based scheduling (Airflow 2.4+) reacts when the upstream table is updated.
Backfill separation
Never use the same DAG for daily and backfill. Daily runs should not contend with backfills. Backfill DAG:
- Dedicated compute pool.
- Lower priority.
- Separate alerting (backfill failures rarely need 3am pages).
SLA management
Every table should have:
- Freshness SLO — "updated within 1h of upstream seal."
- Completeness SLO — "≥99% of expected row count."
- Correctness SLO — "DQ tests green on every run."
Expose as metrics; alert on violations with an on-call rotation.
# Airflow SLA miss callback
def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
notify_on_call(...)
default_args = {
"sla": timedelta(hours=1),
"sla_miss_callback": sla_miss_callback,
}Next: 03-streaming-processing.md — unbounded data, event-time processing, watermark math, Flink and Kafka internals, exactly-once proofs.
13. Dependency Graphs and Critical Path Analysis
A batch "pipeline" at scale is never a chain — it's a directed acyclic graph of hundreds of tasks with cross-dataset dependencies. Two questions dominate: when will the whole graph finish, and which node, if it slips, slips the whole graph? These are critical-path questions.
Modeling the graph
Each node is a task with a duration. Each edge expresses "task B cannot start until task A succeeds." Total graph completion time is the longest-path sum from any source to any sink. That longest path is the critical path.
Nodes off the critical path have slack — the amount they can slip without affecting total completion time. Nodes on the critical path have zero slack. A single extra minute on any critical-path task adds a minute to the whole pipeline.
Why this matters for interviews
"Our daily pipeline finishes at 6am but the business wants it by 4am — how do you speed it up?" Wrong answer: "I'd optimize the slowest job." Right answer: "I'd find the critical path and optimize the longest task on it. Optimizing off-path tasks won't move completion time at all."
The Airflow / Dagster reality
Most orchestrators expose task duration logs. Extract them, build the adjacency matrix, compute longest-path via topological sort + dynamic programming. O(V+E). For a 500-task graph, runs in sub-second and tells you exactly which five tasks you must optimize.
# Simplified: compute critical path duration
def critical_path(nodes, edges, duration):
# topological sort
indeg = {n: 0 for n in nodes}
for a, b in edges: indeg[b] += 1
order, q = [], [n for n,d in indeg.items() if d == 0]
while q:
n = q.pop(0); order.append(n)
for a, b in edges:
if a == n:
indeg[b] -= 1
if indeg[b] == 0: q.append(b)
# DP: earliest finish time
finish = {n: duration[n] for n in nodes}
for n in order:
for a, b in edges:
if a == n:
finish[b] = max(finish[b], finish[n] + duration[b])
return max(finish.values())
14. The Batch Cost Model
Senior candidates should be able to estimate batch cost without invoking the cloud-provider pricing page. The model has four terms.
The four costs
- Storage. GB-months × $/GB-month. Flat. For S3 Standard, ~$0.023/GB-month. For Glacier, ~1/10 of that.
- Compute. Core-hours × $/core-hour. EC2 spot typically $0.01–0.02 per vCPU-hour; on-demand 3x that.
- Shuffle / intermediate I/O. Bytes written to local disk or the shuffle service. On ephemeral disk, free but capacity-limited. On remote shuffle (cloud managed), charged per GB.
- Egress. Cross-region or internet egress. The silent killer. $0.02–0.09 per GB depending on direction. Can dwarf compute cost for lift-and-shift workloads.
Worked estimate — 10 TB daily batch
Input 10 TB. One Spark job with a 3-way join producing 2 TB output. Shuffle amplification typically 2–4x input — say 30 TB of shuffle. Runtime 40 min on 100 vCPUs.
- Compute: 100 vCPU × 0.67 hour × $0.02 = $1.33
- Storage (same-region S3, input rereadable): negligible incremental
- Shuffle: 30 TB local disk, free
- Egress: zero if same region, ~$900 if cross-region output
The answer "this job costs roughly $1.50 same-region, $900+ cross-region" is the senior-level answer. Notice the 600x swing just on egress — that's why "lift-and-shift to another region" is so often the wrong design move.
15. Parallelism Math and the Skew Tax
Perfectly parallel work distributes across N workers with speedup ≈ N. Real work rarely achieves this. Two sources of loss: sequential fraction (Amdahl's law) and skew (one partition is fatter than the rest).
The skew tax, quantified
If you partition 1 TB of work into 100 evenly-sized pieces of 10 GB each, total wall-clock time is ≈ (10 GB work / per-worker throughput). If instead one partition is 500 GB and the other 99 average 5 GB, total wall-clock is bounded below by 500 GB / throughput — 50x slower than the balanced case, despite the same total work. That's the skew tax.
Detecting skew ahead of time
- Compute partition size distribution:
SELECT part_key, COUNT(*), SUM(bytes) FROM … GROUP BY 1 ORDER BY 2 DESC LIMIT 20. If the top partition is 10x the median, you have skew. - Coefficient of variation (stddev / mean) on partition sizes. Above 0.5 = significant skew.
Mitigations
- Salting. Suffix the skewed key with a small random integer (
key + '_' + RAND(10)), join, then aggregate up. Fixes the shuffle but doubles join complexity. - Broadcast. If the other side is small, broadcast it and avoid shuffling the skewed key entirely.
- Isolate the hotspot. Split the skewed keys into a separate job with different parallelism.
- AQE. Spark's adaptive skew-join detects skewed partitions at runtime and splits them. Configure
spark.sql.adaptive.skewJoin.enabled=true, know whatskewedPartitionThresholdInBytescontrols.
16. Checkpoint and Restart Semantics
Batch jobs are not point-in-time atomic. A 6-hour job that fails at hour 4 either resumes from a checkpoint (if designed for it) or restarts from zero. The senior design decision is which.
Three restart strategies
- Fully idempotent restart from start. Every job is safe to rerun end-to-end. Requires deterministic inputs and idempotent sinks (MERGE on business key). Simple; operationally resilient. Cost: wasted compute on every failure.
- Task-level checkpointing. Orchestrator records task completion; restart resumes at the failed task. Works for DAG-structured jobs. Requires each task's output to be written atomically (typically via temp-then-rename).
- Intra-task checkpointing. A long task persists progress state (e.g., partition cursor) and resumes from it. Necessary for multi-hour jobs. Expensive to implement right; usually worth it only for jobs that exceed ~2 hours.
The atomic-write discipline
Every job output must be all-or-nothing. The pattern: write to a temp location, validate, then atomic rename or commit. Never "write in place and pray." Lakehouse formats (Iceberg, Delta) get you atomic commits for free. Hive partitioned tables do not — use _SUCCESS markers and write to a staging path first.
The "restart at noon" drill
Interviewers love to ask: "Yesterday's job failed at 11pm. You rerun at noon today. What happens downstream?" The senior answer enumerates: which tables get rewritten, which downstream caches invalidate, which dashboards see a stale→fresh transition, which SLAs trip. If you can't answer this for your own pipelines, you don't own them operationally — no matter what the org chart says.
17. Partition Key Decisions by Domain
Choosing a partition key is the single highest-leverage physical-design decision in a lakehouse or warehouse table. A good choice makes common queries prune 99% of the data; a bad one produces billion-row scans for every dashboard. Below is a pattern-match table of typical domains, the natural partition key, the access patterns that justify it, and the most common wrong choice candidates reach for first.
| Domain / table | Recommended partition key | Secondary (clustering/sort) | Why | Common wrong pick |
|---|---|---|---|---|
| Events (app, web, mobile) | event_date | user_id, event_name | Nearly all queries bound by a date range | user_id — cardinality explosion, files per user explode |
| Orders / transactions | order_date | customer_id, region | Daily batches, MoM reports, finance closes — all date-bounded | order_id — gives one row per partition |
| Clickstream / impressions | event_date + hour (hourly) | session_id, campaign_id | High volume; hourly grain keeps partitions under ~1 TB | campaign_id — skew on hero campaigns |
| Payments / ledger | posting_date | merchant_id, account_id | Financial close runs on posting date | merchant_id — power-law skew |
| Subscription billing | invoice_month | account_id | Billing cycles are monthly | account_id — tenant skew; too few files for small tenants |
| CDC / change log | change_date | source_table, pk_hash | Replay and backfill are date-bounded | source_table — one huge partition per high-volume table |
| IoT telemetry | ingest_date + hour | device_id bucket (16–64 buckets) | Continuous high-volume stream; hourly prevents partition bloat | device_id raw — cardinality blowup |
| Logs (systems) | log_date + hour | service, severity | Investigations bounded by time; joining across services is rare | host_id — millions of tiny partitions |
| Media catalog | — (no partition; small) | title_id cluster | A catalog of 10–100 K items is too small to benefit from partitioning | partitioning by genre — rarely prunes |
| Playback / watch events | event_date | title_id, region | Engagement reporting is date-bounded; title-level rollups are frequent enough to cluster on | title_id — one-hit wonders create skew |
| Ad impressions | event_date + hour | advertiser_id, placement_id | Volume is extreme; hourly is the right grain | advertiser_id — top advertiser dominates |
| Inventory snapshots | snapshot_date | warehouse_id, sku | Snapshots are by date; comparisons are date-vs-date | sku — SKU count is in the millions |
| Clinical claims | service_date | payer_id, provider_id | Service date anchors claims processing and audit | patient_id — cardinality and privacy issues |
| Ride / trip data | trip_end_date | city_id, driver_id | Operational reporting is daily per city | driver_id — long tail of one-trip drivers |
| Support tickets | opened_date | product_area, priority | SLA and backlog queries are date-bounded | customer_id — sparse per customer |
| CRM activity | activity_date | owner_id, account_id | Rep productivity reports run by week/month | account_id — uneven size |
| Search queries | query_date | query_hash bucket | Investigations and model training are both date-bounded | query_text — one partition per unique query |
| Model training features | feature_date | entity_id bucket | Training sets are date-bounded; entity_id is bucketed to avoid skew | entity_id raw — skew + small partitions |
Sizing heuristic
Aim for partitions in the ~100 MB – ~10 GB range after compaction. Below that, metadata overhead dominates. Above it, a single consumer reading one partition loses parallelism. For a daily-partitioned 1 TB/day fact, that's ideal. For a 100 GB/day fact, daily is still fine. For a 10 TB/day fact, move to hourly.
The bucketed-id trick
When you have a high-cardinality key (user_id, device_id) that users filter on but can't partition on directly, hash it into a fixed number of buckets and use that as a secondary partition. The query can push the bucket predicate down (WHERE user_bucket = hash(user_id) % 32), pruning ~97% of partitions while keeping each partition reasonable size. This is how Iceberg's hidden partitioning works under the hood for bucket transforms.
-- Iceberg: hidden bucketing — no user code changes
CREATE TABLE events (
event_date DATE,
user_id BIGINT,
event_name STRING,
...
)
USING ICEBERG
PARTITIONED BY (event_date, bucket(32, user_id));
-- Queries that filter on user_id automatically prune 31/32 buckets
SELECT * FROM events
WHERE event_date = '2026-04-20' AND user_id = 12345;
18. NULL Semantics — Examples by Domain
NULL is never a single thing. It carries different meanings across columns and domains, and conflating them is a root cause of subtle downstream bugs. The rule: every nullable column in a contract must document which meaning applies. Examples below.
| NULL meaning | Column examples | Recommended handling |
|---|---|---|
| Unknown (value exists but not captured) | customer_dob, visit_purpose, referrer_url | Leave NULL; document that absence ≠ absence-of-fact |
| Not applicable (value cannot exist for this row) | spouse_name for single customers, return_reason for non-returns, churn_date for active users | Leave NULL; contract explicitly enumerates which combinations preclude the column |
| Pending (value will arrive later) | shipped_ts, paid_ts, resolved_ts in an accumulating snapshot | NULL is the signal; downstream code tests IS NULL explicitly |
| Sentinel for "no match" | fraud_model_score on transactions the model declined to score, segment_id for un-enrollable users | Prefer a sentinel value over NULL: -1 for scores, 'UNASSIGNED' for enums |
| Soft-deleted | deleted_at populated = row is logically deleted; NULL = active | NULL means "not deleted"; every consumer must filter on deleted_at IS NULL |
| End-date of open interval | valid_to, employment_end_date, lease_end_date | NULL means "still valid"; OR set to 9999-12-31 for range-join friendliness |
| Privacy-suppressed | customer_ip, email post-GDPR | Prefer an explicit status column (pii_status='erased') alongside NULL |
| Default-not-set | preferred_language, notification_opt_in | Fill with a system default ('en-US', FALSE); NULL here is usually a bug |
| Zero vs NULL (the classic) | refund_amount, discount_amount, gift_card_value | Zero when we know no refund occurred; NULL only if we don't know. The two are different rows in finance. |
The aggregation trap
SUM(column_with_nulls) ignores NULLs, but SUM(column_with_nulls) / COUNT(*) does not — COUNT(*) includes NULL rows. If you want the average over non-null rows, use AVG(col) or SUM(col) / COUNT(col). Mixing SUM and COUNT(*) is the single most common "the numbers are off" bug in dashboards that mix nullable measures.
The NULL-vs-zero revenue bug
Finance often wants "revenue = 0 for days with no sales." If your fact table uses NULL to mean "no sales," your MoM chart will produce NULL MoM for zero-revenue-days, not "down 100%." Generate a dense calendar grid (see Q4 in Part 09's SQL bank), COALESCE NULLs to 0, and you're safe.
Streaming Processing
Streaming is where data engineering gets hard — not because the APIs are complex, but because the physics of distributed time makes correctness subtle. This file goes into watermark math, window mechanics, Flink's checkpoint algorithm, Kafka's internal protocols, and the proofs behind exactly-once semantics.
1. Unbounded Data — What Actually Changes
An unbounded source has four properties that break batch intuitions:
- No sealed state. You never "have all the data." Every computation is "as of now," subject to revision.
- Arbitrary lateness. Events can arrive minutes, hours, or days after they happened (mobile reconnects, buffering clients).
- Out-of-order arrival. An event timestamped 10:05 may arrive after an event timestamped 10:07, because it took a different route through the system.
- Continuous state. The processor must maintain state (windows, joins, aggregations) that grows without bound unless explicitly expired.
Consequence 1: "correct" becomes "correct as of watermark W"
A batch SQL SELECT SUM(watch_ms) GROUP BY dt has one answer. A streaming equivalent has an answer that keeps updating as late data arrives — and at any moment, the answer is "the sum of all data observed up to the current watermark, with the agreement that events older than W are considered closed."
Consequence 2: computations have four dimensions, not one
Beam's influential model:
- What are you computing? (the aggregation function)
- Where in event time? (the windowing scheme)
- When in processing time? (the trigger — when to emit a result)
- How do refinements relate? (accumulating, discarding, accumulating-with-retractions)
Naming them explicitly clarifies every streaming design decision.
Consequence 3: state is load-bearing
A batch job's state is transient (shuffle files, intermediate RDDs). A streaming job's state is the product. Lose it, lose correctness. Managing state durably (checkpoints, savepoints, RocksDB), bounding its growth (TTL, key expiration), and recovering it on failure (from checkpoint) dominates streaming engineering.
Consequence 4: time is a first-class citizen
In batch, "what time is it?" is a parameter. In streaming, time is the dataflow itself. Watermarks advance; triggers fire; windows close; state expires. Every operation is time-indexed.
2. The Three Times (Event, Processing, Ingestion)
Define precisely:
- Event time (ET) — when the event actually occurred in the real world. Stamped by the producer (the client, the sensor, the database transaction). This is what users care about. "Watch hours on 2026-04-18" means "hours of watching whose event_time falls on 2026-04-18."
- Ingestion time (IT) — when the event was received by the messaging system (e.g., the time Kafka assigned on append).
- Processing time (PT) — when the stream processor is evaluating the event.
Relationships:
- ET ≤ IT ≤ PT always (events can't be received before they happened; processing can't finish before receipt).
IT - ETis the producer lateness — how long the event took to reach the messaging system.PT - ITis the processing lag — how behind the consumer is.PT - ETis the end-to-end latency experienced by consumers.
Distribution of ET → IT lag
Real production distributions are heavily right-skewed:
- p50: milliseconds to seconds.
- p99: tens of seconds to minutes.
- p99.9: hours (mobile reconnects, app-in-background).
- p99.99: days (offline clients syncing weeks later).
Lesson: watermarks must be tuned to the distribution, not the average. Setting a 1-minute lateness tolerance drops ~1% of events in most mobile datasets.
When does processing time ever make sense?
- Ad-hoc system health ("events/sec arriving right now").
- Simple at-least-once transforms with no time semantics.
- Systems that explicitly want "wall clock" behavior (alarms, heartbeats).
For any business metric, event time is correct. Always start with event time.
3. Watermarks — Math and Mechanics
A watermark is an assertion: "I assert that no more events with event_time ≤ W will arrive." The stream processor uses this to decide when windows can close.
Perfect vs heuristic watermarks
- Perfect watermark: actually correct. Only achievable when you have metadata (e.g., Kafka producer timestamps with bounded clock skew and source-aware committed offsets). Rare.
- Heuristic watermark: a guess, usually
max(event_time_seen_so_far) - safety_margin. Wrong sometimes; that's what late-data handling is for.
Bounded out-of-orderness watermark
The common practical watermark:
W(t) = max(event_time seen before t) - B
where B is the bounded out-of-orderness (expected max lateness).
Example with B = 5 seconds:
- Stream sees events
{ET: 10:00:00, ET: 10:00:02, ET: 10:00:01, ET: 10:00:04, ET: 10:00:03, ...}. - After
ET: 10:00:04is observed, watermark =10:00:04 - 5s = 09:59:59. - Event
ET: 10:00:03arrives next: falls below watermark?10:00:03 > 09:59:59, so not late. Accepted into windows. - After
ET: 10:00:10is observed, watermark =10:00:05. Events arriving with ET < 10:00:05 are now late.
Choosing B (bounded out-of-orderness)
Empirical approach:
- Instrument production: log
processing_time - event_timefor every event. - Compute percentiles: p50, p95, p99, p99.9 of lateness.
- Pick
Bas p99 or p99.9 — the tail you're willing to drop vs block.
Math: if B = p99 lateness, ~1% of events are late (dropped or routed to side output).
Higher B:
- Fewer late events.
- Longer window emission delays (latency ↑).
- Larger active state (windows held open longer).
Lower B:
- Shorter latency.
- More late events (correctness ↓).
- Smaller state.
Typical values: web/mobile analytics B = 30s–5m; sensor/IoT with good clocks B = 1–10s; batch-like (daily S3 drops) B = hours.
Watermark propagation in a dataflow graph
In Flink, each operator has its own watermark. Upstream operators propagate watermarks downstream. When an operator has multiple inputs, it takes the minimum watermark of its inputs:
source A: watermark advancing at 10:05
source B: watermark stuck at 10:02 (slow source)
joined operator: watermark = min(10:05, 10:02) = 10:02
This is correct: the downstream operator can't claim "no events before 10:05" because source B might still emit events with event time between 10:02 and 10:05.
Practical implication: a single slow source stalls the entire pipeline. Diagnose by watching per-operator watermarks in Flink UI. Solutions:
- Ensure all sources produce events regularly (heartbeat events if idle).
- Use
WatermarkStrategy.forMonotonousTimestamps()where applicable (no out-of-orderness expected). - Configure idle source detection — a source is marked idle after inactivity and its watermark contribution is ignored.
WatermarkStrategy<Event> strategy = WatermarkStrategy
.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withIdleness(Duration.ofMinutes(1))
.withTimestampAssigner((ev, ts) -> ev.eventTime);Watermarks across Kafka partitions
A single Kafka topic has multiple partitions. Each is an independent log. Event times within a partition are monotonic only if the producer writes in event-time order (usually not).
Flink's Kafka source computes per-partition watermarks, then combines via min. This is correct but introduces subtle issues:
- A partition with slow producers drags the watermark.
- An empty partition (no events) can cause the overall watermark to freeze unless idleness is configured.
Set watermarks per partition, with idleness detection, and ensure your partitioning key doesn't create empty partitions for prolonged times.
The "punctuated" variant
Instead of heuristics, the producer emits explicit watermark marker records: "I'm at event time T, no more events before T from me." Flink supports this via WatermarkGenerator with onEvent advancing and onPeriodicEmit emitting. Ideal when the source is a database CDC stream with explicit sequencing.
4. Windows — Types, State, and Trade-offs
A window groups events by time (or session). Each window maintains state (the aggregating values) and emits a result when it's ready.
Tumbling window
Fixed size, non-overlapping. Each event belongs to exactly one window.
|----W1----|----W2----|----W3----|
^ ^ ^
events fall in one window
State: O(1) per key per active window. Simplest.
# PySpark Structured Streaming — 1-minute tumbling window
(events
.withWatermark("event_ts", "5 seconds")
.groupBy(F.window("event_ts", "1 minute"), "title_id")
.agg(F.sum("watch_ms").alias("watch_ms")))// Flink
events
.keyBy(e -> e.titleId)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(new SumAggregator());Sliding (hopping) window
Fixed size, slides forward by a step smaller than the size. Each event belongs to size/slide windows.
size=10m, slide=1m: every event is in 10 overlapping windows.
State: O(size/slide) × keys. Quickly becomes expensive.
Use when: you need rolling metrics ("watch hours in the trailing 10 minutes, updated every minute").
Tip: sliding windows can often be emulated by a tumbling window at the fine granularity + downstream rollup, which is cheaper.
Session window
Dynamic size, closes when there's a gap of inactivity longer than the session timeout.
events: * * * * * * * (gap >= timeout means new session)
windows: |------| |---| |-|
State: O(events per active session) × active sessions. Can be unbounded for users with continuous activity.
events
.keyBy(e -> e.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(30)))
.aggregate(new SessionBuilder());Challenge: a single "always-active" user (a bot, a scraper, a stuck client) holds a session open forever, preventing emission. Solutions:
- Maximum session length (cap at e.g. 4 hours).
- Bot filtering at ingestion.
- Late "ejection" if max latency exceeded.
Global window
One window for the entire stream, with custom triggers deciding when to emit. Rarely correct unless you know exactly why you want it (e.g., a running counter that never resets).
Custom windows (e.g., calendar-aligned)
Windows aligned to calendar boundaries ("this day in user's local timezone", "this business week"). Implement with custom window assigners in Flink:
public class CalendarDayWindowAssigner extends WindowAssigner<Event, TimeWindow> {
@Override
public Collection<TimeWindow> assignWindows(Event ev, long ts, WindowAssignerContext ctx) {
ZoneId zone = ZoneId.of(ev.userTimezone);
ZonedDateTime zdt = Instant.ofEpochMilli(ev.eventTime).atZone(zone);
ZonedDateTime dayStart = zdt.toLocalDate().atStartOfDay(zone);
ZonedDateTime dayEnd = dayStart.plusDays(1);
return Collections.singleton(new TimeWindow(
dayStart.toInstant().toEpochMilli(),
dayEnd.toInstant().toEpochMilli()
));
}
// ... getDefaultTrigger, getWindowSerializer, etc.
}Window state size
For N_keys keys and W_active simultaneously active windows:
- Tumbling:
W_active = 1; state = O(N_keys). - Sliding (size S, slide St):
W_active = S/St; state = O(N_keys × S/St). - Session:
W_active = active_sessions; state = O(active_sessions × avg_events_per_session).
For high cardinality + long windows + many sliding steps: state explodes. Mitigations:
- RocksDB backend (disk-backed state, not memory-bound).
- Pre-aggregate (use
aggregatenotapply— maintains incremental state, not buffered events). - TTL on state (Flink's
StateTtlConfig).
5. Triggers and Emission Policy
A trigger decides when a window's current result is emitted. Default: on watermark (once per window, when watermark passes end-of-window). But you can do much more.
When to use custom triggers
- Speculative emission: emit partial results early, even before the window closes. Trade accuracy for latency.
- Per-row triggering: emit every N events (useful for low-latency monitoring).
- Processing-time-based: emit every 10 seconds regardless of watermark (for real-time dashboards).
- Data-driven: emit when some condition in the data is met (e.g., "emit when window's count exceeds threshold").
Example: early + on-watermark + late
events
.keyBy(e -> e.titleId)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.trigger(
EventTimeTrigger.create()
// speculative: emit every 10s while window is open (processing time)
// Flink's built-in trigger doesn't combine these directly;
// implement a custom Trigger extending EventTimeTrigger
)
.aggregate(new ViewerCountAgg());Accumulation mode
When you emit multiple times for the same window, what do the downstream consumers see?
- Accumulating: each emission is the latest running total. Downstream must be an upsert sink that replaces the previous value. Example:
(title_id=X, window=[10:00,10:01), count=42)replaces the earliercount=37. - Discarding: each emission is only the new events since last emission. Downstream must sum across emissions to get the total. Harder downstream, but smaller messages.
- Accumulating + retracting: each emission includes a retraction of the previous value plus the new value. Allows downstream to handle both "live" updates and "final" snapshots. Most expressive, most expensive.
Choose based on downstream semantics. If sink is Kafka with compacted topic keyed by (title_id, window_start): accumulating + upsert. If sink is event log: retracting. If sink is an idempotent append-only store: discarding.
6. Late Data Strategies
An event is "late" if it arrives after its window's watermark has passed.
Strategy 1: Drop (default in Spark)
Late events are silently discarded. Fast, simple, incorrect for audit-sensitive workloads.
Strategy 2: Allowed lateness (extend window life)
Hold the window open past watermark by some duration; late events that arrive within this grace period update the window and trigger re-emission.
events
.keyBy(e -> e.userId)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.allowedLateness(Time.minutes(10)) // 10 extra minutes
.aggregate(new Counter());State cost: windows held for window_size + allowed_lateness. For a 1-minute window with 10-minute lateness, state is held 11 minutes per window.
Downstream must handle re-emissions — sink must be upsert-capable.
Strategy 3: Side output (recommended)
Route late events to a separate "late" stream for reconciliation downstream.
OutputTag<Event> lateTag = new OutputTag<>("late"){};
SingleOutputStreamOperator<Result> result = events
.keyBy(e -> e.userId)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.sideOutputLateData(lateTag)
.aggregate(new Counter());
DataStream<Event> lateEvents = result.getSideOutput(lateTag);
lateEvents.addSink(new LateEventSink()); // write to separate topic/tableDual-path reconciliation:
- Streaming pipeline handles on-time events.
- A batch job reads the late-event table daily and re-processes affected windows.
- Gold tables are rebuilt from the reconciled silver.
This is the production pattern. It's a Kappa + audit architecture.
Strategy 4: Retractions
Emit a retraction for the previous result and a new result reflecting the late event. Downstream must support retractions (a log of changes, not a snapshot).
Supported natively in Flink Table API with Retract mode. Upsert sinks handle retractions by deletion+insertion.
Measuring lateness in production
Emit a metric per event: lateness_ms = processing_time - event_time. Plot the distribution. Use to tune watermark B.
# Spark Structured Streaming: compute lateness column
events_with_lateness = events.withColumn(
"lateness_ms",
F.unix_millis(F.current_timestamp()) - F.unix_millis(F.col("event_ts"))
)7. Delivery Semantics Proved
"Exactly-once" is often said without precision. Let's nail it down.
Definitions
Let M be a message in a stream. Let effect(M) be the externally-observable effect of processing M (a row written, a counter incremented, a record published).
- At-most-once: for each
M,effect(M)happens 0 or 1 times. - At-least-once: for each
M,effect(M)happens 1 or more times. - Exactly-once: for each
M,effect(M)happens exactly 1 time.
The impossibility footnote
In a distributed system with possible crashes, the physical delivery of a message cannot be guaranteed exactly-once: any single network request may or may not have arrived despite no response. But the effect can be, by coordinating retries with deduplication or idempotent sinks.
Hence the industry term "effectively exactly-once": the physical operation may happen multiple times, but the side effect happens exactly once.
Three necessary conditions
For effectively-exactly-once:
- Replayable source. On restart after failure, the system can re-consume from a precisely defined offset. (Kafka offsets satisfy this; naive HTTP push sources do not.)
- Deterministic or idempotent processing. Re-processing the same input produces the same output, OR the sink absorbs duplicates idempotently.
- Atomic commit coordinating source offsets and sink output. Either both the new offsets and the new output are committed, or neither.
Any one of these missing, and "exactly-once" fails.
Spark Structured Streaming's exactly-once
Spark provides effectively-exactly-once for specific combinations:
Mechanism:
- Source tracks offsets per micro-batch (stored in
offsets/in checkpoint directory). - Each micro-batch's computation is deterministic given the offsets.
- Sink commits atomically:
- File sinks (Parquet, Iceberg, Delta): use a two-phase commit with a
commits/log. After the data files are written, a commit marker is appended atomically. On restart, uncommitted micro-batches are re-executed. - Kafka sink: uses Kafka transactions (
transactional.id). Micro-batch produces within a transaction; transaction commits with the offset commit.
- File sinks (Parquet, Iceberg, Delta): use a two-phase commit with a
- On failure, restart reads
offsets/andcommits/to find the last committed micro-batch, re-executes the next one deterministically.
Where it breaks:
- Non-deterministic query (
rand(),current_timestamp(), reading from a mutable source). - Non-atomic sinks (HTTP POST, naive JDBC without dedup key).
- Kafka transactions disabled or misconfigured.
- Checkpoint directory corruption.
(stream
.writeStream
.format("iceberg")
.option("checkpointLocation", "s3://ckpt/events/")
.trigger(processingTime="30 seconds")
.start())Spark guarantees EO when using checkpoints + Iceberg/Delta sinks + deterministic transforms.
Flink's exactly-once via 2PC and checkpoints
Flink uses the Chandy-Lamport algorithm (distributed snapshot) for checkpoints, combined with two-phase commit at sinks.
Barrier-based checkpointing:
- JobManager injects a barrier with checkpoint ID
Cinto each source partition. - As barriers flow through the dataflow graph, each operator aligns: when it has received the barrier from all its inputs, it snapshots its state to durable storage.
- State snapshot includes: operator state, keyed state, source offsets.
- Operator forwards the barrier to all its outputs.
- Sinks acknowledge completion. When all sinks ack, checkpoint
Cis globally complete.
[Source] ──barrier──▶ [Map] ──barrier──▶ [Window] ──barrier──▶ [Sink]
↓ ↓ ↓ ↓
snapshot snapshot snapshot commit
Aligned vs unaligned barriers:
- Aligned: operator waits for barriers from all inputs before snapshotting. Clean semantics but blocks processing of inputs whose barrier arrived first.
- Unaligned (Flink 1.11+): operator snapshots immediately when first barrier arrives, including in-flight records. Less backpressure sensitivity; larger snapshots. Enabled via
env.getCheckpointConfig().enableUnalignedCheckpoints().
Two-phase commit at sinks:
- On barrier: sink does a pre-commit (write to staging, reserve transaction ID).
- When JobManager confirms the checkpoint globally complete: sink does commit (make data visible).
- On restart: replay uncommitted pre-commits as needed.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60_000); // every 60s
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30_000);
env.getCheckpointConfig().setCheckpointTimeout(600_000);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);What EO does NOT guarantee
- Ordering across keys: events may still be emitted to the sink in an order different from their arrival.
- No retractions: updates to already-committed results are still visible downstream as new records.
- Cross-system atomicity: EO within one Flink job + one Kafka cluster. Across two Kafka clusters, or Kafka + a different sink, EO requires explicit 2PC support end-to-end.
- Effective-exactly-once for the system, not the business: if the input data is duplicated (upstream producer retried without dedup), EO still processes both copies.
Idempotent sink as an alternative to 2PC
Often simpler than 2PC is to make the sink idempotent by key:
- Write each record with a unique idempotency key.
- The sink's
INSERT IGNORE/ON CONFLICT DO NOTHING/ upsert-by-key handles duplicates. - Combined with at-least-once semantics, this gives effective-exactly-once.
-- Postgres sink with idempotency
INSERT INTO sink_table (event_id, user_id, ...)
VALUES (?, ?, ...)
ON CONFLICT (event_id) DO NOTHING;Works well; doesn't require sink-coordinator awareness. Limit: sinks that can't enforce uniqueness (append-only files) don't support this pattern.
8. Flink Internals — Dataflow, Barriers, State
Architecture
- JobManager: coordinator; runs the ExecutionGraph, schedules tasks, coordinates checkpoints.
- TaskManagers: workers; each runs
Nslots, each slot runs a subtask of an operator. - ResourceManager: allocates TaskManagers (typically YARN or Kubernetes).
- Dispatcher: receives jobs, spawns JobManagers per job.
Tasks, subtasks, and chains
A Flink job is a DAG of operators. Each operator has a parallelism (how many parallel instances run). Each instance is a subtask. A JobManager assigns subtasks to TaskManager slots.
Operator chaining: if an operator's output goes to another operator 1:1 and no shuffle is needed, Flink chains them into a single thread. Reduces serialization overhead hugely. forward channels are chained; rebalance, keyBy, etc. are not.
DataStream<Event> events = env.addSource(kafkaSource); // source
DataStream<Event> filtered = events.filter(ev -> ev.valid); // chained with source
KeyedStream<Event, String> keyed = filtered.keyBy(ev -> ev.userId); // NOT chained (keyBy shuffles)
DataStream<Result> windowed = keyed.window(...).aggregate(...); // new chainTask chain = single thread. Processes records with zero network hops and zero ser/de. Fast.
Keyed state vs operator state
- Keyed state: one value per key per operator.
keyBypartitions the stream; each partition holds state for its keys. The main way Flink scales stateful computation.ValueState<T>,ListState<T>,MapState<K,V>,ReducingState<T>,AggregatingState<IN,OUT>.
- Operator state: per-subtask state, not per-key. Used by sources (offset tracking) and custom operators. Sub-types: broadcast state, list state (union or split).
public class CountByUserFn extends KeyedProcessFunction<String, Event, Result> {
private ValueState<Long> count;
@Override
public void open(Configuration cfg) {
ValueStateDescriptor<Long> d = new ValueStateDescriptor<>("count", Long.class, 0L);
d.enableTimeToLive(StateTtlConfig.newBuilder(Time.hours(24)).build());
count = getRuntimeContext().getState(d);
}
@Override
public void processElement(Event ev, Context ctx, Collector<Result> out) throws Exception {
Long c = count.value();
c += 1;
count.update(c);
out.collect(new Result(ctx.getCurrentKey(), c));
}
}State backends
- HashMapStateBackend: in-JVM-memory. Fastest. State bounded by heap. Synchronous checkpoints (pauses JVM). Use for small state (< ~2GB).
- EmbeddedRocksDBStateBackend: on-disk key-value store (RocksDB). State bounded by local disk. Asynchronous and incremental checkpoints. Use for state > a few GB.
RocksDB internals:
- SSTables (Sorted String Tables) on disk, immutable.
- Memtable in memory; flushed to SSTable periodically.
- LSM tree structure: multiple levels of SSTables, compacted periodically.
- State read/write: serialize key, look up in memtable + levels (with bloom filters).
Incremental checkpoints
With RocksDB, Flink checkpoints only the SSTables that changed since the last checkpoint. New SSTables uploaded; unchanged ones referenced by manifest. Reduces checkpoint time from "full state" to "delta state."
checkpoint N-1: [sst1, sst2, sst3]
operator writes, compaction happens
checkpoint N: [sst1 (ref), sst4, sst5] <-- only sst4 and sst5 uploaded
Savepoints vs checkpoints
- Checkpoints: system-owned, periodic, optimized for fast recovery. May be incremental.
- Savepoints: user-triggered, versioned, portable across Flink versions (with caveats). Use for explicit restarts (deploys, cluster migrations, schema changes).
flink savepoint <job_id> s3://savepoints/myjob/
flink run -s s3://savepoints/myjob/savepoint-xxx <new-jar>.jarState migration
When you change the shape of state (add a field, change types), you need a migration. Flink supports:
- Schema evolution via Avro-serialized state (auto).
- Explicit state migration: write a one-off job that reads the savepoint with old schema, writes with new.
- Keyed state type migration: limited; usually requires re-keying from scratch.
9. Kafka Internals — ISR, Controllers, Exactly-Once
Topic, partition, offset
- Topic: named append-only log, split into partitions for parallelism.
- Partition: an ordered sequence of records; each record has a monotonically increasing offset within the partition.
- Producers choose partition (via key hash or explicit selection).
- Consumers read partitions sequentially; track their offset.
Parallelism = number of partitions. One consumer per partition per consumer group is the maximum (more consumers than partitions → idle consumers).
Replication
Each partition has N replicas (typically 3):
- Leader: handles reads and writes.
- Followers: pull from the leader asynchronously.
- ISR (In-Sync Replicas): the set of followers that are within
replica.lag.time.max.msof the leader.
Writes:
acks=0: fire-and-forget. Producer doesn't wait for leader ack.acks=1: wait for leader to persist. Lost on leader failure before followers catch up.acks=all: wait for all ISRs to persist. No loss within replication factor (assumingmin.insync.replicas >= 2).
Controller and leader election
One broker is the controller (elected via ZooKeeper or KRaft). Controller maintains cluster metadata — which broker leads which partition.
On broker failure:
- Controller detects failure (ZooKeeper session expiry or KRaft heartbeat timeout).
- Controller removes failed broker from ISR for partitions it led.
- Controller elects a new leader from surviving ISR (first in the ISR list).
- Controller propagates leadership change to all brokers and producers.
Without unclean leader election, the new leader is guaranteed to have every committed record. With unclean leader election enabled (unclean.leader.election.enable=true), a non-ISR replica can be elected — at the cost of data loss.
KRaft (ZooKeeper-free Kafka)
Kafka 3.x+ supports KRaft mode: the controller runs internally via Raft. No ZooKeeper. Simpler operationally; faster metadata propagation; supports larger clusters (millions of partitions).
Default in new deployments from 2024 onward.
Kafka exactly-once
Kafka offers exactly-once semantics (EOS) for specific patterns:
Producer idempotence (enable.idempotence=true):
- Producer is assigned a unique
producer_id(PID) by the broker. - Each record has a sequence number per partition.
- Brokers deduplicate based on (PID, partition, sequence) for the lifetime of the PID's epoch.
- Prevents duplicates from producer retries within a single producer session.
Transactional producers (transactional.id=my-tx):
- Producer starts a transaction, writes to multiple partitions/topics, commits.
- All writes are atomic — consumers with
isolation.level=read_committedsee either all or none. - Critical for exactly-once stream processing where "read from topic A, write to topics B and C" must be all-or-nothing.
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(new ProducerRecord<>("out_topic_1", ...));
producer.send(new ProducerRecord<>("out_topic_2", ...));
producer.sendOffsetsToTransaction(consumedOffsets, consumerGroup);
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
throw e;
}Consumer rebalance protocol
When consumers in a group join or leave:
Eager rebalance (classic):
- Coordinator triggers rebalance.
- All consumers revoke their partitions (stop processing entirely).
- Coordinator assigns partitions via strategy.
- Consumers resume.
Stop-the-world can be minutes with many consumers / many partitions. Kills latency SLOs.
Cooperative sticky rebalance (Kafka 2.4+):
- Coordinator identifies partitions to move.
- Only affected consumers revoke only the moving partitions.
- New assignments go to target consumers.
- Non-affected consumers continue uninterrupted.
Minimal disruption. Prefer in all modern deployments: partition.assignment.strategy=CooperativeStickyAssignor.
Log compaction
For "changelog" topics keyed by entity ID, Kafka can compact:
- Keeps only the latest record per key.
- Older records with the same key are garbage-collected during compaction.
- Tombstones (null-valued records) delete the key permanently after a grace period.
log before compaction:
key=user1, val={name:A}
key=user2, val={name:X}
key=user1, val={name:A'}
key=user1, val=null (tombstone)
key=user2, val={name:Y}
log after compaction:
key=user1, val=null (to be deleted after grace)
key=user2, val={name:Y}
Compacted topics serve as durable key-value stores readable as streams. Used for:
- Materializing database CDC (final state of each row).
- State topic backing Kafka Streams applications.
- Configuration distribution.
10. Streaming Joins
Joining two unbounded streams is the hardest operation in streaming. There are several flavors.
Stream-stream window join
Join two streams within a time window: match pairs (a, b) where a.timestamp and b.timestamp fall within a common window and the join key matches.
clicks.join(impressions)
.where(c -> c.userId).equalTo(i -> i.userId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.apply(new JoinFn());State: O(|clicks| + |impressions|) within the window. Grows linearly with window size and rate.
Stream-stream interval join
Match event pairs within a relative time interval: for each a, find b where a.ts - X ≤ b.ts ≤ a.ts + Y.
clicks.keyBy(c -> c.userId)
.intervalJoin(impressions.keyBy(i -> i.userId))
.between(Time.minutes(-1), Time.minutes(5))
.process(new IntervalJoinFn());Used for causal analysis (impressions preceding clicks) where window alignment would be wrong.
Stream-table (temporal) join
Join a stream of events against a slowly-changing table (often a dim broadcast or a compacted topic). Each event looks up the version of the table that was current at event time.
Implementation patterns:
- Broadcast state: broadcast the table updates to all parallel instances of the join operator. Each instance keeps the table in local state. Great when the table is small (< 1 GB).
- Keyed state + async lookup: each event does an async DB lookup, keeping a cache in keyed state.
- Versioned table joins (Flink Table API):
FOR SYSTEM_TIME AS OFsyntax.
-- Flink SQL: temporal join
SELECT o.*, c.country, c.subscription_tier
FROM orders o
JOIN user_dim FOR SYSTEM_TIME AS OF o.event_ts AS c
ON o.user_id = c.user_id;The engine maintains the history of user_dim; the join picks the version of each user that was current at event_ts.
Stream enrichment via async I/O
When the dimension is too large for broadcast state but must be looked up per event:
AsyncDataStream.unorderedWait(
events,
new AsyncDatabaseLookup(),
30, TimeUnit.SECONDS, // timeout
1000 // capacity (max concurrent lookups)
).map(enriched -> process(enriched));Concurrent async lookups keep the pipeline fed while external systems respond. Critical: use unorderedWait if order doesn't matter (higher throughput); orderedWait otherwise.
11. State Management at Scale
Production streaming pipelines accumulate TBs of state. Managing it is its own discipline.
State TTL
Per-state configurable time-to-live. Flink's StateTtlConfig:
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.hours(24))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupInRocksdbCompactFilter(1000)
.build();Cleanup strategies:
- In-access: expired state is removed on read (cheap but delays cleanup).
- Full-snapshot: cleanup during checkpoint (exhaustive but costly).
- RocksDB compaction filter: cleanup during RocksDB's background compaction. Preferred for large state.
Bounded session windows via custom expiration
Session windows without explicit max length can grow without bound. Implement max session length:
public class BoundedSessionFn extends KeyedProcessFunction<String, Event, Session> {
private ValueState<Session> current;
private ValueState<Long> expirationTimer;
@Override
public void processElement(Event ev, Context ctx, Collector<Session> out) throws Exception {
Session s = current.value();
if (s == null || ev.ts > s.lastEventTs + SESSION_GAP_MS) {
// new session
if (s != null) out.collect(s);
s = new Session(ev);
} else {
s.add(ev);
if (s.durationMs() > MAX_SESSION_MS) {
out.collect(s);
s = new Session(ev); // start a new session from this event
}
}
current.update(s);
ctx.timerService().registerEventTimeTimer(ev.ts + SESSION_GAP_MS);
}
@Override
public void onTimer(long ts, OnTimerContext ctx, Collector<Session> out) throws Exception {
Session s = current.value();
if (s != null && ts >= s.lastEventTs + SESSION_GAP_MS) {
out.collect(s);
current.clear();
}
}
}State partitioning evolution
Change the key you're partitioning on? Flink doesn't let you; you must:
- Savepoint current state.
- Write a migration job that reads savepoint, rewrites state under new keying.
- Start new job from migrated savepoint.
Or: start fresh, lose history. Painful; plan keys carefully upfront.
State size budgeting
Estimate:
state_size = num_keys × bytes_per_key_state × num_windows_active_per_key
Example:
- 100M users
- 1 KB state per user (a few counters, last-event-ts)
- 1 window active per user (current session)
- → 100 GB of state.
RocksDB backend + local SSD + incremental checkpoints handle this fine. HashMap backend wouldn't (100 GB heap is infeasible).
For 1B users × 1 KB = 1 TB state: still viable with RocksDB + fast SSDs. Above that, consider sharding across multiple Flink jobs or moving state to an external store with async lookup.
12. Backpressure and Flow Control
Backpressure: when a downstream operator can't keep up, upstream must slow down. If it doesn't, unbounded buffers, OOM, or data loss follow.
How Flink signals backpressure
Flink uses credit-based flow control. Each downstream operator advertises how much buffer space it has (credit); upstream only sends when credit is available.
If downstream is slow, credit is consumed faster than replenished, and upstream naturally blocks. This propagates back to the source, which (if Kafka) simply stops pulling records.
Diagnosing backpressure
In the Flink UI, each operator shows backpressure status (OK / LOW / HIGH). Trace the chain:
- Find the operator showing HIGH backpressure.
- Its downstream is the bottleneck.
Metrics to check:
outputQueueLength: how full the output buffer is.inputQueueUsage: same for inputs.numRecordsInPerSecondvsnumRecordsOutPerSecond: disparity indicates buffering.
Fixing backpressure
- Increase downstream parallelism: run more instances of the slow operator.
- Optimize the slow operator: profile; CPU? I/O? state access? The usual culprits are: Python UDFs (in PyFlink), synchronous external calls, state backend misuse.
- Use async I/O for external lookups.
- Shard state across more keys: if one key is hot, repartition with a salt (see Spark's skew fix — same principle applies).
Backpressure and checkpoints
Aligned checkpoints require barriers to reach operators; if an operator is backpressured, barriers pile up behind records in the input buffer, delaying the snapshot. Severe backpressure → checkpoint timeouts → job restarts → fallback loop.
Unaligned checkpoints (Flink 1.11+) mitigate: barriers "jump" ahead of in-flight records in the buffer. The snapshot includes those records as part of the channel state. Larger snapshots, but doesn't stall under backpressure.
env.getCheckpointConfig().enableUnalignedCheckpoints();Enable in production where backpressure is possible.
13. Lambda vs Kappa Revisited
Lambda architecture
Two pipelines:
- Batch layer — periodic, fully-accurate, high-latency.
- Speed layer — streaming, best-effort, low-latency.
- Serving layer — merges batch and speed views; clients see the combined result.
Problems:
- Two codebases. Every business rule implemented twice. Drift inevitable.
- Two debugging surfaces. Which layer produced this wrong number?
- Two SLA sets. When one layer lags, is the combined answer correct?
Kappa architecture
One pipeline, streaming. Batch is "streaming with a bigger window" or "streaming over a replay."
- Source: Kafka with long retention (or a lakehouse with time travel).
- Processing: streaming engine in both online and replay modes.
- Correction: replay the stream from the past to reprocess.
One codebase, one mental model.
The pragmatic middle: streaming + batch audit
The most common production pattern:
- Streaming pipeline provides low-latency "hot" answers.
- Batch pipeline runs daily/hourly over the full bronze layer to produce "certified" results.
- Differences between hot and certified are monitored; large gaps are incidents.
- Gold consumption reads whichever is appropriate (or blends by SLO).
This gives streaming's freshness AND batch's correctness guarantees, without Lambda's double-maintenance burden (the streaming version is the "real" code; the batch is a reference/audit path).
14. Streaming Pipeline Example End-to-End
A full example: Netflix-style playback sessionization.
Data flow
Mobile client ─► Kafka (events) ─► Flink (sessionize) ─► Kafka (sessions)
│
├─► Iceberg bronze (raw append)
│
└─► Redis (real-time counters)
Flink sessionization job
DataStream<PlaybackEvent> events = env
.fromSource(
KafkaSource.<PlaybackEvent>builder()
.setBootstrapServers("kafka:9092")
.setTopics("playback.events")
.setGroupId("sessionizer")
.setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.LATEST))
.setValueOnlyDeserializer(new PlaybackEventDeserializer())
.build(),
WatermarkStrategy.<PlaybackEvent>forBoundedOutOfOrderness(Duration.ofSeconds(30))
.withIdleness(Duration.ofMinutes(1))
.withTimestampAssigner((ev, ts) -> ev.eventTime),
"kafka-playback"
);
// Validate and filter
DataStream<PlaybackEvent> validated = events
.filter(ev -> ev.isValid())
.name("validate");
// Sessionize by sessionId with 30-minute event-time gap
OutputTag<PlaybackEvent> lateTag = new OutputTag<>("late-events"){};
SingleOutputStreamOperator<PlaybackSession> sessions = validated
.keyBy(ev -> ev.sessionId)
.window(EventTimeSessionWindows.withGap(Time.minutes(30)))
.allowedLateness(Time.minutes(15))
.sideOutputLateData(lateTag)
.trigger(EventTimeTrigger.create())
.aggregate(
new SessionAggregator(), // incremental aggregation
new SessionEmitter() // attach window metadata on fire
)
.name("sessionize");
// Primary sink: Kafka (for downstream real-time consumers)
sessions.sinkTo(
KafkaSink.<PlaybackSession>builder()
.setBootstrapServers("kafka:9092")
.setRecordSerializer(new SessionSerializer())
.setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.setTransactionalIdPrefix("session-sink-")
.build()
).name("sink-sessions");
// Secondary sink: Iceberg (for batch analytics)
sessions.sinkTo(
IcebergSink.forRowData(...)
.table(catalog.loadTable(tableId("silver", "fact_playback_session")))
.writeParallelism(8)
.distributionMode(DistributionMode.HASH)
.build()
).name("sink-iceberg");
// Late events go to separate Kafka topic for reconciliation
sessions.getSideOutput(lateTag).sinkTo(
KafkaSink.<PlaybackEvent>builder()
.setBootstrapServers("kafka:9092")
.setRecordSerializer(new EventSerializer("playback.events.late"))
.build()
).name("sink-late");
// Checkpointing: exactly-once, incremental
env.enableCheckpointing(60_000, CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30_000);
env.getCheckpointConfig().enableUnalignedCheckpoints();
env.setStateBackend(new EmbeddedRocksDBStateBackend(true)); // incremental
env.getCheckpointConfig().setCheckpointStorage("s3://checkpoints/sessionizer/");
env.execute("playback-sessionizer");SessionAggregator: incremental state
public class SessionAggregator
implements AggregateFunction<PlaybackEvent, SessionAccumulator, PlaybackSession> {
@Override
public SessionAccumulator createAccumulator() {
return new SessionAccumulator();
}
@Override
public SessionAccumulator add(PlaybackEvent ev, SessionAccumulator acc) {
if (acc.startTs == 0) acc.startTs = ev.eventTime;
acc.endTs = Math.max(acc.endTs, ev.eventTime);
switch (ev.type) {
case HEARTBEAT: acc.watchMs += ev.watchDeltaMs; break;
case BUFFERING: acc.bufferMs += ev.bufferDeltaMs; break;
case SEEK: acc.seeks += 1; break;
case PAUSE: acc.pauses += 1; break;
case ENDED: acc.endedReason = ev.endedReason; break;
}
acc.maxBitrate = Math.max(acc.maxBitrate, ev.bitrateKbps);
acc.sumBitrate += ev.bitrateKbps; acc.countBitrate += 1;
acc.userId = ev.userId;
acc.titleId = ev.titleId;
return acc;
}
@Override
public PlaybackSession getResult(SessionAccumulator acc) {
return new PlaybackSession(
acc.userId, acc.titleId,
Instant.ofEpochMilli(acc.startTs), Instant.ofEpochMilli(acc.endTs),
acc.watchMs, acc.bufferMs, acc.seeks, acc.pauses,
acc.maxBitrate, acc.sumBitrate / Math.max(acc.countBitrate, 1),
acc.endedReason
);
}
@Override
public SessionAccumulator merge(SessionAccumulator a, SessionAccumulator b) {
// Merge two sessions (used by session windows that get merged)
a.startTs = Math.min(a.startTs, b.startTs);
a.endTs = Math.max(a.endTs, b.endTs);
a.watchMs += b.watchMs; a.bufferMs += b.bufferMs;
a.seeks += b.seeks; a.pauses += b.pauses;
a.maxBitrate = Math.max(a.maxBitrate, b.maxBitrate);
a.sumBitrate += b.sumBitrate; a.countBitrate += b.countBitrate;
a.endedReason = a.endedReason != null ? a.endedReason : b.endedReason;
return a;
}
}What makes this production-grade
- Event-time with watermarks + idleness. Handles out-of-order and idle producers.
- Side-output for late data. Reconciliation path, not silent drops.
- Incremental aggregation (
AggregateFunction). State per session is O(1), not O(events). - Exactly-once Kafka sink with transactional producer.
- Iceberg sink for analytics continuity.
- Unaligned checkpoints. Resilient under backpressure.
- RocksDB backend. State can scale to TB.
- Reasonable parallelism (distribution mode hash). Avoids skew on session IDs.
What to monitor
- Watermark per operator (Flink UI): stuck watermark = no emission.
- Checkpoint duration: increasing duration = state growing or backpressure.
- Records emitted / received per second: discrepancy = backpressure.
- Late event rate: tune bounded out-of-orderness or allowed lateness.
- End-to-end latency: p50/p95/p99 of
now() - eventTimeat sink. - Consumer lag on source topics (committed offset behind latest).
- Kafka sink transaction rate and abort rate (high aborts = something wrong).
Next: 04-spark-internals.md — Catalyst, AQE, shuffle internals, join algorithms, and skew.
16. Idempotent Producers and Exactly-Once in Kafka
"Exactly-once" in Kafka is three separate guarantees composed: idempotent producer, transactional writes across partitions, and transactional consumer-offset commits. Handwaving past any of them produces bugs. Senior candidates can dissect each layer.
Layer 1 — Idempotent producer
The producer attaches a producer_id + sequence_number to each record. The broker tracks the highest sequence seen per (producer, partition). Duplicates — retries from the producer after a network blip — are detected and dropped at the broker. This alone gives you "no duplicates within a single producer session to a single partition."
Limits: the guarantee breaks across producer restarts (new producer_id) and across partitions (no coordination). For exactly-once across partitions, you need transactions.
Layer 2 — Producer transactions
Enable with transactional.id and bracket writes with beginTransaction() / commitTransaction(). The producer writes a transaction marker to each affected partition's commit-coordinator. Consumers configured with read_committed filter uncommitted records out of their stream. Aborted transactions leave tombstone markers.
Layer 3 — The two-phase commit sink
For true end-to-end exactly-once (consume → process → produce), the processor must commit its source offsets and its output records atomically. Kafka supports this via the sendOffsetsToTransaction() API — the source offset commit is included in the same producer transaction as the output records. Flink's TwoPhaseCommit sink implements this pattern.
What can still go wrong
- Zombie producers. A restarted producer with a new
transactional.idepoch fences the old one. If you reusetransactional.idnaively across pods, you will see fence exceptions in production. - External side effects. If your processor writes to Kafka and calls a REST API, only Kafka is in the transaction. The REST call can happen twice.
- Read_uncommitted consumers. A downstream consumer not set to
read_committedsees uncommitted records. EOS is a configured guarantee, not a default.
17. Complex Event Processing and MATCH_RECOGNIZE
When the interview question is "find users who did X then Y then Z within 10 minutes," you've hit a Complex Event Processing (CEP) problem. Three implementation paths.
Path 1 — Flink CEP library
Flink ships a native CEP API: Pattern.begin("a").where(...).next("b").where(...).within(Time.minutes(10)). The engine compiles to an NFA (non-deterministic finite automaton) and matches against the event stream with full watermark-aware correctness.
Path 2 — SQL MATCH_RECOGNIZE
Standardized in SQL:2016, supported by Flink SQL, Oracle, and Snowflake. Expresses pattern-match queries declaratively:
SELECT *
FROM events
MATCH_RECOGNIZE (
PARTITION BY user_id
ORDER BY event_ts
MEASURES A.event_ts AS start_ts, C.event_ts AS end_ts
ONE ROW PER MATCH
PATTERN (A B* C)
DEFINE
A AS A.event_name = 'view_plan',
B AS B.event_name IN ('hesitate','back','scroll'),
C AS C.event_name = 'purchase'
AND C.event_ts <= A.event_ts + INTERVAL '10' MINUTE
);
Path 3 — Hand-rolled stateful operator
For simple patterns (A then B), a stateful Flink operator keyed by user_id is often cleanest. Store a last-seen-A timestamp per key; on each B event, check whether last-seen-A is within the window. Trade-off: simpler code than CEP, but doesn't generalize to longer patterns.
Interview probe
"How would you detect account-takeover patterns in real time?" Strong answer: pattern-match on [login-success, IP-change, password-change, payment-added] within N minutes; use MATCH_RECOGNIZE on Flink SQL with a 30-minute watermark; emit to an alert topic; tune false positives with threshold + allow-list.
18. Stream-Table Duality in Depth
Every table is a compressed history of a stream. Every stream is the log of changes to a table. Systems that expose this duality directly (Kafka Streams, Flink, ksqlDB, Debezium + materialized views) let you reason about processing without the batch/stream dichotomy. Senior candidates internalize this until it becomes the natural lens.
Table → Stream (changelog)
Given a table of current state, emit its change log. Tools: CDC from the database's write-ahead log (Debezium on Postgres MySQL Oracle), or DELETE+INSERT triggers, or a MERGE-with-history pattern. The changelog is itself a topic others can subscribe to.
Stream → Table (materialized view)
Given a changelog stream, maintain a current-state table by replaying with upsert semantics. In Kafka Streams, this is KTable. In Flink, it's the result of SELECT ... GROUP BY key with retractions. The table updates in real time as events arrive.
The two join semantics
- Stream-table join. Each event on the stream is enriched with the current value of the table. No watermarks needed on the stream side.
- Stream-stream join. Two streams joined within a time window; both need watermarks; late arrivals produce retractions. Substantially more expensive to reason about.
The temporal table join
Sometimes you want "enrich this event with the table state as of the event's timestamp" — not the current value. This is a temporal table join. Flink supports it natively; implementing it by hand requires a versioned table with valid-from / valid-to per row and an as-of subquery.
19. Stateful Functions and Application-Level Patterns
Flink's Stateful Functions and Kafka Streams' Processor API blur the line between stream processing and general-purpose event-driven applications. Three patterns to recognize.
Pattern A — Stateful session accumulation
Keyed by user. State: list of events in current session, last-seen timestamp. Trigger: session window close on 30-minute inactivity. Emit: one summary record per session. Classic streaming sessionization — preferable to SQL sessionization when you need the result within seconds, not the next batch.
Pattern B — The operator-as-state-machine
Keyed by order_id. State: current lifecycle phase (placed, paid, shipped, delivered). Input: milestone events. Transitions: (placed, payment_received) → paid. Invalid transitions emit to a dead-letter side-output. Gives you an accumulating snapshot in streaming form, with explicit invariants.
Pattern C — Distributed cache with TTL
Keyed by feature_id. State: cached value + expiry. Periodic timer refreshes entries before expiry. Serves as an online feature store that can live alongside the pipeline rather than in a separate system. Useful when the feature set is small enough to fit in state and latency matters more than scale.
Spark Internals
"Spark is a distributed compiler and memory manager that happens to execute SQL." — if you can hold that sentence and defend it, you're at L5.
This chapter goes past the "Spark has an optimizer" talking points. We walk through what Catalyst actually does rule-by-rule, how Adaptive Query Execution (AQE) re-plans at runtime, how the shuffle service works on disk and over the network, how each join algorithm is chosen with the actual math, how skew is detected and split, and how Tungsten manages memory off-heap with its own binary format.
Contents
- The layered architecture you must carry in your head
- Catalyst: logical plan → optimized plan → physical plan
- The cost model (such as it is) and why it's wrong
- Adaptive Query Execution (AQE): the runtime re-planner
- Shuffle: the disk-and-network truth
- Join algorithms with real math: BHJ, SMJ, SHJ, BNLJ
- Skew: detection, splitting, salting, AQE handling
- Tungsten: off-heap memory, UnsafeRow, codegen
- Memory model: execution, storage, user, reserved
- Partitioning, coalesce, repartition — when each is wrong
- Broadcast internals: why 10 MB default and how to push it
- Pandas UDFs, Arrow, and the JVM↔︎Python boundary
- Writing the plan diff: a real query walkthrough
- Configuration cheat sheet — what each knob actually does
1. The layered architecture you must carry in your head
When a user submits spark.sql("SELECT ..."), control flows through five distinct layers. Know the layer that owns the problem and you'll skip 90% of the debugging time:
- SQL parser / DataFrame API — produces an unresolved logical plan (tree of
LogicalPlannodes). - Analyzer — resolves identifiers against the catalog, types, UDFs. Output: resolved logical plan.
- Optimizer (Catalyst) — rule-based rewrites on the logical plan. Output: optimized logical plan.
- Planner — translates logical operators to physical operators (with strategies). Output: physical plan.
- Execution (Tungsten + shuffle + RDD) — whole-stage codegen compiles physical operators into bytecode, shuffle ships data between stages, tasks run on executors.
Debugging rule: if EXPLAIN FORMATTED doesn't show what you expect at layer N, don't go looking at layer N+1. A filter that fails to push down is an optimizer problem, not a shuffle problem.
Use:
df.explain(mode="formatted") # full plan
df.explain(mode="cost") # plan + cost statistics (post-AQE if enabled)For even deeper diagnosis:
spark.conf.set("spark.sql.planChangeLog.level", "WARN")
# prints every rule that fires during optimization2. Catalyst: logical plan → optimized plan → physical plan
Catalyst is a tree-rewriting framework with four passes:
Parser → Unresolved Logical Plan
Analyzer → Resolved Logical Plan (binds column names, checks types)
Optimizer → Optimized Logical Plan (rule-based, equivalent rewrites)
Planner → Physical Plan (chooses strategies: join, aggregate, etc.)
2.1 The tree
A logical plan is an immutable tree of nodes (Project, Filter, Join, Aggregate, LeafNode like Relation). Optimizations are rules — functions LogicalPlan → LogicalPlan that pattern-match on subtrees and return a rewritten tree.
// Simplified Catalyst rule idea
object PushDownFilter extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case Filter(cond, Project(fields, child)) if cond.references.subsetOf(child.outputSet) =>
Project(fields, Filter(cond, child)) // push Filter under Project
}
}2.2 The rule batches (what actually fires, in order)
The optimizer runs batches of rules, many to fixed point. The production list is long; memorize the greatest hits:
| Batch | Key rules | What it does |
|---|---|---|
| Finish Analysis | ResolveReferences, ResolveSubquery |
resolves names, types |
| Subquery | RewritePredicateSubquery |
correlated subquery → semi/anti join |
| Infer Filters | InferFiltersFromConstraints |
a = b AND a = 5 infers b = 5 |
| Operator Optimizations | PushDownPredicate, ColumnPruning, CombineFilters, ConstantFolding, BooleanSimplification, SimplifyCasts, NullPropagation, EliminateOuterJoin |
the heart of Catalyst |
| Join Reorder | ReorderJoin, CostBasedJoinReorder |
left-deep tree by row-count estimate |
| Decimal Optimizations | DecimalAggregates |
promote decimals for overflow safety |
| LocalRelation | ConvertToLocalRelation |
fold static literals |
Two rules do 80% of the work:
- PushDownPredicate: drives filter push-down through
Project,Join,Aggregate, down into theFileSourceso Parquet reads the minimum bytes. - ColumnPruning: keeps only needed columns, eliminating entire column chunks from Parquet/ORC reads.
When they don't fire: UDFs (treated as opaque black boxes for filter push-down unless you declare them deterministic = true in a certain shape), non-deterministic expressions, window functions (blocks push-down), and expressions on partition columns that use cast into incompatible types.
2.3 Analyzing the optimized plan
df = (spark.table("bronze.playback_events")
.filter("event_date = '2026-04-15'")
.filter("user_id = 1234")
.select("title_id", "watch_ms"))
df.explain(mode="formatted")Optimized logical plan (expected):
== Optimized Logical Plan ==
Project [title_id#1, watch_ms#2]
+- Filter (event_date = 2026-04-15 AND user_id = 1234)
+- Relation bronze.playback_events[...] parquet
Both filters combined, column pruning applied. Now at physical:
== Physical Plan ==
*(1) Project [title_id#1, watch_ms#2]
+- *(1) Filter (user_id = 1234)
+- *(1) ColumnarToRow
+- FileScan parquet bronze.playback_events
PartitionFilters: [event_date = 2026-04-15]
PushedFilters: [EqualTo(user_id, 1234)]
ReadSchema: struct<title_id:bigint,watch_ms:bigint>
The *(1) prefix means whole-stage codegen stage 1. PartitionFilters = Hive/directory-level pruning. PushedFilters = Parquet row-group filter using min/max statistics.
2.4 When plans go wrong: the "why didn't it push down" checklist
In order of likelihood:
- A UDF touched the column (Catalyst treats most UDFs as opaque).
- The column is inside a
Windoworcollect_listboundary. - A
COALESCE(col, 0) = 5— push down works only on simple forms; coalesce blocks it. Rewrite ascol = 5 OR (col IS NULL AND 5 = 0). - Parquet file has no statistics (written by an old or broken writer).
- Data type mismatch (
stringcolumn filtered againstint) → implicit cast disables pushdown. - You used
cache()above the filter; filter no longer pushes past the cache node.
3. The cost model (such as it is) and why it's wrong
Spark's Cost-Based Optimizer (CBO) uses statistics from ANALYZE TABLE ... COMPUTE STATISTICS stored in the catalog. Disabled by default (spark.sql.cbo.enabled=false).
Key statistics collected:
- Row count, size in bytes
- Per-column: min, max, nullCount, distinctCount (approx, HyperLogLog), avgLen, maxLen
- Histograms (
ANALYZE TABLE ... FOR COLUMNS ... WITH HISTOGRAM)
3.1 Why the CBO is a paper tiger
- Stale statistics: nobody re-runs
ANALYZE TABLEafter every write. The stats you have are from last week. - Partial statistics: CBO falls back to row-count heuristics when columns lack histograms.
- Bias: join selectivity estimation assumes uniform distribution. Real data is Zipfian (a few users do 100× the activity of the median).
- AQE made it less necessary: at runtime, AQE has real shuffle statistics, which beats any plan-time estimate.
3.2 The two cases where CBO actually earns its keep
- Star schema join reordering: building a left-deep join tree that probes the smallest dimension first. Makes a measurable difference at 10+ joins.
- Choosing broadcast for a reasonably-sized dimension: if the stats say the dimension is 50 MB and
spark.sql.autoBroadcastJoinThreshold = 100 MB, CBO chooses BHJ pre-AQE.
3.3 What to run
ANALYZE TABLE silver.dim_title COMPUTE STATISTICS;
ANALYZE TABLE silver.dim_title COMPUTE STATISTICS FOR COLUMNS title_id, genre;
ANALYZE TABLE silver.dim_title COMPUTE STATISTICS FOR ALL COLUMNS;For partitioned tables you can also do FOR COLUMNS ... PARTITION (event_date='2026-04-15').
4. Adaptive Query Execution (AQE): the runtime re-planner
AQE is the single most important Spark performance feature of the last five years. It flips the model from "compile once, execute" to "compile stage-by-stage, with real statistics from the previous stage's shuffle".
Enabled by default in Spark 3.2+:
spark.sql.adaptive.enabled = true
4.1 What AQE actually does
AQE splits execution at materialization barriers (shuffle, broadcast exchange). After each barrier, it:
- Reads the actual shuffle map output sizes (per partition).
- Re-optimizes the remainder of the plan with those real numbers.
- Executes the next stage.
The three main AQE rules:
| Rule | What it does | Key config |
|---|---|---|
| Coalesce Shuffle Partitions | merges small post-shuffle partitions into fewer, larger ones (reduces task overhead) | spark.sql.adaptive.coalescePartitions.enabled=true, spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB |
| Skew Join Handling | detects skewed partitions, splits them across multiple tasks | spark.sql.adaptive.skewJoin.enabled=true, skewedPartitionFactor=5.0, skewedPartitionThresholdInBytes=256MB |
| Dynamic Join Strategy | converts a planned SMJ to a BHJ at runtime if one side turned out small | spark.sql.adaptive.localShuffleReader.enabled=true |
4.2 The coalesce algorithm
Before AQE: you set spark.sql.shuffle.partitions=200 globally. After a selective filter, you might have 200 tiny partitions of 5 MB each and 200 tasks of overhead. Waste.
AQE algorithm (simplified):
target_bytes = advisoryPartitionSizeInBytes # e.g. 64 MB
sorted_partitions = sort_by_size(shuffle_map_output)
groups = []
current = []
current_size = 0
for p in sorted_partitions:
if current_size + p.size > target_bytes and current:
groups.append(current)
current, current_size = [], 0
current.append(p)
current_size += p.size
if current: groups.append(current)
# Each group becomes one reduce task reading contiguous map output
Result: ~the right number of tasks at the right size, without user tuning.
4.3 The skew handling algorithm
Detection:
median_size = median(partition_sizes)
is_skewed(p) = p.size > median_size * skewedPartitionFactor
AND p.size > skewedPartitionThresholdInBytes
Split: AQE reads the skewed partition in ceil(p.size / advisoryPartitionSizeInBytes) sub-tasks, each doing a streaming join against the full (small) other side.
Trade-off: the "other side" is read once per sub-task — so skew handling only helps when the skewed side is much larger than the other. AQE is conservative: it won't split if the cost is worse.
4.4 Dynamic join strategy conversion
The plan said SMJ, but after filter and aggregation one side's shuffle output is 40 MB (under broadcast threshold). AQE converts to Broadcast Hash Join using the shuffle map output as the broadcast table — cheaper than re-reading the source.
4.5 Verifying AQE fired
In Spark UI, the SQL tab shows the Initial Plan and the Final Plan after AQE. Look for AdaptiveSparkPlan wrapper nodes and CustomShuffleReader with coalesced / skewed indicators.
df.explain(mode="formatted")
# post-execution to see AQE's final plan:
spark.sparkContext.uiWebUrl # browse to SQL tab4.6 When to turn AQE off (rare)
- Very small queries where re-optimization adds measurable latency (streaming, sub-second).
- Determinism tests where you need plan stability.
- Broken UDFs that crash when the number of partitions changes.
Otherwise: leave it on.
5. Shuffle: the disk-and-network truth
Shuffle is the single biggest cost in any non-trivial Spark job. Understanding exactly what happens buys you a lot of optimization.
5.1 The shuffle lifecycle
Consider a stage boundary like GROUP BY user_id:
[Map Stage] [Shuffle] [Reduce Stage]
Task 1 ──► writes 200 files Block manager + ESS Task 1 ──► reads 1 block
Task 2 ──► writes 200 files advertises locations Task 2 ──► reads 1 block
... to driver ...
Task M ──► writes 200 files Task N ──► reads 1 block
Each map task writes N files (one per reducer).
Total files on disk: M × N (for sort-based shuffle: M × 1 file + M index files)
Modern Spark uses sort-based shuffle (SortShuffleManager):
- Map task partitions output by
partitioner(key)→ reducer ID. - Inserts records into an in-memory buffer, sorted by (partition_id, key).
- When buffer full: spills to disk as a sorted file.
- At end of map task: merges all spill files into ONE data file + ONE index file.
- Index file: N entries, each an (offset, length) for reducer R's block.
5.2 The External Shuffle Service (ESS)
By default, map outputs live in the executor's local disk. If the executor dies, reduce tasks can't fetch their blocks — they have to be recomputed. With dynamic allocation, executors die routinely.
The External Shuffle Service (ESS) is a long-running process per NodeManager (or K8s daemonset) that serves shuffle blocks independently of executor lifecycle. Executors write shuffle files locally, but the ESS reads and serves them to reduce tasks.
spark.shuffle.service.enabled = true # enables ESS
spark.dynamicAllocation.enabled = true # now safe to dynamically scale executors
5.3 Push-based shuffle (Magnet — Spark 3.2+)
Problem: the fetch side of shuffle is N reducers × M mappers = random-I/O storm on the ESS disk.
Solution: Magnet pushes mapper output to pre-assigned merger nodes as soon as map tasks finish. The reducer fetches ONE pre-merged block per partition instead of M small blocks.
spark.shuffle.push.enabled = true
spark.shuffle.push.server.mergedShuffleFileManagerImpl = ...
Benefit: sequential reads on ESS, typically 2–3× faster shuffle read for large jobs.
5.4 Shuffle size math
For a job with:
Mmap tasks (input partitions)Nreducer tasks (spark.sql.shuffle.partitions)Dbytes of shuffled data
Network transfer: D bytes (all must cross the network).
Disk writes (map side): D bytes.
Disk reads (reduce side): D bytes.
Number of fetch connections (without push-based): M × N (tiny messages hurt).
Common mistake: setting spark.sql.shuffle.partitions too high. If D = 10 GB and N = 1000, each partition is 10 MB — fine. If N = 10000, each is 1 MB — task overhead dominates. Target: advisoryPartitionSizeInBytes = 64–128 MB.
5.5 Shuffle on object storage — the Netflix / EMR pattern
On-prem: local NVMe disks + ESS. Cloud: executors are ephemeral, local disks are tiny.
Options:
- EMR on EBS: local but slower; ESS works fine.
- Shuffle plugins that write to S3 (Celeborn, Uniffle, Apache Spark SS on S3): decouples shuffle from executors entirely. Slower per-byte, but enables spot instances / fast scale-down.
- Remote shuffle service (RSS): Celeborn (formerly Aliyun/RSS), Uber's Uniffle. Now dominant at large shops.
6. Join algorithms with real math: BHJ, SMJ, SHJ, BNLJ
Spark has four join strategies. Catalyst's planner chooses one based on hints, statistics, and sizes. Know the decision tree and its failure modes.
6.1 Decision tree (simplified)
if user hint "broadcast(small)": → BroadcastHashJoin
elif one side < spark.sql.autoBroadcastJoinThreshold (default 10 MB): → BHJ
elif one side fits in memory and has less rows (estimated): → ShuffledHashJoin (rare)
elif keys sortable: → SortMergeJoin
else: → BroadcastNestedLoopJoin (correctness last resort)
6.2 Broadcast Hash Join (BHJ)
Cost: O(|large|) per executor. Broadcast cost: driver-collect |small| + fan-out |small| to each executor.
How it works:
- Driver collects small side (
collect()over RDD) into aHashedRelation. - Broadcast via Spark's torrent-based broadcast.
- Each executor joins its slice of the large side by probing the hash table.
- No shuffle.
Math:
- Driver memory:
|small|. Rule of thumb: the broadcast table is ~3× larger in driver memory than the raw Parquet size (decompressed, Java object overhead). - Executor memory: same
|small|on every executor. - Network:
|small| × #executors.
When it breaks:
- Driver OOM at
collect(). spark.driver.maxResultSize(default 1 GB) too small.- Skewed small side's hash table too big per executor when bumped beyond the default 10 MB limit.
Tuning:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100MB") # raise with care
# Or hint:
from pyspark.sql.functions import broadcast
large.join(broadcast(small), "user_id")6.3 Sort-Merge Join (SMJ)
Cost: O(|L| log|L| + |R| log|R|) for the shuffle sort; merge itself is O(|L| + |R|).
How it works:
- Both sides shuffled by join key.
- Within each partition, sorted by join key.
- Two pointers walk the sorted streams and emit matches.
Math:
- Memory: bounded (sort-based merge is streaming after the sort).
- Disk: both sides are sorted → spills possible during sort.
- Network: shuffle both sides =
|L| + |R|.
Why it's the default for large-large joins: predictable memory behavior, handles arbitrary size.
6.4 Shuffled Hash Join (SHJ)
Both sides shuffled, then one side built into an in-memory hash table per partition.
When chosen:
- CBO thinks the build side fits in memory.
spark.sql.join.preferSortMergeJoin=false(defaulttruein most Spark versions — SHJ is disabled by default because it can OOM).
Use case: one medium-sized side that fits in executor memory but is too big to broadcast.
6.5 Broadcast Nested-Loop Join (BNLJ)
Fallback for non-equijoin conditions (ON a.x < b.y). Cartesian by default, filtered.
Cost: O(|L| × |R|).
If you see BNLJ in a plan and you didn't intend a cross join, fix the query. Rewrite inequality joins as range joins or pre-filter.
6.6 Join types × strategies — what's supported
Not every strategy supports every join type. INNER, LEFT OUTER, RIGHT OUTER are universal. FULL OUTER needs SMJ (or BHJ with specific conditions). LEFT SEMI / LEFT ANTI are universal. The EXPLAIN tells you what Catalyst picked:
*(3) BroadcastHashJoin [user_id#1], [user_id#10], Inner, BuildRight
*(4) SortMergeJoin [user_id#1], [user_id#10], Inner
*(5) ShuffledHashJoin [user_id#1], [user_id#10], Inner, BuildLeft
7. Skew: detection, splitting, salting, AQE handling
Data skew kills Spark jobs. Here's how to detect and mitigate, in order of sophistication.
7.1 Detecting skew
Symptoms in the Spark UI:
- One task in a stage takes 10× the median duration.
- GC Time column shows 50%+ on the slow task.
- Shuffle Read Size shows a massive outlier partition.
Programmatic check before the job:
from pyspark.sql.functions import col, count, percentile_approx
skew_check = (df.groupBy("join_key")
.count()
.selectExpr("percentile_approx(count, 0.50) as p50",
"percentile_approx(count, 0.99) as p99",
"max(count) as max"))
skew_check.show()If p99 / p50 > 5 you have skew.
7.2 AQE skew handling (the easy fix)
With spark.sql.adaptive.skewJoin.enabled=true, AQE splits any partition that's:
> skewedPartitionFactor × median(default 5.0), AND> skewedPartitionThresholdInBytes(default 256 MB)
This handles ~80% of skew cases without code changes.
Where it doesn't help: pre-shuffle skew (one input file is enormous), or non-join skew (GROUP BY on a skewed key).
7.3 Salting
When AQE isn't enough — most commonly, aggregations on a skewed key:
from pyspark.sql.functions import col, concat, lit, floor, rand, expr
# Step 1: add a salt to the skewed key on the large side
salt_factor = 100
salted = df.withColumn("salt", floor(rand() * salt_factor)) \
.withColumn("key_salted", concat(col("user_id"), lit("_"), col("salt")))
# Step 2: explode the small side (cross join with salts)
small_exploded = small_df.crossJoin(
spark.range(salt_factor).toDF("salt")
).withColumn("key_salted", concat(col("user_id"), lit("_"), col("salt")))
# Step 3: join on the salted key
joined = salted.join(small_exploded, "key_salted")Cost: small side grows by salt_factor. Works when small side is truly small.
7.4 Two-stage aggregation (for GROUP BY skew)
# Stage 1: pre-aggregate with salt
stage1 = (df.withColumn("salt", floor(rand() * 100))
.groupBy("user_id", "salt")
.agg(sum("amount").alias("sum_amt")))
# Stage 2: final aggregate without salt
stage2 = stage1.groupBy("user_id").agg(sum("sum_amt").alias("total"))Math: stage 1 shuffles with 100 × N distinct keys (uniformly distributed). Stage 2 shuffles a tiny pre-aggregated dataset.
7.5 Isolating the heavy hitters
For extreme skew (one key = 99% of data), split the query:
heavy = df.filter(col("user_id") == "netflix_test_account") # process separately
normal = df.filter(col("user_id") != "netflix_test_account") # normal join
# union at the end8. Tungsten: off-heap memory, UnsafeRow, codegen
Tungsten is Spark's execution engine rewritten (circa 2015) to eliminate JVM overhead. Three pillars:
- Off-heap memory via
sun.misc.Unsafe— Spark manages raw byte arrays outside the JVM heap. - UnsafeRow — binary row format with direct memory offsets (no Java object per field).
- Whole-stage codegen — generate Java bytecode at runtime that fuses multiple operators into one tight loop.
8.1 UnsafeRow layout
A normal Java Row of 10 fields = 10 boxed objects + header overhead = ~200 bytes. UnsafeRow for the same row = ~80 bytes packed.
+------------------+-----------------+------------------+
| Null bit set | Fixed-width | Variable-length |
| (ceil(N/64) × 8) | (8 bytes each) | (strings, etc.) |
+------------------+-----------------+------------------+
Field access = pointer arithmetic (O(1), no indirection). Enables SIMD-friendly loops.
8.2 Whole-stage codegen
For a query like SELECT col1 + col2 FROM t WHERE col3 > 10, Catalyst generates code equivalent to:
while (iter.hasNext()) {
UnsafeRow row = iter.next();
int col3 = row.getInt(2);
if (col3 > 10) {
int col1 = row.getInt(0);
int col2 = row.getInt(1);
int result = col1 + col2;
outputBuffer.putInt(result);
}
}Instead of an operator tree walking row-by-row with virtual calls, it's one JIT-inlineable hot loop. 2–5× throughput on CPU-bound queries.
Recognize codegen stages in EXPLAIN: operators prefixed with * like *(2) Filter.
8.3 When codegen doesn't kick in
- Operator not supported (some UDFs, complex window specs).
- Too many operators in a stage: Spark falls back at ~8000 lines of generated code.
- Disabled explicitly:
spark.sql.codegen.wholeStage=false.
9. Memory model: execution, storage, user, reserved
Spark's unified memory model (since 1.6). Each executor's heap is partitioned as:
Total heap:
┌─────────────────────────────────────────────────────────────┐
│ Reserved (300 MB hardcoded) │
├─────────────────────────────────────────────────────────────┤
│ Spark Memory = (heap - 300MB) × spark.memory.fraction (0.6) │
│ ┌──────────────────────────────┬────────────────────────┐ │
│ │ Execution │ Storage │ │
│ │ (shuffle, joins, aggregates) │ (cache, broadcast) │ │
│ │ │ │ │
│ │ ←── can borrow from Storage ──── can borrow from Exec │ │
│ └──────────────────────────────┴────────────────────────┘ │
│ spark.memory.storageFraction = 0.5 (initial boundary) │
├─────────────────────────────────────────────────────────────┤
│ User Memory = (heap - 300MB) × (1 - 0.6) = 40% default │
│ (user data structures in custom UDFs) │
└─────────────────────────────────────────────────────────────┘
Execution borrows from Storage freely (evicts cached blocks). Storage borrows from Execution only if unused. Execution wins in conflicts.
9.1 Off-heap memory
spark.memory.offHeap.enabled = true
spark.memory.offHeap.size = 4g
Tungsten uses this pool for managed binary data. Adds to executor's container memory request but doesn't contribute to GC pressure. Use when GC is dominant (> 10% of task time).
9.2 The OOM debugging flowchart
When an executor OOMs, check in order:
- Container OOM killed by YARN/K8s (exit code 137 or 143): total container > request. Increase
spark.executor.memoryOverhead(default max(384 MB, 10% of heap) — often too small for Python/Arrow). - Java heap OOM: dump heap and inspect. Usually one of:
- Large broadcast.
- Skewed partition needing to fit in hash table.
- Cached data too large.
- GC overhead OOM: >98% time in GC. Almost always skew or over-cached state. Enable off-heap, increase
spark.memory.fractionto 0.7.
9.3 memoryOverhead gotcha
Python Pandas UDFs, Arrow, native code (Parquet writer) all use off-heap container memory. If you see Container killed by YARN for exceeding memory limits, your overhead is too low.
Rule of thumb: overhead = 20–30% of executor memory for PySpark jobs with Pandas UDFs.
10. Partitioning, coalesce, repartition — when each is wrong
repartition(n): full shuffle, round-robin to n partitions. Use when entering a stage and you need more parallelism or a rebalance.repartition(n, col): shuffle by hash ofcoltonpartitions. Use before a window or GROUP BY if you know the keys are safe.coalesce(n): narrow transformation, merges existing partitions WITHOUT a shuffle.nmust be ≤ current partition count. Use before writing out files.repartitionByRange(n, col): sampling + range-partitioner for ordered output. Use for sort-merge joins you need to control manually, or for writing out sorted files.
10.1 The classic mistake
df.filter("event_date = '2026-04-15'").coalesce(1).write.parquet(...)You intended one output file. You got: the entire read + filter runs on one executor because coalesce propagates upward. This turns a 10-node job into a 1-node job.
Correct:
df.filter("event_date = '2026-04-15'").repartition(1).write.parquet(...)
# OR (better): let AQE coalesce, and use `maxRecordsPerFile`
df.filter(...).write.option("maxRecordsPerFile", 1_000_000).parquet(...)10.2 Writing files — controlling output count
(df.repartition(200, "event_date")
.sortWithinPartitions("event_date", "user_id")
.write
.partitionBy("event_date")
.option("maxRecordsPerFile", 5_000_000)
.parquet(path))repartition(200, "event_date"): co-locates all records for a given date in the same task.sortWithinPartitions: improves Parquet stats and row-group pruning.partitionByat write time: one directory per date.maxRecordsPerFile: bounds each file's size.
11. Broadcast internals: why 10 MB default and how to push it
11.1 The broadcast lifecycle
- Driver
collect()s the small side intoHashedRelation. - Serializes it (Kryo typically).
- Publishes to
BlockManagerwithBroadcastBlockId. - TorrentBroadcast: executors fetch pieces (~4 MB each) from each other and the driver, BitTorrent-style. Reduces driver egress.
- Each executor caches the broadcast. Join task probes the hash table locally.
11.2 Why 10 MB default
- Driver memory safety:
collect()must not OOM the driver. - Network: pushing
|small|to N executors costs|small| × Nbytes of driver egress before torrenting kicks in. - Executor memory: every executor holds a copy.
11.3 Raising the threshold safely
Know your cluster. If:
- Driver has ≥ 8 GB and can comfortably collect the side,
- Executors have ≥ 8 GB,
- There are ≥ 50 executors so the benefit is large,
then:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "200MB")Also set spark.driver.maxResultSize appropriately.
Hint instead of global:
from pyspark.sql.functions import broadcast
fact.join(broadcast(dim), "key")11.4 Broadcast gotchas
- Stale broadcast: a dataset that's borderline 10 MB might grow past the threshold at runtime (old stats). AQE handles this; pre-AQE it fails.
- Collect-time timeout:
spark.sql.broadcastTimeout(default 300s). Bump for slow sources. - Driver OOM on repeated broadcasts: long-running jobs that broadcast inside a loop can retain references.
12. Pandas UDFs, Arrow, and the JVM↔︎Python boundary
A regular Python UDF serializes each row JVM → Python → JVM, one at a time. Throughput: ~10K rows/sec per executor. Terrible.
Pandas UDFs (vectorized UDFs) ship batches of rows via Apache Arrow zero-copy buffers. Throughput: ~1M rows/sec per executor. Essential.
12.1 Types of Pandas UDFs
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
import pandas as pd
# 1. Series → Series (scalar)
@pandas_udf(DoubleType())
def log1p_udf(s: pd.Series) -> pd.Series:
return (s + 1).map(float).map(np.log)
# 2. Iterator[Series] → Iterator[Series] (for heavy setup once per batch)
@pandas_udf(DoubleType())
def model_score(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
model = load_model_once() # loaded per executor, not per batch
for s in iterator:
yield model.predict(s.values)
# 3. Grouped Map (applyInPandas)
def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
pdf["z"] = (pdf["x"] - pdf["x"].mean()) / pdf["x"].std()
return pdf
df.groupBy("user_id").applyInPandas(normalize, schema="user_id long, x double, z double")12.2 The Arrow boundary
JVM executor (scala) Python worker (child process)
+-----------------------+ +------------------------------+
| UnsafeRow batch | socket | Pandas DataFrame |
| → Arrow RecordBatch |===pipe===> | Pandas UDF applies |
| | binary | → Arrow RecordBatch back |
| ← Arrow RecordBatch |<========== | |
+-----------------------+ +------------------------------+
Arrow's columnar representation maps directly to Pandas columns (zero copy for numeric types, small copy for strings).
Config:
spark.sql.execution.arrow.pyspark.enabled = true
spark.sql.execution.arrow.maxRecordsPerBatch = 10000 # tune for memory
12.3 Pandas UDF gotchas
- Memory overhead: each Python worker holds a Pandas DataFrame in memory. If
maxRecordsPerBatch=10000and rows are 1 KB, that's 10 MB per worker. Multiply byspark.executor.cores— it adds up. Increasespark.executor.memoryOverheadaccordingly. - Python object columns: strings become Python
str(object dtype in Pandas 1.x, StringArray in 2.x).map()over them reverts to Python loop → defeats the vectorization. - Schema mismatch: returning a DataFrame with different column order than the declared schema silently produces wrong data. Use
.applyInPandaswith explicit schema and test.
13. Writing the plan diff: a real query walkthrough
Consider the Netflix-shaped query: "For last 7 days, top 10 titles per country by completions".
completions = (spark.table("silver.fact_playback_completion")
.filter("event_date >= current_date() - INTERVAL 7 DAYS"))
dim_title = spark.table("silver.dim_title")
dim_country = spark.table("silver.dim_country")
joined = (completions
.join(dim_title, "title_id")
.join(dim_country, "country_id"))
from pyspark.sql import Window
from pyspark.sql.functions import count, row_number, col
w = Window.partitionBy("country_name").orderBy(col("completions").desc())
top10 = (joined
.groupBy("country_name", "title_id", "title_name")
.agg(count("*").alias("completions"))
.withColumn("rn", row_number().over(w))
.filter("rn <= 10"))
top10.explain(mode="formatted")Expected optimized plan:
== Optimized Logical Plan ==
Filter (rn <= 10)
+- Window [row_number() OVER (PARTITION BY country_name ORDER BY completions DESC)]
+- Aggregate [country_name, title_id, title_name], [count(1) AS completions]
+- Project [country_name, title_id, title_name]
+- Join Inner, country_id = country_id
:- Join Inner, title_id = title_id
: :- Filter (event_date >= date_sub(current_date(), 7))
: : +- Relation silver.fact_playback_completion ...
: +- Relation silver.dim_title ...
+- Relation silver.dim_country ...
Expected physical plan (with AQE):
== Physical Plan ==
AdaptiveSparkPlan (isFinalPlan=true)
+- *(6) Filter (rn#5 <= 10)
+- Window [row_number() windowspecdefinition(country_name#6, completions#7 DESC ...) AS rn#5]
+- *(5) Sort [country_name#6 ASC, completions#7 DESC]
+- ShuffleQueryStage (coalesced from 200 to 50)
+- Exchange hashpartitioning(country_name#6, 200)
+- *(4) HashAggregate(keys=[country_name#6, title_id#2, title_name#3], functions=[count(1)])
+- Exchange hashpartitioning(country_name#6, title_id#2, title_name#3, 200)
+- *(3) HashAggregate(keys=[country_name#6, title_id#2, title_name#3], functions=[partial_count(1)])
+- *(3) Project [country_name#6, title_id#2, title_name#3]
+- *(3) BroadcastHashJoin [country_id#4], [country_id#8], Inner
:- *(3) Project [country_id#4, title_id#2, title_name#3]
: +- *(3) BroadcastHashJoin [title_id#1], [title_id#2], Inner
: :- *(3) Filter isnotnull(title_id#1)
: : +- *(3) ColumnarToRow
: : +- FileScan silver.fact_playback_completion
: : PartitionFilters: [event_date >= 2026-04-12]
: : PushedFilters: [IsNotNull(title_id)]
: +- BroadcastExchange
: +- *(1) ColumnarToRow
: +- FileScan silver.dim_title
+- BroadcastExchange
+- *(2) ColumnarToRow
+- FileScan silver.dim_country
Reading the plan top to bottom:
- FileScan fact_playback_completion with
PartitionFilters— directory pruning to last 7 days. Catalyst pushed the date filter to the partition key. - BroadcastExchange dim_title and BroadcastExchange dim_country — both dims are small, BHJ chosen.
- Partial aggregate (*(3)) per input partition: combines rows locally.
- Exchange hashpartitioning on the group keys: 200 partitions (planned).
- ShuffleQueryStage (coalesced from 200 to 50) — AQE merged post-shuffle partitions.
- Final HashAggregate.
- Exchange by
country_namefor the window. - Sort + Window for row_number.
- Filter rn <= 10.
Failure modes to know:
dim_titlegrew over broadcast threshold → planner switches to SMJ → shuffle cost doubles. Watch it.- If
event_datefilter uses a UDF likedate_trunc(event_date) >= ..., the partition filter doesn't push; you read all 7 × N files. - If the aggregate keys are skewed, the final
HashAggregatebecomes the long tail.
14. Configuration cheat sheet — what each knob actually does
Only the ones worth knowing.
14.1 Shuffle & AQE
| Config | Default | What it does |
|---|---|---|
spark.sql.shuffle.partitions |
200 | Post-shuffle partitions. With AQE enabled, AQE coalesces — set this high (e.g. 1000) and let AQE merge. |
spark.sql.adaptive.enabled |
true (3.2+) | Master AQE switch. |
spark.sql.adaptive.advisoryPartitionSizeInBytes |
64MB | Target size for coalesced partitions. Raise to 128–256 MB on large clusters. |
spark.sql.adaptive.coalescePartitions.enabled |
true | Enable partition coalescing. |
spark.sql.adaptive.skewJoin.enabled |
true | Detect and split skewed join partitions. |
spark.sql.adaptive.skewJoin.skewedPartitionFactor |
5.0 | Partition is "skewed" if > factor × median. |
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes |
256MB | And absolute size > this. |
14.2 Join & broadcast
| Config | Default | What it does |
|---|---|---|
spark.sql.autoBroadcastJoinThreshold |
10MB | Side size below which BHJ is auto-selected. |
spark.sql.broadcastTimeout |
300s | collect() timeout for broadcast. |
spark.sql.join.preferSortMergeJoin |
true | Prefer SMJ over SHJ when both are valid. Leave true. |
14.3 Memory
| Config | Default | What it does |
|---|---|---|
spark.executor.memory |
— | Heap per executor. |
spark.executor.memoryOverhead |
max(384MB, 10% of heap) | Off-heap container memory. Bump for PySpark/Arrow/RocksDB. |
spark.memory.fraction |
0.6 | Fraction of (heap - 300 MB) for Spark. |
spark.memory.storageFraction |
0.5 | Initial storage:execution ratio. |
spark.memory.offHeap.enabled |
false | Use off-heap for Tungsten. |
spark.memory.offHeap.size |
0 | Size of off-heap pool. |
14.4 Execution
| Config | Default | What it does |
|---|---|---|
spark.sql.files.maxPartitionBytes |
128MB | Target size of each input split from file sources. |
spark.sql.files.openCostInBytes |
4MB | Overhead cost of opening a file; favors combining small files. |
spark.sql.execution.arrow.pyspark.enabled |
true | Arrow for Pandas UDF and toPandas(). |
spark.sql.execution.arrow.maxRecordsPerBatch |
10000 | Rows per Arrow batch. |
spark.sql.codegen.wholeStage |
true | Whole-stage codegen. Turn off only to debug plans. |
14.5 Dynamic allocation
| Config | Default | What it does |
|---|---|---|
spark.dynamicAllocation.enabled |
false | Scale executors in/out with load. |
spark.dynamicAllocation.minExecutors |
0 | Lower bound. |
spark.dynamicAllocation.maxExecutors |
∞ | Upper bound. |
spark.dynamicAllocation.executorIdleTimeout |
60s | Release an idle executor after. |
spark.shuffle.service.enabled |
false | Required for dynamic allocation (unless push-based shuffle or decommissioning). |
Closing framework
Spark problems almost always sit in one of six buckets:
- Plan is wrong: filter didn't push down, join is wrong strategy. → Fix query, add hints, update stats.
- Partition count is wrong: too many tiny tasks or too few giant ones. → Let AQE coalesce; set advisory bytes.
- Skew: one task is the long tail. → AQE skew handling, salting, split heavy hitters.
- Shuffle is too big: try to eliminate it (broadcast), reduce it (pre-aggregate), or move it (RSS / push-based shuffle).
- Memory: OOM, spill, GC thrash. → off-heap, overhead, reduce broadcast, check caching.
- Driver: driver collecting too much. →
maxResultSize, avoidcollect(), usetoLocalIterator.
That framing survives every interview question and every 3am page.
17. Tungsten and the Off-Heap Memory Model
Before Tungsten (Spark 1.4, matured in 1.6+), Spark ran as a "normal" JVM application: Java objects everywhere, GC pauses ruled tail latency. Tungsten reshapes Spark into something closer to a native database engine — sun.misc.Unsafe memory, whole-stage code generation, cache-conscious binary layouts. Senior candidates know why this matters.
The three pillars
- Memory management outside the heap. UnsafeRow format: binary, pointer-arithmetic, no Java object headers. Eliminates object-per-row overhead; a 10-column row in normal Java was ~200 bytes, UnsafeRow is ~60.
- Cache-aware computation. Operators produce data in patterns that fit the CPU cache line (64 bytes). Sort-merge joins, aggregations and shuffles are written to minimize cache misses.
- Whole-stage code generation. The Catalyst physical plan compiles multiple operators into a single tight Java method at runtime. Virtual dispatch disappears; the JIT can inline everything. 10x–100x speedups on compute-bound stages.
Why interviews probe this
When a candidate says "Spark is slow on a compute-bound job," the interviewer wants to hear: is the whole-stage codegen fallback kicking in? Some operators (complex UDFs, certain UDAFs) disable codegen for their whole stage. You can see this in df.explain(true) — look for * prefixes on operators. Missing stars = missing codegen = 10x slower for no good reason.
The off-heap memory debugging checklist
spark.memory.offHeap.enabled=trueandspark.memory.offHeap.size— required for Tungsten to run truly off-heap.- OOMs that say "Container killed by YARN for exceeding memory limits" — Tungsten counts, but the OS container is still bounded. Raise
spark.executor.memoryOverhead. - Sort-based aggregation spills to disk when the hash table exceeds the executor's execution memory pool. Spills tank performance. Either raise memory or reduce shuffle partitions / partition key cardinality.
18. Dynamic Allocation and Elastic Clusters
Dynamic allocation adds and removes executors during a job's lifetime based on pending-task pressure. Done right, it's the cheapest compute model Spark offers. Done wrong, it's a source of intermittent timeouts and surprise costs.
The mechanics
Spark tracks pending tasks per stage. When tasks wait in the queue longer than spark.dynamicAllocation.schedulerBacklogTimeout, new executors are requested. When executors sit idle for spark.dynamicAllocation.executorIdleTimeout, they're released. The cluster manager (YARN, Kubernetes, Databricks) actually provisions or releases the underlying nodes.
The shuffle-data-loss trap
Default behavior: when an executor is released, its shuffle output on local disk is also lost. If a later stage needs that shuffle data, it must recompute the upstream stage. For long-running jobs with expensive shuffles, this can double wall-clock time. Fix: enable the external shuffle service (on YARN) or persistent volumes (on Kubernetes). Without one of these, dynamic allocation is not safe to enable.
When it's the wrong choice
- Jobs under ~2 minutes total. Executor startup dominates; static allocation is faster.
- Streaming jobs. Structured Streaming supports dynamic allocation poorly — back-pressure oscillations trigger thrashing.
- Cost-controlled environments with hard per-job budgets. Dynamic scaling makes cost non-deterministic.
19. Photon and Native Accelerators
Photon (Databricks) and similar native engines (Velox under Presto/Trino, RAPIDS for GPU) replace Spark's Java execution with C++ or CUDA kernels. For the right workloads, 2–3x faster at similar hardware cost; for the wrong ones, neutral or slightly worse.
What Photon accelerates well
- Columnar scans with predicate pushdown. Native code is 3–5x faster than whole-stage-codegen Java.
- Hash aggregations and hash joins on primitive types. 2–3x.
- Certain window functions with simple frame clauses.
What it doesn't help
- Python UDFs. Can't cross into Photon; falls back to Java path with full data serialization penalty.
- Complex type operations (Map, Struct nested access) that aren't yet supported natively fall back.
- Shuffle-bound jobs. Photon's gains are per-operator; if 80% of wall-clock is shuffle, Photon moves the needle 5%.
The interview question
"You enabled Photon and cost went down 40% but also one of your dashboards broke. Why?" Possible answer: Photon returns slightly different results for edge cases (e.g., certain cast overflow behaviors, NULL semantics in specific functions). You need regression tests against both engines before flipping the switch on production.
20. PySpark vs Scala — Where the Performance Actually Goes
The lore says "Scala is faster." The truth is more nuanced: PySpark's DataFrame API runs identical JVM code as Scala — the Python client is just a thin wrapper issuing Catalyst plans. Where Python actually loses is in three specific places.
Where PySpark equals Scala
- Any pure DataFrame operation: joins, filters, aggregations, window functions. The entire execution happens in JVM; Python is not on the hot path.
- SQL strings.
spark.sql("SELECT …")from PySpark is indistinguishable from the same call from Scala.
Where Python is slower — and by how much
- Python UDFs (non-vectorized). Each row is serialized to Python, function called, result deserialized. 10–30x slower than an equivalent expression in native Spark SQL. Avoid when possible; replace with built-in functions or SQL expressions.
- Pandas UDFs (vectorized). Batches of rows ship to Python as Arrow tables; function processes them in one call. 3–10x faster than row-UDFs, within 2x of native Scala. The pragmatic default when you must use Python logic.
- Driver-side orchestration with many collects. Python's overhead for orchestrating Spark actions is ~5–10ms per call. A loop of 1,000 small
collect()s will waste 5–10 seconds in Python glue. Fix: build the query, execute once.
Decision guidance
- DataFrame + SQL only: pick whichever language your team writes better.
- Needs Python libraries (ML, NLP, scipy): PySpark with Pandas UDFs.
- Mission-critical tight loops with custom logic on >1 TB: Scala wins modestly.
- Team owns both: standardize on PySpark unless there's a specific reason. Hiring is easier.
SQL Deep Dive
"SQL is a query language, an optimization problem, a data model, and a contract — all at once. Interviewers at this level want to see you've thought about all four."
This chapter goes past tricks and into the engine. We derive logical-to-physical translation, walk through each join algorithm with complexity, explain what indexes and zone maps actually do on disk, pull window functions apart by their framing math, and finish with the advanced patterns that show up at L5: gaps-and-islands, sessionization without windows, as-of (point-in-time) joins, and probabilistic sketches.
Contents
- The mental model: SQL is declarative; engines are procedural
- Logical plan processing order (the one the spec promises)
- Physical plan processing order (what actually happens)
- Join algorithms with complexity math
- Indexes, zone maps, Bloom filters: what prunes and when
- Subquery types and how they translate to joins
- Window functions: frames, partitions, and performance
- Gaps and islands — the four canonical variants
- Sessionization without windows (Flink-style in SQL)
- As-of joins / point-in-time joins
- Sketches: HyperLogLog, Theta, Bloom, t-digest
- Anti-patterns that kill query plans
- Query tuning workflow
1. The mental model: SQL is declarative; engines are procedural
Write what you want. The engine figures out how. Simple to say, but the gap between the two is where all the hard problems live:
- Query planner chooses join order, strategy, index usage.
- Storage layout (row vs columnar, partitions, zone maps) decides what bytes are read.
- Execution engine (row-at-a-time vs vectorized, pipelined vs materialized) decides how fast rows flow.
At L5 you should be able to look at any query and predict: what will the plan look like? What will cost money? What happens when data doubles?
The three questions you always ask:
- What's the driving table? The biggest one — everything else is joined to it.
- Where do filters apply? At scan time (push-down), or after?
- Where do shuffles happen? Between which operators, on which key?
2. Logical plan processing order (the one the spec promises)
The SQL standard defines the logical processing order of a SELECT, which determines name visibility and semantics:
1. FROM (+ JOINs)
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT (expressions + window functions)
6. DISTINCT
7. ORDER BY
8. LIMIT / OFFSET
Two consequences you must carry:
- A column alias defined in SELECT is not visible in WHERE, because WHERE is logically processed first. (Some engines like Snowflake relax this; portable code doesn't rely on it.)
- Aggregates resolve at step 3, so
HAVINGis how you filter by aggregate;WHEREcan't reference an aggregate directly.
-- Wrong (portable):
SELECT user_id, COUNT(*) AS cnt
FROM events
WHERE cnt > 5 -- ERROR: cnt doesn't exist yet at WHERE
GROUP BY user_id;
-- Right:
SELECT user_id, COUNT(*) AS cnt
FROM events
GROUP BY user_id
HAVING COUNT(*) > 5;2.1 The quiet reorderings
The logical order is what the query means; it's not what the engine does. The optimizer is free to reorder any way it wants as long as semantics are preserved. Predicate push-down through JOIN is one such reordering: the logical WHERE runs "after" the FROM+JOIN, but the engine may push the predicate into the scan under the join.
3. Physical plan processing order (what actually happens)
For this query on Postgres-like:
SELECT u.country, COUNT(*) AS plays
FROM fact_plays p
JOIN dim_user u ON p.user_id = u.user_id
WHERE p.event_date = '2026-04-15'
AND u.signup_country_id = 42
GROUP BY u.country
HAVING COUNT(*) > 100
ORDER BY plays DESC
LIMIT 10;The physical plan Postgres will likely choose:
1. Index Scan on dim_user where signup_country_id = 42 -- cheap, produces small set
2. Hash Join -- build hash on dim_user subset
- Right input: Seq Scan on fact_plays with event_date filter (partitioned index)
3. HashAggregate by country
4. Filter HAVING count > 100
5. Sort plays DESC
6. Limit 10
But note step 1: the small, selective side becomes the build side. That's the first thing most planners do. In Spark/Snowflake the analog is broadcasting the small dimension.
3.1 Reading EXPLAIN from any engine
- Postgres:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ....Buffers: shared hit=X read=Ytells you the buffer cache vs disk split. - Snowflake:
EXPLAIN USING TEXT SELECT .... Use the profile view in the UI for runtime stats (partitions pruned, bytes scanned). - BigQuery: query plan in the UI; dry-run shows bytes billed before execution.
- Spark:
df.explain(mode="formatted")(see chapter 04).
Common elements across engines:
- Scan (Seq / Index / Partitioned): reads the base table.
- Filter: applies predicate row-by-row (if not pushed down).
- Join (Hash / Merge / Nested Loop): combines inputs.
- Aggregate (Hash / Stream / Sort): groups.
- Sort: explicit order.
- Exchange / Shuffle (distributed engines): network redistribution by key.
4. Join algorithms with complexity math
| Algorithm | Preconditions | Build | Probe | Total | Memory |
|---|---|---|---|---|---|
| Nested Loop | any condition | — | O(N·M) | O(N·M) | O(1) |
| Hash Join | equijoin | O(N) build hash | O(M) probe | O(N+M) | O(N) |
| Sort-Merge Join | equijoin, both sorted | O(N log N + M log M) sort | O(N+M) merge | O((N+M) log (N+M)) | O(1) streaming |
| Index Nested Loop | index on inner | — | O(M · log N) | O(M · log N) | O(1) |
| Zone Map Join | partitioned / clustered | — | O(M + pruned_N) | depends on pruning | O(1) |
4.1 Nested Loop Join
For every row in outer, scan every row in inner. Only reasonable when:
- Inner is tiny (< 100 rows).
- Condition is non-equijoin (
BETWEEN,<,>). - You have an index-supporting predicate that turns it into an Index Nested Loop.
In Spark: BroadcastNestedLoopJoin — broadcast one side, scan the other with filter. Only for inequality/Cartesian conditions.
4.2 Hash Join
Build side is the smaller (by row count or estimated size). Hash table built in memory; probe side streamed through.
Memory requirement: rows × (hash_entry_size) where hash_entry_size ≈ key + pointer + row or row-reference. Typical: 50–200 bytes/row. For 10M rows, 1–2 GB memory.
If the build side doesn't fit, engines fall back to:
- Grace Hash Join: partition both sides by hash mod k, spill to disk, then hash-join each partition pair. Memory = O(|partition|).
- Hybrid Hash Join: keeps first partition in memory, spills others.
4.3 Sort-Merge Join
When both sides are large and hash tables don't fit. Both sides sorted, then merged by walking two pointers.
Strong suit: supports range joins (if you partition-by-range) and handles arbitrary size.
Weakness: sort cost. Engines avoid it unless at least one side is already sorted (clustered index, clustering key in Snowflake, Z-ORDER in Delta).
4.4 Choosing a strategy (mental shortcut)
if small.rows × row_size < memory_budget: hash join (build on small)
elif no equijoin: nested loop (beware Cartesian)
elif both very large: sort-merge
elif index exists on inner.join_key: index nested loop
5. Indexes, zone maps, Bloom filters: what prunes and when
5.1 B-tree indexes (OLTP default)
A B-tree of (key, row pointer). Lookup: O(log N). Insert/delete: O(log N) with node splits/merges. Each non-leaf page fanout ≈ 200–500 on a 4–8KB page → 4 levels cover ~10^8 rows.
Covering index: includes all columns needed for the query → the query is satisfied from the index alone, no table lookup.
CREATE INDEX idx_plays_user_date ON fact_plays (user_id, event_date)
INCLUDE (watch_ms);
-- SELECT watch_ms FROM fact_plays WHERE user_id = 1234 AND event_date = '2026-04-15'
-- satisfied entirely from index.5.2 Hash indexes (in-memory dbs, Postgres for exact match)
O(1) exact lookup. Can't support range scans. Rarely first choice.
5.3 Bitmap indexes (low-cardinality)
Column with few distinct values (gender, country, status). One bitmap per value, each bitmap has one bit per row. AND/OR between bitmaps is extremely cheap.
Used in: Oracle, DuckDB (generated on-the-fly), warehouse columnar engines (implicit).
5.4 GIN / GiST / BRIN indexes
- GIN: inverted index for JSONB, array, full-text (each distinct term → postings list of row IDs).
- GiST: generalized search tree for geospatial, range types, ranges.
- BRIN (Block Range INdex): min/max per block range. Tiny, for large time-ordered tables. O(1) space per block. ≈ what a zone map does.
5.5 Zone maps / min-max statistics (columnar warehouses)
Every Parquet row group has min/max per column. Every Snowflake micro-partition, every Delta file, every Iceberg manifest entry — stores them.
For a query WHERE event_date = '2026-04-15', the engine reads min/max per row group and skips any row group whose [min, max] doesn't include '2026-04-15'.
Zone maps work best on clustered/sorted data. Random writes ruin them: every block has the full range of values → no pruning.
Fix: cluster on the column (Snowflake CLUSTER BY, Iceberg sort, Delta Z-ORDER).
5.6 Bloom filters
Probabilistic set membership. Configurable false-positive rate, zero false-negatives. Huge win for point-lookup queries on large tables.
Formula: for n elements and desired FP rate p:
- Bits:
m = -n × ln(p) / (ln(2))^2 - Hashes:
k = (m/n) × ln(2)
Example: 10M elements, 1% FP → m ≈ 96 Mbit ≈ 12 MB. Stored per Parquet row group or per Iceberg data file.
Parquet supports Bloom filters since 1.12. Spark writes them with parquet.bloom.filter.enabled=true. Iceberg has them as a column hint:
ALTER TABLE silver.fact_playback_session
SET TBLPROPERTIES ('write.parquet.bloom-filter-enabled.column.user_id' = 'true');When the filter says "maybe" you read the block. When it says "definitely not" you skip. On highly selective queries with non-clustered keys, 10–100× speedup.
5.7 The pruning checklist
Your query is slow, but the data is partitioned/clustered. What pruned and what didn't?
- Partition pruning (directory level): did the planner push the partition filter? Check
PartitionFiltersin the plan. - Zone map pruning (file/row-group level): min/max stats match the filter? Requires clustering.
- Bloom filter pruning: configured for the column?
- Column pruning (columns not selected): only the needed columns are read.
Each one is a multiplicative win.
6. Subquery types and how they translate to joins
6.1 Scalar subquery (one row, one column)
SELECT user_id, (SELECT MAX(event_ts) FROM sessions WHERE user_id = u.user_id) AS last_seen
FROM dim_user u;Naive execution: for each user, run the inner query. O(N × cost_inner).
Optimizer rewrite (good engines): correlated subquery → left outer join with aggregate.
SELECT u.user_id, s.last_seen
FROM dim_user u
LEFT JOIN (SELECT user_id, MAX(event_ts) AS last_seen FROM sessions GROUP BY user_id) s
ON u.user_id = s.user_id;6.2 IN / EXISTS (existence)
SELECT * FROM dim_user WHERE user_id IN (SELECT user_id FROM sessions WHERE event_date >= '2026-04-01');
-- Rewritten to:
SELECT u.*
FROM dim_user u
SEMI JOIN (SELECT DISTINCT user_id FROM sessions WHERE event_date >= '2026-04-01') s
ON u.user_id = s.user_id;SEMI JOIN returns rows from the left side that have a match in the right, without duplicating rows or returning right columns.
6.3 NOT IN / NOT EXISTS (anti)
Critical NULL pitfall:
SELECT * FROM dim_user WHERE user_id NOT IN (SELECT user_id FROM blocked_users);If blocked_users.user_id has ANY NULL, the whole clause returns empty set (3-valued logic: x NOT IN (... NULL ...) → unknown).
Always prefer NOT EXISTS:
SELECT u.* FROM dim_user u
WHERE NOT EXISTS (SELECT 1 FROM blocked_users b WHERE b.user_id = u.user_id);6.4 Lateral / CROSS APPLY (correlated row-returning)
SELECT u.user_id, top3.title_id
FROM dim_user u,
LATERAL (
SELECT title_id
FROM fact_plays p
WHERE p.user_id = u.user_id
ORDER BY play_count DESC
LIMIT 3
) top3;Per-row subquery that returns multiple rows. Executes per outer row (sometimes optimized to a window function). Keep cardinality in check.
7. Window functions: frames, partitions, and performance
Anatomy:
function(args) OVER (
[PARTITION BY partition_cols]
[ORDER BY order_cols]
[ROWS | RANGE | GROUPS frame_spec]
)7.1 Frame types
- ROWS: physical rows.
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW= exactly 3 rows. - RANGE: logical range over ordered values.
RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW— uses the ORDER BY value to scope. - GROUPS: peer groups (rows with equal ORDER BY values).
GROUPS BETWEEN 1 PRECEDING AND CURRENT ROW= this group + previous group.
7.2 Function families
- Ranking: ROW_NUMBER, RANK, DENSE_RANK, NTILE, PERCENT_RANK, CUME_DIST.
- Analytic: LAG, LEAD, FIRST_VALUE, LAST_VALUE, NTH_VALUE.
- Aggregates over windows: SUM, AVG, MIN, MAX, COUNT, and more.
7.3 The default frame trap
-- Default frame with ORDER BY is:
-- RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
SELECT user_id, event_ts, SUM(amount) OVER (PARTITION BY user_id ORDER BY event_ts) AS running_total
FROM fact_charges;Without ORDER BY: default frame is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Different answer.
7.4 Window performance model
Per distinct PARTITION BY value, the engine:
- Shuffles rows by partition key.
- Sorts within partition by ORDER BY.
- Walks the partition, applying the frame.
Cost = shuffle + sort + O(partition_size × frame_size).
Optimization tips:
- Always
PARTITION BY— otherwise one partition = entire dataset = one task. - Match the partitioning to existing clustering to avoid shuffle.
ROW_NUMBER() <= kfilter: engines can short-circuit to top-k per partition (avoids a full sort). In Spark this is a specific rule.
7.5 Running totals pattern
SELECT
user_id, event_ts, amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY event_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
AVG(amount) OVER (PARTITION BY user_id ORDER BY event_ts
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS trailing_7_avg
FROM fact_charges;7.6 First/last value pattern
SELECT
user_id, session_id, event_ts,
FIRST_VALUE(event_type) OVER (PARTITION BY session_id ORDER BY event_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_event,
LAST_VALUE(event_type) OVER (PARTITION BY session_id ORDER BY event_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_event
FROM session_events;Explicit frame ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is critical — the default for LAST_VALUE is "current row" and will return the current row's value, not the last.
8. Gaps and islands — the four canonical variants
A classic family of problems. You have a sequence; find stretches where a property is true or identify "runs" of consecutive values.
8.1 Variant 1: consecutive identical values
-- Find streaks of same status per user
WITH marked AS (
SELECT user_id, event_ts, status,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts) AS rn_all,
ROW_NUMBER() OVER (PARTITION BY user_id, status ORDER BY event_ts) AS rn_status
FROM fact_events
)
SELECT user_id, status,
MIN(event_ts) AS streak_start,
MAX(event_ts) AS streak_end,
COUNT(*) AS streak_len
FROM marked
GROUP BY user_id, status, (rn_all - rn_status);Why it works: within a streak of same status, rn_all - rn_status is constant. It changes when status changes.
8.2 Variant 2: consecutive integers (gap detection)
-- Find missing invoice numbers
SELECT invoice_num + 1 AS gap_start,
next_num - 1 AS gap_end
FROM (
SELECT invoice_num,
LEAD(invoice_num) OVER (ORDER BY invoice_num) AS next_num
FROM invoices
) t
WHERE next_num - invoice_num > 1;8.3 Variant 3: date gaps (missing daily data)
WITH calendar AS (
SELECT generate_series(DATE '2026-01-01', DATE '2026-04-15', INTERVAL '1 day')::date AS d
)
SELECT d FROM calendar
LEFT JOIN daily_metrics m ON m.metric_date = calendar.d
WHERE m.metric_date IS NULL;8.4 Variant 4: consecutive timestamp "islands" (session gaps)
See Sessionization below — this is the streaming sessionization pattern in SQL.
9. Sessionization without windows (Flink-style in SQL)
Definition: a session is a run of events from the same user with no gap larger than X minutes.
WITH events AS (
SELECT user_id, event_ts
FROM fact_events
WHERE event_date BETWEEN '2026-04-15' AND '2026-04-15'
),
with_prev AS (
SELECT user_id, event_ts,
LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) AS prev_ts
FROM events
),
flagged AS (
SELECT user_id, event_ts,
CASE WHEN prev_ts IS NULL OR event_ts - prev_ts > INTERVAL '30 minutes'
THEN 1 ELSE 0 END AS session_start
FROM with_prev
),
numbered AS (
SELECT user_id, event_ts,
SUM(session_start) OVER (PARTITION BY user_id ORDER BY event_ts) AS session_num
FROM flagged
)
SELECT user_id, session_num,
MIN(event_ts) AS session_start_ts,
MAX(event_ts) AS session_end_ts,
COUNT(*) AS event_count,
EXTRACT(EPOCH FROM MAX(event_ts) - MIN(event_ts)) AS duration_seconds
FROM numbered
GROUP BY user_id, session_num;Why it works:
LAGfinds prior event's timestamp within each user's timeline.- Flag a new session whenever gap > 30m.
SUM(session_start)cumulatively counts session starts, giving each event its session number.- GROUP BY (user_id, session_num).
This is the exact pattern Flink's session window implements in its state machine — only here it's a batch SQL translation.
10. As-of joins / point-in-time joins
"What was user X's subscription tier at the moment they made this play?"
The classic feature-store / temporal question. The dimension has history (SCD2). The fact has timestamps. You want the dimension row valid at each fact timestamp.
10.1 Approach 1: BETWEEN join (supported natively in many engines)
SELECT p.*, d.plan_name, d.plan_price
FROM fact_playback p
JOIN dim_subscription_history d
ON p.user_id = d.user_id
AND p.event_ts >= d.valid_from
AND p.event_ts < d.valid_to;Works, but the BETWEEN join is effectively a range join — default hash/equi joins don't handle it. Spark uses BNLJ; Postgres uses merge join on (user_id, valid_from) if you have the right index.
10.2 Approach 2: LATERAL correlated subquery
SELECT p.*, d.plan_name
FROM fact_playback p,
LATERAL (
SELECT plan_name
FROM dim_subscription_history d
WHERE d.user_id = p.user_id
AND d.valid_from <= p.event_ts
ORDER BY d.valid_from DESC
LIMIT 1
) d;Per fact row: lookup the most recent dim row that started before the fact. Works great with a (user_id, valid_from) index.
10.3 Approach 3: Spark-native AS OF joins (PySpark)
# Using built-in asof via pandas UDF or explicit:
from pyspark.sql.functions import broadcast
# Sort both sides, zip with forward-fill logicNative asof support is limited in Spark SQL; most teams express the BETWEEN join and rely on AQE + good clustering. Alternative: kdb+ / ClickHouse / DuckDB have real ASOF JOIN syntax.
-- DuckDB / ClickHouse ASOF JOIN
SELECT p.*, d.plan_name
FROM fact_playback p
ASOF LEFT JOIN dim_subscription_history d
ON p.user_id = d.user_id AND p.event_ts >= d.valid_from;10.4 Approach 4: "expand-and-join" (batch feature stores)
Produce a denormalized table with (user_id, valid_from, valid_to, plan_name). Then INNER JOIN on user_id and BETWEEN. Works. Costs storage. Used at scale by Uber's Michelangelo, others.
11. Sketches: HyperLogLog, Theta, Bloom, t-digest
When exact answers cost too much, sketches trade bounded error for massive cost reduction.
11.1 HyperLogLog (HLL) — approximate distinct count
Core idea: hash each element to a bit string; track the maximum number of leading zeros observed per bucket; rare patterns imply large cardinality.
Error: ~1.04 / √m where m = number of buckets. Typical m = 16384 (2^14) → 0.8% error.
-- Snowflake
SELECT HLL(user_id) FROM fact_plays; -- approx distinct
SELECT HLL_ACCUMULATE(user_id) FROM fact_plays; -- serializable state
SELECT HLL_COMBINE(state) FROM (SELECT HLL_ACCUMULATE(user_id) AS state FROM ...);
SELECT HLL_ESTIMATE(HLL_COMBINE(state)) FROM ...; -- final cardinality
-- BigQuery
SELECT APPROX_COUNT_DISTINCT(user_id) FROM fact_plays;
-- Presto/Trino
SELECT approx_distinct(user_id) FROM fact_plays;11.2 Theta sketch — distinct with set operations
HLL estimates |A|. Theta estimates |A|, |A ∪ B|, |A ∩ B|, |A \ B|. Critical for "users who watched X AND Y" without recomputing from raw data.
Supported in Snowflake (APPROX_COUNT_DISTINCT uses HLL; Theta available via Java UDF or Datasketches).
11.3 Bloom filters — set membership
Covered under indexes. Also stored as materialized columns for fast "is this key present" tests.
11.4 t-digest — approximate quantiles
For latency percentiles over streams. Compact (~5KB for 1% error at p99). Mergeable.
-- Snowflake
SELECT APPROX_PERCENTILE(latency_ms, 0.99) FROM fact_requests;
-- Trino
SELECT approx_percentile(latency_ms, 0.99) FROM fact_requests;11.5 When to reach for sketches
- Daily/hourly metrics where re-scanning billions of rows is wasteful.
- Feature store cardinality features.
- Cross-partition unions (streaming + batch).
Rule: cost reduction is typically 100× for 1% error.
12. Anti-patterns that kill query plans
12.1 Function on the indexed column
-- Bad: index on event_ts not used
SELECT * FROM fact_plays WHERE DATE(event_ts) = '2026-04-15';
-- Good:
SELECT * FROM fact_plays
WHERE event_ts >= '2026-04-15' AND event_ts < '2026-04-16';12.2 Implicit casts
-- Bad: user_id is BIGINT, '1234' is VARCHAR → cast every row
SELECT * FROM fact_plays WHERE user_id = '1234';
-- Good:
SELECT * FROM fact_plays WHERE user_id = 1234;12.3 SELECT *
In columnar warehouses, reads every column → often 10× the I/O. Name columns.
12.4 DISTINCT as a bug fix
If your query needs DISTINCT, you usually have a join multiplication bug. Find the join producing duplicates and fix the join condition.
12.5 OR across tables in a JOIN
-- Bad: optimizer usually can't split this
SELECT * FROM a JOIN b ON a.x = b.x OR a.y = b.y;
-- Good:
SELECT * FROM a JOIN b ON a.x = b.x
UNION
SELECT * FROM a JOIN b ON a.y = b.y AND a.x <> b.x;12.6 NOT IN with nullable subquery
Covered above. Use NOT EXISTS.
12.7 Massive CTEs that don't inline
Postgres ≤ 11: CTEs were optimization fences (always materialized). Use subqueries or WITH ... AS (NOT MATERIALIZED) on PG 12+.
12.8 Correlated subquery in SELECT
-- Bad: N × M
SELECT user_id,
(SELECT COUNT(*) FROM sessions s WHERE s.user_id = u.user_id) AS cnt
FROM dim_user u;
-- Good: one GROUP BY
SELECT u.user_id, COALESCE(s.cnt, 0) AS cnt
FROM dim_user u
LEFT JOIN (SELECT user_id, COUNT(*) AS cnt FROM sessions GROUP BY user_id) s
USING (user_id);13. Query tuning workflow
- Look at the plan.
EXPLAIN ANALYZE(Postgres) or the query profile (Snowflake/BigQuery/Spark UI). - Find the biggest cost node. Usually one scan or one join dominates. 80/20 rule applies.
- Is it the scan?
- Check partition pruning, zone maps, bloom filter usage.
- Can you push a filter further? Check for UDFs and implicit casts.
- Are the right columns read? Remove SELECT *.
- Is it a join?
- What strategy? Is it what you'd pick?
- Estimated vs actual rows — if wildly wrong, stats are stale. Run ANALYZE.
- Is one side broadcast-sized? Hint it.
- Is it skewed? Salt, two-stage aggregate, or isolate heavy hitters.
- Is it a sort / window?
- Can you partition differently to avoid one?
- Is there an unnecessary
ORDER BYin the final stage?
- Is it a shuffle?
- Can you pre-partition (bucket/cluster)?
- Can it be eliminated by broadcasting a side?
- Is it I/O?
- Compression codec: Snappy for speed, ZSTD for ratio.
- File sizes: consolidate small files; split massive ones.
13.1 "When in doubt, measure" checklist
SELECT COUNT(*)on base tables to confirm size.SELECT COUNT(*), COUNT(DISTINCT key)to confirm cardinality / skew.SELECT key, COUNT(*) FROM t GROUP BY 1 ORDER BY 2 DESC LIMIT 10to find hot keys.SHOW TBLPROPERTIES/SHOW CREATE TABLE— partitioning, clustering, properties.
13.2 A 3-minute triage
For "this query is slow" in an interview:
1. How big are the inputs? (SELECT COUNT(*)...)
2. Is it scan-bound or compute-bound? (look at % time in scan vs join)
3. Did partitions prune? (check plan)
4. Is one join the culprit? (find the big one)
5. Is there skew? (check task time distribution)
6. What's the smallest change that fixes the biggest cost?
That sequence solves 80% of production SQL problems.
15. CTE Materialization — Inlined vs Cached
Common Table Expressions look syntactically identical across engines but behave very differently under the hood. Senior candidates know which engine inlines CTEs (treating them as view substitutions) and which materializes them (computing once, reading N times).
Engine-by-engine behavior
| Engine | Default behavior | Override |
|---|---|---|
| Postgres ≤ 11 | Always materialized (optimization barrier) | Upgrade to 12+ |
| Postgres ≥ 12 | Inlined if referenced once, materialized otherwise | MATERIALIZED / NOT MATERIALIZED keyword |
| Snowflake | Inlined always | Use TEMP TABLE to force materialization |
| BigQuery | Inlined always | Use CREATE TEMP TABLE |
| Spark SQL | Inlined | CACHE TABLE on a subquery |
The performance trap
A CTE referenced twice in an inlined engine is computed twice. If the CTE does a 10-billion-row scan, you're doing 20 billion rows' worth of work. For anything expensive referenced more than once, materialize explicitly.
-- Bad: scans 20B rows
WITH heavy AS (SELECT ... FROM 10B_row_fact WHERE ...)
SELECT ... FROM heavy h1 JOIN heavy h2 ON ...;
-- Good: scans 10B rows once, joins on materialized result
CREATE TEMP TABLE heavy AS SELECT ... FROM 10B_row_fact WHERE ...;
SELECT ... FROM heavy h1 JOIN heavy h2 ON ...;
16. Approximate Aggregations — HLL, t-Digest, CMS
Exact COUNT(DISTINCT) on a billion rows requires shuffling every row. Approximate counterparts (HyperLogLog, t-digest, Count-Min Sketch) give 99% accuracy for 0.01% of the cost. Senior candidates know what each sketch does and when to reach for it.
HyperLogLog — cardinality
APPROX_COUNT_DISTINCT(). Uses ~16 KB of state per distinct-count regardless of input size. Answer within ~2% of exact. Mergeable: HLL sketches from separate shards combine into a single sketch; you can pre-aggregate into daily sketches and then combine across a month for monthly distinct counts without rereading raw data.
T-digest / KLL — quantiles
APPROX_PERCENTILE(). Stores a compressed histogram of value distributions. Supports arbitrary percentiles (p50, p95, p99, p999) from the same sketch. Use for latency dashboards, revenue distributions, any quantile metric on high-cardinality numeric data.
Count-Min Sketch — frequency
Given a stream of events, estimate the frequency of any specific item. Supports "heavy hitters" queries — who are the top-K most-frequent items — cheaply. Less common in SQL engines than HLL but worth naming in interviews for top-K at scale.
When NOT to use sketches
- Compliance-sensitive reporting (finance, regulatory). "Approximate" is never the right word on an audit trail.
- Small datasets under a few million rows — the exact version runs in seconds; approximations add complexity for no gain.
- Critical correctness where a 2% error would change a decision (e.g., fraud thresholds that depend on exact distinct counts).
17. Materialized Views and Incremental Refresh
A materialized view is a cached query result. The interesting question is how it stays fresh. Three refresh strategies, each with its own correctness and cost profile.
Full refresh
Recompute the view from scratch on each refresh. Simple, always correct. Cost scales with base-table size. Use when the view is small, refresh is infrequent, or the base table changes too unpredictably for incremental logic.
Incremental refresh
Apply only the delta since the last refresh. Requires the engine to reason about which base-table rows feed which view rows — a non-trivial analysis that engines support for restricted query shapes only (single table, or simple aggregations over joins with specific key structure).
- Snowflake: "Dynamic Tables" support incremental refresh on many join patterns.
- BigQuery: "Materialized views" support incremental for aggregations and filters over a single base table.
- Postgres: Native
MATERIALIZED VIEWonly supports full refresh; incremental requires third-party tools (pg_ivm) or manual triggers.
Lakehouse MERGE-based patterns
In Iceberg / Delta, the pattern is: compute the delta in a staging table, MERGE INTO the materialized output. The engine doesn't call it a materialized view, but operationally it is one. Write your own refresh job and schedule it in the orchestrator.
18. Query Hints — Per-Engine Grammar
Query hints override the optimizer. They are a sharp tool; use them only when you've measured that the optimizer is wrong and can't be fixed upstream. Major engines:
Snowflake
Snowflake exposes few hints — the philosophy is "trust the optimizer." What exists: USE_CACHED_RESULT=FALSE, query tags, and clustering hints via CLUSTER BY at table DDL time. The right lever in Snowflake is often reshaping the query or the clustering key, not a hint.
BigQuery
Also minimal: @@optimizer_mode, JOIN HASH hint, GROUP BY ROLLUP. For join-order control you often rewrite the query; inline views or CTEs with explicit ordering work where hints don't.
Spark SQL
Rich hint grammar: /*+ BROADCAST(t1), MERGE(t1, t2), SHUFFLE_HASH(t1), SKEW('t1','skewed_key', (1,2,3)) */. The BROADCAST hint is the most common: force broadcast of a dim table the optimizer refused to broadcast because its statistics were stale.
Postgres
No hints by default — the philosophy is "hints lie over time as data distributions change." The extension pg_hint_plan adds them for those who insist. Preferred tools in Postgres: ANALYZE, VACUUM, index creation, and EXPLAIN (ANALYZE, BUFFERS).
The meta-lesson
Reaching for a hint is a smell. Before hinting, confirm: (a) statistics are up-to-date, (b) the query is written in a shape the optimizer can reason about, (c) you've run EXPLAIN ANALYZE and understand the plan. Interviewers are suspicious of candidates who volunteer hints as a first-resort tuning move.
19. Window Function Choice — Examples by Intent
Pick the wrong window function and the query is subtly wrong in ways the optimizer won't flag. The table below maps the business intent to the correct function, with the most common wrong pick for each.
| You want to… | Use | Common wrong pick | Why it matters |
|---|---|---|---|
| Rank with strict ordering, break ties arbitrarily | ROW_NUMBER() | RANK() | ROW_NUMBER gives exactly N rows for top-N; RANK may return more than N on ties |
| Rank with ties, preserving count | DENSE_RANK() | RANK() | RANK leaves gaps (1,1,3); DENSE_RANK doesn't (1,1,2) — pick by whether gaps matter |
| Top-N with "include all tied at Nth" | DENSE_RANK() | ROW_NUMBER() | ROW_NUMBER arbitrarily picks one of the ties |
| Previous row's value per group | LAG(col) | self-join | LAG is one pass; self-join is N² in the worst case |
| Next row's value per group | LEAD(col) | self-join | Same reason |
| Running total | SUM(col) OVER (ORDER BY …) | correlated subquery | Correlated subquery is O(N²); windowed sum is O(N) |
| Moving 7-day sum | SUM(col) OVER (ORDER BY d ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) | RANGE frame | ROWS counts rows; RANGE counts values — with date gaps, RANGE gives different results |
| Moving 7-calendar-day sum (including empty days) | Generate a dense date grid first, then windowed sum | naive ROWS frame | Sparse data breaks ROW-based windows silently |
| First/last row per group, ordered | FIRST_VALUE() / LAST_VALUE() with explicit frame | no frame | Default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — LAST_VALUE returns the current row, not the group's last row |
| Percentile within group | PERCENT_RANK() or NTILE(100) | ROW_NUMBER() / COUNT(*) | Manual division gives wrong semantics for ties |
| Session number per user (gaps-and-islands) | SUM(new_session_flag) OVER (PARTITION BY user ORDER BY ts) | recursive CTE | Recursive is correct but 10–100x slower |
| "Nth most recent" per group | ROW_NUMBER() OVER (PARTITION BY grp ORDER BY ts DESC) + filter = N | ORDER BY ts DESC LIMIT N | LIMIT applies to the entire result; won't give N-per-group |
| Cumulative distinct count | HyperLogLog sketch + windowed merge | COUNT(DISTINCT col) OVER (…) | Most engines don't support window COUNT(DISTINCT); performance is terrible where they do |
| Lagged value N rows back | LAG(col, N) | LAG(LAG(LAG(col))) | LAG takes an offset argument; no need to nest |
| Percent of total per group | col / SUM(col) OVER (PARTITION BY grp) | subquery grouping | Windowed is one pass; subquery-grouping is two scans |
The frame-clause trap
The three functions that surprise people most are LAST_VALUE, SUM OVER ORDER BY, and FIRST_VALUE. Their default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which is almost never what the user intended for LAST_VALUE. Always write the frame clause explicitly when using these:
-- Wrong: returns the CURRENT row's value, not the group's last
LAST_VALUE(col) OVER (PARTITION BY g ORDER BY ts)
-- Right: explicitly extends the frame to the group end
LAST_VALUE(col) OVER (
PARTITION BY g ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
Python for Data Engineering
"Python is the glue, the prototype, and increasingly the runtime. Knowing where its abstractions leak — GIL, GC, copy-vs-view, the Arrow boundary — is how you stop writing slow data code."
This chapter is what a senior data engineer actually needs from Python: GIL semantics down to bytecode, the real differences between Pandas / Polars / PySpark with the math, the Arrow boundary, generators and async that don't crash production, testing strategies that scale, and packaging for distributed runtimes.
Contents
- The Python execution model that matters for DE
- The GIL — what it actually locks
- Memory: refcounting, GC, and the bytes you don't see
- Pandas vs Polars vs PySpark — the trade math
- The Arrow boundary: zero-copy between worlds
- Generators, iterators, and chunked I/O
- Async/await for I/O-bound DE work
- Multiprocessing patterns that work in production
- Type hints, dataclasses, and runtime data validation
- Testing strategy for data pipelines
- Packaging and deploying to Spark / Lambda / Airflow
- Performance debugging toolkit
1. The Python execution model that matters for DE
Three layers, each with its own quirks:
- The interpreter (CPython): source → bytecode → executed by the eval loop.
- Reference counting + cyclic GC: every object has a refcount; cycles collected by a generational GC.
- C extension layer: NumPy, Pandas, PyArrow, etc. — most heavy lifting happens in C and releases the GIL.
You won't write fast Python by writing more Python. You write fast Python by delegating to C/Rust extensions (NumPy, Pandas, Arrow, Polars) and arranging your code so the interpreter loop runs as little as possible.
Quick measurement:
import dis
def add(a, b):
return a + b
dis.dis(add)
# 2 0 RESUME 0
# 3 2 LOAD_FAST 0 (a)
# 4 LOAD_FAST 1 (b)
# 6 BINARY_OP 0 (+)
# 10 RETURN_VALUEEach bytecode op is dispatched by the eval loop. Python 3.11+ added the specializing adaptive interpreter (PEP 659) — same bytecode, but specialized at runtime for observed types (e.g. BINARY_OP_ADD_INT). Real measurable speedup (10–60% for loops). Python 3.12/3.13 push this further.
2. The GIL — what it actually locks
The Global Interpreter Lock is a mutex that guards CPython's interpreter state — primarily refcount manipulation. Only one thread can hold it; only the holder executes Python bytecode at any instant.
What this means in practice:
- CPU-bound pure-Python multithreading: no speedup. Threads serialize on the GIL.
- I/O-bound multithreading: real speedup. Sockets / file I/O release the GIL while blocking.
- C extensions: depends. NumPy, Pandas, scikit-learn frequently release the GIL during heavy compute.
- AsyncIO: single-threaded — it never bumps into the GIL because it doesn't use threads for concurrency.
2.1 Switching interval
The GIL is released periodically:
import sys
sys.getswitchinterval() # 0.005 (5 ms in 3.x)The currently-running thread is told to release the GIL after this many seconds (cooperatively, at next bytecode boundary). Other threads can then contend.
2.2 What happens during time.sleep()
import threading, time
def worker():
time.sleep(1) # releases GIL while sleeping
print("done")
threading.Thread(target=worker).start()time.sleep is a C function that releases the GIL during the sleep. So you can have 10000 threads sleeping; the GIL doesn't matter.
2.3 What happens during pandas df.merge()
The merge is in C. It releases the GIL for the duration. Other Python threads can run. In practice, pandas + threads can scale on multicore for large numerical operations. Caveat: object-dtype columns (strings in Pandas 1.x) need the GIL for hash table building, and won't scale.
2.4 The free-threaded build (PEP 703 — Python 3.13+)
Python 3.13 ships an experimental --disable-gil build. Reference counting becomes biased + atomic; per-object locks replace the GIL. Real concurrency for pure Python. Performance overhead ~10–30% single-threaded.
For DE work in 2026: continue assuming the GIL exists in production. Free-threaded adoption is on the horizon but not yet default.
2.5 The shortcut: concurrent.futures
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
# I/O-bound: thread pool fine, GIL not contended
with ThreadPoolExecutor(max_workers=32) as ex:
results = list(ex.map(fetch_url, urls))
# CPU-bound: process pool to bypass GIL
with ProcessPoolExecutor(max_workers=8) as ex:
results = list(ex.map(crunch_numbers, datasets))3. Memory: refcounting, GC, and the bytes you don't see
Every Python object has at minimum a header (~16 bytes on CPython 3.12) + type pointer + refcount. A trivial int is 28 bytes. A 4-character string is 53 bytes. A 1-element list is 88 bytes.
This is why a Pandas DataFrame with 10M rows × 1 string column can use 2 GB of memory in object dtype but 80 MB as Arrow-backed strings.
3.1 Refcounting
import sys
x = [1, 2, 3]
sys.getrefcount(x) # 2 (variable + getrefcount's local arg)When refcount hits zero → object freed immediately. Predictable, but cycles aren't collected this way.
3.2 Generational GC
For cycles. Three generations (0, 1, 2). New objects in gen 0; survivors promoted. GC scans gen 0 frequently, gen 2 rarely.
import gc
gc.set_threshold(700, 10, 10) # collect gen0 every 700 allocations, etc.
gc.disable() # turn off cyclic GC entirelyDisabling can speed up batch jobs that allocate massively but never form cycles (rare; verify first).
3.3 Memory profiling
import tracemalloc
tracemalloc.start()
# ... your code
snapshot = tracemalloc.take_snapshot()
top = snapshot.statistics('lineno')
for stat in top[:10]:
print(stat)Or for live diagnosis:
pip install py-spy
py-spy top --pid <PID>
py-spy dump --pid <PID>py-spy is a Rust tool that profiles a running Python process without modifying it (USDT-style). Critical for Spark driver issues.
3.4 The __slots__ lever
By default, instance attributes live in a per-instance __dict__ (~~250 bytes). __slots__ declares a fixed attribute set, replacing dict with a struct.
class Point:
__slots__ = ('x', 'y')
def __init__(self, x, y):
self.x, self.y = x, yMemory drop: ~50%. Useful when you're holding 10M small objects (rare in modern DE; you'd use Arrow). But know it exists.
4. Pandas vs Polars vs PySpark — the trade math
The default question in modern DE: which DataFrame library?
4.1 The honest comparison
| Dimension | Pandas | Polars | PySpark |
|---|---|---|---|
| Backing engine | NumPy + (1.x objects, 2.x optionally Arrow) | Rust + Arrow | JVM + Arrow (via wrapper) |
| Execution | Eager | Lazy + eager | Lazy, distributed |
| Threading | Mostly single-threaded | Multi-core by default (Rayon) | Distributed across cluster |
| Memory model | In-memory only | In-memory + streaming engine (1.0+) | Distributed, spillable |
| Sweet spot | < 10 GB on a laptop, single-threaded analytics | 10–500 GB on one beefy machine | TB+, distributed |
| API surface | The biggest, most documented | Smaller but growing fast | Spark SQL ecosystem |
| String/object-heavy | Slow (object dtype) | Fast (Arrow strings) | OK |
| Window functions | OK | Fast | Yes, distributed |
| Joins | OK | Excellent (parallel) | Distributed |
| User-defined funcs | Slow (per-row) | Fast (Rust) or slow (Python) | Pandas UDF for batches |
4.2 Speed math (hand-wavy benchmarks; orders of magnitude)
For a 10 GB CSV → group by → write Parquet:
- Pandas: 5–10 minutes on a laptop, only if it fits in RAM (it won't; you're spilling to swap).
- Polars (lazy + streaming): 20–60 seconds on the same laptop. It streams.
- PySpark local mode: 1–3 minutes. JVM startup overhead, less efficient single-node.
- PySpark on a 10-node cluster: 30 seconds. Now distributed-friendly.
Rule of thumb: if it fits in RAM, Polars wins. If it doesn't, PySpark wins.
4.3 Pandas-specific traps
# View vs copy (the source of "SettingWithCopyWarning")
df2 = df[df['x'] > 0] # might be a view, might be a copy — depends on memory layout
df2['y'] = ... # whiplash: changes might or might not propagate to df
# Fix:
df2 = df[df['x'] > 0].copy()
df2['y'] = ...
# Or use .loc[]
df.loc[df['x'] > 0, 'y'] = ...Pandas 2.0 introduced Copy-on-Write mode (pd.set_option("mode.copy_on_write", True)) which makes this deterministic. Turn it on for new code.
4.4 Polars-specific patterns
import polars as pl
# Lazy: build a plan, optimize, execute once
df = (pl.scan_parquet("s3://bucket/path/*.parquet")
.filter(pl.col("event_date") == "2026-04-15")
.group_by("country")
.agg(pl.col("watch_ms").sum().alias("total_ms"))
.sort("total_ms", descending=True)
.head(10))
result = df.collect(streaming=True) # streaming engine for out-of-corePolars optimizes the plan (predicate push-down into the Parquet scan, projection push-down, common subexpression elimination). Runs Rust-multithreaded.
4.5 PySpark-specific patterns
See chapter 04. Key principle: never iterate row-by-row in Python over a Spark DataFrame. Use:
- Built-in functions (Catalyst native).
- Pandas UDFs (Arrow batches).
.toPandas()only for small results.
5. The Arrow boundary: zero-copy between worlds
Apache Arrow is a columnar in-memory format with a stable C ABI. Anything that speaks Arrow can hand a buffer to anything else without copying.
5.1 Why it matters
Pre-Arrow:
Spark JVM ──serialize Java objects → bytes → deserialize Python objects → Pandas DF
Cost: O(rows × columns), with object allocation for every cell. Slow.
Post-Arrow:
Spark JVM ──Arrow IPC buffer (binary)──> Python (pyarrow.Table) ──> Pandas DF (zero copy for numeric, light copy for strings)
Cost: O(buffer bytes), pointer arithmetic.
5.2 The libraries that talk Arrow natively
- PyArrow
- Polars
- Pandas 2.0+ (with
dtype_backend='pyarrow') - DuckDB (in-memory and zero-copy with Pandas)
- Spark (for Pandas UDFs and
toPandas()) - Datafusion, Dask, Vaex
- Snowpark
- BigQuery storage API
- Iceberg, Parquet (file formats are Arrow-compatible)
5.3 A real example — Polars and DuckDB sharing memory
import polars as pl
import duckdb
pl_df = pl.read_parquet("data.parquet")
duck_result = duckdb.query("SELECT country, SUM(watch_ms) FROM pl_df GROUP BY 1").pl()
# duck_result is a Polars DF — zero copy backDuckDB sees the Polars DF directly via Arrow. No serialization in either direction.
5.4 The strings caveat
Arrow stores strings in two buffers: an offsets array + a single character buffer. Pandas 1.x stores strings as Python str objects (one per row). Converting Pandas-string ↔︎ Arrow-string requires materializing or building Python objects → not zero-copy.
Pandas 2.0 with dtype_backend='pyarrow' keeps strings as Arrow → all the zero-copy benefits.
import pandas as pd
df = pd.read_parquet("data.parquet", dtype_backend="pyarrow")
df.dtypes # show pyarrow types6. Generators, iterators, and chunked I/O
For data that doesn't fit in memory, generators are the idiom.
def chunked_csv(path, chunksize=100_000):
for chunk in pd.read_csv(path, chunksize=chunksize):
yield chunk
for chunk in chunked_csv("big.csv"):
process(chunk) # one chunk in memory at a time6.1 Generator pipelines
Compose stages with generators. Each stage is lazy.
def read_lines(path):
with open(path) as f:
for line in f:
yield line.rstrip()
def parse(lines):
for line in lines:
yield json.loads(line)
def filter_recent(records, since):
for r in records:
if r["ts"] >= since:
yield r
pipeline = filter_recent(parse(read_lines("events.jsonl")), since=T0)
for record in pipeline:
sink(record)Memory use: O(1) per stage, regardless of input size.
6.2 itertools — the standard toolbelt
import itertools
# batch into groups of N
def batched(iterable, n):
iterator = iter(iterable)
while batch := tuple(itertools.islice(iterator, n)):
yield batch
# Cartesian product
itertools.product([1,2], [3,4]) # (1,3), (1,4), (2,3), (2,4)
# Chain iterables
itertools.chain([1,2], [3,4]) # 1, 2, 3, 4
# Group by adjacent equal keys (for pre-sorted data)
for key, group in itertools.groupby(sorted_records, key=lambda r: r["user_id"]):
process_user(key, list(group))6.3 Async generators
async def stream_pages(client):
page_token = None
while True:
page = await client.fetch(page_token=page_token)
for item in page.items:
yield item
if not page.next_token:
break
page_token = page.next_token
async def main():
async for item in stream_pages(client):
await process(item)7. Async/await for I/O-bound DE work
AsyncIO is single-threaded, single-event-loop concurrency. Best for I/O-heavy: HTTP scraping, S3 list/get, database queries.
7.1 The fundamentals
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
async def main(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, u) for u in urls]
return await asyncio.gather(*tasks)
results = asyncio.run(main(urls))Throughput: 1000s of concurrent requests on a single core, vs sequential at 1 RPS.
7.2 Throttling
asyncio.gather launches everything at once. Throttle:
async def fetch_all(urls, concurrency=20):
sem = asyncio.Semaphore(concurrency)
async def bounded(u):
async with sem:
return await fetch_one(u)
return await asyncio.gather(*[bounded(u) for u in urls])7.3 The mistake: mixing sync and async
async def bad():
time.sleep(5) # blocks the entire event loop! No other coroutine runs.
async def good():
await asyncio.sleep(5)Same trap with requests.get() (use aiohttp or httpx.AsyncClient), boto3 (use aioboto3), psycopg2 (use asyncpg).
7.4 Running blocking code from async
import asyncio
async def main():
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(None, blocking_fn, arg1, arg2)
# default executor is a ThreadPoolExecutor7.5 When NOT to use async
- CPU-bound work: doesn't help; use processes.
- Code with no I/O concurrency need: just use sync. Async adds complexity.
- Frameworks that don't natively support it (Spark transformations).
8. Multiprocessing patterns that work in production
When the GIL blocks you on CPU-bound work, spawn processes.
8.1 ProcessPoolExecutor
from concurrent.futures import ProcessPoolExecutor
def heavy(item):
# CPU-bound transformation
return compute(item)
with ProcessPoolExecutor(max_workers=8) as ex:
results = list(ex.map(heavy, items, chunksize=100))chunksize is critical: too small = IPC overhead per item; too large = stragglers. Rule: total_items / (cores × 4) is a good start.
8.2 Shared memory (Python 3.8+)
from multiprocessing.shared_memory import SharedMemory
import numpy as np
# In producer:
shm = SharedMemory(create=True, size=4 * 10_000_000)
arr = np.ndarray((10_000_000,), dtype=np.float32, buffer=shm.buf)
arr[:] = data
print(shm.name) # pass name to worker
# In worker:
shm = SharedMemory(name=name)
arr = np.ndarray((10_000_000,), dtype=np.float32, buffer=shm.buf)
# Read/process without copyingAvoids serialization of large NumPy arrays between processes.
8.3 The fork vs spawn gotcha
On Linux, the default start method is fork — the child process is a near-copy of the parent. Cheap but inherits everything (file handles, locks). On macOS 3.8+, the default switched to spawn for safety. Mismatch causes "works locally, broken in production".
import multiprocessing as mp
mp.set_start_method('spawn', force=True) # be explicit9. Type hints, dataclasses, and runtime data validation
Modern DE code uses types liberally — for IDE support, docs, and validation.
9.1 Dataclasses
from dataclasses import dataclass, field
from datetime import datetime
@dataclass(frozen=True, slots=True)
class PlaybackEvent:
user_id: int
title_id: int
event_ts: datetime
watch_ms: int = 0
metadata: dict = field(default_factory=dict)
ev = PlaybackEvent(user_id=1, title_id=42, event_ts=datetime.now())frozen=True makes it hashable + immutable. slots=True (3.10+) replaces __dict__ with __slots__, saves memory.
9.2 Pydantic for runtime validation
from pydantic import BaseModel, Field, field_validator
class PlaybackEvent(BaseModel):
user_id: int = Field(gt=0)
title_id: int = Field(gt=0)
event_ts: datetime
watch_ms: int = Field(ge=0, le=24*3600*1000)
@field_validator('event_ts')
@classmethod
def not_future(cls, v):
if v > datetime.utcnow():
raise ValueError("event_ts in future")
return v
# Validate untrusted input:
ev = PlaybackEvent(**raw_dict) # raises if invalidPydantic v2 is Rust-powered, fast enough to use in-line for streaming validation. Use in:
- API gateways accepting events.
- DLQ recovery scripts.
- dbt-style data tests outside dbt.
9.3 attrs
The pre-Pydantic ergonomics gold standard. Still excellent for non-validation cases.
9.4 Type-checking pipelines
pip install mypy ruff
mypy src/
ruff check src/Pre-commit hooks enforce. Don't merge code that fails mypy.
10. Testing strategy for data pipelines
This is where DE-grade Python differs from web-app Python. You're testing transformations on data that's typically too big to fixture.
10.1 The pyramid
UI / E2E (manual or rare)
┌────────────────────────────────┐
│ Integration: end-to-end DAG │ (a few, slow, run on PRs)
│ on a small fixture │
├────────────────────────────────┤
│ Pipeline-step unit tests │ (many, fast)
│ in-memory PySpark/Polars │
├────────────────────────────────┤
│ Pure function unit tests │ (most, ms)
│ on plain Python │
└────────────────────────────────┘
10.2 Pure-function units
# transform.py
def normalize_country(country_raw: str) -> str:
if country_raw is None:
return "UNK"
return country_raw.strip().upper()[:3]
# test_transform.py
def test_normalize_country_basic():
assert normalize_country("united kingdom") == "UNI"
def test_normalize_country_null():
assert normalize_country(None) == "UNK"10.3 PySpark unit tests with chispa
import pytest
from pyspark.sql import SparkSession
from chispa import assert_df_equality
from my_pipeline.silver import dedupe_events
@pytest.fixture(scope="session")
def spark():
return SparkSession.builder.master("local[2]").appName("test").getOrCreate()
def test_dedupe_keeps_latest(spark):
input_data = [
("u1", "e1", "2026-01-01T10:00:00", 100),
("u1", "e1", "2026-01-01T10:00:01", 200), # later wins
("u2", "e2", "2026-01-01T10:00:00", 50),
]
cols = ["user_id", "event_id", "event_ts", "value"]
df = spark.createDataFrame(input_data, cols)
actual = dedupe_events(df)
expected = spark.createDataFrame([
("u1", "e1", "2026-01-01T10:00:01", 200),
("u2", "e2", "2026-01-01T10:00:00", 50),
], cols)
assert_df_equality(actual, expected, ignore_row_order=True)chispa (or pyspark-test) gives readable diffs.
10.4 dbt tests
# models/silver/silver_playback_session.yml
models:
- name: silver_playback_session
columns:
- name: session_id
tests:
- unique
- not_null
- name: user_id
tests:
- not_null
- relationships:
to: ref('dim_user')
field: user_id10.5 Property-based tests with Hypothesis
from hypothesis import given, strategies as st
@given(st.lists(st.text(min_size=0, max_size=20)))
def test_normalize_country_idempotent(countries):
once = [normalize_country(c) for c in countries]
twice = [normalize_country(c) for c in once]
assert once == twiceHypothesis generates inputs, including edge cases you didn't think of (Unicode, empty strings, BOM characters).
10.6 Data tests vs code tests
Two distinct things:
- Code tests: does the function logic work? (pytest)
- Data tests: does the data in production today satisfy expectations? (Great Expectations, Soda, dbt tests)
You need both. Code tests prevent regressions; data tests catch upstream changes.
11. Packaging and deploying to Spark / Lambda / Airflow
11.1 Project structure
my-pipeline/
├── pyproject.toml
├── README.md
├── src/
│ └── my_pipeline/
│ ├── __init__.py
│ ├── bronze/
│ ├── silver/
│ ├── gold/
│ └── utils/
├── tests/
└── dags/
└── pipeline_dag.py
pyproject.toml (using uv / hatchling):
[project]
name = "my-pipeline"
version = "0.4.2"
requires-python = ">=3.11"
dependencies = [
"pyspark==3.5.1",
"pyarrow>=15",
"pandas>=2.2",
"pydantic>=2.6",
]
[project.optional-dependencies]
dev = ["pytest", "mypy", "ruff", "chispa", "hypothesis"]
11.2 Distributing to Spark
Spark needs the same Python environment on every executor. Three approaches:
- Bake into the cluster image: simplest for stable deps.
--py-filesfor small custom code: ship a.zipor.egg. Limited; doesn't handle native deps.PYSPARK_DRIVER_PYTHON+ virtualenv archive: ship a packed venv.- Conda-pack / venv-pack +
--archives: ship a portable env tarball.
venv-pack -o env.tar.gz
spark-submit \
--archives env.tar.gz#environment \
--conf spark.pyspark.driver.python=./environment/bin/python \
--conf spark.pyspark.python=./environment/bin/python \
app.py11.3 Airflow
Use PythonOperator only for orchestration logic (calling out to Spark, dbt, etc.). Don't run heavy compute in the Airflow worker.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def trigger_spark(**ctx):
# call EMR / Databricks / whatever
...
with DAG("playback_pipeline", start_date=datetime(2026,1,1), schedule="@daily") as dag:
bronze = PythonOperator(task_id="bronze", python_callable=trigger_spark)For testable DAGs:
- Keep the DAG file thin (config + ops).
- Move logic into a library (
my_pipeline/) that's unit-tested.
11.4 Lambda / serverless
For event-driven enrichment or fan-out work:
- Package as a Lambda layer (deps separate from code).
- Mind cold-start (Polars/Pandas import time is real).
- Use SnapStart on Java; Python equivalents are limited.
12. Performance debugging toolkit
12.1 Profiling pure Python
import cProfile, pstats
profiler = cProfile.Profile()
profiler.enable()
do_work()
profiler.disable()
pstats.Stats(profiler).sort_stats("cumulative").print_stats(30)For line-level:
pip install line_profiler
kernprof -lv my_script.py # decorate fns with @profile12.2 Sampling profiler (no code changes)
pip install py-spy
py-spy record -o profile.svg --pid <PID> # flamegraph
py-spy top --pid <PID> # live top
py-spy dump --pid <PID> # what is each thread doingWorks on running processes — perfect for stuck Spark drivers.
12.3 Memory
pip install memray
memray run app.py
memray flamegraph memray-app.binMemray hooks malloc and reports allocation flamegraphs.
12.4 Pandas/Polars performance checklist
- Are you in object dtype where you should be Arrow?
.dtypesto check. - Are you using
.apply()row-wise? Replace with vectorized ops or.map()with a dict. - Are you copying when you don't need to?
pd.set_option("mode.copy_on_write", True). - Are you reading more columns than you need? Use
usecols=or column projection. - Are you reading the whole file when you could chunk?
chunksize=or scan with predicate push-down.
12.5 The "where is the time" mental model
For any slow Python data job, time goes to one of:
- I/O wait (network, disk read/write)
- Serialization/deserialization (Python ↔︎ JVM, JSON parsing, pickle)
- Pure-Python loop (the GIL on bytecode)
- C extension compute (NumPy/Pandas/Polars internals)
- GC (rare; check with
gc.get_stats())
Profile, classify, fix. Repeat.
Closing principle
The best DE Python is mostly not Python — it's NumPy/Arrow/Pandas/Polars/Spark calls with Python orchestration. Your job is to make the data move from one fast engine to another with minimum overhead. Get that right and everything else gets easier.
15. AsyncIO for IO-Bound Pipelines
A data engineer writes plenty of ETL that is bottlenecked on HTTP calls, S3 listings, BigQuery API pagination, and Kafka producer acks — not CPU. For that shape of work, asyncio is dramatically faster than threads and infinitely cheaper than processes.
The mental model
A single thread runs an event loop. Each coroutine yields control at await points. While one coroutine waits on a network socket, the loop runs other coroutines. You get concurrency without threads, without GIL concerns, and with far lower per-task overhead (~1 KB per coroutine vs ~1 MB per thread).
Worked example — S3 key inventory
import asyncio, aioboto3
async def list_prefix(sess, bucket, prefix):
async with sess.client('s3') as s3:
keys = []
paginator = s3.get_paginator('list_objects_v2')
async for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
keys.extend(o['Key'] for o in page.get('Contents', []))
return keys
async def inventory(buckets, prefixes):
sess = aioboto3.Session()
tasks = [list_prefix(sess, b, p) for b in buckets for p in prefixes]
return await asyncio.gather(*tasks)
# Sequential: O(N*M) calls x ~80ms each = minutes
# asyncio: ~80ms total for reasonable N*M (bounded by S3 throttle)
results = asyncio.run(inventory(['b1','b2'], ['p1','p2','p3']))
Pitfalls
- Don't call synchronous libraries from async code. A blocking
requests.get()stalls the entire event loop. Use the async equivalent (aiohttp,httpx). - Bound concurrency with
asyncio.Semaphore. Launching 10,000 coroutines at once will get you rate-limited by the target API and crash your memory. - Async is not faster for CPU-bound work. A coroutine doing SHA-256 holds the loop hostage until it finishes. CPU-bound needs
multiprocessingor a native library.
16. DuckDB, Polars, and Arrow — The Zero-Copy Trinity
The last three years have reshaped single-node analytics. Three tools dominate, and they share a common substrate: Apache Arrow columnar format. Senior candidates understand how they compose.
Arrow — the interchange format
Columnar, in-memory, language-agnostic. The key property: Python, Rust, Java, C++ can all read the same memory region without serialization. A 10 GB DataFrame handed from Pandas to DuckDB to Spark via Arrow costs zero CPU for the handoff.
DuckDB — SQL on a laptop
Embedded analytical SQL engine. Runs inside your Python process. Reads Parquet, Arrow, Pandas DataFrames, and remote S3 directly. For datasets that fit in memory or can be streamed, DuckDB is often 10x faster than Spark with zero cluster setup. Ideal for data-quality checks, CI tests, ad-hoc analysis.
Polars — Pandas for the modern era
DataFrame library written in Rust, Arrow-native. Multi-threaded by default, lazy evaluation on expression chains, query optimizer. Typically 5–20x faster than Pandas on equivalent operations, with similar-enough API for reasonable migration. When a Pandas job starts being slow on a single machine, Polars is almost always the right next step before reaching for Spark.
The composition pattern
import duckdb, polars as pl
# Read S3 parquet directly with DuckDB, hand to Polars as Arrow
df = duckdb.sql("""
SELECT region, SUM(amount) AS rev
FROM read_parquet('s3://bucket/orders/*.parquet')
WHERE order_ts >= CURRENT_DATE - INTERVAL 30 DAY
GROUP BY region
""").pl() # zero-copy to Polars
# Enrich in Polars
df = df.join(regions_df, on='region').sort('rev', descending=True)
17. Pydantic and Dataclass Contracts
Data pipelines fail at boundaries. Python gives you three levels of structure for enforcing contracts at those boundaries — pick the right one for the scale.
Level 1 — dataclass
Zero-dependency, type-hint-aware, no runtime validation. Use for internal data structures where static typing (mypy) is enough and runtime validation would be overhead.
from dataclasses import dataclass
@dataclass
class Order:
order_id: int
amount: float
currency: str
Level 2 — pydantic.BaseModel
Runtime validation, coercion, JSON schema generation, great error messages. Use at API boundaries, ingestion layers, config files. The one-line cost is well worth it when the data source is outside your control.
from pydantic import BaseModel, Field, field_validator
class Order(BaseModel):
order_id: int
amount: float = Field(gt=0)
currency: str = Field(pattern=r'^[A-Z]{3}$')
@field_validator('currency')
@classmethod
def known_currency(cls, v):
if v not in {'USD','EUR','GBP','JPY'}: raise ValueError(f'unknown currency {v}')
return v
# Raises ValidationError with a precise path if the JSON violates the contract
order = Order.model_validate(json_payload)
Level 3 — External schema registry
For cross-service contracts, Python validation is not enough — the producer and consumer may be written in different languages. Use Avro or Protobuf with a schema registry (Confluent Schema Registry, Buf). Pydantic is the consumer-side validator; the source of truth lives outside Python.
18. Packaging — pip, Poetry, uv
Python packaging is legendarily messy. Three modern tools cover most of the ground for data teams. Senior candidates have opinions on which to use and why.
pip + requirements.txt
The baseline. Works everywhere. Dependency resolution is weak; no lockfile; easy to drift. Fine for scripts; wrong for production pipelines shared across a team.
Poetry
Real dependency resolver, lockfile, virtualenv management, publishing workflow. The dominant choice in open-source Python for the last five years. Slow on large dependency graphs (2+ minutes to resolve a medium project isn't unusual). Well-understood, well-documented.
uv
Written in Rust. 10–100x faster than Poetry at install and resolve. Drop-in compatible with most pip / Poetry workflows. Still maturing on edge cases but rapidly becoming the default for new projects. Worth naming in interviews as the direction the ecosystem is moving.
The team-level recommendation
- Greenfield Python project: uv.
- Existing Poetry project: stay on Poetry until a specific pain justifies migration.
- One-off notebooks / scripts: pip is still fine.
- Multi-Python-version dev (e.g., testing against 3.10 and 3.12): uv handles this cleanly via its own Python-version manager.
The Docker layering discipline
Regardless of tool, your Dockerfile should install dependencies in a separate layer from your code. Change your code, rebuild is 5 seconds. Change one dependency, rebuild is the full install. Messing up this ordering is a 100x slowdown on every CI run.
FROM python:3.12-slim
WORKDIR /app
# Layer 1: dependencies (changes rarely)
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --frozen
# Layer 2: source code (changes often)
COPY src/ ./src/
CMD ["python", "-m", "src.main"]
Lakehouse: Iceberg & Delta
"A lakehouse table is a directory of immutable files plus a manifest that says which of those files are 'now'. Everything else — ACID, time travel, schema evolution, hidden partitioning — is a side effect of how that manifest is structured."
This chapter unpacks the table format internals: how Iceberg and Delta organize bytes on disk, what makes a snapshot atomic, how MERGE actually works (copy-on-write vs merge-on-read), how compaction is planned, what hidden partitioning gives you, and how time travel and schema evolution are implemented.
Contents
- Why open table formats exist
- Iceberg on disk: the layered metadata
- Delta on disk: the transaction log
- Snapshot isolation: how ACID is achieved on object storage
- Hidden partitioning: Iceberg's killer feature
- MERGE under the hood: COW vs MOR
- Compaction: bin-packing, sorting, Z-ORDER
- Time travel and snapshot expiration
- Schema evolution: column-id semantics
- Delete files: position vs equality
- Iceberg vs Delta vs Hudi — the trade table
- Catalog choices: Hive Metastore, Glue, Nessie, Polaris, Unity
- Operating a lakehouse: gotchas and patterns
1. Why open table formats exist
Pre-lakehouse, you had two camps:
- Data warehouse (Snowflake, BigQuery, Redshift): proprietary storage; ACID, fast queries; expensive; hard to integrate with non-SQL tools (ML).
- Data lake (Parquet on S3 + Hive Metastore): cheap storage, multi-tool, but no ACID, no concurrent writes, no schema evolution beyond Hive's limited model, no transactions, no time travel.
Open table formats — Iceberg (Netflix → Apache 2018), Delta Lake (Databricks 2019), Hudi (Uber 2017) — sit between: a metadata layer over Parquet/ORC files in object storage that gives you:
- ACID transactions
- Snapshot isolation
- Schema evolution
- Time travel
- Hidden partitioning (Iceberg)
- Multi-engine read/write (Spark, Trino, Flink, Snowflake, BigQuery, …)
- The data files are still Parquet — query engines read them directly.
2. Iceberg on disk: the layered metadata
The brilliance of Iceberg: a tree of metadata pointers, where every write produces new immutable metadata and a single atomic swap of the "current pointer" commits.
Catalog (HMS / Glue / Nessie):
table: silver.fact_playback
current_metadata_pointer: s3://.../metadata/v00042.metadata.json
s3://bucket/warehouse/silver/fact_playback/
├── data/ # immutable Parquet/ORC files
│ ├── event_date=2026-04-15/
│ │ ├── 00000-1234-5678.parquet
│ │ └── 00001-1234-5678.parquet
│ └── event_date=2026-04-14/
│ └── ...
└── metadata/
├── v00041.metadata.json # previous table state
├── v00042.metadata.json # current — what catalog points to
│ └─ schemas, partition specs, sort orders, snapshots[]
│ └─ each snapshot has manifest_list pointer
├── snap-7843290023-1-...avro # manifest list for snapshot 7843290023
│ └─ rows: each is (manifest_file_path, partitions_summary)
├── 8743-1-...avro # manifest file
│ └─ rows: each is (data_file_path, partition, lower_bounds, upper_bounds, record_count)
└── 8744-1-...avro
2.1 The four metadata levels (top to bottom)
Table metadata (
vNN.metadata.json):table-uuid,format-version(2 in modern usage)schemas[]list (for evolution)partition-specs[](for hidden partitioning evolution)sort-orders[]snapshots[]historycurrent-snapshot-idproperties(compression, format, etc.)
Snapshot (entry in
snapshots[]):snapshot-idparent-snapshot-idtimestamp-mssummary(added/removed records, files)manifest-listpointer
Manifest list (
snap-NNN-...avro):- One row per manifest in the snapshot.
- Includes per-partition summaries (lower/upper bounds for each partition column) — used for manifest-level pruning before opening individual manifests.
Manifest (Avro file):
- One row per data file.
- Includes per-column statistics: lower/upper bounds, null count, value count, NaN count.
- File-level filtering happens here.
2.2 What a query does
For SELECT ... WHERE event_date = '2026-04-15' AND user_id = 1234:
- Catalog → current metadata file.
- Read metadata, get current snapshot's manifest-list.
- Read manifest list; filter manifests whose partition summaries don't include 2026-04-15. Most pruned away.
- Read remaining manifests; filter data files whose
event_dateanduser_idbounds don't match. Many pruned away. - Open and scan the surviving Parquet files; use Parquet's row-group stats and Bloom filters for further row-group pruning.
Every level is a multiplicative pruning step. This is why Iceberg dominates.
2.3 The atomic commit
A commit:
- Write new data files to
data/. - Write new manifest files referring to them.
- Write a new manifest list.
- Write a new metadata.json (
vNN+1.metadata.json) with the new snapshot. - Atomic swap of the catalog pointer from
vNNtovNN+1.
The atomic swap is the only operation that needs synchronization. On HMS/Glue: single-row update with optimistic concurrency. On Nessie / Polaris / Unity: branch-aware semantics. On filesystems: rename, with conflict detection.
If two writers race:
- Both build their metadata in parallel.
- One wins the swap.
- The loser's commit fails; it must rebase (re-apply on top of the winner's snapshot) and retry.
3. Delta on disk: the transaction log
Delta uses a different organization — a transaction log of JSON files (then periodic Parquet checkpoints) describes the table state as a sequence of additions/removals.
s3://bucket/warehouse/silver/fact_playback/
├── _delta_log/
│ ├── 00000000000000000000.json # initial commit
│ ├── 00000000000000000001.json # next commit, etc.
│ ├── ...
│ ├── 00000000000000000010.checkpoint.parquet # Parquet checkpoint of state
│ ├── 00000000000000000010.json
│ └── _last_checkpoint # pointer to most recent checkpoint
└── (data files, optionally partition-prefixed)
├── part-00000-abc.snappy.parquet
└── ...
3.1 What's in a JSON commit
{"commitInfo": {...}}
{"protocol": {"minReaderVersion": 1, "minWriterVersion": 4}}
{"metaData": {...schema, partitionColumns, format, properties...}}
{"add": {"path": "part-...parquet", "partitionValues": {...}, "size": 1234567, "stats": "{json}", ...}}
{"add": {"path": "...", ...}}
{"remove": {"path": "old-file.parquet", "deletionTimestamp": ..., ...}}State at version V = replay all JSON files from 0 to V (or from the last checkpoint to V).
stats JSON contains numRecords, minValues, maxValues, nullCount per column → identical purpose to Iceberg manifest stats.
3.2 Checkpoints
Replaying thousands of JSON files for each query is wasteful. Every 10 commits (configurable), Delta writes a Parquet checkpoint that snapshots the full state. New readers replay from the checkpoint forward.
3.3 Atomic commit
The commit log file (NNNNNN.json) is created with a conditional put (S3 If-None-Match: *). If a file with that version already exists, the writer loses the race and retries.
This requires conditional writes — supported by S3, Azure Blob, GCS, ADLS Gen2 since various dates. (Pre-2024 S3 needed DynamoDB-based locking via delta-storage-s3-dynamodb because S3 didn't have conditional puts; this is no longer required.)
4. Snapshot isolation: how ACID is achieved on object storage
Object storage is the wild west: no rename atomicity, no per-object locks (until conditional writes), eventual consistency historically.
Both Iceberg and Delta achieve serializable isolation through immutability + a single atomic swap point.
4.1 The protocol
- Read your start snapshot:
S0 = current_snapshot(). - Compute the changes (which files to add, which to remove).
- Write data + metadata without touching the catalog/log pointer.
- Attempt atomic swap: change pointer from
S0toS0+1. - If pointer is no longer
S0(someone committed in between), conflict: rebase. - Rebase: re-evaluate whether your changes still apply against the latest snapshot. If safe, retry the swap.
4.2 Conflict resolution rules
For Iceberg/Delta, two writers conflict if they:
- Both modify the same file (one removes a file the other read).
- Both INSERT into the same partition with overlapping logic.
- One writer's MERGE source overlaps with another writer's INSERT.
Conflict-free patterns:
- Two append-only writers to disjoint partitions: never conflict.
- One MERGE writer + one append writer to disjoint files: may not conflict.
- Two MERGE writers to overlapping data: always conflict; one wins, one retries.
4.3 Read-your-writes semantics
A reader gets a consistent snapshot — they see all files included in the snapshot they pinned, none of the files added after. That's snapshot isolation.
Caveat: Iceberg snapshots can be expired (cleaned up by a maintenance job). A long-running reader can fail with "snapshot not found" if expiration runs aggressively. Mitigations: increase retention; pin snapshot-id explicitly; don't run multi-hour queries.
5. Hidden partitioning: Iceberg's killer feature
In Hive-style partitioning, the partition column is part of the directory path (event_date=2026-04-15/...). Queries must filter on the partition column literally:
-- Old Hive trap
SELECT * FROM events WHERE DATE(event_ts) = '2026-04-15';
-- DATE(event_ts) doesn't match the event_date partition column → full scanIceberg separates the logical column from the partition transform:
CREATE TABLE silver.events (
event_ts TIMESTAMP,
user_id BIGINT,
...
)
PARTITIONED BY (days(event_ts), bucket(16, user_id));Now the query:
SELECT * FROM events WHERE event_ts >= '2026-04-15' AND event_ts < '2026-04-16';Iceberg knows that days(event_ts) is the partition; it computes the matching partition values automatically and prunes. No partition column needed in the query.
5.1 Available transforms
identity(col): same as raw value (Hive-style).bucket(N, col): hash bucket. Distribute writes evenly.truncate(N, col): prefix string (e.g. first N chars), or value mod N for ints.year(ts) / month(ts) / day(ts) / hour(ts): time-based.
5.2 Partition spec evolution
You can change partitioning over time. Old data stays under the old spec; new data uses the new spec; queries handle both transparently.
ALTER TABLE silver.events SET PARTITION SPEC (days(event_ts));
-- evolves from hour(event_ts) to day(event_ts)Files written before keep hour partition; new files use day. Iceberg's manifest tracks which spec each file was written under.
6. MERGE under the hood: COW vs MOR
MERGE = upsert + delete + insert in one atomic operation. Both Iceberg and Delta support it. Two execution strategies, with massive performance differences.
6.1 Copy-on-Write (COW) — the original
For each affected data file:
- Read the entire file.
- Apply MERGE changes in memory.
- Write a NEW file with the changed rows.
- Mark the old file as removed in the new snapshot.
Cost: rewriting touched files. If 1% of rows change in a 1 GB file, you rewrite 1 GB. Write amplification: 100×.
Read cost: zero overhead. Each file is a clean Parquet.
Default for most lakehouse tables historically. Best for write-rare, read-heavy workloads.
6.2 Merge-on-Read (MOR) — the streaming-friendly mode
Two strategies for representing the deletes:
- Position deletes: per data file, a delete file listing positions (row_index) of deleted rows.
- Equality deletes: a small delete file listing values (e.g.
user_id = 1234) — at read time, rows matching are filtered out.
Updates = delete + insert: position-delete the old row, write a new row.
Cost on write: write the new rows + the delete file. Cheap. Write amplification: 1×.
Cost on read: read data files + apply delete files. Overhead per file with a delete file. Compaction is required periodically to prevent runaway delete-file growth.
6.3 Choosing COW vs MOR
| Workload | COW | MOR |
|---|---|---|
| GDPR deletes (rare, scattered) | bad (rewrites lots of data) | good (small delete files) |
| Daily SCD2 merge (5% changed) | OK | good |
| Hourly streaming upsert (always changing) | bad | good |
| Read-heavy, infrequent writes | good | OK |
| Time-series append-only | both fine; COW simpler | — |
Set per-table:
-- Iceberg
ALTER TABLE silver.fact_playback SET TBLPROPERTIES (
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read'
);
-- Delta (deletion vectors are roughly equivalent to position deletes)
ALTER TABLE silver.fact_playback SET TBLPROPERTIES (
'delta.enableDeletionVectors' = 'true'
);6.4 Delta deletion vectors (the equivalent of MOR)
A bitmap (RoaringBitmap) per data file marking deleted rows. Reads apply the bitmap to skip deleted rows. Writes only touch the bitmap, not the data file.
Trade-offs same as Iceberg position deletes.
6.5 MERGE example
MERGE INTO silver.dim_user t
USING staging.user_updates s
ON t.user_id = s.user_id
WHEN MATCHED AND s.is_deleted THEN DELETE
WHEN MATCHED AND s.hash_diff <> t.hash_diff THEN UPDATE SET
plan = s.plan,
country_id = s.country_id,
updated_ts = current_timestamp()
WHEN NOT MATCHED THEN INSERT (user_id, plan, country_id, updated_ts)
VALUES (s.user_id, s.plan, s.country_id, current_timestamp());Multi-match ambiguity: by spec, if a single target row matches multiple source rows, the operation is undefined and Iceberg/Delta will throw. Ensure your source has a unique key.
7. Compaction: bin-packing, sorting, Z-ORDER
Streaming writers produce many small files. Updates with MOR produce many delete files. Both kill query performance. Compaction rewrites them periodically.
7.1 Bin-packing (size compaction)
-- Iceberg
CALL system.rewrite_data_files(
table => 'silver.fact_playback',
options => map('target-file-size-bytes', '536870912') -- 512 MB
);
-- Delta
OPTIMIZE silver.fact_playback;Algorithm: pick a partition; pick groups of small files whose total size < target; rewrite each group as one file. New file appears, old ones marked removed. Atomic snapshot.
7.2 Sort compaction
-- Iceberg
CALL system.rewrite_data_files(
table => 'silver.fact_playback',
strategy => 'sort',
sort_order => 'event_ts ASC, user_id ASC'
);Sorts within each output file. Improves Parquet's row-group min/max stats → better zone-map pruning at query time.
7.3 Z-ORDER (Delta + others)
Multi-column locality. The Z-order curve interleaves the bits of two coordinates so that nearby points in N-dimensional space are nearby in linear order.
-- Delta
OPTIMIZE silver.fact_playback ZORDER BY (user_id, title_id);Result: within a file, rows are clustered in 2D space of (user_id, title_id). A predicate on either column prunes a large fraction of files.
Math: on N columns, Z-order gives ~N^(1/N) selectivity per column, vs sort-by-first-only which gives 1 for the first column and 0 for others. For N=2 it's a real win; for N=4 the benefit shrinks.
7.4 Manifest compaction
Iceberg maintains many small manifest files when many commits add a few files each. Manifest compaction merges them.
CALL system.rewrite_manifests('silver.fact_playback');Cheap and runs fast; do it daily for high-write tables.
7.5 Equality / position delete compaction
-- Iceberg
CALL system.rewrite_position_delete_files('silver.fact_playback');
-- For more aggressive compaction that absorbs deletes into data files:
CALL system.rewrite_data_files(
table => 'silver.fact_playback',
options => map('delete-file-threshold', '1') -- rewrite any data file with >=1 delete
);7.6 Maintenance pattern
Daily / hourly cron:
-- Compact small files, absorb deletes
CALL system.rewrite_data_files('silver.fact_playback');
-- Compact manifests
CALL system.rewrite_manifests('silver.fact_playback');
-- Expire old snapshots (default keeps 5 days)
CALL system.expire_snapshots(
table => 'silver.fact_playback',
older_than => TIMESTAMP '2026-04-09 00:00:00'
);
-- Remove orphan files (data files no longer referenced by any snapshot)
CALL system.remove_orphan_files('silver.fact_playback');8. Time travel and snapshot expiration
Both formats preserve old snapshots until you expire them.
-- Iceberg: query by snapshot ID or timestamp
SELECT * FROM silver.fact_playback FOR VERSION AS OF 7843290023;
SELECT * FROM silver.fact_playback FOR TIMESTAMP AS OF '2026-04-15 12:00:00';
-- Delta
SELECT * FROM silver.fact_playback VERSION AS OF 42;
SELECT * FROM silver.fact_playback TIMESTAMP AS OF '2026-04-15 12:00:00';8.1 Practical uses
- Audit: "show me what the table looked like before yesterday's bad merge".
- Reproducibility: ML training reading a frozen snapshot ID.
- Rollback: undo a bad write with a single CALL.
- CDC: Iceberg snapshot diff or Delta CDF gives per-snapshot row changes.
8.2 Expiration trade-off
Long retention = huge storage cost (you keep every old version of every file). Short retention = no recovery from incidents.
Common default: 5–7 days. Audit-required tables: longer (90+ days), with cost to match.
ALTER TABLE silver.fact_playback SET TBLPROPERTIES (
'history.expire.min-snapshots-to-keep' = '20',
'history.expire.max-snapshot-age-ms' = '604800000' -- 7 days
);8.3 Rollback
-- Iceberg
CALL system.rollback_to_snapshot('silver.fact_playback', 7843290023);
-- Delta
RESTORE TABLE silver.fact_playback TO VERSION AS OF 42;Atomic: just swap the current snapshot pointer. The "rolled back" data files reappear (assuming not yet expired).
9. Schema evolution: column-id semantics
Hive's schema evolution is positional and brittle. Iceberg tracks every column by an immutable column ID, not by name or position.
9.1 What's safe
- Add column: new ID, default value if not in older files.
- Drop column: ID retired; old files keep their data but it's not exposed.
- Rename column: same ID, new name (no file rewrite!).
- Reorder columns: doesn't matter; ID-based reads.
- Promote types: int → long, float → double, decimal precision up. Done in metadata.
9.2 What's not safe (requires rewrite or fails)
- Rename when other engines map by name (some catalogs).
- Change type incompatibly (string → int).
- Change a NOT NULL constraint on existing data with NULLs.
9.3 Why Parquet "just works"
Parquet stores column names, not IDs by default. Iceberg requires writing files with field IDs in Parquet metadata so that reads can map by ID even if names changed. Older readers without ID support fall back to name-based mapping.
ALTER TABLE silver.events ADD COLUMN device_class STRING;
ALTER TABLE silver.events RENAME COLUMN device_class TO device_type;
ALTER TABLE silver.events DROP COLUMN device_type;
ALTER TABLE silver.events ALTER COLUMN price TYPE decimal(18, 4); -- promotionAll metadata-only operations. No data file rewrite.
9.4 Delta schema evolution
Similar capabilities; column IDs introduced in Delta's protocol versions ≥ 2. Older Delta tables (without column mapping) need ALTER TABLE ... SET TBLPROPERTIES('delta.columnMapping.mode' = 'name') to enable rename/drop.
10. Delete files: position vs equality
In Iceberg's MOR mode, two flavors of delete file:
10.1 Position deletes
Per data file, lists row positions (offsets in row order) to skip.
delete-file-1.parquet:
file_path | pos
--------------------------------+-----
data/event_date=.../00001.parquet | 42
data/event_date=.../00001.parquet | 1003
data/event_date=.../00002.parquet | 7
Read-time cost: low. Per-file applied as a row-position bitmap.
10.2 Equality deletes
Lists values that should be deleted.
delete-file-1.parquet:
user_id | event_id
--------+----------
1234 | abc
5678 | xyz
At read time, every row is checked against the equality predicate.
Read-time cost: higher (requires per-row evaluation). But useful when you don't know exact positions (e.g. CDC stream just says "delete user_id=1234").
10.3 When each is used
- Position deletes: produced by MERGE/UPDATE/DELETE that knows the affected file+row.
- Equality deletes: produced by streaming sinks (Flink) that don't want to read the data file to find positions.
Iceberg can mix both within a snapshot.
11. Iceberg vs Delta vs Hudi — the trade table
| Feature | Iceberg | Delta | Hudi |
|---|---|---|---|
| Origin | Netflix | Databricks | Uber |
| Governance | Apache | Linux Foundation (Delta Lake) | Apache |
| Engine support | Spark, Flink, Trino, Snowflake, BigQuery, Athena, ClickHouse | Spark (best), Trino, Flink, Synapse | Spark (best), Trino, Flink |
| Hidden partitioning | yes | no | partial |
| Schema evolution | column IDs | column mapping (opt-in) | yes |
| MERGE strategies | COW + MOR (position + equality deletes) | COW + deletion vectors (position equivalent) | COW + MOR |
| Time travel | yes | yes | yes |
| Branches / tags | yes (with Nessie/Polaris/Iceberg V3) | no (clone is similar) | savepoints |
| Concurrent writers | optimistic; conflict on file overlap | optimistic; conflict on file overlap | optimistic |
| Catalog options | HMS, Glue, REST (Polaris), Nessie, JDBC | Hive, Unity (Databricks), Glue (limited) | HMS |
| Streaming ingestion | Flink, Spark Streaming | Spark Streaming, Flink | Streamer (own tool) |
| Industry traction (2026) | dominant for new builds, Snowflake/BigQuery/Databricks all support | strong on Databricks, Unity Catalog | declining outside Hudi-native shops |
The honest take: Iceberg has won the open standard race. Delta is excellent on Databricks. Hudi is fine if you're already on Hudi.
12. Catalog choices: Hive Metastore, Glue, Nessie, Polaris, Unity
The catalog stores table → current metadata pointer. The catalog choice determines:
- Atomic commit semantics
- Multi-engine interop
- Branch/tag support
- Access control
- Lineage
12.1 Hive Metastore (HMS)
The original. Thrift API, MySQL/Postgres backend. Universally supported. No branches. Limited ACL. Single-region typically.
12.2 AWS Glue Data Catalog
HMS-compatible API, managed, multi-region. Good with EMR / Athena / Redshift Spectrum / Snowflake (via Iceberg integration). No branches.
12.3 Project Nessie
Git-like semantics for catalogs. Branches, tags, commit history per table change. Great for ML reproducibility ("this experiment trained on the prod-2026-04-15 tag").
12.4 Apache Polaris (Snowflake's contributed REST catalog)
Open-source REST catalog implementing Iceberg's REST spec. Multi-engine, RBAC, support for branches / Nessie-style semantics over time. The strongest contender for "the universal Iceberg catalog".
12.5 Unity Catalog (Databricks)
Databricks's metastore with fine-grained ACL, lineage, audit. Native to Databricks; expanding interop via Iceberg REST.
12.6 The pragmatic choice
For Iceberg in 2026: Polaris (REST) or Glue if you're cloud-native; Nessie if you want git-like branching; Unity if you're on Databricks. HMS for legacy.
13. Operating a lakehouse: gotchas and patterns
13.1 The small-files killer
Streaming writers (Flink/Spark Structured Streaming) commit every checkpoint interval. With a 30-second checkpoint and 100 partitions, that's 200 small files per minute → 288K per day. Run compaction.
Mitigations:
- Increase checkpoint interval (1–5 minutes is usually fine).
- Use distribution mode
hashso each writer covers a single partition. - Enable Iceberg's auto-compaction (write-side) if available.
- Schedule daily compaction job.
13.2 The expiration trap
Aggressively expiring snapshots breaks long-running readers. Aggressively expiring orphan files can DELETE files an in-flight reader needed.
Pattern:
- Set
expire_snapshotsolder_thanto "older than the longest possible query" (e.g. 24h). - Run
remove_orphan_fileswitholder_thaneven longer (e.g. 72h). - Never run
remove_orphan_fileswith a low threshold — you risk deleting files referenced by a snapshot you're about to expire but haven't yet.
13.3 Concurrent writers conflict spiral
Pattern: stream + batch both writing same table → conflict → batch retries → still conflicts → fails.
Fix:
- Partition by time so stream and batch write to different partitions.
- Or have one writer; use staging tables and merge in a single writer.
- Or use REST catalog with retry-with-backoff.
13.4 Cross-region reads
Iceberg metadata is small; data files are large. If reading from another region, the metadata cost is negligible but data egress is real. Replicate data files with S3 cross-region replication; the catalog can point at the regional bucket; consider per-region copies of the table.
13.5 Auditing / data contracts
Both formats store commit metadata: who, when, what changed. Surface this in your data catalog.
SELECT *
FROM silver.fact_playback.snapshots
ORDER BY committed_at DESC
LIMIT 10;Iceberg metadata tables (.snapshots, .history, .files, .partitions, .manifests) are first-class queryable views on the table's state. Use them for monitoring, alerting on anomalous commits, and observability.
13.6 Cost monitoring
S3 LIST and small-file PUT operations are real cost drivers at scale. Monitor:
- Number of files per table per day.
- Average file size.
- Number of manifests.
- Snapshot accumulation.
A healthy table has files in the 128 MB – 1 GB range, manifests merged daily, snapshots expired weekly.
Closing principle
Iceberg/Delta turned object storage into a real database. The cost: you have to operate it like one — compaction, expiration, schema discipline, conflict handling. Get the maintenance jobs right and the lakehouse just works. Skip them and you'll be on a 3am page within 90 days.
16. Iceberg vs Delta vs Hudi — The Feature Matrix
In every lakehouse interview at least one question probes your table-format opinions. The informed answer goes beyond "Iceberg has better engine support" — here's the matrix of real differences.
| Capability | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| Snapshot isolation | Manifest-list based | JSON transaction log | Commit timeline + file groups |
| MERGE semantics | Copy-on-write (v1) or merge-on-read (v2) | Copy-on-write default, deletion vectors for MoR | MoR default (merge-on-read with log files) |
| Hidden partitioning | Yes — partition spec evolution | No (Liquid clustering is the answer) | Partial (partition fields declared) |
| Schema evolution | Add, drop, rename, reorder, type-widen — all safe | Add, type-widen only (drop/rename require explicit config) | Add, rename (field IDs) |
| Time travel | Snapshot ID, timestamp | Version, timestamp | Commit timestamp |
| Row-level deletes | Position or equality delete files | Deletion vectors (since 3.0) | Native |
| Engine support | Spark, Trino, Flink, DuckDB, many | Spark (native), Trino (read), others growing | Spark (native), Trino (read) |
| Cross-engine writes | Excellent via REST catalog | Historically Spark-first; improving | Spark-first |
| Streaming ingest optimization | Good (compact-on-write) | Good | Best (MoR designed for upsert streams) |
How to pick — the decision heuristic
- Multiple engines writing the same table? → Iceberg. Cross-engine write semantics are its strongest suit.
- Single-engine Databricks shop? → Delta. Best native integration, newer features land there first.
- High-volume upsert workload (CDC ingest, slowly-changing)? → Hudi. MoR was designed for this.
- Analytics-heavy with occasional updates? → Iceberg or Delta; both fit well.
17. Catalog Architectures — Hive, Glue, REST, Unity, Polaris
The catalog layer decides which engine can write the same table and how metadata is shared across boundaries. It's also the single biggest lock-in risk in a lakehouse.
Hive Metastore (HMS)
The original. Stores table metadata in a relational DB (MySQL/Postgres). Thrift API. Good: every query engine in the world speaks it. Bad: single-master, schema-less for table formats, no native ACLs, no multi-tenancy.
AWS Glue
Managed HMS-equivalent. Same API, fewer operational headaches. Lock-in to AWS for metadata. Supports Iceberg and Delta as catalog entries. Fine for AWS-centric shops; painful if you plan a multi-cloud future.
Iceberg REST Catalog
Open spec. HTTP API that any engine can call. Backend-agnostic (can be backed by HMS, Glue, Nessie, Polaris). The direction the industry is moving for vendor-neutral catalogs.
Unity Catalog (Databricks)
Centralized governance across Delta tables, files, ML models, and views. Strong on ACLs, lineage, attribute-based access control. Works best inside Databricks; open-source Unity is catching up on ecosystem support.
Polaris (Snowflake)
Open-source Iceberg REST catalog from Snowflake. Aimed at the same "centralized governance, open engine access" thesis as Unity. Still early but strategically significant.
The migration trap
Switching catalogs is not a metadata-only operation even though it sounds like one. Every engine's configuration must point at the new catalog, every table must be re-registered with matching paths, and any engine-specific extensions (Delta vacuum retention, Iceberg snapshot expiration) must be preserved. A catalog migration is a project, not a weekend. Plan for it the same way you'd plan a table-format migration.
18. Write Amplification and Compaction Strategy
Copy-on-write tables pay a cost each time a partition is updated: the entire partition gets rewritten. That cost is called write amplification — the ratio of bytes rewritten to bytes actually changed.
Quantifying write amplification
Suppose a 1 TB daily partition is written once then a MERGE statement touches 0.1% of rows the next day. Copy-on-write rewrites the whole 1 TB partition to land 1 GB of changes. Write amplification = 1000x. At 30 such MERGEs per day, you're writing 30 TB to land 30 GB of logical changes.
Merge-on-read as the fix
Delta's deletion vectors and Iceberg's v2 delete-files let you write a small side-file containing "these rows are deleted / updated to X." Readers apply the overlay at query time. Write cost collapses. Read cost rises — by how much depends on how many overlays have accumulated. That's where compaction comes in.
Compaction strategies
- Bin-pack. Merge many small files into fewer larger ones. Cheapest form of compaction; doesn't touch overlays.
- Sort. Rewrite with a clustering key (
ZORDERin Delta, sort-order in Iceberg). Improves read performance by improving min/max pruning and compression. More expensive to run. - Delete-vector application. Materialize deletions into main data files. Reclaims the read-time overlay cost. Run periodically (weekly / monthly) based on overlay ratio.
Operational rule of thumb
If your read queries on a table are getting slower over time without the data growing, it's probably compaction debt. Monitor the ratio of delete-file bytes to data-file bytes; when it crosses ~5–10%, schedule compaction.
19. Multi-Table Transactions and the Two-Phase Commit
Lakehouse formats provide atomic commits per table. They don't provide atomic commits across tables by default. If your pipeline updates a fact and a dimension that must move together, you need to solve this yourself.
The problem, stated precisely
Pipeline writes fct_order and dim_customer. If fct_order commits but dim_customer fails, consumer queries see new orders referencing missing customer keys. Data corruption.
Three mitigation strategies
- Staging table pattern. Write to
staging_*tables. Once both staging writes succeed, do a "rename" commit — point the final-table catalog entries at the staging paths atomically. Works; requires careful metadata choreography; error-recovery is tricky. - Two-phase commit via external coordinator. Use a transaction coordinator (Zookeeper, a custom service, or an orchestrator checkpoint). Prepare writes to both tables; only both commits succeed or both roll back. Heavy; rarely worth the complexity.
- Ordering + idempotent consumers. Pick an ordering where the "most dangerous" side is committed last. If dim is committed before fact, consumers can handle missing dim keys gracefully (inferred members). Accept at-least-once committing; downstream consumers tolerate it. This is the most common production pattern.
Iceberg's multi-table transaction feature
Iceberg (as of v1.4+) supports a Transactions API that commits against multiple tables atomically if the catalog supports it. REST Catalog does; Glue doesn't. When supported, it eliminates the strategies above for its scope. Still worth understanding the fallbacks because catalog support is uneven.
Interview Q&A — Real Scenarios
"The point of these questions isn't 'do you know the answer.' It's 'can you reason out loud, ask the right clarifying questions, and stop when you've answered enough.'"
These are scenarios drawn from real Senior / L5 Data Engineering loops — Netflix, Stripe, Airbnb, Pinterest, Uber, Meta, DoorDash. No leetcode. No "reverse this binary tree." Every scenario is something a real engineer faced at 3am or in a design review.
The format for each:
- Scenario — what they ask.
- What they're testing — the underlying skill.
- Answer skeleton — how a strong candidate structures the answer.
- What weak candidates miss.
- Bonus / follow-ups.
Contents
Incident response (page-at-3am)
- The pipeline missed SLA — diagnose
- The dashboard shows half the revenue it did yesterday
- Streaming consumer lag is climbing and won't drain
- Iceberg table reads are 10× slower this week
- Spark job OOMs only on Mondays
- Late-arriving data corrupted yesterday's report
System design
- Design a clickstream pipeline at 1M events/sec
- Design a feature store for ML serving
- Design point-in-time correct training data
- Design the architecture for a daily metric that must be 100% accurate
- Design SCD Type 2 ingestion from Kafka CDC
- Design a multi-region data warehouse
- Design exactly-once for a payments-counting pipeline
Deep internals (gotchas)
- Explain how Spark decides BHJ vs SMJ at runtime
- Why is your shuffle slow and what can you do
- Why doesn't this filter push down
- Walk me through what happens during a Flink checkpoint
- Walk me through an Iceberg commit, end-to-end
- How does a watermark form across a Kafka topic with 12 partitions
- Why does my EXACTLY_ONCE Kafka producer still produce duplicates downstream
Modeling judgement
- Star schema vs OBT — when each
- Should this dimension be SCD2 or SCD1
- Should we persist this Silver model or rebuild from Bronze
- Do we partition by user_id or by date
- How do you handle 'soft deletes' in a fact table
- The PM asked for 'real-time' — what do you ask back
Engineering quality
- How do you test a Spark transformation
- How do you backfill safely
- How do you design a data contract
- How do you measure pipeline quality
- How do you do schema evolution without breaking consumers
- What's your CI/CD for a data warehouse
Trade-offs and judgment
- Latency vs cost vs correctness — pick two
- When would you choose a row-store for analytics
- When would you NOT use Iceberg
- When is Lambda architecture justified in 2026
- Polars or Spark — when each
- Build vs buy: orchestration, lineage, quality, catalog
Behavioural / leadership
- Tell me about a time you said no to a stakeholder
- Tell me about a time a pipeline you owned caused a bad metric
- How do you decide what NOT to build
- How do you onboard the next engineer
Incident: The pipeline missed SLA — diagnose
What they're testing: triage methodology under uncertainty.
Answer skeleton:
- What's the SLA, what's the actual completion time, by how much did it slip? "Missed SLA" might be 2 minutes or 6 hours — the answer is different.
- What's the symptom: late, failed, or partial?
- Look at the orchestrator's view first: which task in the DAG is slow/failing? That narrows the surface area immediately.
- For the slow task: drill into the platform's UI (Spark UI, Flink UI, dbt logs, query history). Look at:
- Stage times (which stage spent the most time?)
- Task time distribution (skew?)
- Data volume read/written (data spike?)
- GC time (memory pressure?)
- Compare to baseline: was yesterday fine? If yes, what changed? Recent code deploy, infra change, upstream data volume change?
- Hypothesis → fix → verify. Don't fix without a hypothesis.
- Communicate: post in #data-incident, set ETA, update if it slips.
Common mistakes: jumping to "it's the cluster size, scale up" before understanding what changed. Fixing in production without a backout plan.
Bonus: prevention — what observability would have caught this earlier? (Volume monitoring on the upstream Kafka topic, partition skew alerts, anomaly detection on per-stage runtime.)
Incident: The dashboard shows half the revenue it did yesterday
What they're testing: data correctness debugging.
Answer skeleton:
- Confirm the symptom: is it the dashboard's filter, time zone, query? Check the dashboard's underlying SQL.
- Check the source-of-truth tables: query directly with the same logic, see if numbers match. If yes → BI/dashboard issue. If no → data issue.
- For a data issue, walk upstream:
- Did the gold model run? (orchestrator)
- Did silver run? Did it produce expected row counts?
- Did bronze ingestion complete?
- Did the source produce normal volume? (Kafka topic lag, source DB row counts)
- Most likely culprits in order: (a) duplicate-suppression bug now over-suppressing; (b) join condition newly produces no match for a category; (c) upstream schema change dropped a column to NULL; (d) timezone shift; (e) partial data due to a missed late arrival.
- Reproduce in a notebook: bisect by date and category to isolate. "It dropped at 6pm UTC and only for SUBSCRIPTION revenue" tells you a lot.
- Hot-fix the metric (if possible), file an incident, do a postmortem.
Common mistakes: guessing at causes; fixing the dashboard query without finding the root cause; not checking row counts at every layer.
Incident: Streaming consumer lag is climbing and won't drain
What they're testing: streaming systems debugging.
Answer skeleton:
- Quantify: which topic, which consumer group, how much lag, growing how fast (msg/sec)? Per-partition or uniform?
- Check upstream: is the producer rate higher than usual? Spike, sustained, or normal?
- Check downstream: if the consumer writes to a sink (DB, S3), is the sink the bottleneck? Sink latency, throttling errors?
- Check the consumer process: CPU? Memory? GC? Look at Flink/Spark UI: backpressure indicators, busy operators, checkpoint durations.
- Check skew: one partition's lag is 10× others? → key skew. Need to rebalance, salt, or rebalance with a different partitioner.
- Check rebalances: storms of group rebalances cause processing pauses. Look at consumer group state changes.
- Triage actions in order:
- Add more consumers / parallelism (if uniform load).
- Bypass DLQ for known-bad messages (if poison pill).
- Pause non-critical sinks to give CPU back.
- As a last resort, increase Kafka partitions (rebalances, irreversible).
- Don't reset offsets unless you know what you're doing — risk of data loss or duplication.
Common mistakes: blindly scaling up consumers (won't help if downstream is the bottleneck); resetting offsets without consequences understood.
Bonus: how does Flink's credit-based backpressure manifest? (Upstream operator slows down because downstream stops issuing credits — backPressuredTimeMsPerSecond metric.)
Incident: Iceberg table reads are 10× slower this week
What they're testing: lakehouse operations.
Answer skeleton:
- Check the file count and average file size: query the metadata table.
SELECT COUNT(*) AS file_count, AVG(file_size_in_bytes) AS avg_size, SUM(file_size_in_bytes) AS total_size FROM silver.fact_playback.files WHERE partition_event_date = DATE '2026-04-15'; - Check the manifest count: lots of small manifests = lots of LIST/GET on metadata.
- Most likely: streaming writer started running every minute, producing 1440 small files per partition per day. Reads must open all of them.
- Fix: schedule compaction.
CALL system.rewrite_data_files('silver.fact_playback'); CALL system.rewrite_manifests('silver.fact_playback'); - Long-term: enable async compaction at write time, or reduce streaming commit frequency.
Common mistakes: blaming the query engine; not checking metadata.
Incident: Spark job OOMs only on Mondays
What they're testing: data-volume-aware debugging.
Answer skeleton:
- What's special about Mondays: typically aggregating a weekend's worth of data → 3× the volume.
- Where does it OOM: container OOM (exit 137) or JVM heap OOM? They have different fixes.
- Container OOM: Pandas UDF or Arrow buffer overflow → bump
spark.executor.memoryOverhead. - JVM OOM in shuffle/join: a partition that fits Tue-Sun no longer fits Mon → AQE skew handling, salt, raise broadcast threshold, or split heavy hitters.
- JVM OOM in Pandas UDF aggregation: Python worker exploding on a single mega-group → check group sizes, rewrite using SQL aggregations or
applyInPandaswith smaller groups. - Long-term fix: don't let runtime scale linearly with data; use incremental / windowed aggregation.
Common mistakes: increasing executor memory without diagnosing whether it's heap or overhead.
Incident: Late-arriving data corrupted yesterday's report
What they're testing: event-time vs processing-time understanding.
Answer skeleton:
- Confirm: was yesterday's report finalized at "watermark close" or "processing-time end-of-day"? Different bugs each.
- If watermark-based: late events past the allowed lateness were dropped. Configure higher allowed lateness, or accept that some events will appear in tomorrow's report instead.
- If processing-time: yesterday included only events that arrived yesterday, not events whose event_time was yesterday. Switch the report to event-time semantics.
- For lakehouse: re-process the affected day with a backfill that overwrites only the affected partition. Use
INSERT OVERWRITEorreplace-whereor MERGE. - Communicate: data was corrected at T+24h; explain the trade-off (you can't have low-latency AND complete data without retraction).
Common mistakes: ignoring the event-time/processing-time distinction; reporting on stream output without watermark discipline.
Design: a clickstream pipeline at 1M events/sec
What they're testing: end-to-end systems thinking, capacity planning.
Answer skeleton:
- Clarify: 1M events/sec average or peak? Event size (bytes)? Latency requirement (sec, min, hour)? Use cases (real-time dashboard, ML feature, batch analytics)? Retention (days, years)?
- Capacity math: 1M × 1KB = 1 GB/sec ingress = 86 TB/day. Storage: 30 days = 2.6 PB raw; with Snappy compression / Parquet → ~600 TB.
- Architecture (sketch):
Edge SDK → API Gateway → Kafka (topic, ~100 partitions) → ├─ Real-time: Flink → Druid/Pinot for sub-second dashboard ├─ Near-real-time: Flink → Iceberg (1-min commit) → Trino for ad-hoc └─ Batch enrichment: Spark daily → silver/gold Iceberg tables - Kafka design: 100 partitions × 3 replicas × 7-day retention; partition by user_id (most queries are per-user) or random (avoid skew). Consider tiered storage.
- Schemas: Avro/Protobuf with Schema Registry. Backward + forward compat policy.
- Failure modes:
- Edge SDK can't reach gateway → local buffer + retry with backoff.
- Kafka unavailable → producer queue → DLQ.
- Flink job dies → exactly-once via checkpoints + transactional Kafka writes.
- Iceberg commit conflict → REST catalog with retry.
- Cost: Kafka (compute + storage) > Flink (CPU) > S3 (storage); aim for 70/20/10 distribution.
- Monitoring: lag per consumer group, SLO per stage, p99 latency, schema validation failures.
What weak candidates miss: capacity math; Kafka partition design; the difference between "real-time dashboard" and "real-time data lake".
Bonus: how do you handle hot keys? (Salting the partition key but joining back later, or pre-aggregating before partitioning.)
Design: a feature store for ML serving
What they're testing: balance of online + offline systems.
Answer skeleton:
- Clarify: how many features, how many models, online QPS, offline volume, freshness SLA per feature.
- Two-tier architecture:
- Offline store (Iceberg/Delta on S3): feature values over time, used for training.
- Online store (DynamoDB / Redis / ScyllaDB): latest values per entity, low-latency lookup.
- Materialization:
- Batch features (daily aggregations) computed in Spark, written to both stores.
- Streaming features (last 5 minutes) computed in Flink, written to both.
- Same definition, two paths — risk of skew. Use a single feature definition (Tecton-like, or your own DSL).
- Point-in-time correctness: training data must use the feature's value as it was at the prediction time, not as it is now. Implement an as-of join (see chapter 05).
- Lineage and discovery: catalog of feature definitions, owners, freshness, schema, tests.
- Failure modes: online store stale (alert), offline/online skew (compare hourly), feature drift (statistical tests).
What weak candidates miss: point-in-time joins for training; the offline/online consistency problem.
Design: point-in-time correct training data
See as-of joins and the feature-store design above. Key principle: every feature value has a valid_from/valid_to; every label has an event_ts; the join takes the latest feature value where valid_from <= event_ts. Implement with BETWEEN join + LATERAL or DuckDB ASOF JOIN.
Design: a daily metric that must be 100% accurate
What they're testing: when to give up on streaming.
Answer skeleton:
- Clarify "100%": matched to a system of record (payments DB)? Eventually consistent (T+1)?
- If matched to system of record: don't stream. Run a batch ETL after the source system has closed the day (T+24h). Reconcile against the SOR with a tolerance check (e.g., difference < $1).
- Architecture:
- Source DB → daily Snowflake/BigQuery export at end of day.
- Reconcile total revenue from source vs total in warehouse.
- If diff > tolerance, halt downstream pipelines and alert.
- Only after reconciliation succeeds, publish the metric.
- Why not stream? Streaming has bounded out-of-orderness; late events change the answer. For audit-grade metrics, accept latency.
- Hybrid: real-time approximate + daily authoritative. Make sure consumers know which they're reading.
What weak candidates miss: the reconciliation step; understanding that streaming is fundamentally probabilistic for metrics.
Design: SCD Type 2 ingestion from Kafka CDC
What they're testing: streaming + dimensional modeling joined.
Answer skeleton:
- Source: Debezium → Kafka topic with one envelope per change
{op: c|u|d, before, after, source: {ts_ms, snapshot}}. - Sink: Iceberg dim table with
valid_from,valid_to,is_current. - Pattern A: streaming MERGE per micro-batch (Spark Structured Streaming):
- Each batch: deduplicate by user_id keeping latest LSN/ts_ms.
- MERGE INTO dim:
- WHEN MATCHED AND hash differs: close current row (
valid_to = batch_ts,is_current = false), insert new row. - WHEN NOT MATCHED: insert new row.
- WHEN MATCHED AND op = 'd': close current row, optionally insert tombstone.
- WHEN MATCHED AND hash differs: close current row (
- Pattern B: Flink streaming with state:
- Keyed by user_id, store last-seen hash + open-row pointer.
- Emit two records per change: a "close" + an "open" — written via Iceberg sink with MERGE/upsert semantics.
- Idempotency: at-least-once Kafka delivery means dupes possible. The MERGE must be idempotent: deduping by (user_id, source_ts_ms) before MERGE.
- Late events: out-of-order CDC is rare per partition (Debezium preserves order per source row), but can happen across partitions. Reject events with source_ts_ms older than the dim's current row.
What weak candidates miss: dedup before merge; handling deletes; out-of-order tolerance.
Design: a multi-region data warehouse
What they're testing: distributed system trade-offs at the storage layer.
Answer skeleton:
- What's the goal: latency for queries near users, DR, regulatory data residency?
- Pattern A: single global warehouse, regional caches — easy if latency is OK; one source of truth.
- Pattern B: per-region warehouses, eventual sync — low latency reads, complex consistency. Each region writes locally; replication daily or streamed via CDC.
- Pattern C: per-region storage of regional data, federated query — complies with data residency; cross-region queries are slow.
- Conflict handling for writes: avoid by partitioning ownership (US users write US warehouse). Last-writer-wins with timestamps. Or use Iceberg with REST catalog and replicate snapshots forward.
- Cross-region S3 replication for backup; understand cost (egress is expensive).
- Catalog: one global catalog (e.g., Glue replicated, or REST catalog with HA) or per-region catalogs synced. The global catalog is simpler to reason about.
What weak candidates miss: data residency/regulatory; cost of cross-region egress; per-region writability vs read-only replicas.
Design: exactly-once for a payments-counting pipeline
What they're testing: depth on exactly-once semantics.
Answer skeleton:
See chapter 03, sections 7 and 14. The five questions:
- Source: replayable with deterministic offsets? Kafka, yes. HTTP webhook, no — need a buffer.
- State: durable across failures? Use Flink's keyed state on RocksDB with checkpoints; or Spark structured streaming with checkpoint location.
- Sink: idempotent or transactional?
- Transactional: Kafka EOS (transactional producer), JDBC with XA.
- Idempotent: writing to a sink keyed by
(payment_id, event_id)so dupes upsert harmlessly.
- Effective once at consumer: consumers must read transactionally (
isolation.level=read_committed). - End-to-end: even with EOS, downstream consumers can dedupe on a unique business key as belt-and-suspenders.
Architecture:
Payment events → Kafka (transactional producer) → Flink (checkpoints, EOS sink) → Iceberg
↓
Side output → Audit log
What weak candidates miss: thinking exactly-once means "no dupes ever". It really means "exactly-once effect on the durable state". The right framing is idempotent sink + transactional or replayable source.
Internals: How does Spark decide BHJ vs SMJ at runtime?
See chapter 04, sections 4 and 6. Key points:
- Plan time: catalyst's planner chooses BHJ if (a) user hint, (b) one side estimated <
autoBroadcastJoinThreshold. Otherwise SMJ. - Runtime (AQE): after the shuffle, AQE re-evaluates. If the post-shuffle map output of one side fits the threshold, AQE converts to BroadcastHashJoin using the shuffled output as broadcast — this is
Dynamic Join Selection. - Skew handling: AQE further detects skewed partitions in SMJ and splits them.
A senior should mention: estimates can be wildly wrong → why AQE matters; broadcast can fail at runtime if the side is bigger than the driver can hold.
Internals: Why is your shuffle slow and what can you do?
See chapter 04, sections 5 and 7.
Reasons in order of likelihood:
- Too many partitions → small files, fetch overhead. AQE coalesces.
- Skew → one task is the long pole. AQE skew handling, salting.
- Disk → ESS disk full or slow. Push-based shuffle (Magnet) helps.
- Network → cross-AZ traffic, slow NICs. Co-locate where possible.
- Wrong join strategy → SMJ when BHJ would have eliminated the shuffle entirely.
Fix in order: enable AQE, increase advisory partition size, enable push-based shuffle, broadcast where possible, remove the shuffle entirely (pre-bucket the table).
Internals: Why doesn't this filter push down?
See chapter 04, section 2.4. Top reasons:
- UDF on the column (Catalyst is conservative).
- Function on indexed column:
WHERE date(ts) = ...— doesn't push. - Cast type mismatch (string column compared against int).
- Window/aggregate sits between Filter and Scan.
- Cache (
.cache()) above the filter.
How to verify: df.explain(mode="formatted") and look for PushedFilters in the FileScan.
Internals: Walk me through what happens during a Flink checkpoint
See chapter 03, section 9. Crisp version:
- JobManager triggers a checkpoint and broadcasts a checkpoint barrier to every source.
- Each source emits the barrier into its output stream (between data records).
- Operators receive barriers on each input. With aligned checkpoints, they wait for barriers on all inputs (buffering); with unaligned, they snapshot in-flight buffers immediately.
- When all barriers are received, the operator snapshots its state to RocksDB → uploads to S3 (incremental: only new SSTables since last checkpoint).
- JobManager collects acks; once all operators ack, checkpoint metadata is written and the checkpoint is "complete".
- On failure, all operators restart from the last completed checkpoint and replay from the offsets stored in source state.
Bonus mention: incremental checkpoints (RocksDB SSTable diffs), savepoints (manual + format-stable), exactly-once requires sink commits to be tied to checkpoint completion (two-phase commit).
Internals: Walk me through an Iceberg commit, end-to-end
See chapter 07, sections 2 and 4. Crisp version:
- Writer reads current snapshot (S0) from catalog.
- Writes new Parquet data files to
data/. - Writes new manifest file(s) referring to the data files.
- Writes a new manifest list combining new manifests with existing ones from S0.
- Writes a new metadata.json (
v00043.metadata.json) that records the new snapshot S1 with its manifest list, parent = S0. - Atomic compare-and-swap: ask the catalog to update the table pointer from
v00042tov00043. If catalog says "still at v00042" → success. If "now at v00043" (someone else won) → conflict. - On conflict: discard the new metadata.json (the data files are orphan but that's fine for now), refresh to the latest snapshot, re-evaluate the changes, retry from step 2 (or just step 5 if changes still apply).
- Periodically, an orphan-file cleanup removes data files not referenced by any reachable snapshot.
Internals: How does a watermark form across a Kafka topic with 12 partitions?
See chapter 03, section 3. Key points:
- Each Kafka partition has its own watermark (computed from the timestamps of events in that partition with
WatermarkStrategy.forBoundedOutOfOrderness(...)). - The Flink source operator's watermark is the minimum of its partition watermarks.
- Downstream operators receive watermarks from each upstream channel; their effective watermark is the min of all incoming watermarks.
- An idle partition can stall the global watermark forever. Use
withIdleness(Duration.ofMinutes(2))so an idle source partition's watermark is excluded from the min calculation after the timeout.
Bonus: with parallelism > partitions, multiple source subtasks share partitions but the math holds — each subtask emits one watermark; downstream takes min.
Internals: Why does my EXACTLY_ONCE Kafka producer still produce duplicates downstream?
What they're testing: layered understanding.
Answer:
- Are downstream consumers reading committed only? They need
isolation.level=read_committed. Default isread_uncommitted→ reads aborted transactions too. - Is the sink idempotent on its own key? EOS on the producer side prevents one Kafka message becoming two. But if the sink's key is
(user_id, ts)and your processing produces non-deterministic timestamps on retry, you get two distinct rows downstream. - Is your transformation deterministic? Replay from a checkpoint must produce identical outputs to the prior run, otherwise the sink sees both versions.
- Are there multiple producers writing same key? Two upstream pipelines independently writing the same business event to the topic → two messages → two downstream rows. EOS doesn't dedupe across producers.
Modeling: Star schema vs OBT — when each?
See chapter 01, section 11.
- Star schema: governance, multi-consumer, evolving dimensions, conformed dimensions across many facts. Snowflake/BigQuery execution prefers it. Most warehouses.
- OBT (One Big Table): feature engineering for ML, ad-hoc exploration where re-joins would dominate cost, low-cardinality dimensions stable enough to denormalize. Modern columnar warehouses with run-length encoding store denormalized cheaply.
Heuristic: keep dimensions normalized (star) for the warehouse; materialize OBTs as gold-layer marts for specific high-traffic use cases.
Modeling: SCD2 or SCD1 for this dimension?
What they're testing: requirement-driven design.
Answer pattern: ask the question back. SCD2 if any downstream use case needs to know the value at a past point in time (audits, point-in-time joins for ML, retroactive cohort analysis). SCD1 if "current value only" is fine.
Common SCD2 cases: subscription plan, account country, billing address, organization ownership. Common SCD1 cases: name spelling fixes, internal IDs that don't change semantically, fields where history is held in another system.
Cost trade-off: SCD2 multiplies row count by churn rate. For low-churn dims it's free; for high-churn (5% per day) it gets large.
Modeling: Persist this Silver model or rebuild from Bronze?
What they're testing: cost-vs-correctness reasoning.
Persist if:
- Downstream pipelines depend on it daily (recompute cost > storage cost).
- Source data is volatile and you need stable history.
- The transformation is expensive (joins, windowing, dedup at scale).
Rebuild if:
- The transformation is cheap and the source is small.
- You need to evolve the logic frequently (each silver state is "as of last run" — re-running is the only way to get the latest logic).
In practice: persist Silver, but also persist Bronze (raw ingestion, immutable). When logic changes, backfill Silver from Bronze.
Modeling: Partition by user_id or by date?
What they're testing: partition design intuition.
By date if queries filter by date (most analytical queries do). By user_id if queries are point-lookups by user (operational). Both with primary partition by date and clustering / Z-ORDER by user_id — the modern lakehouse default.
Avoid high-cardinality columns as partition: 10M users × 365 days = 3.65B partitions = catastrophic. Partition is for ~100s to ~10000s of values; clustering/sorting/bucketing for higher.
Modeling: Soft deletes in a fact table?
What they're testing: temporal modeling.
Facts are typically immutable — what happened, happened. "Soft delete" of a fact often signals a model issue: the user cancelled something, that's a new fact (the cancellation event), not a deletion of the original.
If you must mark a fact as voided, add is_voided BOOLEAN and voided_at TIMESTAMP. Downstream queries WHERE NOT is_voided. Don't actually DELETE — keep audit trail.
If GDPR-driven hard delete: lakehouse MERGE-ON-READ with deletion vectors, applied on a per-user basis, with retention controls.
Modeling: PM asked for "real-time" — what do you ask back?
What they're testing: requirement gathering.
Ask:
- What latency is acceptable: 1 second, 1 minute, 5 minutes, 1 hour? "Real-time" varies by user.
- What's the use case: dashboard refresh, ML feature, alerting, operational decision?
- How frequently is the data viewed? (Refreshing a dashboard every 30s for a metric viewed once a day is wasteful.)
- What's the cost budget? Sub-second is 10–100× more expensive than 5-minute.
- What's the quality bar? Approximate or exact? Does late data need retraction?
Most "real-time" requests turn out to be "5–15 minutes is fine" once you ask, and that's a vastly different system.
Quality: How do you test a Spark transformation?
See chapter 06, section 10. Three layers:
- Pure-function unit tests on transformations decomposed from the DataFrame logic.
- End-to-end PySpark tests with
chispaon small fixture DataFrames. - Integration tests against a temp Iceberg table on local filesystem.
Plus dbt-style data tests on the actual table in production (uniqueness, nulls, referential integrity).
Quality: How do you backfill safely?
See chapter 02, section 6. Crisp principles:
- Idempotent overwrite per partition: deterministic so reruns are safe.
- Bounded parallelism: don't reprocess 365 days at once on a 10-node cluster.
- Staging table first: write backfilled data to a staging table; validate row counts, key uniqueness, sums; then atomic swap into prod.
- Audit the diff: compare backfill output vs current prod for each partition; require sign-off if delta > threshold.
- Notify downstream: backfill changes data; downstream caches must invalidate.
Quality: How do you design a data contract?
See chapter 01, section 13. Components:
- Schema (columns, types, nullability, constraints).
- Semantics (what each column means, units, valid range).
- Freshness SLA (data is < N hours old).
- Volume range (rows/day expected; alert outside ±X%).
- Owner & on-call.
- Change process (versioned; consumers notified before breaking changes).
- Tests (run on each ingestion; halt downstream if broken).
Stored as YAML in the producer's repo; checked in CI; surfaces in the data catalog.
Quality: How do you measure pipeline quality?
Six dimensions:
- Freshness: time since last successful run. SLO measurable.
- Completeness: row count vs expected; missing partitions.
- Accuracy: spot-checks against a system of record; statistical anomaly detection.
- Consistency: same metric computed two ways agrees within tolerance.
- Validity: passes schema and referential integrity tests.
- Uniqueness: primary keys are unique; no duplicate facts.
Surface as an observability layer (Monte Carlo, Datafold, custom). Alert on regressions.
Quality: Schema evolution without breaking consumers?
Strategies:
- Add only: new columns are non-breaking; consumers without them ignore.
- Deprecate, don't drop: keep old columns with NULL or last-known value during deprecation window; remove only after consumers migrated.
- Renames via copy: add new column populated from old; deprecate old.
- Type changes: only widening (int → bigint, decimal precision up); other changes need explicit consumer support.
- Versioned tables / views:
silver.fact_playback_v2as a new table, redirect consumers gradually.
For Kafka topics: Schema Registry compatibility checks (BACKWARD, FORWARD, FULL); reject incompatible producer changes.
Quality: CI/CD for a data warehouse?
Pattern:
- Source control (git) for SQL models, transformations, DAG definitions, schemas.
- PR pipeline:
- Lint (sqlfluff, ruff).
- Type check (mypy on Python; dbt parse on SQL).
- Unit tests.
- dbt build on a sample / dev database.
- dbt test on dev.
- Deploy to staging environment on merge to main.
- Smoke tests on staging.
- Promote to prod via versioned release.
- Observability post-deploy: row count check, freshness, anomaly detection.
For Spark / Flink jobs: containerize, deploy via terraform/k8s/argo; canary via shadow runs comparing outputs against current prod.
Trade-offs: Latency vs cost vs correctness — pick two
What they're testing: judgement.
The triangle:
- Latency + correctness: real-time + exact = expensive (high replication, transactional sinks, big clusters).
- Latency + cost: real-time + cheap = approximate (sketches, sampling, eventual consistency).
- Cost + correctness: cheap + exact = batch with reconciliation, T+24h.
You pick based on use case. Operational alerting → latency + correctness. Marketing dashboard → cost + correctness. Personalization features → latency + cost (ish).
Trade-offs: Row-store for analytics?
What they're testing: knowing the rule and its exceptions.
Rule: columnar for analytics. Exceptions:
- Point-lookup heavy workloads (single-row, single-column reads).
- Operational analytics on small tables (< 1M rows) where columnar overhead doesn't pay off.
- HTAP (hybrid transactional/analytical) systems where the same row is updated AND analyzed (Singlestore, Postgres with extensions).
For 99% of pure analytics: columnar.
Trade-offs: When NOT to use Iceberg?
- Single-engine, single-region, no concurrent writers, no schema evolution → plain Parquet on S3 + Hive Metastore is fine.
- Tables < 1 GB, few writes → not worth metadata overhead.
- Streaming-heavy with sub-minute commits and no compaction strategy → Iceberg metadata explodes.
- Compute fully on a managed warehouse (Snowflake native tables) where you don't need open format.
For most other cases: use Iceberg.
Trade-offs: Lambda architecture in 2026?
What they're testing: knowing why Kappa won mostly, but Lambda still has corners.
Kappa (single streaming pipeline) is the default now: simpler, less code, single source of truth. Use Kappa when:
- Streaming engine handles batch via the same code (Flink, Spark Streaming).
- You're OK with "real-time + late corrections via reprocessing".
Lambda (separate batch + streaming) still justified when:
- Batch is the audit-grade source; streaming is a real-time approximation explicitly labeled.
- Different teams own batch vs streaming with different SLAs.
- The streaming engine can't express the batch-quality computation (rare today).
Trade-offs: Polars or Spark — when each?
See chapter 06, section 4.
- Polars: fits in memory of one machine (incl. its streaming engine for spill), local notebook, ad-hoc work, ETL where Spark cluster startup is overkill.
- Spark: > one machine of data, distributed shuffle/join, integration with Iceberg/Delta at scale, multi-tenant clusters.
Heuristic: if your data fits on a r6i.16xlarge (512 GB RAM), Polars wins on speed and cost. Above that, Spark.
Trade-offs: Build vs buy?
| Layer | Build | Buy |
|---|---|---|
| Orchestration | Airflow (open) | Astronomer, Prefect Cloud, Dagster Cloud |
| Lineage | OpenLineage + custom | Datafold, Atlan, Alation |
| Quality | Great Expectations / Soda + custom alerts | Monte Carlo, Bigeye, Anomalo |
| Catalog | DataHub (open) | Atlan, Alation, Collibra |
| Lakehouse compute | EMR / OSS Spark | Databricks, Snowflake |
| Streaming | Flink on K8s | Confluent Cloud, Decodable, Aiven |
| BI | Superset (open), Lightdash | Looker, Tableau, Power BI, Mode |
Heuristic: buy what's commodity (BI, catalog, lineage), build what's differentiated (custom quality rules, domain transformations). Don't build orchestration.
Behavioural: A time you said no to a stakeholder
What they're testing: judgment, communication, ownership.
Structure (STAR):
- Situation: PM wanted real-time engagement metrics in 1 minute latency for a feature launch.
- Task: I was the DE owner; engineering had to deliver within 4 weeks.
- Action: I quantified the cost (3× current pipeline budget) and complexity (Flink job, new infra, on-call burden), and presented an alternative — 5-minute latency at 1.2× cost. Worked with the PM to confirm that 5-minute was actually acceptable for the use case.
- Result: Shipped at 5-minute latency in 2 weeks, freed the team for higher-priority work, set a precedent for trading latency for cost transparently.
Avoid: cynical "no", or "yes" without surfacing trade-offs.
Behavioural: A pipeline you owned caused a bad metric
What they're testing: ownership, blamelessness, root-cause discipline.
Structure:
- The bug (be specific and technical).
- How it was found (you, monitoring, downstream).
- Immediate response (timeline of mitigation, communication to stakeholders).
- Root cause analysis (5-whys; structural, not personal).
- The fix (immediate + systemic).
- Prevention (test, monitor, contract).
Don't blame teammates, vendors, or upstream. Own it.
Behavioural: How do you decide what NOT to build?
- Will it deliver measurable user value?
- Is the team's time better spent elsewhere?
- Can we buy/use OSS instead?
- Is the demand stable or speculative?
- What's the maintenance cost over 3 years?
Decide explicitly. Document the decision and the alternatives. Revisit when context changes.
Behavioural: How do you onboard the next engineer?
- Day 1: their dev env runs the smallest end-to-end pipeline.
- Week 1: they own a small change behind a flag, reviewed thoroughly.
- Month 1: they own a non-critical pipeline end-to-end, including on-call.
- Month 3: they're contributing to architecture discussions.
Pair on incidents. Have them write/update the runbook. Make them speak in design reviews early.
Closing meta-advice
For the actual interview:
- Slow down before answering. Restate the question to confirm you got it right.
- Ask 1–3 clarifying questions before diving in. Don't interrogate, but don't assume.
- Talk through trade-offs, not just "the right answer". L5 is judgment.
- Use real numbers (data volume, latency, cost) — concrete is more impressive than abstract.
- Go deep when asked, broad when asked. Read the cue.
- It's OK to say "I don't know" — and immediately follow with "but here's how I'd find out".
- End with a sentence about what you'd do next — what you'd validate, monitor, document.
That's how senior engineers think on their feet. Practice it out loud. Good luck.
Observability & Data Quality Scenarios
Design a data-quality monitoring system
Prompt: our data team has 400 tables across bronze/silver/gold. Design a monitoring system that catches data quality issues before stakeholders do.
Strong answer structure:
- Four dimensions per table: freshness (last successful write), volume (rows in last load vs expected), schema (column count, type drift), values (NULL rate, uniqueness on business key, range sanity on numeric columns).
- Thresholds are learned, not hand-set. Compute a 30-day rolling baseline per metric per table; alert on deviations beyond 3 sigma or engineered thresholds. Static thresholds create noise.
- Alert routing follows ownership. Each table has a designated owner; alerts go to that owner's channel, not a global firehose. A global firehose gets muted within a week.
- Severity tiers. Tier 1 (pager): gold tables backing exec dashboards, revenue reports. Tier 2 (on-call): silver tables with downstream consumers. Tier 3 (daily digest): bronze + low-traffic tables.
- Prevention, not just detection. Data contracts at the bronze→silver boundary with enforced schemas. Column-level lineage so a schema change in bronze surfaces to impacted silver/gold owners via PR comment or Slack bot.
Senior signal: name the tension. Strict monitoring creates alert fatigue and slows deploys. Lax monitoring lets issues reach stakeholders. The cheap middle path is: strict on the top 20 business-critical tables, advisory on the rest. Revisit the ratio quarterly.
How do you design freshness SLAs?
Freshness SLA ≠ pipeline SLA. Freshness is end-to-end: event emission → table queryable. Pipeline SLA is internal: job start → job success. Conflating them misses the retry tail, the watermark, and the downstream materialization lag.
A proper freshness SLA has four parameters: target (95% of data available within 2 hours of event time), measurement window (rolling 7 days), error budget (5% = 8.4 hours/week can miss target), breach consequence (page on-call when error budget hits 50%). Without all four, an SLA is a wish.
A dashboard shows zero for a KPI that should be 10 K
Investigation sequence: (1) is the source pipeline green? (2) is the underlying table populated? (3) is the query filter correctly matching timezone, date bucket, and filter predicates? (4) has the downstream aggregation job run? Most often the cause is at (3) — an upstream deploy changed a column's casing, enum value, or timezone, and the query now matches zero rows silently. The fix at the system level is: schema tests for enum values and a row-count floor check on every aggregation output.
Cloud-Specific Scenarios
Your Snowflake bill tripled last month. What do you check first?
- Query history by warehouse. Which warehouse's credit consumption grew? Sort queries by credit spent; find the top 10.
- Warehouse utilization. Is a warehouse set too large for its workload? Oversized warehouses consume credits on idle time if
AUTO_SUSPENDis too long. - Storage growth. Did a table grow 5x because of a bad MERGE that didn't vacuum? Run
TABLE_STORAGE_METRICS. - Serverless features. Snowpipe, Automatic Clustering, Materialized Views — each has its own credit line that doesn't appear in warehouse consumption. Check
AUTOMATIC_CLUSTERING_HISTORY. - New users or external sharing. A new BI tool integration pulling hourly extracts can double compute silently.
Senior answer: after finding the driver, propose a tag-based cost allocation so the next month's surprise lives inside a named team's budget. "Free compute" disguises real cost.
Your BigQuery slots are exhausted during business hours. What's the playbook?
Three levers. (1) Reservations + assignments: carve slots across workloads so production-critical jobs are insulated from ad-hoc analyst queries. (2) Query tuning: convert full-table scans on wide tables to partitioned/clustered scans; the top three wide-table scans usually account for most of the slot time. (3) Scheduling: move nightly batches out of the 9am–5pm window; use flex slots with off-peak pricing. The mistake is to buy more slots before tuning the top 20 offenders — expensive and doesn't fix the underlying sprawl.
S3 list-objects is slow on a 10 M object prefix
The prefix has become a scan hotspot because listing is O(objects under prefix). Three mitigations: (a) partition by date as a prefix structure (/2026/04/20/), so listing only touches the day; (b) use Iceberg or Delta — they replace list-based discovery with metadata files and scan O(snapshot size), not O(files); (c) S3 Inventory exports a daily manifest of all objects; use it for bulk reconciliation instead of live listing.
ML & Feature Store Scenarios
Design a feature store with online/offline parity
The parity requirement: every feature value used to train a model must be identically computable at serving time. Two architecture patterns.
Pattern A — dual-write from a single computation. A feature pipeline writes to both the offline store (Iceberg / BigQuery) and the online store (Redis / DynamoDB) from the same code path. Parity by construction. Cost: the feature pipeline must run in both batch and stream modes correctly.
Pattern B — online store derived from offline. Compute features in batch into the offline store; periodically load into online. Simpler; online store lags offline by the refresh interval. Acceptable when model serving tolerates stale-by-N-minutes features.
Anti-pattern: two separate pipelines (one batch for training, one streaming for serving) computing "the same" feature with different code. This produces silent drift and model degradation that is nearly impossible to debug.
Your model's offline accuracy is 92% but online performance is 84%. What happened?
Systematic checklist: (1) Feature drift: distribution of input features at serving differs from training. Measure with a statistical test on production vs. training features per day. (2) Training-serving skew: features computed differently in the two environments. Test with a canary — run the same row through both pipelines, compare bit-for-bit. (3) Label leakage in training: the training label was accidentally derivable from a feature that isn't available at serving time. Audit feature timestamps against label timestamps. (4) Target distribution shift: the thing you're predicting has changed since training. Retrain. (5) Infrastructure bugs: online feature lookup returning stale or default values for a subset of users.
The order matters. Start with (5) — fastest to rule in/out, and a surprising fraction of "model issues" are actually serving infra bugs.
How would you retrain a production model without breaking serving?
Four stages. (a) Shadow training: new model trains in parallel, scores the same production traffic as the current champion, logs go to an evaluation table. (b) Offline comparison: compute business metrics from the logged shadow decisions vs. champion decisions. (c) Canary rollout: 1% of traffic to the new model; watch key metrics for 24–48 hours. (d) Ramp: 10%, 50%, 100%, with automated rollback if metrics regress. Each stage must have a rollback trigger and a named decision-maker. The biggest failure mode is a candidate who describes this at a high level but can't name the metrics or thresholds that govern each stage gate.
The Prep Program
The previous parts taught you the territory. This one teaches you how to move through it under fire. Interviews are not knowledge tests — they are a performance discipline, and this is the practice plan for that performance.
1. The Four-Week Roadmap
Four weeks is the minimum a working engineer needs to credibly prepare for a senior loop at a top-tier company. Less than that and either the loop itself is easy or you are relying on accumulated muscle memory. The roadmap below assumes ~10 focused hours per week (two weekday evenings + one weekend block). Adjust duration, not phasing.
The single most common preparation mistake is studying more topics instead of fewer topics deeper, with reps. You cannot talk your way through system design if you've never said the words out loud.
Week 1 — Foundations: SQL and Modeling
Goal at end of week: you can write window-function-heavy SQL under time pressure without pausing to look up syntax, and you can sketch a star schema for any business domain in under 10 minutes on a whiteboard.
Day 1 — SQL joins and set operations
- Re-derive when
LEFT JOINandNOT EXISTSgive different answers (NULL semantics, duplicate rows). Write both for the same problem. - Practice: five problems at medium difficulty from a public SQL practice set. Time yourself to 12 minutes each.
- End-of-day drill: explain out loud why
COUNT(*) vs COUNT(col)can differ and under what exact conditions.
Day 2 — Window functions
- Rebuild the mental model: frame clauses, partitioning, ordering,
RANGEvsROWS. Know the default frame for each window function by heart. - Practice: month-over-month revenue with sparse months, top-N per group with ties, running totals with resets, gaps and islands for sessionization.
- End-of-day drill: write percentile_cont by hand using
NTILEand interpolation.
Day 3 — Advanced SQL patterns
- Cohort retention (signup-week × activity-week matrix). Sessionization with timeout. Last-touch attribution. Point-in-time joins (as-of).
- Practice the canonical cohort retention query end-to-end, starting from raw events.
- End-of-day drill: explain the difference between point-in-time correctness and snapshot correctness in 90 seconds.
Day 4 — Dimensional modeling
- Draw a star schema for three domains from memory: subscription streaming service, two-sided marketplace, advertising network. For each, state the grain of each fact.
- Write DDL and SCD Type-2 MERGE for one customer dimension. Include effective-from / effective-to / current-flag columns.
- End-of-day drill: state in one sentence what conformed dimensions are and why they matter.
Day 5 — Contracts and NULL semantics
- Re-read Part 01 §12–13. Write a schema contract in the format your target company uses (YAML or JSON Schema).
- Draft a "contract violation" runbook: what the producer does, what the consumer does, where it breaks the pipeline.
- End-of-day drill: pick one table you know well and enumerate every column that could be NULL and what NULL actually means for each (unknown, not-applicable, pending, sentinel).
Weekend block — SQL mock round
- Two hours. Six problems in a notebook. No internet, no autocomplete. Time each to 15 minutes max. Record a screencast of yourself speaking through problem 4 or 5.
- Review the tape: where did you pause? Where did you talk in circles? Where did you jump to a query before stating assumptions? Those are the gaps.
Week 2 — Processing Engines and Distributed Compute
Goal at end of week: you can explain how a Spark job is physically executed from df.join().groupBy().agg().write() down to tasks and files, and you can talk about streaming guarantees without handwaving "exactly once."
Day 6 — Spark from the top down
- Catalyst rule passes. Logical → optimized → physical → RDD. AQE and what it rewrites at runtime.
- Drill: take a 5-line PySpark script and narrate every stage boundary, every shuffle, every file the executor will write.
Day 7 — Shuffle, joins, and skew
- Broadcast vs sort-merge vs shuffle-hash. The exact threshold math. When AQE converts SMJ to BHJ mid-job.
- Skew detection and the salt trick. What "skew join" in AQE actually does under the hood.
- Drill: you have a 10 TB fact joined to a 50 GB dim and jobs time out. Walk through your debugging sequence — what do you look at first and why?
Day 8 — Batch patterns and idempotency
- MERGE semantics on lakehouse tables. The "idempotent over the same interval" proof. Backfill strategies.
- Drill: describe, in order, what happens when you re-run yesterday's job at noon today. Which rows are touched, which partitions are rewritten, how downstream caches invalidate.
Day 9 — Streaming fundamentals
- Event-time vs processing-time. Watermarks. Windows. Late data. Trigger semantics.
- Exactly-once in Kafka + Flink: the two-phase commit sink and what can still go wrong.
- Drill: explain the difference between "no duplicates in output" and "exactly-once processing" in one minute.
Day 10 — Streaming patterns
- Stream-stream joins with watermarks. Stream-table joins. Temporal joins. Enrichment patterns.
- Backpressure, state backends, checkpointing, recovery time objectives.
- Drill: the job restarts and reads 30 minutes of state — what guarantees does the consumer downstream still have?
Weekend block — Engine mock round
- Pick one prompt from Part 08 that involves Spark or streaming. Stand at a whiteboard or a blank page. Talk for 45 minutes out loud. Record it.
- Re-watch. Count filler words. Count how many times you said "it depends" without stating the axis.
Week 3 — System Design, Lakehouse, and Tradeoffs
Goal at end of week: you can scope, decompose, and defend a 45-minute data system design for any prompt, including volumes, latency, failure modes, and cost.
Day 11 — The system-design frame
- The five-step opener: restate, scope, volumes and SLAs, decompose, defend one pick.
- Draft a "volumes cheat sheet" for every scale you might claim — 1M/10M/100M/1B events per day, what each means for Kafka partitions, Spark cluster size, storage cost per month.
Day 12 — Lakehouse table formats in depth
- Iceberg vs Delta vs Hudi: snapshot layout, metadata tree, MERGE internals, compaction behavior, schema evolution rules.
- Catalog architectures: Hive, Glue, REST, Unity, Polaris. Which ones allow cross-engine writes and which don't.
- Drill: why does Iceberg's hidden partitioning avoid the "partition column drift" problem?
Day 13 — System design prompt 1
- "Design a pipeline that ingests 10 M events/day from a mobile app and serves both a near-real-time operational dashboard and a nightly revenue report."
- Solve on paper in 45 minutes. Use the transcript in §5 below as the grading rubric.
Day 14 — System design prompt 2
- "Migrate a 50 TB partitioned Hive table to Iceberg with zero downtime for consumers."
- Solve on paper in 45 minutes. Focus on the migration sequence, consumer cutover, and rollback.
Day 15 — Observability and data quality
- Data-SLA vs system-SLA. Freshness, completeness, validity, uniqueness. Where to emit metrics. When to fail the pipeline vs quarantine vs warn.
- Drill: design the alerting strategy for a pipeline where the input can legitimately drop to zero on some days (e.g., marketing campaigns). How do you avoid noise?
Weekend block — System-design mock round
- Pair with a peer if possible. They ask the prompt, you whiteboard for 45 minutes, they probe. No book, no laptop.
- If solo: pick a prompt, run a 45-minute self-tape, re-watch, grade yourself on the rubric in §8.
Week 4 — Mocks, Behavioral, and Polish
Goal at end of week: three full mock loops completed, eight STAR frames rehearsed, and a one-page "signal deck" of the projects you'll reference.
Day 16 — Behavioral foundations
- The eight STAR frames in §6. Write one story per frame, not more. Aim for 3–4 minutes spoken each.
- Record yourself telling two of them. Watch for mealy-mouthed passive voice. Rewrite.
Day 17 — Mock loop #1 (full)
- 45 min coding/SQL → 45 min Spark/streaming deep dive → 45 min system design → 30 min behavioral. Back to back. Lunch break OK.
- Post-mortem the gaps that same evening. Don't wait a day — decay is fast.
Day 18 — Weakness patch
- Whatever failed in Mock #1, spend today on. Don't study anything new.
Day 19 — Mock loop #2 (full)
- Same structure. Ideally different interviewer if you have one.
Day 20 — Resume drill-down prep
- For every project on your resume, prepare: scale (numbers), your contribution (verbs), the tradeoff you made, the failure mode you avoided, the thing you'd do differently now.
- Four bullets max per project. Practice delivering each in under 90 seconds.
Weekend block — Mock loop #3 + final polish
- Third and last full mock. Fix the one or two tics that have survived this long.
- Day before the real loop: no new study. Light review. Sleep.
Anti-Goals — What Not to Do
- Do not grind leetcode for a DE senior loop. You will be asked at most one lightly algorithmic problem in SQL or Python; topic breadth matters more than graph-traversal fluency.
- Do not read five blog posts per topic. Pick the one canonical source (a section of this guide, a chapter of a book, a primary doc page) and read it three times.
- Do not try to memorize command flags. Memorize the decisions the flags control.
- Do not skip the out-loud practice. Reading a solution and speaking a solution are different skills; the interview only tests the second.
2. How Candidates Lose Offers — The Failure Catalog
This section is uncomfortable on purpose. These are the specific failure modes that show up in post-debrief scorecards, not the abstract "communication issues" that everybody writes about. Most of them are behavioural, not technical — which is precisely why they're survivable if named.
A. Pre-flight and resume failures
A1 — The "resume says senior, answers say mid"
You claim on your resume that you "owned the streaming platform" but when the interviewer asks what Kafka version you ran, what your broker count was, what the retention policy was, and what happened the last time you ran out of disk — you freeze. Scorecard verdict: "Scope inflation, no depth behind the title."
Fix: for every line on the resume, pre-build a three-layer drill-down. Top line, two supporting paragraphs of numbers and verbs, one anecdote of a production incident you debugged.
A2 — The unexplained gap between title and output
Your last role was "Staff Data Engineer" but your project list reads like three-month feature work. Interviewers silently score you down for weak scope signal. You never recover.
Fix: lead with scope numbers — number of downstream consumers, SLAs owned, size of the footprint, org reach. If the numbers are modest, claim the title you can defend, not the title on the offer letter.
B. Coding and SQL round failures
B1 — Query-before-clarify
The prompt is "Find users whose average order value grew MoM." You start typing immediately. You don't ask whether "user" means account or visitor, whether "month" is calendar or rolling, whether deleted users count, how to handle months with zero orders. The interviewer silently writes down "jumped to implementation" and moves the bar up.
Fix: 90 seconds of clarification is never too much. Ask at least three questions before your cursor touches the query. Restate the problem back in your own words first.
B2 — The window-function shortcut
You know that ROW_NUMBER() OVER (PARTITION BY ...) is the trick for "top per group." You reach for it without thinking through whether ties matter, whether NULLs should be included, whether you want dense-rank instead. Your answer is close but subtly wrong on edge cases. Scorecard: "Knows the tools, hasn't internalized the semantics."
Fix: for any window function, say out loud: "the frame is X, ties resolve Y, NULLs sort Z." Every time. Even when obvious.
B3 — Silent scratch-paper math
You stop talking. You do mental math for 45 seconds. The interviewer has no signal on how you think. By the time you speak again they've already decided.
Fix: narrate. "I'm going to try ROW_NUMBER here — wait, ties matter for revenue buckets so RANK would over-count. Let me check…" Even wrong narration beats silent correctness in the first 20 minutes of the round.
B4 — The one-liner trap
You compress a three-CTE problem into a single deeply nested query because "it's more elegant." The interviewer cannot read it in real time and neither can you. You can't debug it when they ask a follow-up. Verdict: "Writes for themselves, not for review."
Fix: default to named CTEs with short, descriptive names. Elegance is for Twitter, not interview rounds.
C. Engine / systems deep-dive failures
C1 — "AQE handles it"
Asked how Spark handles skew, you say "AQE handles it automatically." Asked to go deeper — how does AQE detect skew, what config controls the threshold, what's the trade-off vs broadcast — you have nothing. Verdict: "Knows the keyword, not the mechanism."
Fix: for every feature you name, know the three layers beneath it. Configuration, algorithm, trade-off. If you can't describe all three, don't volunteer the feature.
C2 — Exactly-once handwaving
"We used Kafka exactly-once so there are no duplicates." The interviewer probes: what about the sink? What about application-level retries? What happens during a rebalance mid-commit? You handwave. Verdict: "Treats EOS as a buzzword."
Fix: memorize the two-phase commit sink diagram and the exact failure modes it does and does not cover. Also memorize the phrase "exactly-once effectively" and when it differs from "exactly-once processing."
C3 — Partitioning by hope
Your system design partitions a fact table by "user_id" because partitioning is good. You don't state the cardinality, don't state the access pattern, don't state the partition file count at six-month retention. The interviewer asks "how many files is that" and you guess wildly.
Fix: before you name a partition key, compute the file count out loud. Cardinality × retention / compaction target = files. If the answer is 10 million, change the key.
D. System design failures
D1 — Scope sprawl
The prompt is "design a pipeline for 10 M events/day." You spend 15 minutes on Kafka producer ack semantics and never get to storage, serving, or cost. Time runs out. Verdict: "Deep in one corner, blank elsewhere."
Fix: budget your 45 minutes. Roughly: 5 min scope, 5 min volumes/SLAs, 15 min happy path, 10 min failure modes, 5 min cost, 5 min trade-offs. Put it on your scratch paper before you start.
D2 — Solutioning before scoping
Interviewer: "Design a pipeline for 10 M events/day." You: "I'd use Kafka, then Spark Streaming, then Iceberg, then Looker." One minute in. You never asked what the events are, what the SLA is, who the consumer is, what the business question is.
Fix: five scoping questions before any box appears on the board. "What's the event schema? What are the downstream consumers? What's the freshness SLA? What's the volume distribution — steady or bursty? What's the retention and access pattern?"
D3 — The "golden path" trap
Your design works perfectly on the happy path. You never mention: what happens when Kafka is down, what happens when the transform fails on one record, what happens when the sink table is being vacuumed, what happens when a consumer is 6 hours behind. Verdict: "Designs for the demo."
Fix: reserve the second half of the round for failure modes. Treat "what happens when X breaks" as first-class content, not a footnote.
D4 — No cost awareness
Your design streams everything with millisecond latency when the business question is a daily dashboard. You triple-replicate data across warehouses "for safety." You never state the monthly cloud bill. Verdict: "Optimizes for the whiteboard, not the business."
Fix: always close the design with a one-line cost estimate — storage per TB, compute per hour, egress if relevant. Even order-of-magnitude is fine. Silence is not.
E. Behavioral failures
E1 — Hero mode
Every story is "I single-handedly…". The interviewer, who works on teams, silently down-scores collaboration. By round four, the hiring committee has "limited team signal" in bold.
Fix: 60/40. In every story, spend 60% on your specific actions and 40% on how you enabled others. Name them.
E2 — STAR drift
You start with Situation → Task, but by minute three you've slid into "and then generally we do X at our company." You lost the interviewer 90 seconds ago. They're waiting for the "R" — what happened, what did you learn.
Fix: write every story on a 4×1 index card. Situation (2 sentences). Task (1 sentence). Action (4 sentences). Result (2 sentences with numbers). Under 4 minutes spoken, always.
E3 — "I would have…"
Asked for a time you failed, you describe what you would have done differently. You never described the actual failure. Verdict: "Can't own an outcome."
Fix: the failure story starts with what actually broke, in one blunt sentence. "We shipped a backfill that doubled the revenue numbers on the executive dashboard for six hours." Then the rest.
E4 — Over-prepared monotone
You've rehearsed the story so many times it sounds canned. The interviewer senses it. Authenticity tanks. Verdict: "Felt like a script."
Fix: rehearse the structure, not the words. Pick three different phrasings of the opening line. Use whichever fits the conversational flow that day.
F. Closing-round and hiring-committee failures
F1 — The "why this company" gap
You've answered everything technically well. The bar raiser asks "why us specifically?" and you recite a line from the website. Verdict: "Strong IC, weak mission fit." That's enough for a committee to pass.
Fix: have three specific reasons ready, at least one of which references something from an engineering blog, an open-source commit, or a recent talk by someone there. Show you actually looked.
F2 — Underselling in the wrap
Asked "is there anything we should know that we haven't asked?" you say "no, I think we covered it." You just gave up your last 90 seconds of scoring time.
Fix: always have one prepared "closing signal" — a specific project they didn't ask about, or a failure recovery story that shows judgment. Rehearsed, short, positive.
F3 — Weak reverse questions
"Do you have questions for me?" You ask about work-from-home policy and team size. Those are HR questions. Engineering interviewers want to see what you'd actually want to know on day one.
Fix: three tiers of questions ready: (a) one about their specific team's current technical challenge, (b) one about how decisions get made when the team disagrees, (c) one about what the interviewer personally wishes was better.
G. The "almost-hires" pattern — why committees pass on strong candidates
The most painful losses are not the obvious fails. They're the loops where every round is "lean hire" but the committee still passes. Three patterns drive this:
- Narrow specialist signal. Every round showed the same skill. Committee can't tell if you'd be useful on a different problem next quarter. Fix: use the four rounds to show four different skills. If round 1 was SQL depth, round 2 should be operations judgment, round 3 should be scope negotiation, round 4 should be mentoring. You choose the stories.
- No "hell yes" round. Four "lean hires" is a statistical hire probability below 50% at most committees. You need one round where someone writes "strong hire, would fight for." Fix: pick the round you expect to do best in and prepare a signature answer — a story or a design move that is unusual enough to be memorable in debrief.
- Level ambiguity. Scorecards split between L4 and L5. The committee defaults down. Fix: in every round, demonstrate at least one behaviour from the level above — ambiguity tolerance in coding, org-level reasoning in design, mentoring framing in behavioural.
3. What Changes at Staff+ and Principal
The interview loop for senior (L5), staff (L6), and principal (L7) uses the same rounds — but the scorecard rubric shifts. Miscalibrating the level you're targeting is one of the most common reasons for "down-leveling" at offer time. What follows is the signal map.
The Level Matrix
| Axis | L4 — Mid | L5 — Senior | L6 — Staff | L7 — Principal |
|---|---|---|---|---|
| Scope | A task inside a feature | A feature end-to-end | A multi-team system | An org-wide technical direction |
| Ambiguity | Told what to do | Told what outcome to deliver | Defines the outcome | Defines which problems are worth solving |
| Tech judgment | Uses existing patterns | Picks the right pattern | Decides when to invent a pattern | Sets the patterns others will use for years |
| Influence | Influences own code reviews | Influences a team | Influences peer teams | Influences the org and sometimes the industry |
| Cost awareness | Aware of query cost | Owns pipeline cost | Owns team budget | Shapes the cost curve of the platform |
| Failure recovery | Fixes their bug | Owns the incident | Changes the system so this class of incident can't recur | Changes the org's practices around failure |
| Mentoring | Receives mentoring | Mentors juniors | Mentors seniors and raises the team's ceiling | Mentors staff engineers and managers |
| Writing | PR descriptions | Design docs for a feature | Strategy docs that reshape decisions | Technical vision papers that set roadmaps |
Staff (L6) Signals
At the staff level, interviews are no longer asking "can you build it" — they're asking "would you be the person others run a hard decision by?" Three signals carry most of the weight.
S1 — Scope negotiation
When given a prompt, staff candidates compress or expand the scope deliberately. Example: the prompt is "design a recommendation pipeline." A senior answers the prompt. A staff engineer says: "The interesting question here isn't the pipeline — it's how we version the feature store, because that's where the ML team and the infra team disagree. Let me scope the answer around that disagreement and mention the pipeline details as they come up."
That move — naming the real problem inside the prompt — is staff-level. It requires judgment about what is worth engineering time in the real world, not just technical skill.
S2 — Cross-team trade-off reasoning
Senior engineers answer "what's the right design." Staff engineers answer "what's the right design given what the adjacent team is doing, what the org's cost posture is, and what we're willing to deprecate in twelve months." Every technical answer has an org-level sentence.
S3 — Mentorship embedded in the story
In behaviourals, staff candidates naturally mention other engineers by role — "the senior on the team," "the junior who joined last quarter," "the staff eng on the adjacent team who had the opposite opinion." Their stories are never about themselves alone. A candidate whose every story uses "I" without ever naming who they worked with reads as individual-contributor strong, team-lead weak.
Principal (L7) Signals
At principal, the interview bar shifts again. The scorecard looks for evidence that you set technical direction across an organization, and that your framing changes how others think — not just what they build.
P1 — Problem selection
Given a system-design prompt, principal candidates often reframe the question. "The prompt asks how to serve 10 M events per day at 200 ms p99 latency. In my experience, at that scale the real question is whether we even need 200 ms — because every order of magnitude cheaper we can make the infrastructure unlocks a different set of products. Let me walk through three designs at three latency points and show the cost curve."
Principal engineers select which problems to solve, not just how to solve them.
P2 — Org-level influence without authority
Principal behaviourals feature stories where the candidate changed the behaviour of people who did not report to them and were not asked to. A principal engineer rarely describes winning an argument — they describe reframing the argument so the outcome was inevitable.
P3 — Narrative ownership
Principal candidates can answer "what's your three-year technical bet?" with specificity. They have a thesis — about where the stack is going, which abstractions are about to break, what the team should be investing in now to be ready. That thesis is testable and defensible.
Common Down-Leveling Traps
- L5 claiming L6. You work on a big team, so you claim staff. But every story is scoped to a feature, not a system. Fix: either claim L5 cleanly and ace it, or find two stories where you owned a cross-team decision, and rehearse them cold.
- L6 claiming L7. You've delivered multi-team systems, but you can't articulate a three-year bet. Fix: prepare a written one-page "technical narrative" before the loop. Even if it never comes up, writing it will sharpen every answer.
- L7 under-demonstrating. You're so deep in the problem that you forget to name the org-level framing. Fix: every answer opens with the framing sentence. "The interesting tension here is between X team's need for Y and Z team's need for W." Then dive in.
4. The SQL Question Bank — Senior Tier
These are the SQL problems that actually separate seniors from mid-level engineers in interview rounds. Each comes with a prompt, the schema, the expected output shape, a walkthrough of the thinking, and a reference solution. The difficulty target is the "hard tier" on public SQL practice platforms — multi-CTE, window-function-heavy, business-grounded, and with at least one non-obvious edge case.
Every solution is written in ANSI-style SQL that runs on Snowflake, BigQuery, Redshift, and Postgres with minor dialect tweaks. Spark SQL equivalents are noted where they diverge.
The fastest way to fail these problems in an interview is to type before you've restated the requirements out loud. Every solution below opens with "Let me restate what I'm solving."
Q1 — Cohort Retention Matrix
Prompt
For a subscription product, build the signup-month × activity-month retention matrix. Each cell should contain the percentage of users who signed up in the row month and were active in the column month. Include month 0 (100% by definition). Go up to month 12. The output should be sorted by signup month descending.
Schema
users(user_id BIGINT, signup_ts TIMESTAMP)
events(user_id BIGINT, event_ts TIMESTAMP, event_name STRING)
-- "active" = at least one event in that calendar month
Why this problem is hard
Three traps: (1) the month-diff must be computed carefully across year boundaries; (2) a naive join will double-count users who had multiple events in a month; (3) cohorts with zero retained users at month N must still appear — you can't filter them out with an inner join.
Walkthrough
- Build the cohort dimension: one row per user, with their signup month truncated to the first of the month.
- Build the activity dimension: distinct (user_id, activity_month) pairs from events.
- Join cohort → activity on user_id, compute
month_number = month_diff(activity_month, signup_month). - Aggregate: for each (signup_month, month_number), count distinct users. Divide by the cohort size to get the retention percentage.
- Pivot into the matrix shape — either with conditional aggregation or with your engine's PIVOT clause.
Reference solution
WITH cohort AS (
SELECT
user_id,
DATE_TRUNC('month', signup_ts) AS signup_month
FROM users
),
cohort_size AS (
SELECT signup_month, COUNT(*) AS cohort_n
FROM cohort
GROUP BY signup_month
),
activity AS (
SELECT DISTINCT
user_id,
DATE_TRUNC('month', event_ts) AS activity_month
FROM events
),
retention AS (
SELECT
c.signup_month,
DATEDIFF('month', c.signup_month, a.activity_month) AS month_n,
COUNT(DISTINCT a.user_id) AS retained
FROM cohort c
JOIN activity a USING (user_id)
WHERE a.activity_month >= c.signup_month
AND DATEDIFF('month', c.signup_month, a.activity_month) <= 12
GROUP BY 1, 2
)
SELECT
r.signup_month,
s.cohort_n,
MAX(CASE WHEN month_n = 0 THEN retained END) AS m0,
MAX(CASE WHEN month_n = 1 THEN retained END) AS m1,
MAX(CASE WHEN month_n = 2 THEN retained END) AS m2,
MAX(CASE WHEN month_n = 3 THEN retained END) AS m3,
MAX(CASE WHEN month_n = 6 THEN retained END) AS m6,
MAX(CASE WHEN month_n = 12 THEN retained END) AS m12,
ROUND(100.0 * MAX(CASE WHEN month_n = 1 THEN retained END) / s.cohort_n, 1) AS pct_m1,
ROUND(100.0 * MAX(CASE WHEN month_n = 3 THEN retained END) / s.cohort_n, 1) AS pct_m3,
ROUND(100.0 * MAX(CASE WHEN month_n = 12 THEN retained END) / s.cohort_n, 1) AS pct_m12
FROM retention r
JOIN cohort_size s USING (signup_month)
GROUP BY r.signup_month, s.cohort_n
ORDER BY r.signup_month DESC;
Follow-up probes the interviewer will ask
- "What if we want rolling 30-day retention instead of calendar-month retention?" → replace
DATE_TRUNC('month', …)with a bucket derived fromFLOOR(DATEDIFF('day', signup_ts, event_ts) / 30). - "What if some users have backdated signup timestamps because they were imported?" → add a validity filter on signup_ts and a separate "import" cohort.
- "How would you compute the confidence interval on each retention percentage?" → Wilson interval, or bootstrap, depending on cohort size.
Q2 — Sessionization with Idle Timeout
Prompt
You have raw event logs with user_id and event timestamps. Define a "session" as a contiguous sequence of events from the same user where no gap between consecutive events exceeds 30 minutes. For each session, output: user_id, session_id, session_start, session_end, duration_seconds, event_count. The session_id should be stable across re-runs for the same input.
Schema
events(user_id BIGINT, event_ts TIMESTAMP, event_name STRING)
Why this problem is hard
This is a "gaps and islands" problem in disguise. Two traps: (1) you need to compute the gap against the previous event per user, not globally; (2) the session_id must be deterministic — a naive ROW_NUMBER() changes across runs if ties exist. A typical junior solution uses a self-join which scans O(n²); the senior solution uses a single pass with LAG.
Walkthrough
- Partition by user, order by timestamp, compute
LAG(event_ts)to get the previous event's timestamp. - Flag a new session when the gap > 30 minutes or when there is no previous event (first row per user).
- Take a running
SUMof the flag over the partition to assign monotonically increasing session numbers per user. - Aggregate by (user_id, session_number) for the final output. Build a stable session_id by hashing user_id + session_start.
Reference solution
WITH e AS (
SELECT
user_id,
event_ts,
event_name,
LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) AS prev_ts
FROM events
),
flagged AS (
SELECT
e.*,
CASE
WHEN prev_ts IS NULL THEN 1
WHEN DATEDIFF('second', prev_ts, event_ts) > 1800 THEN 1
ELSE 0
END AS new_session_flag
FROM e
),
numbered AS (
SELECT
f.*,
SUM(new_session_flag) OVER (
PARTITION BY user_id ORDER BY event_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_number
FROM flagged f
)
SELECT
user_id,
MD5(CAST(user_id AS STRING) || '_' || CAST(MIN(event_ts) AS STRING)) AS session_id,
MIN(event_ts) AS session_start,
MAX(event_ts) AS session_end,
DATEDIFF('second', MIN(event_ts), MAX(event_ts)) AS duration_seconds,
COUNT(*) AS event_count
FROM numbered
GROUP BY user_id, session_number
ORDER BY user_id, session_start;
Follow-up probes
- "How would you make this incremental?" → state the watermark: "I'd only reprocess sessions that extend into the new batch window, which means replaying the tail of each user's most recent session from a checkpoint."
- "What if two events share a timestamp?" → the
ORDER BYmust include a tiebreaker — typicallyevent_id— or sessions become non-deterministic across runs. - "Can this be done in Spark with Structured Streaming?" → yes, using
session_windowdirectly; walk through the watermark and state timeout semantics.
Q3 — Last-Touch Attribution With Lookback Window
Prompt
For every purchase, find the last marketing touch (channel + campaign) that occurred within the 7 days preceding the purchase, for the same user. If no touch exists in that window, attribute the purchase to "organic." Output: purchase_id, user_id, purchase_ts, purchase_amount, attribution_channel, attribution_campaign, days_between_touch_and_purchase.
Schema
purchases(purchase_id BIGINT, user_id BIGINT, purchase_ts TIMESTAMP, amount DECIMAL)
touches(user_id BIGINT, touch_ts TIMESTAMP, channel STRING, campaign STRING)
Why this problem is hard
Two traps: (1) this is a range join — every purchase potentially joins to many touches, and a careless query will multiply rows before the rank, producing wrong totals; (2) ties at the same timestamp must be broken deterministically. The correct approach ranks candidates inside a bounded sub-query rather than joining the full cross-product.
Walkthrough
- Join purchases to touches on user_id and the 7-day window. Compute the lag in seconds.
- Rank touches per purchase_id by touch_ts DESC (last-touch wins), with channel alphabetical as a deterministic tiebreaker.
- Keep only rank = 1. Left-join back to the full purchases table so unattributed rows show "organic."
Reference solution
WITH candidates AS (
SELECT
p.purchase_id,
p.user_id,
p.purchase_ts,
p.amount,
t.touch_ts,
t.channel,
t.campaign,
ROW_NUMBER() OVER (
PARTITION BY p.purchase_id
ORDER BY t.touch_ts DESC, t.channel ASC
) AS rn
FROM purchases p
LEFT JOIN touches t
ON t.user_id = p.user_id
AND t.touch_ts <= p.purchase_ts
AND t.touch_ts >= p.purchase_ts - INTERVAL '7 days'
)
SELECT
purchase_id,
user_id,
purchase_ts,
amount AS purchase_amount,
COALESCE(channel, 'organic') AS attribution_channel,
COALESCE(campaign, 'none') AS attribution_campaign,
CASE WHEN touch_ts IS NULL THEN NULL
ELSE DATEDIFF('day', touch_ts, purchase_ts)
END AS days_between_touch_and_purchase
FROM candidates
WHERE rn = 1
ORDER BY purchase_ts;
Follow-up probes
- "How would the query change for first-touch attribution?" → swap
ORDER BY t.touch_ts DESCto ASC, and consider expanding the window to the user's lifetime. - "How would you handle cross-device users?" → introduce an identity table and replace user_id with a resolved identity_id upstream.
- "What does the query cost at 1 B purchases and 10 B touches?" → range joins are expensive; discuss pre-filtering touches by date, using broadcast hash joins if touches fits, and partitioning by user_id on both sides.
Q4 — Month-over-Month Growth With Sparse Months
Prompt
Compute month-over-month revenue growth for every product, for every calendar month in the past 24 months. If a product had no sales in a given month, treat its revenue as zero (not NULL). Output: product_id, month, revenue, prev_month_revenue, mom_growth_pct.
Why this problem is hard
The sparse-months trap. A naive LAG() over the sales table will compare April to the previous row, which might be from February if March had no sales. The answer is wrong and doesn't flag itself. The fix is to generate a dense calendar (product × month) and left-join actual revenue onto it.
Reference solution
WITH months AS (
SELECT DATE_TRUNC('month', d) AS month
FROM GENERATE_SERIES(
DATE_TRUNC('month', CURRENT_DATE - INTERVAL '24 months'),
DATE_TRUNC('month', CURRENT_DATE),
INTERVAL '1 month'
) AS d
),
products AS (SELECT DISTINCT product_id FROM sales),
grid AS (
SELECT p.product_id, m.month
FROM products p CROSS JOIN months m
),
monthly AS (
SELECT
product_id,
DATE_TRUNC('month', sale_ts) AS month,
SUM(amount) AS revenue
FROM sales
GROUP BY 1, 2
),
filled AS (
SELECT
g.product_id,
g.month,
COALESCE(m.revenue, 0) AS revenue
FROM grid g
LEFT JOIN monthly m USING (product_id, month)
)
SELECT
product_id,
month,
revenue,
LAG(revenue) OVER (PARTITION BY product_id ORDER BY month) AS prev_month_revenue,
CASE
WHEN LAG(revenue) OVER (PARTITION BY product_id ORDER BY month) IS NULL THEN NULL
WHEN LAG(revenue) OVER (PARTITION BY product_id ORDER BY month) = 0 AND revenue > 0 THEN 9999.0
WHEN LAG(revenue) OVER (PARTITION BY product_id ORDER BY month) = 0 THEN 0
ELSE ROUND(
100.0 * (revenue - LAG(revenue) OVER (PARTITION BY product_id ORDER BY month))
/ LAG(revenue) OVER (PARTITION BY product_id ORDER BY month), 2
)
END AS mom_growth_pct
FROM filled
ORDER BY product_id, month;
Follow-up probes
- "How do you handle products that were launched mid-window?" → emit NULL rather than 0 for months before the product's first sale; use a
first_sale_monthCTE. - "The query is slow at 50 M rows. What do you change?" → pre-aggregate sales to daily first; partition-prune on sale_ts; materialize the monthly roll-up as a daily batch job.
Q5 — Top-N Per Group With Ties and Pagination
Prompt
For each region, return the top 3 products by revenue in the current quarter. If two products tie on revenue, break the tie by product name alphabetically. The query must be correct even when a region has fewer than 3 products.
Why it's hard
The window function choice matters: ROW_NUMBER arbitrarily picks a winner on ties, RANK leaves gaps, DENSE_RANK doesn't. For "top 3 with ties resolved deterministically" you almost always want ROW_NUMBER with a deterministic tiebreaker inside the ORDER BY. That's what this question is really testing.
Reference solution
WITH q AS (
SELECT
region,
product_id,
product_name,
SUM(amount) AS revenue
FROM sales
WHERE sale_ts >= DATE_TRUNC('quarter', CURRENT_DATE)
AND sale_ts < DATE_TRUNC('quarter', CURRENT_DATE + INTERVAL '3 months')
GROUP BY region, product_id, product_name
),
ranked AS (
SELECT
q.*,
ROW_NUMBER() OVER (
PARTITION BY region
ORDER BY revenue DESC, product_name ASC
) AS rn
FROM q
)
SELECT region, product_id, product_name, revenue, rn AS rank_in_region
FROM ranked
WHERE rn <= 3
ORDER BY region, rn;
Follow-up probes
- "What if the business wants all ties included in the top 3?" → switch to
DENSE_RANK()and filter ondense_rank <= 3. - "How would you page this to return rank 4–6 next?" → parametrize the filter; note the performance hit if the engine doesn't push the filter below the window.
Q6 — As-Of / Point-in-Time Join
Prompt
For each order, find the customer's subscription tier as of the order timestamp. A customer's tier can change over time and we have a history table with valid-from / valid-to per tier. Output: order_id, order_ts, customer_id, tier_at_order_time.
Why it's hard
This is a point-in-time join — the most common mistake is to join on "tier where valid_from <= order_ts and valid_to >= order_ts." That works only if the history is gap-free and exactly one row is valid at any moment. Real-world histories have gaps, overlaps from bad backfills, and open-ended current rows (valid_to = NULL). The senior answer handles all three.
Reference solution (gap-tolerant, picks most-recent valid)
WITH tier_history AS (
SELECT
customer_id,
tier,
valid_from,
COALESCE(valid_to, TIMESTAMP '9999-12-31 00:00:00') AS valid_to
FROM customer_tier_history
),
candidates AS (
SELECT
o.order_id,
o.order_ts,
o.customer_id,
h.tier,
h.valid_from,
ROW_NUMBER() OVER (
PARTITION BY o.order_id
ORDER BY h.valid_from DESC
) AS rn
FROM orders o
LEFT JOIN tier_history h
ON h.customer_id = o.customer_id
AND h.valid_from <= o.order_ts
AND h.valid_to > o.order_ts
)
SELECT
order_id,
order_ts,
customer_id,
COALESCE(tier, 'free') AS tier_at_order_time
FROM candidates
WHERE rn = 1
ORDER BY order_ts;
Follow-up probes
- "What do you do if the history has overlaps?" → rank by
valid_from DESC, then byloaded_at DESC, and take the most recently-loaded record. Flag the overlap to a data-quality dashboard. - "Spark has a native as-of join — when do you use it?" → when the fact side is much larger than the history side and you'd otherwise shuffle both; Spark's
asOfJoincan keep the history broadcast.
Q7 — Funnel Analysis With Ordered Steps
Prompt
A signup funnel has four ordered steps: visit → view_plan → start_checkout → complete_signup. For users in the past 30 days, compute (a) how many users reached each step, and (b) the step-to-step conversion rate. A user only "reaches" a step if they also reached every earlier step, and the events must appear in order within a single 24-hour window.
Why it's hard
Ordered funnels with a time constraint cannot be solved with simple COUNT(DISTINCT CASE WHEN …) over the events table — that would count users who did step 3 before step 1 or across weeks. The senior answer uses window functions to enforce ordering within a window, or array_agg with pattern-matching on the sequence.
Reference solution
WITH ordered AS (
SELECT
user_id,
event_name,
event_ts,
MIN(CASE WHEN event_name = 'visit' THEN event_ts END) OVER w AS t_visit,
MIN(CASE WHEN event_name = 'view_plan' THEN event_ts END) OVER w AS t_view,
MIN(CASE WHEN event_name = 'start_checkout' THEN event_ts END) OVER w AS t_start,
MIN(CASE WHEN event_name = 'complete_signup' THEN event_ts END) OVER w AS t_done
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '30 days'
WINDOW w AS (PARTITION BY user_id)
),
per_user AS (
SELECT
user_id,
t_visit,
CASE WHEN t_view > t_visit AND t_view <= t_visit + INTERVAL '24 hours' THEN t_view END AS t_view_ok,
CASE WHEN t_start > t_view AND t_start <= t_visit + INTERVAL '24 hours' THEN t_start END AS t_start_ok,
CASE WHEN t_done > t_start AND t_done <= t_visit + INTERVAL '24 hours' THEN t_done END AS t_done_ok
FROM ordered
WHERE t_visit IS NOT NULL
GROUP BY user_id, t_visit, t_view, t_start, t_done
)
SELECT
COUNT(DISTINCT CASE WHEN t_visit IS NOT NULL THEN user_id END) AS n_visit,
COUNT(DISTINCT CASE WHEN t_view_ok IS NOT NULL THEN user_id END) AS n_view,
COUNT(DISTINCT CASE WHEN t_start_ok IS NOT NULL THEN user_id END) AS n_start,
COUNT(DISTINCT CASE WHEN t_done_ok IS NOT NULL THEN user_id END) AS n_done,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN t_view_ok IS NOT NULL THEN user_id END)
/ NULLIF(COUNT(DISTINCT CASE WHEN t_visit IS NOT NULL THEN user_id END), 0), 2) AS visit_to_view_pct,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN t_start_ok IS NOT NULL THEN user_id END)
/ NULLIF(COUNT(DISTINCT CASE WHEN t_view_ok IS NOT NULL THEN user_id END), 0), 2) AS view_to_start_pct,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN t_done_ok IS NOT NULL THEN user_id END)
/ NULLIF(COUNT(DISTINCT CASE WHEN t_start_ok IS NOT NULL THEN user_id END), 0), 2) AS start_to_done_pct
FROM per_user;
Follow-up probes
- "What if a user can enter the funnel more than once?" → redefine "reaching a step" per funnel-instance, not per user. Add a
funnel_idvia sessionization (reuse Q2's trick). - "Engines vary on ordering guarantees in nested windows. Which engine are you targeting?" → call out Snowflake / BigQuery vs. Spark differences.
Q8 — Anomaly Detection by Z-Score Per Group
Prompt
For each merchant, flag daily transaction counts in the past 90 days that are more than 3 standard deviations from the merchant's own 90-day mean. Output: merchant_id, txn_date, txn_count, rolling_mean, rolling_stddev, z_score, is_anomaly.
Why it's hard
Three traps: (1) you need a rolling mean and stddev over the last 90 days per merchant, which is a moving window — not just a group aggregate; (2) merchants with fewer than, say, 14 days of history should be excluded or flagged separately because z-scores on small N are meaningless; (3) if you include the current row in the mean you bias the anomaly test.
Reference solution
WITH daily AS (
SELECT merchant_id, DATE(txn_ts) AS txn_date, COUNT(*) AS txn_count
FROM transactions
WHERE txn_ts >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY 1, 2
),
stats AS (
SELECT
merchant_id,
txn_date,
txn_count,
AVG(txn_count) OVER w AS rolling_mean,
STDDEV(txn_count) OVER w AS rolling_stddev,
COUNT(*) OVER w AS n_days
FROM daily
WINDOW w AS (
PARTITION BY merchant_id
ORDER BY txn_date
ROWS BETWEEN 89 PRECEDING AND 1 PRECEDING
)
)
SELECT
merchant_id,
txn_date,
txn_count,
ROUND(rolling_mean, 2) AS rolling_mean,
ROUND(rolling_stddev, 2) AS rolling_stddev,
CASE
WHEN n_days < 14 THEN NULL
WHEN rolling_stddev = 0 THEN NULL
ELSE ROUND((txn_count - rolling_mean) / rolling_stddev, 2)
END AS z_score,
CASE
WHEN n_days < 14 OR rolling_stddev IS NULL OR rolling_stddev = 0 THEN FALSE
WHEN ABS((txn_count - rolling_mean) / rolling_stddev) > 3 THEN TRUE
ELSE FALSE
END AS is_anomaly
FROM stats
ORDER BY merchant_id, txn_date;
Follow-up probes
- "Z-score assumes normality — transaction counts are often Poisson. What would you use instead?" → MAD (median absolute deviation), or a Poisson-based confidence interval, or a seasonally-decomposed residual.
- "How would you productionize this?" → as a materialized view refreshed hourly; alerts go into a low-cardinality table, not straight to PagerDuty; include a 15-minute debounce.
Q9 — Median Per Group Without a MEDIAN Function
Prompt
Compute the median order value per country for the past year. The target engine does not have a MEDIAN function (assume older Postgres). Use window functions and produce deterministic output for both odd and even counts per group.
Reference solution
WITH ordered AS (
SELECT
country,
amount,
ROW_NUMBER() OVER (PARTITION BY country ORDER BY amount ASC, order_id ASC) AS rn_asc,
COUNT(*) OVER (PARTITION BY country) AS n
FROM orders
WHERE order_ts >= CURRENT_DATE - INTERVAL '365 days'
),
medians AS (
SELECT country, amount
FROM ordered
WHERE
(n % 2 = 1 AND rn_asc = (n + 1) / 2)
OR
(n % 2 = 0 AND rn_asc IN (n / 2, n / 2 + 1))
)
SELECT country, AVG(amount) AS median_amount
FROM medians
GROUP BY country
ORDER BY country;
Follow-up probes
- "How would you compute the 95th percentile the same way?" → the general form is
rn_asc = CEIL(n * 0.95)for the lower index, with interpolation if you need continuous percentiles. - "What about approximate percentiles at scale?" → introduce
APPROX_PERCENTILE/ t-digest and discuss the accuracy / cost trade-off.
Q10 — Mutual / Reciprocal Relationships
Prompt
Given a directed follows table (follower_id, followee_id), find all mutual pairs — where A follows B and B follows A — with each pair listed only once, ordered by the smaller user_id first.
Reference solution
SELECT
LEAST(follower_id, followee_id) AS user_a,
GREATEST(follower_id, followee_id) AS user_b
FROM follows f1
WHERE EXISTS (
SELECT 1 FROM follows f2
WHERE f2.follower_id = f1.followee_id
AND f2.followee_id = f1.follower_id
)
AND f1.follower_id < f1.followee_id;
Why the < filter matters
Without f1.follower_id < f1.followee_id, each mutual pair appears twice (once as A→B, once as B→A). The combination of EXISTS and the asymmetric filter keeps each pair exactly once.
Follow-up probes
- "How do you extend this to find friend-of-a-friend recommendations?" → self-join the mutual-pairs CTE on either user, exclude existing follows, rank by count of common connections.
- "This query is slow at 500 M rows. What do you change?" → cluster
followsbyfollower_id, consider a pre-computedmutualstable refreshed nightly.
5. System Design Transcripts — Bad, Average, Strong
Reading a system-design answer is not the same as hearing one. What follows are four prompts rendered as transcripts at three skill levels each, with a scorecard on why each lands where it does. Read them out loud. The rhythm is part of what you're practicing.
Scenario A — Design an Event Pipeline for 10 M Events/Day
Prompt (verbatim)
Our mobile app sends ~10 M events per day. Product wants both a near-real-time operational dashboard (freshness under 5 minutes) and a nightly business report. Design the pipeline.
Bad answer (what not to do)
"I'd use Kafka, then Spark Structured Streaming, then write to Snowflake. The dashboard reads from Snowflake. The nightly report also reads from Snowflake."
Why it fails: No clarifying questions. No volumes (10 M/day is only the headline — what's the peak multiple of the mean? what's the payload size?). No SLA decomposition (operational dashboards and nightly reports have different correctness tolerances, not just freshness). No failure modes. No cost. The interviewer will push and push, hoping to find ground.
Average answer
"10 M events per day is ~115 events per second average, probably 1,000 per second peak. Events are small, maybe 2 KB each, so ~20 GB per day. I'd use Kafka for ingest with three brokers, retention of 7 days for replay. A Spark Structured Streaming job writes to a bronze Iceberg table every 2 minutes with micro-batch trigger. A second job transforms bronze to silver hourly. The dashboard queries silver via a serverless warehouse. The nightly report runs on gold tables materialized at 2am.
For failure handling, I'd use idempotent writes keyed on event_id. If Spark dies mid-batch, the checkpoint lets it resume. If Kafka is backed up, the dashboard shows stale data but doesn't break."
Why it's only average: correct, complete enough to pass a senior bar, but doesn't surface tension. No mention of schema evolution, late data, or the cost gap between streaming bronze and batching silver. No mention of who owns which SLA when something breaks.
Strong answer (Staff+)
"Three clarifying questions before I design. One: the 5-minute freshness — is that end-to-end from app-emit to dashboard, or from ingestion to dashboard? The difference is a mobile SDK retry tail that can add 30 minutes. Two: is the dashboard aggregated or row-level? Aggregations can tolerate more latency because they're less sensitive to individual drop-outs. Three: who is the 3am owner of this pipeline, and what breakage wakes them up versus opens a ticket?
Assuming end-to-end 5 min, aggregated dashboard, and on-call is a SRE team that wants noise-free alerts — here's the design.
Mobile SDK batches events and posts to a collector behind a CDN. The collector validates schema and writes to Kafka partitioned by user_id for ordering. Retention is 7 days so we can replay a full week. Spark Structured Streaming consumes with a 1-minute trigger, writes to a bronze Iceberg table with hidden partitioning on event_date. That gives us cheap compaction and schema evolution without consumer breakage.
A second streaming job projects bronze into silver with typed columns and business-key resolution. The dashboard queries silver, filtered to the past 24 hours, via a warehouse that auto-scales on idle. For the nightly report, a batch job at 2am reads silver, joins with reference data, and writes a gold table consumed by the BI tool.
Three failure modes worth naming. First: late data. SDK retries can arrive 12 hours after the event. I'd set a watermark of 2 hours on silver and emit late arrivals to a "late" side-table that the nightly batch re-incorporates. The dashboard never sees late data — it's already moved on. Second: schema drift. I'd require the bronze writer to fail-open with a quarantine record whenever an unknown field appears, and alert the producer team, not the consumer team. Third: cost runaway. If the mobile SDK ships a bug that 10x's event volume, Kafka's retention will still hold 7 days, Spark will throttle at its configured max, and the dashboard will stale rather than crash. That's the correct failure mode because a stale dashboard is a pageable event and a crashed cluster is a catastrophe.
One trade-off to flag: we're duplicating storage (bronze + silver + gold). The alternative is materialized views on raw, which saves storage but gives us less control over cost and less visibility into pipeline failures. At 20 GB/day raw, that's only ~7 TB/year; keeping all three tiers costs ~$2,000/year in storage. Cheap. I'd keep the three-tier design. If we were at 10x the scale, the trade-off would reverse."
Why it's strong: opens with scoping, states assumptions before designing, names who pages at 3am (org awareness), discusses failure modes as first-class content, quantifies cost, closes with a trade-off the interviewer didn't ask about. That last move is the staff-level signature — volunteering the tension you chose to live with.
Scenario B — "The Dashboard Numbers Don't Match" (Live Debug)
Prompt (verbatim)
It's Tuesday morning. Two dashboards show different values for the same KPI — yesterday's revenue. The finance team reports $4.2M; the executive dashboard reports $3.9M. Both were "right" last week. Walk me through how you investigate.
Bad answer
"I'd check the queries behind each dashboard and compare them."
Why it fails: this is what the junior tells the senior who then has to do the actual work. The interviewer is testing investigation methodology, not tooling. A one-sentence answer signals "I've never owned an incident like this."
Average answer
"First I'd reproduce both numbers myself against the warehouse — confirm finance's $4.2M and exec's $3.9M are the right numbers from each dashboard's source query. Then I'd inspect the SQL of both queries. Usually the difference is timezone — one uses UTC and one uses local time — or a join filter that includes/excludes refunds.
If SQL is identical, I'd check the source tables — are they reading the same table? If one reads gold.revenue and the other reads silver.orders, those pipelines might have different lag. I'd look at pipeline run logs for yesterday.
Once I find the root cause, I'd document it, fix it, and post a message in the data slack channel with the explanation."
Why it's only average: correct debugging sequence, but no prioritization, no mention of stopping the bleeding while investigating, no consideration of downstream consumers, no ownership of communications beyond "post in slack." This is senior IC thinking, not staff thinking.
Strong answer
"Before I touch anything technical — three moves in parallel. One: I IM both dashboard owners to say 'we see a discrepancy, investigating, will post to #data-incidents within 30 min, please don't publish commentary on these numbers externally until we post.' That buys me a window and avoids the CFO telling the CEO a wrong number at 10am. Two: I note the current values from both dashboards with screenshots — the numbers can drift during investigation. Three: I check if there's an active incident, since discrepancies often correlate with upstream failures.
Now the investigation. I treat this as a funnel: at which layer does the $300K disagreement first appear? I run both queries myself against the warehouse and confirm. Then I go up the pipeline — do the upstream silver tables agree? Do the bronze raw tables agree? Somewhere in that chain the numbers diverge, and that's the layer where the bug lives.
The four most likely root causes in order: timezone boundary (revenue before/after midnight attributes to different days), filter drift (one query includes cancellations, the other excludes), late-arriving data (finance pulled later and saw more rows), and join duplication (a new dim row caused some facts to be double-counted after a recent deploy).
Let's say I find the cause: a dim_store refresh last night added a duplicate mapping for two store_ids, which inflated finance's number. Now the fix. Short-term: I roll back the dim_store snapshot. Medium-term: I fix the dim loader to surface uniqueness violations as a pipeline failure, not a warning. Long-term — and this is what most engineers skip — I want to understand why this wasn't caught by the data-quality monitor. Was the monitor not configured on this table? Was the threshold too wide? That's the conversation that prevents recurrence.
For communications, I'd post to #data-incidents with a four-part summary: impact, root cause, fix, prevention. I'd notify finance and the exec team directly about the corrected number, with the delta explained — because the trust repair is as important as the fix.
One meta-point I'd flag: we shouldn't have two dashboards consuming the same KPI from two different sources. That's the real bug. I'd follow up with a proposal to consolidate onto a single certified source, owned by the data team, with a deprecation path for whichever dashboard loses."
Why it's strong: parallel moves (comms + triage + investigation), explicit stakeholder management, structured root-cause hypothesis, short/medium/long-term fix framing, and the closing move — naming the organizational root cause — is unmistakably staff-level.
Scenario C — Migrate 50 TB Hive Table to Iceberg With Zero Consumer Downtime
Prompt (verbatim)
You have a 50 TB Hive-external table on S3, partitioned by day, with 40+ downstream consumers. Product wants to migrate it to Iceberg so we get time travel, safe schema evolution, and compaction. Design the migration. Zero consumer downtime is a hard requirement.
Bad answer
"I'd use Iceberg's migrate-table procedure, run it overnight, and notify consumers in the morning."
Why it fails: doesn't engage the zero-downtime constraint, doesn't state which tools support both the old and new formats during cutover, doesn't plan rollback, doesn't acknowledge 40+ consumers means 40+ migration coordination problems.
Average answer
"Iceberg has two migration paths — migrate which replaces the Hive table in place, and snapshot which creates a parallel Iceberg table pointing at the same files. I'd use snapshot first so the Hive table stays live.
Then I'd point one pilot consumer at the Iceberg table and validate it for a week. If good, I'd ask each consumer team to update their reader, one per week, until everyone is on the new table. Last step: decommission the Hive table.
For rollback, since the snapshot doesn't modify the underlying files, I can always delete the Iceberg metadata and fall back to Hive."
Why it's only average: correct at a high level, but treats "40 consumers update their reader" as a simple line item. In reality that's the bulk of the migration work and the bulk of the risk. Also doesn't address what happens to writes during the cutover window, when both tables must agree.
Strong answer
"Scoping first. 'Zero consumer downtime' needs unpacking — do we mean zero dashboard gaps, or zero read failures? Those are different SLAs. And 'migrate' — are we snapshotting to keep the same files, or ingesting fresh? If we need time travel from before today, we need snapshot. If not, we can do a parallel rebuild and save ourselves a class of bugs.
Assuming snapshot-based migration with zero read failures as the hard SLA, here's the plan in five phases.
Phase 1 — parallel registry. Run Iceberg's snapshot procedure to create a parallel Iceberg table pointing at the same underlying files. The Hive table is unchanged. No consumer sees anything. This phase is safe to test, revert, and retry.
Phase 2 — dual-write. Every job that writes to the Hive table now also writes to the Iceberg table. I'd rather not do this for 50 TB historically, so dual-write applies to new partitions only. Historical partitions are snapshotted once and read-only. I add a data-quality check that sums both tables nightly and pages on divergence over 0.01%.
Phase 3 — pilot consumers. I identify three low-risk consumers: a BI dashboard with a weekly refresh, a data-science notebook used by one team, and a back-office report. I migrate them first. I give each one two weeks to raise issues. I publish a migration guide with the exact reader config change required — engine, version, one-liner diff.
Phase 4 — tiered rollout. I rank the remaining 37 consumers by three axes: business criticality, reader engine, and how many consumers share that engine. I migrate them in cohorts, one cohort per sprint, with a "flag day" where all readers for a given engine cut over on the same Monday. I require every cohort to explicitly sign off — no passive migration.
Phase 5 — deprecation. Only after 100% of the 40 consumers are on Iceberg do I shut off the Hive metastore entry for the old table. I keep the underlying S3 files for 90 days for forensic rollback, then delete.
Rollback plan, which I want to name explicitly because it's the part that determines whether I'll actually ship this: at every phase, the previous state is still reachable. If Phase 2 dual-write introduces a bug, I disable the Iceberg write and the Hive table is still correct. If Phase 3 pilot fails, consumers swap their reader back. If Phase 4 surfaces a platform-wide bug, we pause and fix before the next cohort. I would not start without these rollback contracts written down.
Three risks worth naming. First: engine support. Not every consumer's reader speaks Iceberg at the version we'd target — the Presto team might be on an older release that doesn't support Iceberg's v2 spec. I need a reader-engine inventory before Phase 1. Second: catalog. If the company uses Glue as the Hive metastore but we want Iceberg on a REST catalog, we're also migrating the catalog. That's a separate project and I'd treat it as such. Third: compaction. Iceberg's advantage is smaller files via compaction, but until we run compaction the new table looks identical to the old. Consumers will not feel the benefit for weeks, which is a narrative risk — leadership might ask "why are we doing this" mid-migration. I'd schedule the first compaction for right after Phase 3 so there's something to show."
Why it's strong: phases are named and sequenced with explicit entry/exit criteria; rollback is treated as first-class, not a footnote; risks are named with mitigations; and the narrative risk in the last paragraph is a staff-level move — the candidate is thinking about the political economy of the migration, not just the technical one.
Scenario D — Real-Time Fraud Detection for Card Transactions
Prompt (verbatim)
Design a real-time fraud detection system for card transactions at a payments processor. Volume: 5,000 TPS steady, 20,000 TPS peak during holidays. Decision budget: sub-300 ms p99 from transaction arrival to approve/decline. The model team will own the model; you own everything else.
Bad answer
"I'd stream transactions through Kafka, call a model service, return the result. If the model is slow I'd add a cache."
Why it fails: no scoping, no SLAs decomposed, no feature-store plan, no fallback, no model team interface, no regulation mention, no audit log. Payments is a compliance-heavy domain and a good answer must signal awareness of that.
Average answer
"Transactions arrive via an API gateway, land in Kafka, and are consumed by a Flink job that enriches them with features — card history, merchant history, geo-velocity — and calls the fraud model. The model returns a score; if it's above a threshold, the transaction is declined. The result is posted back to the gateway within the latency budget.
For features, I'd use a feature store with online and offline tiers. The online tier is Redis or DynamoDB for sub-10ms lookups. The offline tier is a lakehouse table for training.
For the model, I'd support A/B by shadow-scoring with a second model and logging both outputs.
For audit, every decision is logged with the input features, the score, and the threshold applied."
Why it's only average: the shape is right, but the answer is generic — it would describe any ML serving system. Missing: the 300 ms budget decomposed into how many ms for each hop; the failure behavior when the model is unavailable; the regulatory constraint that declines cannot be retried on a different model version; and the interface contract with the model team that prevents them from breaking the latency budget unilaterally.
Strong answer
"A few scoping questions. One: is 300 ms p99 end-to-end from gateway-in to gateway-out, or is it just the decision engine? The answer changes the budget for feature lookup. Two: what's the fallback when the model is unavailable? 'Approve everything' is fraud-friendly; 'decline everything' is customer-friendly. Someone — probably not us — has a business answer. Three: is this the only decision system, or is there a rules engine ahead of or behind us? Interaction with rules is often where the latency budget gets eaten.
Assuming 300 ms end-to-end, deny-on-failure at the rules engine, and we're the single ML decision point — here's the design.
Budget decomposition: 20 ms network gateway→us, 20 ms us→gateway, 50 ms feature lookup, 100 ms model inference, 50 ms auxiliary logging and control plane, 60 ms buffer. That's the shape of the SLA. Anyone who eats into it — a model change that adds 50 ms, a feature that requires a fresh join — has to justify it against that budget.
Transactions land at a regional API gateway that validates, authenticates, and writes to a Kafka topic partitioned by card_hash. A stateless Flink job consumes with per-key ordering guarantees. The job reads online features from Redis (cached card- and merchant-level aggregates refreshed every 5 minutes) and from DynamoDB (longer-tail features). The model is deployed as a gRPC service with per-request timeouts of 150 ms; if it times out, we emit a fallback decision from the rules engine and flag the transaction for review.
The feature store has two sides — online (Redis + DynamoDB) for real-time lookups, offline (Iceberg) for training. The online-offline parity guarantee is critical: every feature must be written through the same code path to both stores so training and serving can't drift. I'd enforce this with a feature-registration contract — model team cannot deploy a feature in prod that isn't registered in the offline store and backfilled.
For model deployment, shadow-mode first: new models score alongside the current champion for a week without affecting decisions. If the shadow passes drift and calibration checks, a canary cohort (say, 1% of traffic) gets the new decisions. Only after a clean 7 days do we ramp to 100%. Every step is automatically rolled back on a pager page for latency regression or fraud-recall regression.
Three regulatory and operational points. First: every decline must be explainable. I'd log the top three feature contributors per decision, sourced from the model's SHAP output, retained for 7 years. That's a compliance minimum in this industry. Second: decisions cannot be retried on a different model version. If we decline, a retry hitting a newer model version that approves would constitute a split decision — a disaster for audit. I'd pin the model version per decision_id and never serve a retry with a different version. Third: the model team owns model quality, I own the system. I'd publish a two-line contract: they guarantee p99 inference under 100 ms at any given version, I guarantee feature freshness within SLA and fallback-on-timeout. Anything outside those contracts goes to a monthly joint review.
Cost: at 5,000 TPS steady with 20,000 peak, you're looking at roughly $50-100K/month just for the online feature store at that request rate, plus model GPU costs. I'd budget accordingly and flag early that if we push sub-100 ms we're in a different cost regime.
Finally, one risk I'd name loudly: holiday peak is 4x steady. If we autoscale on traffic and the model inference is stateful (GPU warm-up takes 2 min), we'll see cold-start latency violations during the ramp. The cleanest mitigation is capacity-plan to peak, accept the cost, and run scheduled scale-up the week before known peaks. Autoscaling alone is the wrong tool here."
Why it's strong: latency budget decomposed, online-offline parity named, shadow → canary → ramp rollout, compliance signals (explainability, model-version pinning, retention period), explicit contract with the model team, cost estimate, and the final "holidays + autoscaling = cold starts" callout is the kind of specific failure-mode-naming that distinguishes a senior answer from a staff one.
6. Behavioral STAR Frames
Every senior loop includes at least one dedicated behavioral round, and behavioral signal gets probed inside every technical round too. The STAR framework (Situation-Task-Action-Result) is the industry default — but most candidates apply it badly. This section is the correction.
STAR Is Not A Script. It's A Skeleton.
The frame has four parts, each with a strict time budget:
- Situation (2 sentences, ~20 seconds): where and when, role of others, why it mattered. No backstory dumps.
- Task (1 sentence, ~10 seconds): what you specifically were responsible for.
- Action (4–6 sentences, ~90 seconds): what you did, with verbs and decisions. Not what the team did — what you did.
- Result (2 sentences with numbers, ~30 seconds): the outcome with a metric, and what you learned or changed because of it.
Total spoken time: under 4 minutes. If you can't tell the story in 4 minutes, either it's two stories or you're hiding something in the actions.
The Eight Canonical Frames
You need one prepared story for each of these frames. Pre-prepared. Written on an index card. Rehearsed out loud with a timer.
- Disagreement resolved. A time you disagreed with a senior person and worked through it. Tests influence without authority, emotional regulation.
- Failure owned. A significant technical failure that was your responsibility. Tests ownership and growth mindset.
- Trade-off under pressure. A time you had to ship something imperfect because of a deadline or cost constraint. Tests judgment and pragmatism.
- Influence across teams. A time you got people outside your team to do something they weren't planning to do. Tests scope and communication.
- Mentored someone. A time you materially raised the skill of another engineer. Tests teaching ability — critical at staff+.
- Said no to a stakeholder. A time you declined a reasonable-sounding ask because the engineering cost wasn't worth the value. Tests principled pushback.
- Led through ambiguity. A time the problem was undefined and you had to define it. Tests senior-level scoping.
- Drove alignment. A time multiple teams were working at cross-purposes and you moved them to a shared plan. Tests staff-level scope.
Worked Example — "Disagreement Resolved"
Bad version: "The lead architect wanted to use Kafka, I wanted to use Pulsar. We argued about it and eventually settled on Kafka."
What's wrong: no stakes, no actions, no outcome. Reads like losing the argument and still submitting it as an answer.
Average version: "Our lead architect wanted to standardize on Kafka. I'd used Pulsar at my previous company and thought the geo-replication story was better for our multi-region rollout. I wrote a short doc comparing the two and we discussed it in a design review. We ended up going with Kafka because of team familiarity. I learned that tool familiarity beats feature-set at most decisions."
What's right: structured, correctly humble ending, actual lesson. What's missing: no specific actions, no tension beyond the abstract. Forgettable.
Strong version: "[Situation] At a payments company in 2024, we were about to stand up a new event platform for multi-region fraud detection. Our principal engineer made an early decision on Kafka because the team had just finished a six-month upgrade and felt battle-tested. [Task] I was the tech lead for one of three consumer teams and had four weeks of experience running Pulsar at my previous job. I thought geo-replication would matter more than we were weighting it.
[Action] I didn't push back in the meeting — I wanted to do the work first. I spent two evenings writing a six-page doc comparing the two on five axes: throughput ceiling, ops burden, geo-replication correctness under partition, cost at our projected scale, and team ramp-up. I deliberately included our fraud team's exact SLAs and projected a twelve-month cost envelope. I shared it with the principal engineer first, in a private 1:1, with the explicit opener: 'I'm probably wrong but I want you to check my math before I share this.' He caught one real error in my cost model and reshaped the throughput section. Then we shared the revised doc with the broader group.
[Result] We went with Kafka for the core platform but adopted Pulsar for the cross-region mirror because it met a specific geo-replication need that Kafka didn't cover cleanly. I would have lost a pure-principle argument. The written doc, and asking him to review it first, turned the disagreement from 'who wins' into 'what's right.' Three months later the principal engineer asked me to co-author the next major platform doc. [Lesson] A thing I do now as a default: when I disagree with a senior decision, I write the case as if I'll be proven wrong, and I give the person I disagree with the first review."
Why this version works: specific numbers (six pages, two evenings, five axes, twelve-month envelope), a specific sentence the candidate said out loud ("I'm probably wrong but…"), a nuanced outcome (not a clean win, a compromise that served both concerns), and a closing "what I do now" that signals durable behavioral change.
Worked Example — "Failure Owned"
Weak opening: "I made a mistake once where…"
Strong opening: "In 2023 I pushed a backfill at 3pm on a Thursday that doubled the revenue numbers on the executive dashboard for six hours. The CFO saw it before we caught it."
The strong opening names the impact in the first sentence. No warm-up, no hedging. The rest of the story then earns the right to be heard.
Three rules for failure stories:
- Own it in the first sentence. "I pushed," not "we had an incident where."
- Name the specific decision that failed, not a vague "we didn't have enough tests." What decision — yours — would you make differently?
- The fix is systemic, not just personal. "I added a test" is junior. "I proposed, and we shipped, a pre-deploy data-diff gate that catches this class of error at the CI layer" is senior.
Anti-Patterns — The STAR Failures
- The we-drift. You start with "I" and by minute two you're saying "we" for everything. The interviewer loses track of your specific contribution. Fix: every paragraph of the Action section must contain the word "I" with a verb.
- The hero arc. Every story ends with you being thanked, promoted, or proven right. It rings false. Real senior stories include at least one "what I'd do differently" in the Result.
- The silent co-star. You never name the other humans in the story. Interviewers reading between lines assume you can't work with people. Fix: name roles (not real names) — "the staff engineer on the adjacent team," "the product manager new to our group."
- The cliffhanger result. Your Result is "and then I changed jobs" or "and then the project got reprioritized." Pick a different story. The Result section must actually resolve.
- The date-less incident. "At a previous company, some time back…" is unanchored and hard to assess. Always anchor in a year and a team size.
The Cross-Round Story Portfolio
Across a 4-round loop, you'll have maybe 3–4 chances to tell a prepared story, plus 2–3 smaller "tell me about a time" asks woven into technical rounds. You want to show different sides — not tell the same story in three different frames.
Map your eight prepared stories to at least four distinct skills: technical depth, cross-team influence, mentoring, and operational judgment. If you look at your portfolio and three of the eight are "I debugged a hard incident," rotate. Variety is what hiring committees reward.
7. Decision Frameworks — The Cheat Sheet
Many interview rounds test whether you can make a reasoned decision between two legitimate options, not whether you know the options exist. These are the most common forks and the axis each hinges on. Memorize the frames, not the conclusions — every real decision depends on context, and the reasoning is what gets scored.
Batch vs Streaming
The decision axis is value of freshness vs cost of continuous compute. If the downstream consumer acts on the data within minutes, streaming. If a human reads it daily or a model retrains weekly, batch. The trap: candidates default to streaming because it sounds modern. Real answer: streaming is roughly 3–10x the ops overhead of batch and should be justified by a concrete business lever, not a feeling.
A clarifying question that works in any round: "Who acts on this data, and how soon after arrival?" If the answer is "a dashboard refreshed at 9am" — batch. If it's "a fraud decision in 200 ms" — streaming. If it's "a data scientist who runs queries ad hoc" — batch with a small streaming tier for the last hour.
Normalize vs Denormalize
The decision axis is update frequency vs read pattern fan-out. Normalize when the source of truth changes often and the consumers are narrow. Denormalize when the source is stable and the consumers are many and varied. In a lakehouse, denormalized wide tables are almost always the right answer for analytics layers, because columnar storage makes the read cost of unused columns near-zero, and eliminating joins dominates the access-pattern tradeoffs.
Ask: "Is this table read a thousand times for every write?" If yes, denormalize aggressively. "Is the write source a system of record that will audit every change?" If yes, normalize the source and denormalize only the consumer layer.
Build vs Buy
Three questions in order. One: is this a core competency of our business? If yes, build — the control and optionality matter more than the cost. Two: does an off-the-shelf option meet 80% of our needs at 20% of the cost? If yes, buy, and put the 20% engineering savings toward the core. Three: what's the exit cost if we buy? If vendor lock-in on this layer would require a year-long migration, discount the "buy" option heavily.
Avoid the false economy of "we'll build a simple version." Every simple version becomes a complex version in three years with no maintainer. If you wouldn't staff a team of three to own it, don't build it.
Schema-on-Read vs Schema-on-Write
Schema-on-read is attractive in bronze/raw layers where the cost of rejecting a malformed record is higher than the cost of a downstream error. Schema-on-write is right in silver/gold where consumers depend on predictable typing. The senior pattern is schema-on-read at ingest with typed contracts at the bronze→silver transform — reject bad records to a quarantine table, alert the producer, keep the main pipeline flowing.
Warehouse vs Lakehouse vs Operational Store
Warehouses (Snowflake, BigQuery) optimize for analytics read patterns with tight governance, fast answer times, and high cost-per-GB at scale. Lakehouses (Iceberg/Delta on S3) win when data volume crosses ~100 TB or when the same data serves both analytics and ML training. Operational stores (Postgres, DynamoDB) are for sub-10ms online lookups, not analytics. The three should coexist; the anti-pattern is forcing one to do another's job because "we already have it."
Spark vs SQL Engine
Spark is the right tool when you're writing non-trivial logic (Python UDFs, iterative algorithms, complex joins over 100 GB+ data). A warehouse SQL engine is the right tool for analytics queries, dimensional joins, and anything a BI tool will render. The overlap zone — moderate-size ETL — goes to whichever platform your team already operates well. Ops maturity beats theoretical fit.
8. The Mock Interview Loop
Mock interviews are the single highest-leverage activity in prep. Most candidates do too few and do them wrong. This section fixes both.
Weekly Cadence
Target three full mocks in weeks 3–4 of the roadmap. Each mock should simulate a real round: 45 minutes, interviewer plays the role faithfully, no pausing to look things up. The post-mock review is as important as the mock itself — budget another 45 minutes to review the recording the same day.
Self-Tape Method (Solo Practice)
When you don't have a partner, record yourself solving a prompt out loud. Set a timer. No pausing. At the end, watch it back with a rubric. Count filler words ("um," "so," "basically"), measure time-to-first-assumption, note every moment you paused mid-sentence. The tape is uncomfortable to watch — that's the point.
Three specific drills:
- The 45-minute system design tape. Pick a prompt from §5. Solve on a whiteboard or blank page. Record. Review for: time budgeting, scoping questions, failure-mode coverage, and the closing trade-off.
- The 20-minute SQL tape. Pick a question from §4. Solve on paper with narration. Review for: clarifying questions asked, narration cadence, and silent math.
- The 4-minute behavioral tape. Pick a frame from §6. Tell the story. Review for: time budget, "I" vs "we" count, date/team-size anchoring.
Pair-Up Protocol
If you're mocking with another candidate, swap roles. You interview first (learning what good questions look like), then you're interviewed. The act of playing interviewer sharpens your own answers more than any study session.
Set rules before you start: interviewer won't help unless the candidate is truly stuck for 3+ minutes, candidate must narrate, no looking up anything. Afterwards, 10 minutes of specific feedback using the rubric below. Vague feedback like "you did great" is worthless.
Feedback Rubric — Five Dimensions, 1-4 Scale
| Dimension | 1 (weak) | 2 (mid) | 3 (senior) | 4 (staff+) |
|---|---|---|---|---|
| Scoping | Jumped to solution | One clarifying question | Restated + 2–3 questions | Named the real problem under the prompt |
| Narration | Silent stretches >30s | Narrated half the time | Narrated continuously | Narrated and invited collaboration |
| Technical depth | Buzzword-level | Knew the mechanism | Knew the mechanism and the trade-off | Invented a framing to compare mechanisms |
| Failure thinking | Happy path only | Named one failure mode | Systematic failure analysis | Named the class of failures + mitigation |
| Close | Trailed off / time ran out | Summarized the answer | Summarized + one trade-off | Surfaced a tension the interviewer didn't ask about |
Score yourself honestly. A 3 across all five dimensions is a clean senior pass. A 4 anywhere is Staff+ signal. A 1 anywhere is a gap that must be closed before the real loop — not after.
9. The Pre-Interview Checklist
This is the list to run through the week before the loop.
Resume
- For every bullet: scale numbers, your verbs, the trade-off, the failure mode, the "would do differently."
- One sentence — just one — that captures your career arc. Interviewers ask this as an opener in 80% of rounds.
- Remove anything you can't defend under drill-down. If a bullet mentions Kafka, you must be ready for 20 minutes on Kafka.
Project Deep-Dives
- Pick your three strongest projects. Prepare a 2-minute pitch and a 10-minute deep-dive for each.
- The deep-dive must include: architecture diagram (drawn from memory, on paper), the decision you'd revisit, the failure you recovered from, the metric you moved.
- Rehearse out loud. Time yourself.
Company-Specific
- Read at least one engineering blog post from the target team. Pick one specific technical claim to reference in the loop.
- Know the team's stack. If they use a specific cloud, brush up on its nuances the night before.
- Have three reverse questions prepared (see §F3 above).
Logistics
- Test your video, mic, screen-share 48 hours out. Do not test them 10 minutes before the round.
- Second monitor — one for the prompt, one for your scratch notes. Do not screen-share the scratch monitor.
- Water, snack, silent phone. The loop is 4+ hours and most candidates crash at hour 3 from caffeine and dehydration.
The Day Before
- No new study. Re-read your own notes.
- Sleep 8 hours. Interview performance drops measurably below 7 hours.
- Plan your morning routine to the minute: wake, shower, coffee, walk, setup, first round. Remove decisions from the morning.
10. Day-Of Playbook
Morning
- Exercise for 20 minutes, even lightly. It measurably improves cognitive performance for the next 4–6 hours.
- Breakfast with protein. Skip the heavy carbs — the crash will hit mid-round.
- Review your one-page "stories portfolio" — the eight STAR frames, one sentence each. Not the full stories; just the index.
15 Minutes Before Each Round
- Stand up, move, breathe. Do not re-read notes. You already know what you know.
- Look at your three reverse questions. Pick which one fits this round.
- Water. Bathroom. Phone on silent.
During The Round
- Open with a warm "thanks for making time, excited to dig in." Not "let's get started" — this is a conversation, not a meeting.
- In the first 2 minutes, say one thing that isn't an answer to a question. It humanizes you. "I noticed you mentioned [X] in your bio — I've been curious about that."
- Narrate your thinking. Silence is a scoring vacuum.
- If you don't know something: "I haven't worked with that specifically, but based on [related thing] I'd guess it works like…" Buy time to think by demonstrating how you'd think.
- Watch the interviewer's body language / tone. If they're restless, compress. If they're leaning in, go deeper.
- Always leave 3 minutes at the end for questions.
Between Rounds
- Stand up, stretch, breathe. Do not ruminate on the previous round. The next interviewer has not talked to the last interviewer yet — it's a clean slate.
- Write one sentence of notes on the previous round for the post-loop debrief, then close the notebook.
- Walk for 3 minutes if time allows. Oxygen and blood flow matter.
Post-Loop
- Write a 30-minute debrief: one page per round, covering what you were asked, what you answered, what you wish you'd answered.
- Send a brief thank-you within 24 hours to each interviewer, referencing one specific thing they said. Generic thank-yous are neutral; specific ones are memorable.
- Don't re-litigate the loop mentally. You're done. Wait.
11. Tricky Behavioral Questions — Model Answers
Managers ask a specific class of behavioral questions that don't fit the neat STAR frames above. They are deliberately probing for the seams: how you handle a peer who won't back down, how you revive an idea after it's been shot down, how you absorb tough feedback without losing your footing. This section gives the specific moves, with worked examples and phrases you can actually say in the room.
The underlying skill these questions test isn't "do you avoid conflict" — it's can you stay in the conversation when it gets uncomfortable without either capitulating or escalating. Most candidates fail by doing one or the other.
11.1 — "Tell me about a time you had a conflict with a peer"
What the interviewer is really probing
Three things at once: (a) can you surface disagreement directly or do you route around it; (b) do you separate the person from the position; (c) do you walk away with a working relationship intact. The candidates who fail this question do so by describing a conflict where they were entirely right, the other person was entirely wrong, and the outcome was that the other person "came around." That sounds like a win — it signals weak empathy and low collaboration to experienced interviewers.
The move — the four-beat answer
- Name the tension in neutral terms. "We disagreed on whether to refactor the upstream service before building the new feature."
- State your actual conviction — and acknowledge theirs. "I was confident the new feature would accumulate tech debt if we skipped the refactor. My peer was equally confident that the refactor would burn three sprints for marginal gain."
- Describe the move you made that took the heat out. "I proposed we spike the feature without the refactor for one sprint, instrument the specific debt I was worried about, and let the data tell us whether to refactor before continuing."
- Close with the relationship, not just the outcome. "We ended up shipping on a hybrid plan. Three months later he was the first person I asked when a related decision came up."
Sample phrases that land
- "I notice we're solving two different problems — can we agree on which problem we're actually disagreeing about?" This single sentence defuses more engineering conflicts than any other.
- "You might be right, and if you are, here's what I'd want to see in the data to know." Concedes possibility without conceding the decision.
- "Let me restate your position back — tell me if I have it right before I respond." Most conflicts are people talking past each other; this breaks the pattern.
11.2 — "What if the peer is cocky and won't back down?"
What the interviewer is really probing
Whether you escalate at the right threshold. "I went to my manager" too early signals avoidance; "I just kept pushing" signals bad judgment. The answer lives in the middle: you invest meaningfully in the 1:1 path, you collect evidence, and you escalate with a process proposal, not a person complaint.
The move — four stages, named
- Stage 1 — ask for the underlying principle. Cocky disagreement is often surface-level. Ask: "If the data showed X, would your position change?" If the answer is "no, regardless of data" — you now know it's not a technical disagreement, it's identity. Route accordingly.
- Stage 2 — take the work offline. "Let's not solve this in the standup. I'll put the comparison in a doc by Friday, you add your counter, we regroup Monday." The written form neutralizes verbal dominance games.
- Stage 3 — bring a third technical voice. Not your manager. A respected senior outside both of your reporting lines. Frame it as "we both respect X, let's get their read." This is very different from escalation — you're enlarging the decision-making set, not appealing to authority.
- Stage 4 — genuine escalation. If stages 1–3 fail and the decision must be made, surface to both managers together with a written doc that fairly represents both positions. The last move is always process-oriented: "we need a decision, here's the doc with both sides, we need you to adjudicate."
The sentences that work on a cocky peer
- "I hear your conclusion. I don't yet hear the reasoning that got you there — can you walk me through it?" Invites specificity without conceding anything.
- "We've been going in circles on this for 20 minutes. Let's write it up instead." Breaks the verbal dynamic entirely.
- "I may be wrong. Tell me what I'd need to believe to agree with you." Surprisingly disarming; invites them to articulate the model, which often exposes where the disagreement actually lives.
What not to say
- "I'll defer to you on this one." — you just told the interviewer you fold under pressure.
- "He was just being a jerk." — even if true, you've revealed you can't frame people charitably.
- "I went to my manager." — unless you've first described the 1:1 investment. Manager-first is a red flag.
11.3 — "You presented a new idea. One peer is hell-bent on rejecting it, and he has reasons. How do you handle it?"
What's actually being tested
Whether you can distinguish good-faith opposition from obstruction, and whether you can use good-faith opposition to make the idea better instead of defending it. The best idea-pushers treat a rigorous objector as a gift. That mindset is a strong hire signal.
The move — steelman then separate
- Steelman their objection out loud. "Let me restate what I hear you saying, and tell me if it's right. You believe the approach is the wrong move because of [specific reason 1], [specific reason 2], [specific reason 3]. Did I miss anything?" Do not skip this. Most objectors are never accurately restated back — just doing it creates real goodwill.
- Ask what evidence would change their mind. "If [specific outcome], would that address [specific objection]?" Two outcomes: (a) they give you a concrete test, which is a gift — now you can address it; (b) they can't name any evidence that would change their mind, which means the objection isn't actually technical. Now you know what you're working with.
- Separate the decidable from the undecidable. Some parts of the objection can be settled with a prototype, benchmark, or pilot. Others are pure judgment calls where reasonable people differ. Address the decidable ones with actual work; name the judgment calls as judgment calls and propose how the org should decide them (usually: someone with the authority to adjudicate decides, and you commit either way).
- Invite them to co-own the next step. "If we ran the pilot for four weeks with [metric] as the kill criterion, would you be willing to co-design the pilot?" The skeptic who co-designs the pilot becomes your most credible advocate or your most credible killer — either way you're unstuck.
Phrases that change the room
- "You're asking a good question — I don't have the answer yet. Give me two days to come back with data." Protects your credibility better than faking an answer.
- "If we're going to kill this, let's be clear on the criteria — what would have to be true for us to revisit?" Turns a rejection into a conditional rejection, which is recoverable.
- "I'd rather we killed this for the right reason than ship it and regret it." Signals that you're attached to the outcome, not the idea.
The failure mode
The worst thing to do is double down and try to "win" by sheer persistence. Even if you win the meeting, you've spent political capital that took years to accumulate. The senior move is to separate being right from being heard — and to recognize that an idea killed today can be revived in three months with better data.
11.4 — "Tell me about a time you received feedback that was hard to hear"
What the interviewer is probing
Whether you can hold feedback without either (a) performing humility theater or (b) getting defensive. The bad answer: "I got feedback that my communication was too blunt and I realized I should be nicer." The interviewer has heard that 400 times and it lands as canned.
The senior answer shape
Name the feedback specifically. Name your initial reaction honestly — including the part where it stung. Name what you did with it over the next 90 days, with a specific behavior change. Close with what you'd do differently now if you got the same feedback earlier.
"A director told me in my six-month review that I wrote beautiful design docs that no one finished reading — they were too long and too exhaustive for anyone but me. My first reaction, honestly, was defensive: I thought the docs were my strength. I sat with it for a week before I could see that what felt like thoroughness to me felt like friction to readers. I rewrote my last three docs to cap at one page for the executive summary, with a linked appendix for the depth. Response rate on my doc reviews roughly tripled. The feedback I wish I'd gotten six months earlier was the specific diagnosis — 'too long' is easy to hear once, hard to internalize; 'nobody finishes them' forced me to re-examine the format."
11.5 — "Tell me about a decision you made that turned out to be wrong"
The honest answer
Bad answers for this question are versions of "I took a risk that didn't pay off but I'd do it again." That's not a mistake — that's a risk. The interviewer wants an actual wrong call you made.
Good answer shape: name the decision, name your reasoning at the time, name the signal you had that you ignored or under-weighted, name the outcome, name what you do differently now when that signal reappears.
"In 2023 I picked PostgreSQL as the primary store for a new system where the read pattern was going to be 98% aggregated analytics. My reasoning was that PostgreSQL was battle-tested and we had the operational expertise. The signal I under-weighted: our own internal latency requirements said p99 under 200 ms for queries touching 30 days of data, and I did not actually benchmark whether PostgreSQL could hit that with the data we'd be accumulating. Six months in, a BI dashboard that needed to scan 500 M rows was taking 12 seconds. We migrated to a columnar warehouse, which was the right answer from day one. What I do now: if the access pattern is 90%+ analytics at any projected scale, I start from the analytics store and justify the OLTP side, not the other way around."
11.6 — "Describe a situation where you had to push back on a manager"
What's actually being tested
Whether you can disagree with authority without either rolling over or becoming a problem. Most candidates err in one direction or the other — both are disqualifying for staff+.
The shape of the good answer
Your manager asked for something that you had reason to believe was the wrong call. You invested in understanding their reasoning before pushing back. You made your counter-case in writing with specifics, not in a meeting with heat. You accepted the final decision whether it went your way or not, and if it didn't go your way you executed as if it were your idea.
That last sentence — "I executed as if it were my idea" — is what separates senior from staff. Pushing back is an IC skill. Executing the manager's call with your full effort after you pushed back is a leadership skill. Demonstrate both.
11.7 — "What do you do when you realize mid-project that the plan won't work?"
Why this question gets asked
Most real projects do not go to plan. Interviewers want to see whether you escalate early and clearly, or whether you try to hero-ship and only surface the problem once it's catastrophic.
The four-move answer
- Quantify the miss. Don't say "we're behind." Say "we're three weeks behind on an eight-week plan, and the current trajectory puts us at 12 weeks if nothing changes."
- Reassess the goal, not just the plan. Sometimes the goal itself has drifted and the old plan was chasing the wrong thing. Restate what we're actually trying to achieve.
- Present three options with honest trade-offs. "Option A: keep the scope, slip the date. Option B: cut scope X and Y, hit the original date. Option C: add two engineers, hit the original date with full scope. I recommend B because…" Options are respectful; single-recommendation-only feels like you've already decided.
- Escalate to whoever owns the trade-off. Schedule the decision, don't wait for it to surface. Attach a deadline: "we need a decision by Thursday to preserve optionality on Option C."
11.8 — The meta-pattern these questions share
Across all of the above, the interviewer is testing a single underlying capability: can you hold tension in a working relationship without either collapsing into it or blowing it up? The candidates who score highest on behavioral rounds never have the cleanest stories — they have stories where both people had real points, where the outcome was messy, where the candidate's growth is named specifically, and where the relationship survived and often improved.
Three practices that make this easier to do live in the interview:
- Name the other person charitably before you describe the disagreement. "A senior engineer I respected…" is much stronger framing than "this guy who insisted…". The framing signals to the interviewer what kind of colleague you are.
- Include at least one sentence where you were wrong, misjudged, or had to revise. Stories with a 100% success rate for the narrator are suspect.
- End with what you'd do differently now or how the experience changed a default you operate with. Without that, the story is an anecdote; with it, it's evidence of growth.
Closing Note
Interviews reward a specific kind of preparation — the kind that produces fluent speech under pressure, not encyclopedic knowledge in silence. Everything in this part is aimed at the first, not the second.
If you do the four-week roadmap, run three mocks, prepare eight STAR frames, and memorize the first 30 seconds of your answer to each scenario in §5 — you will walk into the real loop with more readiness than most candidates ever have. That's the bar to aim for.
Good luck.