Part 0 · What Data Engineering Actually Is
95% of data engineering interview loops gate on SQL and Python before you ever see a system-design board.
Most candidates still prepare for these rounds the way they would prep for LeetCode — random questions, no structure, no theory of what the interviewer is actually scoring. That is why so many strong candidates wash out at the screen.
This article is the prep I wish I had had: 100 questions on one realistic dataset, ordered simple → complex, framed as the prompts a Meta / Amazon / Apple / Netflix / Google interviewer would actually give you. Every question shows the source data, the code, the actual computed output, and the failure modes a senior interviewer probes after your first answer lands.
If you only have one weekend, jump to the Top-20 must-do questions. If you have two weeks, do them all. If you are interviewing for an Ads / Marketing Science role, the Pre-/Post-iOS-ATT bonus section at the bottom is required reading.
Most people meet a data engineer through their absence. The dashboard is stale. The number doesn't match the other number. The ML model trained on yesterday's distribution and now nobody trusts the score. Somewhere upstream, a pipeline shrugged.
Data engineering is the discipline of moving data from where it is born — clicks, payments, sensors, logs, third-party APIs — to where decisions get made, and doing it with enough rigor that the people downstream (analysts, scientists, ML systems, executives, regulators) can build on top without flinching.
A data engineer designs, builds, and operates the systems that turn raw events into reliable, queryable, governed datasets — at the latency, cost, and quality the business has agreed to.
The five things you actually do
- Ingest. Pull data from operational systems (Postgres, Kafka, S3 drops, SaaS APIs) without losing rows, without double-counting, and without melting the source.
- Model. Decide what a ‘user’, an ‘order’, a ‘session’, a ‘subscription’ means — and shape the data so a hundred analysts get the same answer.
- Transform. Write SQL/Python/Spark that turns raw events into clean facts and dimensions, idempotently, on a schedule, and re-runnable.
- Serve. Make those datasets available — warehouse tables, lakehouse files, low-latency caches, feature stores, BI semantic layers.
- Operate. Monitor freshness, completeness, schema drift, cost, lineage; respond when something breaks; explain why a number changed.
This article focuses sharply on the two skills every loop tests: SQL and Python. If you cannot ace the SQL screen and the coding round, the design round never happens.
Part 1 · How to Prepare for SQL & Python Rounds
Most candidates over-study the wrong things — memorizing every window function, every pandas method — and under-study the things interviewers actually weight: clarifying the question, picking the right primitive, narrating tradeoffs, and writing code that runs.
1. Clarity — Did you restate the problem? Did you ask about edge cases (nulls, ties, late data, time zones)?
2. Correctness — Does the query / function actually return the right answer on the obvious test cases?
3. Reasoning — Can you talk through why you picked this join / window / pattern? Can you compare it to one alternative?
4. Production sense — Would this scale? Is it idempotent? What happens at 100x the volume? What happens if the source schema changes?
How to use this question bank
Each of the 100 questions below is a closed card by default — click the title to expand. Inside you get four pieces:
- Scenario — what a real PM, analyst, or ops lead would actually ask.
- Source — a sample of the input table(s) so you can see what the data looks like.
- Code — the SQL or Python answer, deliberately written for readability over cleverness.
- Output — the actual result of running that code on the dataset (computed at build time, not hand-typed).
- Why it matters — the production angle, edge case, or interviewer follow-up.
Read the scenario, write your own answer, then compare. The compare-after-attempt step is where the learning lives.
Part 2 · The Practice Dataset
Every question below uses the same realistic e-commerce dataset — 15 tables totaling roughly 7,000 rows — small enough to run on your laptop, structured enough to surface real interview patterns. The CSVs and a Postgres / MySQL setup script live in interview/sample dataset/.
| Table | Rows | Columns | Used for |
|---|---|---|---|
users | 200 | user_id, name, email, signup_date, country | Anti-joins, cohorts, country roll-ups |
purchases | 300 | purchase_id, user_id, amount, purchase_date, product_category | Revenue, category roll-ups |
user_profiles | 334 | user_id, name, city, updated_at | Dedup (intentional duplicates) |
weekly_metrics | 52 | week_start, wau | WoW growth with LAG |
sessions | 800 | session_id, user_id, platform, session_date, duration_sec | CASE pivots, platform splits |
products | 50 | product_id, product_name, category, revenue | Top-N per group |
clickstream | 1,971 | user_id, event_time, page | Sessionization (30-min idle) |
events | 1,014 | event_id, user_id, event_type, event_date | Funnel: signup → activation → purchase |
logins | 1,925 | user_id, login_date | Consecutive streaks (gaps & islands) |
daily_metrics | 120 | ds, daily_revenue, daily_dau | Running totals, moving averages |
daily_metrics_gapped | 105 | ds, daily_revenue, daily_dau | Date spine — 15 missing dates to fill |
employees | 28 | employee_id, full_name, manager_id, department, salary | Self-joins, recursive org traversal |
orders | 500 | order_id, user_id, order_date, amount | Cumulative distribution, correlated subquery |
product_dim | 68 | product_id, product_name, price, effective_date, expiry_date, is_current | SCD Type 2 |
skewed_table | 1,000 | id, join_key, value | Skew detection (~70% rows on hot_key) |
The five tiers
Single table, SELECT/WHERE/ORDER/LIMIT, basic aggregates.
Multi-table joins, GROUP BY, anti-joins, semi-joins.
Window functions, CASE, subqueries, dedup.
CTEs, sessionization, funnels, cohorts, date spines.
Recursive CTEs, SCD2, skew, Spark, idempotency.
Top 20 · If you only study 20 questions, make it these
These twenty are the highest signal-to-noise ratio in the bank. They cover ~80% of the patterns that show up in MAANG SQL screens and Python coding rounds, in roughly the order you should master them. Skim the rest at your peril; do these and the interviewer will run out of pattern types before they run out of clock.
Q11 · Anti-join (users with no purchases)
Why: tests NULL discipline. The trap is NOT IN; the right answer is LEFT JOIN ... IS NULL or NOT EXISTS. Almost every loop probes this pattern.
Q22 · Dedup with deterministic tiebreak
Why: tests idempotency thinking. Tied updated_at without a secondary sort gives non-reproducible output across backfills — the bug that surfaces only in production.
Q23 · LAG for WoW growth
Why: the most-asked window function in any growth-team loop. Bonus signal: knowing to NULLIF the divisor for the first row.
Q25 · Top-N per group
Why: tests scaling logic. Correlated subquery works at small data; DENSE_RANK() OVER PARTITION BY is the only answer that scales. 100× cost difference at 1B rows.
Q34 · Funnel with monotonicity guards
Why: the senior catch. Most candidates compute counts per stage and call it a funnel. The interviewer wants the activation_at >= signup_at guard — without it your conversion rate exceeds 100% in some weeks.
Q36 · Sessionization (30-min idle)
Why: the cumulative-sum trick on (time - prev_time) > threshold. This exact pattern shows up in Reels, Stories, IG Live, YouTube watch-time, every video product. Memorize it.
Q37 · Streaks (gaps & islands)
Why: the canonical pattern. date - row_number() is constant within consecutive runs. If you have to think about this in the interview, you have not internalized it — drill it.
Q41 · SCD2 as-of join
Why: the senior data-modeling test. The invariant matters: exactly one row per (entity, point-in-time). Half-open [effective, expiry) windows or you get duplicate matches at boundaries.
Q42 · Recursive CTE for org chart
Why: tests whether you have ever built a hierarchical pipeline. Bonus signal: volunteering the cycle-guard pattern (depth < 20 + visited-set) before the interviewer asks.
Q47 · Reconciliation against raw facts
Why: the L5 signal. A metric is only as trusted as its reconciliation report. Ship the audit alongside the metric — it is a habit MAANG senior DEs all share.
Q11 · merge with validate=
Why: the cheapest production-bug-prevention pattern in pandas. Use it once after a silent m:m blow-up burns you, and you'll use it forever.
Q13 · Dedup with assert is_unique
Why: the senior habit. Never trust dedup by inspection. The assert is the answer.
Q21 · rolling with min_periods
Why: tests definition discipline. min_periods=1 is convenient but produces 1-day ‘averages’ for the head of the series. min_periods=7 is the honest answer. Pick deliberately.
Q28 · Sessionize in pandas
Why: identical mental model to the SQL form. Practice both side-by-side until you can switch fluently — in either round.
Q30 · Cohort retention table
Why: the shape that ships in every weekly growth review. Rows = cohort, cols = age, cells = % retained. Heatmap straight from this table.
Q32 · merge_asof for SCD2
Why: the SCD2 superpower in pandas. Solo this question and you understand half of dimensional modeling.
Q33 · Salt the hot key (skew mitigation)
Why: the senior Spark signal even in a pandas round. Knowing to salt both sides of the join is what separates “has read about Spark” from “has shipped a Spark job at scale.”
Q46 · DAG executor with toposort
Why: tests whether you have built a scheduler, not just used one. Maestro (Netflix) and Dataswarm (Meta) both started this way.
Q47 · LRU cache from scratch
Why: classic mid-level Python question. OrderedDict gives O(1) get/put; the doubly-linked-list-from-scratch version impresses the senior interviewer.
Q50 · Atomic daily-partition publish
Why: the publish-a-daily-fact-table pattern every senior DE has written 50 times. The shape of the answer matters more than the syntax.
Part 3 · 50 SQL Questions, Simple to Complex
Open a Postgres / DuckDB / SQLite shell, load the CSVs, and try each question on your own before reading the solution. Solutions use ANSI SQL with a Postgres / DuckDB flavor; portability notes call out where Spark / BigQuery / Snowflake differ. Every Output block was produced by actually running the query against the dataset at build time.
Tier 1 · Foundation
How many users do we have per country?
[Meta · IG Growth] An Instagram growth lead is deciding which language UI to localize next. They want a country split of the active user base before scoping the launch.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
SELECT country, COUNT(*) AS user_count
FROM users
GROUP BY country
ORDER BY user_count DESC;
country user_count
0 US 53
1 IN 33
2 BR 22
3 UK 20
4 FR 20
.. ... ...
11 in 2
12 1
13 US 1
14 UK 1
15 <NULL> 1
Why it matters: The bread-and-butter GROUP BY. Interviewers watch whether you alias the count, sort the result, and resist SELECT *.
Edge cases & gotchas (4)
- ‘Missing’ hides in five different shapes
Real-world country columns leak more than just
NULL:'\N'(the default CSV null sentinel fromCOPY ... FORMAT csv), the literal string'<NULL>', empty strings, whitespace-only rows (' '), and 'NA' / 'N/A'. A naiveCOALESCE(country, '<NULL>')only catches the first one. The fix has to canonicalize all of them — and report which dirty variants existed so the DQ team can ticket the upstream.Fix · SQL-- Production-grade: collapse every flavor of "missing" into one bucket, -- normalize case + whitespace so 'BR' and 'br' aggregate together, -- and surface raw_variants so the DQ team can ticket the upstream owner. WITH cleaned AS ( SELECT CASE WHEN country IS NULL THEN '<NULL>' WHEN TRIM(country) = '' THEN '<EMPTY>' WHEN UPPER(TRIM(country)) IN ('\N','NULL','<NULL>','NA','N/A') THEN '<NULL>' ELSE UPPER(TRIM(country)) -- normalize case + trim END AS country_clean, country AS country_raw FROM users ) SELECT country_clean, COUNT(*) AS n, COUNT(DISTINCT country_raw) AS raw_variants, STRING_AGG(DISTINCT '''' || COALESCE(country_raw, 'NULL') || '''', ', ' ORDER BY '''' || COALESCE(country_raw, 'NULL') || '''') AS variants_seen FROM cleaned GROUP BY country_clean ORDER BY n DESC; -- Expected output collapses dirty variants into clean buckets, -- with raw_variants flagging which buckets had multiple input forms.Outputcountry_clean n raw_variants variants_seen 0 US 54 2 ' US', 'US' 1 IN 35 2 'IN', 'in' 2 BR 24 2 'BR', 'br' 3 UK 21 2 'UK ', 'UK' 4 FR 20 1 'FR' 5 JP 18 1 'JP' 6 <NULL> 17 2 '<NULL>', 'NULL', '\N' 7 DE 7 1 'DE' 8 <EMPTY> 4 2 ' ', ' ' - Case-only differences (‘BR’ vs ‘br’)
GROUP BY countryis case-sensitive on Postgres / DuckDB / Snowflake (with default collation). 'BR' and 'br' are two buckets — almost always a DQ bug, never the intent. Normalize withUPPER(TRIM(country))at the GROUP BY boundary, but keep the raw value visible for the audit trail.Fix · SQL-- See variants per canonical bucket SELECT UPPER(TRIM(country)) AS country_clean, COUNT(*) AS n, ARRAY_AGG(DISTINCT country ORDER BY country) AS raw_forms_observed FROM users WHERE country IS NOT NULL GROUP BY UPPER(TRIM(country)) HAVING COUNT(DISTINCT country) > 1 -- only buckets with dirty forms ORDER BY n DESC;Outputcountry_clean n raw_forms_observed 0 US 54 [ US, US] 1 IN 35 [IN, in] 2 BR 24 [BR, br] 3 UK 21 [UK, UK ] 4 4 [ , ] - Empty table returns no rows (not zero)
If users is empty, the query returns no rows — not a row with count=0. Wrap in a sentinel CTE if downstream needs a deterministic shape (e.g., a daily metrics job that
UNIONs yesterday with today). Self-contained version below.Fix · SQL-- Empty-safe roll-up: always returns at least one row WITH base AS ( SELECT country, COUNT(*) AS n FROM users WHERE 1=0 -- simulates an empty source for demo GROUP BY country ) SELECT country, n FROM base UNION ALL SELECT '<empty_table>', 0 WHERE NOT EXISTS (SELECT 1 FROM base);Outputcountry n 0 <empty_table> 0 COUNT(*)vsCOUNT(country)diverge on NULLCOUNT(*)counts every row including those with NULL country.COUNT(country)skips NULLs.COUNT(DISTINCT country)also skips NULLs. The difference between the first two is your NULL count for free. Sentinel collision warning: in the output below,distinct_incl_null_unsafeequalsdistinct_excl_null— surprising at first, until you notice the dirty data already contains a literal string'<NULL>', so aCOALESCE(col, '<NULL>')sentinel collides with a real value and adds nothing. The portable, sentinel-free fix is to count NULL as+1explicitly. AvoidCHR(0)as a sentinel: Postgres rejects NUL bytes inTEXT/VARCHARcolumns with SQL Error [54000]: null character not permitted. DuckDB allows it; Postgres does not.Fix · SQL-- Section 1 — counts side-by-side, all portable across -- Postgres / DuckDB / Snowflake / BigQuery. SELECT COUNT(*) AS rows_total, COUNT(country) AS rows_with_country, COUNT(*) - COUNT(country) AS country_null_rows, COUNT(DISTINCT country) AS distinct_excl_null, -- UNSAFE: this sentinel collides with the literal string '<NULL>' in the data COUNT(DISTINCT COALESCE(country, '<NULL>')) AS distinct_incl_null_unsafe, -- SAFE: pick a sentinel obviously not in the domain. Long + unlikely. COUNT(DISTINCT COALESCE(country, '__NULL_SENTINEL_8f3a__')) AS distinct_incl_null_safe, -- BEST: no sentinel at all. Add 1 if any NULLs exist. COUNT(DISTINCT country) + CASE WHEN COUNT(*) > COUNT(country) THEN 1 ELSE 0 END AS distinct_incl_null_explicit FROM users; -- Section 2 — what ARE the distinct values (so you can SEE the mess). -- Quote each value so empty strings and whitespace become visible. SELECT COALESCE(country, '__NULL__') AS country_value_or_null_marker, '''' || COALESCE(country, '__NULL__') || '''' AS country_quoted, COUNT(*) AS n_rows FROM users GROUP BY country ORDER BY n_rows DESC, country_quoted;Outputrows_total rows_with_country country_null_rows distinct_excl_null distinct_incl_null_unsafe distinct_incl_null_safe distinct_incl_null_explicit 0 200 193 7 15 15 16 16 ──── result 2 ──── country_value_or_null_marker country_quoted n_rows 0 US 'US' 53 1 IN 'IN' 33 2 BR 'BR' 22 3 FR 'FR' 20 4 UK 'UK' 20 5 JP 'JP' 18 .. ... ... ... 10 br 'br' 2 11 in 'in' 2 12 ' ' 1 13 US ' US' 1 14 <NULL> '<NULL>' 1 15 UK 'UK ' 1
Top 5 highest-paid employees with their department.
[Amazon · People Tech] A comp partner is calibrating the L5/L6 band before the next focal cycle. They want the top earners surfaced with department, to flag any outliers for HRBP review.
employee_id full_name manager_id department salary
0 100 Dana Chen NaN Engineering 250000.0
1 201 Eve Park 100.0 Engineering 180000.0
2 202 Frank Li 100.0 Engineering 175000.0
3 203 Grace Kim 100.0 Engineering NaN
4 301 IC_301 201.0 Engineering 116962.0
SELECT employee_id, full_name, department, salary
FROM employees
ORDER BY salary DESC
LIMIT 5;
employee_id full_name department salary
0 100 Dana Chen Engineering 250000.0
1 201 Eve Park Engineering 180000.0
2 202 Frank Li Engineering 175000.0
3 315 IC_315 Engineering 161966.0
4 312 IC_312 Engineering 145797.0
Follow-up: “What if two people tie at #5?” → discuss FETCH FIRST 5 ROWS WITH TIES or RANK() OVER (ORDER BY salary DESC).
Edge cases & gotchas (2)
- Ties at the cutoff
Plain LIMIT 5 silently drops one of two tied salaries. Use FETCH FIRST 5 ROWS WITH TIES (Postgres 13+, Snowflake) or RANK() to make ties explicit.
Fix · SQL-- Postgres 13+: keep all ties at the cutoff SELECT * FROM employees ORDER BY salary DESC FETCH FIRST 5 ROWS WITH TIES; -- Portable: filter by RANK WITH ranked AS ( SELECT *, RANK() OVER (ORDER BY salary DESC) AS rk FROM employees ) SELECT * FROM ranked WHERE rk <= 5;Output[error] ParserException: Parser Error: syntax error at or near "WITH" LINE 3: FETCH FIRST 5 ROWS WITH TIES ^ - NULL salary breaks ORDER BY semantics
Postgres orders NULLs LAST in DESC by default; MySQL orders them FIRST. State your DB or be explicit with NULLS LAST.
Fix · SQLSELECT * FROM employees ORDER BY salary DESC NULLS LAST -- portable across Postgres, Snowflake, BigQuery LIMIT 5;Outputemployee_id full_name manager_id department salary 0 100 Dana Chen NaN Engineering 250000.0 1 201 Eve Park 100.0 Engineering 180000.0 2 202 Frank Li 100.0 Engineering 175000.0 3 315 IC_315 203.0 Engineering 161966.0 4 312 IC_312 203.0 Engineering 145797.0
Total revenue across all purchases.
[Amazon · Retail Finance] Day-1 of an OP1 review: the S-team wants the simplest possible top-line for the deck — lifetime GMV booked through purchases.
purchase_id user_id amount purchase_date product_category
0 1 149 321.30 2024-11-01 Sports
1 2 82 36.66 2024-11-01 Clothing
2 3 19 70.12 2024-11-01 Clothing
3 4 98 122.21 2024-11-01 Food
4 5 59 5.33 2024-11-01 Electronics
SELECT ROUND(SUM(amount), 2) AS total_revenue
FROM purchases;
total_revenue
0 127603.25
Edge case: If amount can be NULL, SUM already ignores nulls — but state that explicitly. Interviewers like when you pre-empt the null question.
Edge cases & gotchas (4)
- How every aggregate handles NULL (the full matrix)
COUNT(*)counts rows.COUNT(col)counts non-NULL values of col.SUM/AVG/MIN/MAXall skip NULLs entirely. The denominator surprise:AVG(col)drops NULL rows from both numerator AND denominator — not what you want if NULL means ‘real zero’. The all-NULL surprise:SUM(col)on a column where every row is NULL returnsNULL, not 0. The dataset has 3 NULL amount rows after the dirty-data injection, so we can show the divergence side-by-side.Fix · SQL-- Every aggregate's NULL behavior, on the same data: SELECT COUNT(*) AS rows_total, -- counts rows incl. NULL COUNT(amount) AS amount_non_null, -- skips NULL COUNT(*) - COUNT(amount) AS amount_null_rows, -- diff = NULL count COUNT(DISTINCT amount) AS distinct_amounts, -- skips NULL too ROUND(SUM(amount), 2) AS sum_amount, -- NULL-safe iff at least 1 non-null COALESCE(ROUND(SUM(amount),2),0) AS sum_amount_safe, -- always a number ROUND(AVG(amount), 2) AS avg_amount_pandas_default, -- mean of NON-NULL only ROUND(SUM(amount) / COUNT(*), 2) AS avg_treating_null_as_zero, -- different number! MIN(amount) AS min_amount, -- skips NULL MAX(amount) AS max_amount -- skips NULL FROM purchases;Outputrows_total amount_non_null amount_null_rows distinct_amounts sum_amount sum_amount_safe avg_amount_pandas_default avg_treating_null_as_zero min_amount max_amount 0 301 298 3 296 127603.25 127603.25 428.2 423.93 -228.72 99999.99 - AVG when the column is all NULL
AVG returns
NULL(not 0) on a partition where every row is NULL. If you guard withCOALESCE(AVG(x), 0)you risk hiding the ‘no data at all’ signal in a real zero. Better: emit COUNT alongside, or useNULLIF/ COALESCE deliberately with a label.Fix · SQL-- Per-category averages with explicit empty-handling SELECT product_category, COUNT(*) AS rows, COUNT(amount) AS rows_with_amount, AVG(amount) AS avg_raw, COALESCE(AVG(amount), 0) AS avg_zero_filled, CASE WHEN COUNT(amount) = 0 THEN 'no_data' ELSE CAST(ROUND(AVG(amount),2) AS VARCHAR) END AS avg_labeled FROM purchases GROUP BY product_category ORDER BY rows DESC;Outputproduct_category rows rows_with_amount avg_raw avg_zero_filled avg_labeled 0 Electronics 73 73 88.380822 88.380822 88.38 1 Clothing 64 64 1651.377344 1651.377344 1651.38 2 Food 59 59 87.792034 87.792034 87.79 3 Books 56 55 102.794909 102.794909 102.79 4 Sports 49 47 98.507447 98.507447 98.51 COALESCE(SUM(x), 0)vsSUM(COALESCE(x, 0))Both forms return the same number when the table has rows — SUM ignores NULLs either way. They diverge only on the empty input case:
•COALESCE(SUM(x), 0)applies the default to the aggregated result → returns0on empty (and NULL-only) tables.
•SUM(COALESCE(x, 0))replaces NULL with 0 row-by-row before summing → returns0on a NULL-only table, butNULLon a truly empty input (SQL standard: SUM of zero rows is NULL).
Pick by use case: dashboards that must always show a number (KPIs) want the outside form. Per-row imputation (e.g., treating NULL price as 0 for ledger math) wants the inside form.Fix · SQL-- Side-by-side on three scenarios: -- (a) the real table (has rows + 3 NULL amounts) -- (b) NULL-only table (all amount IS NULL) -- (c) empty table (zero rows) WITH null_only AS (SELECT * FROM purchases WHERE 1=0 UNION ALL SELECT 999, 1, NULL, '2025-01-01', 'demo'), empty_set AS (SELECT * FROM purchases WHERE 1=0) SELECT 'real_table' AS scenario, SUM(amount) AS sum_raw, COALESCE(SUM(amount), 0) AS coalesce_outside, -- final NULL -> 0 SUM(COALESCE(amount, 0)) AS coalesce_inside -- NULL row -> 0 then sum FROM purchases UNION ALL SELECT 'null_only_table', SUM(amount), COALESCE(SUM(amount), 0), SUM(COALESCE(amount, 0)) FROM null_only UNION ALL SELECT 'empty_table', SUM(amount), COALESCE(SUM(amount), 0), SUM(COALESCE(amount, 0)) FROM empty_set;Outputscenario sum_raw coalesce_outside coalesce_inside 0 real_table 127603.25 127603.25 127603.25 1 null_only_table NaN 0.00 0.00 2 empty_table NaN 0.00 NaN- Currency precision (float drift vs DECIMAL)
Storing amount as DOUBLE/FLOAT will lose cents at scale — SUM is non-associative on floats. The schema uses DECIMAL(10,2); confirm before reporting to finance.
Fix · SQL-- SAFE: explicit DECIMAL cast prevents float drift SELECT ROUND(SUM(amount)::DECIMAL(18,2), 2) AS revenue FROM purchases WHERE amount IS NOT NULL;Outputrevenue 0 127603.25
Which products generated more than $1,000 in revenue?
[Amazon · Prime Day Merch] Merchandising wants the ‘winners shelf’ for the homepage rotation: every SKU that has cleared $1,000 in revenue.
product_id product_name category revenue
0 1 prod_1 Electronics 435.0
1 2 prod_2 Electronics 25.0
2 3 prod_3 Electronics 384.0
3 4 prod_4 Electronics 1054.0
4 5 prod_5 Electronics 485.0
SELECT product_id, product_name, category, revenue
FROM products
WHERE revenue > 1000
ORDER BY revenue DESC;
product_id product_name category revenue
0 40 prod_40 Sports 12777.0
1 28 prod_28 Clothing 9770.0
2 50 prod_50 Food 8731.0
3 30 prod_30 Clothing 7190.0
4 29 prod_29 Clothing 2972.0
.. ... ... ... ...
11 38 prod_38 Sports 2003.0
12 45 prod_45 Food 1827.0
13 12 prod_12 Books 1698.0
14 17 prod_17 Books 1565.0
15 4 prod_4 Electronics 1054.0
Why it matters: WHERE + sort. Trivial — but watch candidates who use HAVING here without a GROUP BY.
Edge cases & gotchas (2)
- WHERE vs HAVING
If revenue is a pre-aggregate, WHERE works. If you compute it in this query (SUM), you must use HAVING. Mixing them is a frequent interview trap.
Fix · SQL-- Computed revenue: HAVING, not WHERE SELECT product_category, SUM(amount) AS revenue FROM purchases GROUP BY product_category HAVING SUM(amount) > 1000;Outputproduct_category revenue 0 Books 5653.72 1 Clothing 105688.15 2 Electronics 6451.80 3 Food 5179.73 4 Sports 4629.85 - Inclusive vs exclusive boundary
Always confirm '> 1000' vs '>= 1000'. Off-by-one at category boundaries silently changes the report.
Average session duration by platform.
[Meta · IG Product] Before the notification redesign ships, product wants average session length by platform — does iOS actually engage longer than Android, controlled for sample size?
session_id user_id platform session_date duration_sec
0 1 51 android 2025-01-01 388.0
1 2 73 android 2025-01-01 167.0
2 3 32 android 2025-01-01 2493.0
3 4 75 android 2025-01-01 1881.0
4 5 8 android 2025-01-01 1217.0
SELECT platform,
COUNT(*) AS sessions,
ROUND(AVG(duration_sec)::DOUBLE, 1) AS avg_duration_sec
FROM sessions
GROUP BY platform
ORDER BY avg_duration_sec DESC;
platform sessions avg_duration_sec
0 WEB 5 2907.6
1 iOS 1 2691.0
2 Android 2 2085.5
3 web 198 1823.9
4 ios 329 1744.4
5 android 265 1679.0
Follow-up: “Why include the count?” → averages without sample size are dangerously seductive.
Edge cases & gotchas (3)
- Average without sample size lies
Always emit COUNT alongside AVG. A platform with 5 sessions and a 10-second average looks identical to one with 50,000 sessions and the same average — but only one is statistically meaningful.
Fix · SQLSELECT platform, COUNT(*) AS sessions, AVG(duration_sec) AS avg_dur, STDDEV(duration_sec) AS sd_dur, AVG(duration_sec) / SQRT(COUNT(*)) AS sem FROM sessions GROUP BY platform;Outputplatform sessions avg_dur sd_dur sem 0 Android 2 2085.500000 713.470742 1474.671192 1 android 265 1678.973485 1011.429126 103.138521 2 web 198 1823.908629 1007.822862 129.619542 3 iOS 1 2691.000000 NaN 2691.000000 4 ios 329 1744.445122 1062.677389 96.174373 5 WEB 5 2907.600000 775.846828 1300.318250 - Outlier-skewed mean
A few 12-hour zombie sessions pull AVG up. Report median or a trimmed mean alongside.
Fix · SQLSELECT platform, AVG(duration_sec) AS avg_dur, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_sec) AS p50_dur FROM sessions GROUP BY platform;Outputplatform avg_dur p50_dur 0 Android 2085.500000 2085.5 1 WEB 2907.600000 3257.0 2 android 1678.973485 1557.5 3 web 1823.908629 1928.0 4 iOS 2691.000000 2691.0 5 ios 1744.445122 1719.5 - AVG silently drops NULL and negative durations
AVG ignores NULL — but a negative
duration_sec(corrupt event) is treated as a real value and pulls the mean down. The denominator surprise: NULL rows are excluded from both numerator and denominator. Always pre-filter to valid positives before averaging, AND emit adroppedcount so the DQ trail is visible.Fix · SQLSELECT LOWER(platform) AS platform, COUNT(*) AS rows_total, COUNT(duration_sec) AS rows_with_duration, SUM(CASE WHEN duration_sec IS NULL THEN 1 END) AS dropped_null, SUM(CASE WHEN duration_sec < 0 THEN 1 END) AS dropped_negative, ROUND(AVG(duration_sec), 1) AS avg_raw, -- pulled down by negatives ROUND(AVG(CASE WHEN duration_sec >= 0 THEN duration_sec END), 1) AS avg_clean -- valid positives only FROM sessions GROUP BY 1 ORDER BY rows_total DESC;Outputplatform rows_total rows_with_duration dropped_null dropped_negative avg_raw avg_clean 0 ios 330 329 1.0 1.0 1747.3 1752.7 1 android 267 266 1.0 1.0 1682.0 1688.4 2 web 203 202 1.0 NaN 1850.7 1850.7
How many sessions occur on weekends?
[Netflix · Member Ops] An ops lead suspects weekend viewing is what is driving the Tier-1 support backlog. Size weekend session volume so capacity planning has a number to work with.
session_id user_id platform session_date duration_sec
0 1 51 android 2025-01-01 388.0
1 2 73 android 2025-01-01 167.0
2 3 32 android 2025-01-01 2493.0
3 4 75 android 2025-01-01 1881.0
4 5 8 android 2025-01-01 1217.0
SELECT COUNT(*) AS weekend_sessions
FROM sessions
WHERE EXTRACT(DOW FROM session_date) IN (0, 6);
weekend_sessions
0 224
Portability: MySQL uses DAYOFWEEK() (1=Sun…7=Sat); Spark dayofweek() uses the same convention as MySQL. Always state your DB.
Edge cases & gotchas (2)
- DOW numbering is DB-specific
Postgres EXTRACT(DOW): Sun=0..Sat=6. ISODOW: Mon=1..Sun=7. MySQL DAYOFWEEK: Sun=1..Sat=7. Wrong constant = wrong answer.
Fix · SQL-- Portable: name-based SELECT COUNT(*) FROM sessions WHERE strftime(session_date, '%w') IN ('0','6'); -- DuckDB / SQLite -- Or: EXTRACT(ISODOW FROM session_date) IN (6, 7) -- PostgresOutputcount_star() 0 224 - Time zone drift
If session_date is derived from a UTC timestamp, a 'weekend' for a Sydney user differs from one for an LA user. Convert to user-local TZ before bucketing.
List all distinct event types in the events table.
[Meta · Onboarding DE] A new analyst on the funnel team needs to know what canonical event types exist before writing any SQL. The lookup table is documented but they want to verify against the actual fact stream.
event_id user_id event_type event_date
0 1 1 signup 2025-01-04
1 2 1 activation 2025-02-08
2 3 1 purchase 2025-03-13
3 4 2 signup 2025-02-17
4 5 2 activation 2025-02-24
SELECT DISTINCT event_type
FROM events
ORDER BY event_type;
event_type
0 activation
1 purchase
2 signup
3 NaN
Production angle: In a real warehouse, prefer querying a documented enum/lookup table instead of DISTINCT on a high-cardinality fact table.
Edge cases & gotchas (2)
- DISTINCT on huge fact tables is expensive
On a 10B-row events table, SELECT DISTINCT does a full scan. Hit the documented enum/lookup table or a daily-cardinality summary.
Fix · SQL-- Cheap on a billion rows: the catalog table SELECT type FROM event_types ORDER BY type;Output[error] CatalogException: Catalog Error: Table with name event_types does not exist! Did you mean "events"? LINE 2: SELECT type FROM event_types ORDER BY type ^ - Case sensitivity
DISTINCT is case-sensitive. 'SignUp' and 'signup' are two values; lowercase before grouping if upstream is dirty.
Fix · SQLSELECT DISTINCT LOWER(event_type) FROM events ORDER BY 1;Outputlower(event_type) 0 activation 1 purchase 2 signup 3 NaN
How many users signed up in Q4 2024?
[Google · Ads Marketing] The growth team is reconciling Q4 2024 acquisition against the quarterly board pack. Count signups landed in that window, with a half-open date range.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
SELECT COUNT(*) AS q4_2024_signups
FROM users
WHERE signup_date >= DATE '2024-10-01'
AND signup_date < DATE '2025-01-01';
q4_2024_signups
0 46
Why half-open ranges: >= start AND < end is timezone-safe, leap-day-safe, and avoids the classic BETWEEN '2024-12-31' off-by-one when timestamps creep in.
Edge cases & gotchas (2)
- BETWEEN is INCLUSIVE on both ends
BETWEEN '2024-10-01' AND '2024-12-31' silently misses any midnight-on-Dec-31 timestamps. Half-open ranges (>= start AND < end) are timezone- and DST-safe.
Fix · SQL-- WRONG when signup_date can be a TIMESTAMP WHERE signup_date BETWEEN '2024-10-01' AND '2024-12-31' -- RIGHT: half-open WHERE signup_date >= DATE '2024-10-01' AND signup_date < DATE '2025-01-01' - Time zone of the report
‘Q4 2024’ in PT vs UTC differs by 8 hours. State the report TZ and CONVERT/AT TIME ZONE if signup_date is UTC.
Largest single order amount and its date.
[Amazon · Trust & Safety] A fraud analyst is doing an outlier review and wants the top-of-distribution: the single largest order ever booked, with date for forensic context.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
SELECT order_id, user_id, order_date, amount
FROM orders
ORDER BY amount DESC
LIMIT 1;
order_id user_id order_date amount
0 446 176 2025-02-25 958.35
Alternative: A subquery with WHERE amount = (SELECT MAX(amount) FROM orders) — and discuss why ORDER BY ... LIMIT is usually clearer and almost always indexed-friendlier.
Edge cases & gotchas (2)
- Multiple tied maxima
If two orders share the max amount, LIMIT 1 returns one non-deterministically. Use ROW_NUMBER with a deterministic tiebreak.
Fix · SQLSELECT order_id, user_id, order_date, amount FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY amount DESC, order_id) AS rn FROM orders ) t WHERE rn = 1;Outputorder_id user_id order_date amount 0 446 176 2025-02-25 958.35 - Negative amounts (refunds)
If the table also stores refunds with negative amounts, MAX still returns the largest positive. But MIN may surprise: the ‘biggest refund’ is the most negative.
Sessions per day for the most recent 7 days in the data.
[Netflix · DataOps] An on-call DE just shipped a schema change to the sessions topic. Before paging the team, they want a freshness sanity check: session counts for the last 7 days the data actually covers.
session_id user_id platform session_date duration_sec
0 1 51 android 2025-01-01 388.0
1 2 73 android 2025-01-01 167.0
2 3 32 android 2025-01-01 2493.0
3 4 75 android 2025-01-01 1881.0
4 5 8 android 2025-01-01 1217.0
SELECT session_date, COUNT(*) AS sessions
FROM sessions
WHERE session_date >= (SELECT MAX(session_date) - INTERVAL '6 days' FROM sessions)
GROUP BY session_date
ORDER BY session_date;
session_date sessions
0 2099-06-01 1
Why MAX() instead of CURRENT_DATE: the data may be stale; you want ‘last 7 days that exist’, not ‘last 7 calendar days’.
Edge cases & gotchas (2)
- Sparse data near the head
MAX(session_date) - 6 days might span days with zero sessions. Date-spine left-join if you want exactly 7 daily rows.
Fix · SQLWITH spine AS ( SELECT (SELECT MAX(session_date) FROM sessions) - INTERVAL n DAY AS d FROM range(0, 7) t(n) ) SELECT s.d, COUNT(x.session_id) AS sessions FROM spine s LEFT JOIN sessions x ON x.session_date = s.d GROUP BY s.d ORDER BY s.d;Output[error] ParserException: Parser Error: syntax error at or near "DAY" LINE 2: ... SELECT (SELECT MAX(session_date) FROM sessions) - INTERVAL n DAY AS d ^ - Time zone vs ‘today’
CURRENT_DATE returns the server's date — usually UTC, sometimes the cluster region. Use AT TIME ZONE to report in the dashboard's TZ.
Tier 2 · Joins & Multi-Table
Find users who have never made a purchase.
[Meta · Lifecycle] The lifecycle team wants to send a free-trial coupon to every user who signed up but never converted to a purchase event. Build the audience list with an anti-join — and explain why NOT IN is the wrong tool.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
SELECT u.user_id, u.name, u.country, u.signup_date
FROM users u
LEFT JOIN purchases p ON p.user_id = u.user_id
WHERE p.purchase_id IS NULL;
user_id name country signup_date
0 1 user_1 IN 2024-10-02
1 2 user_2 UK 2024-10-04
2 4 user_4 JP 2024-10-08
3 8 user_8 US 2024-10-15
4 10 user_10 IN 2024-10-19
.. ... ... ... ...
70 193 user_193 US 2025-10-21
71 194 user_194 UK 2025-10-23
72 195 user_195 BR 2025-10-25
73 197 user_197 US 2025-10-28
74 199 user_199 US 2025-11-02
Why anti-join via LEFT JOIN + IS NULL: often plans better than NOT IN (which silently breaks on NULLs) or NOT EXISTS.
Edge cases & gotchas (3)
- NOT IN poisons on NULL
If the subquery returns any NULL, NOT IN evaluates to UNKNOWN for every row — the result set becomes empty. Anti-join via LEFT JOIN + IS NULL or NOT EXISTS is NULL-safe.
Fix · SQL-- DANGEROUS SELECT * FROM users WHERE user_id NOT IN (SELECT user_id FROM purchases); -- SAFE SELECT u.* FROM users u LEFT JOIN purchases p ON p.user_id = u.user_id WHERE p.purchase_id IS NULL;Outputuser_id name email signup_date country 0 1 user_1 user1@example.com 2024-10-02 IN 1 2 user_2 user2@example.com 2024-10-04 UK 2 4 user_4 user4@example.com 2024-10-08 JP 3 8 user_8 user8@example.com 2024-10-15 US 4 10 user_10 user10@example.com 2024-10-19 IN 5 12 user_12 user12@example.com 2024-10-23 \N .. ... ... ... ... ... 69 190 user_190 user190@example.com 2025-10-15 FR 70 193 user_193 user193@example.com 2025-10-21 US 71 194 user_194 user194@example.com 2025-10-23 UK 72 195 user_195 user195@example.com 2025-10-25 BR 73 197 user_197 user197@example.com 2025-10-28 US 74 199 user_199 user199@example.com 2025-11-02 US ──── result 2 ──── user_id name email signup_date country 0 1 user_1 user1@example.com 2024-10-02 IN 1 2 user_2 user2@example.com 2024-10-04 UK 2 4 user_4 user4@example.com 2024-10-08 JP 3 8 user_8 user8@example.com 2024-10-15 US 4 10 user_10 user10@example.com 2024-10-19 IN 5 12 user_12 user12@example.com 2024-10-23 \N .. ... ... ... ... ... 69 190 user_190 user190@example.com 2025-10-15 FR 70 193 user_193 user193@example.com 2025-10-21 US 71 194 user_194 user194@example.com 2025-10-23 UK 72 195 user_195 user195@example.com 2025-10-25 BR 73 197 user_197 user197@example.com 2025-10-28 US 74 199 user_199 user199@example.com 2025-11-02 US - Soft-deleted users
If users has a deleted_at column, you might be sending a coupon to a churned account. Filter the audience BEFORE the anti-join.
WHERE country != 'US'excludes NULL tooThree-valued logic strikes again:
NULL != 'US'evaluates toUNKNOWN, not TRUE — so users with NULL country drop out of the audience silently. If you actually want ‘not US, including unknowns’, useIS DISTINCT FROM(Postgres / DuckDB) or an explicit OR. Side-by-side counts show the gap.Fix · SQL-- All three answers in one row, so the gap is unmissable SELECT SUM(CASE WHEN country != 'US' THEN 1 ELSE 0 END) AS not_us_naive, -- drops NULL SUM(CASE WHEN country IS DISTINCT FROM 'US' THEN 1 ELSE 0 END) AS not_us_safe, -- NULL-safe SUM(CASE WHEN country != 'US' OR country IS NULL THEN 1 ELSE 0 END) AS not_us_explicit_or, -- same idea SUM(CASE WHEN country IS NULL THEN 1 ELSE 0 END) AS gap_explained -- = safe - naive FROM users;Outputnot_us_naive not_us_safe not_us_explicit_or gap_explained 0 140.0 147.0 147.0 7.0
Average purchase amount per user country.
[Amazon · Pricing Science] Pricing wants to test whether IN, UK, and US baskets differ enough to justify country-specific promo pricing. Average purchase amount by user country, with sample size.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
SELECT u.country,
COUNT(p.purchase_id) AS purchases,
ROUND(AVG(p.amount), 2) AS avg_basket
FROM users u
JOIN purchases p ON p.user_id = u.user_id
GROUP BY u.country
ORDER BY avg_basket DESC;
country purchases avg_basket
0 US 86 1271.63
1 in 4 147.87
2 NaN 8 135.66
3 \N 21 113.80
4 FR 31 108.38
.. ... ... ...
10 br 2 40.26
11 4 34.99
12 UK 2 27.49
13 <NULL> 1 6.21
14 US 1 3.62
Trap: A LEFT JOIN here would inflate countries with many no-purchase users — they would show AVG = NULL, not 0, which is correct but easy to mis-read.
Edge cases & gotchas (2)
- Inner vs left changes the denominator
INNER JOIN gives the avg only for purchasing countries. LEFT JOIN with AVG gives NULL for non-purchasers — easy to mis-read as 0.
Fix · SQL-- LEFT JOIN, COALESCE for clarity SELECT u.country, COALESCE(ROUND(AVG(p.amount), 2), 0) AS avg_basket, COUNT(p.purchase_id) AS purchases FROM users u LEFT JOIN purchases p ON p.user_id = u.user_id GROUP BY u.country;Outputcountry avg_basket purchases 0 JP 79.99 23 1 NaN 135.66 8 2 FR 108.38 31 3 <NULL> 6.21 1 4 BR 99.58 28 5 34.99 4 .. ... ... ... 10 US 3.62 1 11 0.00 0 12 UK 27.49 2 13 UK 65.22 29 14 in 147.87 4 15 DE 84.08 13 - Per-country sample size
Always emit purchase count alongside average. A 1-purchase country with avg $999 looks identical to a 1,000-purchase country with avg $999.
Which users have no profile row?
[Meta · Privacy Eng] Before a profile-personalization feature goes GA, the privacy review needs a list of users who have a users row but no user_profiles row — these users have not consented to the profile-data flow.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
SELECT u.user_id, u.name
FROM users u
WHERE NOT EXISTS (
SELECT 1 FROM user_profiles up WHERE up.user_id = u.user_id
);
Empty DataFrame
Columns: [user_id, name]
Index: []
Why NOT EXISTS: safe with NULLs, semantically clearest, and the optimizer typically transforms it into an anti-join.
Edge cases & gotchas (2)
- Multiple profile rows per user
user_profiles intentionally has 334 rows for 200 users (Q22's dedup target). NOT EXISTS still works — it short-circuits — but if you switched to LEFT JOIN, you must SELECT DISTINCT.
Fix · SQL-- LEFT JOIN form needs DISTINCT SELECT DISTINCT u.user_id, u.name FROM users u LEFT JOIN user_profiles up ON up.user_id = u.user_id WHERE up.user_id IS NULL;OutputEmpty DataFrame Columns: [user_id, name] Index: [] - NULL user_id in user_profiles
If user_profiles has rows with NULL user_id (corrupt loads), they do not match anything and don't pollute the anti-join — but they are a DQ bug worth alerting on separately.
Top 10 spenders — name and total spend.
[Netflix · CX] The white-glove support team is spinning up a VIP queue. They want the top 10 spenders by lifetime billing — name + total + order count — so phone routing can lift them.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
SELECT u.user_id, u.name,
ROUND(SUM(o.amount), 2) AS total_spend,
COUNT(o.order_id) AS order_count
FROM users u
JOIN orders o ON o.user_id = u.user_id
GROUP BY u.user_id, u.name
ORDER BY total_spend DESC
LIMIT 10;
user_id name total_spend order_count
0 176 user_176 1445.05 8
1 70 user_70 1337.78 10
2 168 user_168 1117.62 3
3 112 user_112 971.71 6
4 36 user_36 952.59 3
5 140 user_140 808.65 3
6 184 user_184 768.93 2
7 133 user_133 752.55 4
8 131 user_131 741.68 4
9 38 user_38 715.93 4
Note: orders.user_id is not a strict FK in the dataset — discuss what happens if an order references a missing user (use LEFT JOIN to find them; see Q46).
Edge cases & gotchas (2)
- Orphan order rows inflate the spend
If orders.user_id can reference a missing users row, INNER JOIN drops them; LEFT JOIN keeps them but the user name is NULL. State which behavior the dashboard wants.
Fix · SQLSELECT u.name, SUM(o.amount) FILTER (WHERE u.user_id IS NOT NULL) AS spend_known, SUM(o.amount) FILTER (WHERE u.user_id IS NULL) AS spend_orphan FROM orders o LEFT JOIN users u ON u.user_id = o.user_id;Output[error] BinderException: Binder Error: column "name" must appear in the GROUP BY clause or must be part of an aggregate function. Either add it to the GROUP BY list, or use "ANY_VALUE(name)" if the exact value of "name" is not important. LINE 1: SELECT u.name, ^ - Refunded orders
If amount can be negative (refunds), SUM is the net. Some teams report gross + refund separately; clarify before coding.
List each employee with their manager's name.
[Amazon · HR] HR is generating onboarding packets and wants a flat employee → manager mapping. The CEO has no manager and must still appear in the export.
employee_id full_name manager_id department salary
0 100 Dana Chen NaN Engineering 250000.0
1 201 Eve Park 100.0 Engineering 180000.0
2 202 Frank Li 100.0 Engineering 175000.0
3 203 Grace Kim 100.0 Engineering NaN
4 301 IC_301 201.0 Engineering 116962.0
SELECT e.employee_id, e.full_name AS employee, e.department,
m.full_name AS manager
FROM employees e
LEFT JOIN employees m ON m.employee_id = e.manager_id
ORDER BY e.department, e.full_name;
employee_id employee department manager
0 100 Dana Chen Engineering NaN
1 201 Eve Park Engineering Dana Chen
2 202 Frank Li Engineering Dana Chen
3 203 Grace Kim Engineering Dana Chen
4 301 IC_301 Engineering Eve Park
.. ... ... ... ...
23 320 IC_320 Engineering IC_302
24 321 IC_321 Engineering IC_302
25 322 IC_322 Engineering IC_306
26 323 IC_323 Engineering IC_306
27 324 IC_324 Engineering IC_306
Why LEFT JOIN: the CEO has no manager (manager_id IS NULL) and would otherwise vanish.
Edge cases & gotchas (2)
- Self-join hides the CEO
INNER JOIN drops the CEO (manager_id IS NULL). LEFT JOIN keeps them but their manager column is NULL — which the export must handle, not crash.
Fix · SQLSELECT e.full_name AS employee, COALESCE(m.full_name, '<no manager>') AS manager FROM employees e LEFT JOIN employees m ON m.employee_id = e.manager_id;Outputemployee manager 0 Eve Park Dana Chen 1 Frank Li Dana Chen 2 Grace Kim Dana Chen 3 IC_301 Eve Park 4 IC_302 Eve Park 5 IC_303 Eve Park .. ... ... 22 IC_320 IC_302 23 IC_321 IC_302 24 IC_322 IC_306 25 IC_323 IC_306 26 IC_324 IC_306 27 Dana Chen <no manager> - Stale manager_id (manager left)
If a manager has been terminated and the row removed, the report shows NULL manager. Real HRIS systems keep terminated employees with a deleted_at flag for exactly this reason.
Salary delta vs. each employee's manager.
[Amazon · People Tech] A comp review is flagging ‘manager-paid-less-than-report’ cases — a classic signal of a stale promo or under-leveled report. Surface every pair.
employee_id full_name manager_id department salary
0 100 Dana Chen NaN Engineering 250000.0
1 201 Eve Park 100.0 Engineering 180000.0
2 202 Frank Li 100.0 Engineering 175000.0
3 203 Grace Kim 100.0 Engineering NaN
4 301 IC_301 201.0 Engineering 116962.0
SELECT e.full_name AS employee, e.salary AS emp_salary,
m.full_name AS manager, m.salary AS mgr_salary,
(m.salary - e.salary) AS manager_premium
FROM employees e
JOIN employees m ON m.employee_id = e.manager_id
WHERE m.salary < e.salary;
employee emp_salary manager mgr_salary manager_premium
0 IC_316 121279.0 IC_301 116962.0 -4317.0
1 IC_317 131077.0 IC_301 116962.0 -14115.0
Follow-up: “What about skip-level comparisons?” — that is the recursive CTE in Q42–Q43.
Edge cases & gotchas (2)
- Promo lag, not bug
A new senior IC may legitimately out-earn their L4 manager. Don't auto-flag; surface the pair for HRBP review with comp band context.
Fix · SQLSELECT e.full_name AS report, e.salary, e.department, m.full_name AS manager, m.salary, (e.salary - m.salary) AS report_premium FROM employees e JOIN employees m ON m.employee_id = e.manager_id WHERE e.salary > m.salary ORDER BY report_premium DESC;Outputreport salary department manager salary_1 report_premium 0 IC_317 131077.0 Engineering IC_301 116962.0 14115.0 1 IC_316 121279.0 Engineering IC_301 116962.0 4317.0 - Currency / part-time normalization
Salaries should be normalized to FTE / annualized USD before comparison. A 0.6-FTE manager out-earning a 1.0-FTE report is expected, not a flag.
Top 3 product categories by total purchase revenue.
[Amazon · Holiday Merch] Merchandising is locking the homepage carousel for Prime Day. Top 3 product categories by revenue and purchase count — confirm both definitions agree.
purchase_id user_id amount purchase_date product_category
0 1 149 321.30 2024-11-01 Sports
1 2 82 36.66 2024-11-01 Clothing
2 3 19 70.12 2024-11-01 Clothing
3 4 98 122.21 2024-11-01 Food
4 5 59 5.33 2024-11-01 Electronics
SELECT product_category,
ROUND(SUM(amount), 2) AS revenue,
COUNT(*) AS purchases
FROM purchases
GROUP BY product_category
ORDER BY revenue DESC
LIMIT 3;
product_category revenue purchases
0 Clothing 105688.15 64
1 Electronics 6451.80 73
2 Books 5653.72 56
Subtle: ‘by revenue’ can mean SUM(amount) or COUNT(*) × avg price. Always confirm which revenue definition before writing.
Edge cases & gotchas (2)
- Definition of revenue
‘Top by revenue’ can mean SUM(amount) (gross), or SUM(amount * (1 - return_rate)) (net), or COUNT * AVG (units × ASP). Confirm before writing.
- Tied #3 silently dropped
LIMIT 3 drops one of two tied categories at #3. Use DENSE_RANK if the carousel can show ties.
Fix · SQLWITH ranked AS ( SELECT product_category, SUM(amount) AS revenue, DENSE_RANK() OVER (ORDER BY SUM(amount) DESC) AS rk FROM purchases GROUP BY product_category ) SELECT * FROM ranked WHERE rk <= 3;Outputproduct_category revenue rk 0 Clothing 105688.15 1 1 Electronics 6451.80 2 2 Books 5653.72 3
Users active in both events and logins.
[Meta · DAU Definition] Engagement is debating ‘truly active’ — does it require both an event and a login that day? Find the intersection of users in events and logins as a starting reference set.
event_id user_id event_type event_date
0 1 1 signup 2025-01-04
1 2 1 activation 2025-02-08
2 3 1 purchase 2025-03-13
3 4 2 signup 2025-02-17
4 5 2 activation 2025-02-24
SELECT user_id FROM events
INTERSECT
SELECT user_id FROM logins;
user_id
0 35
1 37
2 54
3 62
4 64
.. ...
95 73
96 76
97 82
98 89
99 94
Tradeoff: INTERSECT is concise and dedups for free; EXISTS wins when one side is much larger or you want to stream early.
Edge cases & gotchas (2)
- INTERSECT removes duplicates
INTERSECT in standard SQL is set semantics — duplicates collapse. INTERSECT ALL preserves duplicates. Check your DB; not every engine implements ALL.
- Time-window mismatch
‘Active in events AND logins’ needs a window — last 7 days? Lifetime? The intersection definition is meaningless without one. Add a date filter.
Fix · SQLSELECT user_id FROM events WHERE event_date >= CURRENT_DATE - INTERVAL '7' DAY INTERSECT SELECT user_id FROM logins WHERE login_date >= CURRENT_DATE - INTERVAL '7' DAY;OutputEmpty DataFrame Columns: [user_id] Index: []
Country-level signup→purchase conversion rate.
[Meta · Reels Funnel] Growth is debugging a country-level dropoff in the signup → purchase funnel. Compute conversion rate per country and sort — the IN cohort is suspected of payment-friction issues.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
SELECT u.country,
COUNT(DISTINCT u.user_id) AS signups,
COUNT(DISTINCT p.user_id) AS purchasers,
ROUND(100.0 * COUNT(DISTINCT p.user_id)
/ NULLIF(COUNT(DISTINCT u.user_id), 0), 2) AS conv_pct
FROM users u
LEFT JOIN purchases p ON p.user_id = u.user_id
GROUP BY u.country
ORDER BY conv_pct DESC;
country signups purchasers conv_pct
0 <NULL> 1 1 100.00
1 in 2 2 100.00
2 US 1 1 100.00
3 UK 1 1 100.00
4 \N 9 8 88.89
.. ... ... ... ...
11 NaN 7 4 57.14
12 BR 22 12 54.55
13 br 2 1 50.00
14 UK 20 9 45.00
15 1 0 0.00
Why NULLIF(..., 0): defends against divide-by-zero if a country has no signups (will not happen here, but it is a habit).
Edge cases & gotchas (2)
- Cohort definition
‘Conversion’ is undefined without a window: ever, first 7 days, first 30 days, first session? State which.
Fix · SQL-- 30-day-window conversion (more honest than ‘ever’) SELECT u.country, 100.0 * COUNT(DISTINCT p.user_id) FILTER ( WHERE p.purchase_date <= u.signup_date + INTERVAL '30' DAY ) / NULLIF(COUNT(DISTINCT u.user_id), 0) AS conv_30d_pct FROM users u LEFT JOIN purchases p ON p.user_id = u.user_id GROUP BY u.country;Outputcountry conv_30d_pct 0 <NULL> 100.000000 1 66.666667 2 JP 61.111111 3 NaN 57.142857 4 US 62.264151 5 IN 54.545455 .. ... ... 10 BR 50.000000 11 in 100.000000 12 UK 100.000000 13 FR 60.000000 14 UK 40.000000 15 DE 71.428571 - Tiny-cohort countries
DE with 8 users gives a 100% conversion that is statistically meaningless. Add a min_cohort filter for the chart, but report raw counts for completeness.
Join products with the current price from product_dim (SCD2).
[Apple · App Store Pricing] A pricing dashboard needs the live SKU price for each product, ignoring expired historical rows. Use the SCD2 dim with is_current = 'Y' — and flag why this is wrong for as-of-historical queries (see Q41).
product_id product_name price effective_date expiry_date is_current
0 1 product_1 60.01 2024-01-01 2024-04-16 N
1 1 product_1 66.01 2024-04-16 NaT Y
2 2 product_2 171.42 2024-01-01 2024-04-22 N
3 2 product_2 188.56 2024-04-22 2024-07-28 N
4 3 product_3 156.15 2024-01-01 2024-04-28 N
SELECT p.product_id, p.product_name, p.category,
pd.price AS current_price, pd.effective_date
FROM products p
LEFT JOIN product_dim pd
ON pd.product_id = p.product_id
AND pd.is_current = 'Y'
ORDER BY p.product_id
LIMIT 10;
product_id product_name category current_price effective_date
0 1 prod_1 Electronics 66.01 2024-04-16
1 1 prod_1 Electronics 66.01 2024-04-16
2 2 prod_2 Electronics NaN NaT
3 3 prod_3 Electronics NaN NaT
4 4 prod_4 Electronics 119.64 2024-06-06
5 5 prod_5 Electronics 38.98 2024-03-26
6 6 prod_6 Electronics NaN NaT
7 7 prod_7 Electronics NaN NaT
8 8 prod_8 Electronics 170.15 2024-03-05
9 9 prod_9 Electronics 64.08 2024-11-14
Setup for Q41: using is_current = 'Y' is fast but only correct for ‘now’ queries. For ‘as-of-some-historical-date’ you must use effective_date / expiry_date windows.
Edge cases & gotchas (2)
- Multiple is_current='Y' per product
If your SCD2 builder has a bug, two rows can have is_current='Y' for the same product. Add an assert.
Fix · SQL-- DQ: must be 1 current row per product SELECT product_id, COUNT(*) AS current_rows FROM product_dim WHERE is_current = 'Y' GROUP BY product_id HAVING COUNT(*) > 1;Outputproduct_id current_rows 0 11 2 1 19 2 - is_current='Y' is a denormalization
It is a maintained flag — easily out of sync with effective/expiry. For correctness-critical queries, derive 'current' from expiry_date IS NULL OR expiry_date > CURRENT_DATE.
Tier 3 · Window Functions, CASE, Subqueries
Rank users by total order spend (with ties).
[Amazon · Loyalty A/B] A loyalty experiment is ranking treatment users by lifetime spend to pick the top tier. Use both RANK and DENSE_RANK so ties are handled deterministically — ties matter when the cutoff is paid.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
SELECT user_id,
ROUND(SUM(amount), 2) AS total_spend,
RANK() OVER (ORDER BY SUM(amount) DESC) AS spend_rank,
DENSE_RANK() OVER (ORDER BY SUM(amount) DESC) AS spend_dense_rank
FROM orders
GROUP BY user_id
ORDER BY spend_rank
LIMIT 10;
user_id total_spend spend_rank spend_dense_rank
0 176 1445.05 1 1
1 70 1337.78 2 2
2 168 1117.62 3 3
3 112 971.71 4 4
4 36 952.59 5 5
5 140 808.65 6 6
6 184 768.93 7 7
7 133 752.55 8 8
8 131 741.68 9 9
9 38 715.93 10 10
RANK vs DENSE_RANK vs ROW_NUMBER: ranks skip after ties (1,2,2,4); dense_rank does not (1,2,2,3); row_number breaks ties arbitrarily. Pick deliberately.
Edge cases & gotchas (2)
- RANK skips, DENSE_RANK doesn't, ROW_NUMBER picks one
If three users tie at #1: RANK gives 1,1,1,4; DENSE_RANK gives 1,1,1,2; ROW_NUMBER gives 1,2,3 non-deterministically. Pick by downstream consumer.
Fix · SQLSELECT user_id, total_spend, RANK() OVER (ORDER BY total_spend DESC) AS r, DENSE_RANK() OVER (ORDER BY total_spend DESC) AS dr, ROW_NUMBER() OVER (ORDER BY total_spend DESC, user_id) AS rn FROM (SELECT user_id, SUM(amount) AS total_spend FROM orders GROUP BY 1) t;Outputuser_id total_spend r dr rn 0 176 1445.05 1 1 1 1 70 1337.78 2 2 2 2 168 1117.62 3 3 3 3 112 971.71 4 4 4 4 36 952.59 5 5 5 5 140 808.65 6 6 6 .. ... ... ... ... ... 178 15 13.34 179 179 179 179 22 9.74 180 180 180 180 41 8.39 181 181 181 181 71 7.88 182 182 182 182 183 3.41 183 183 183 183 25 3.16 184 184 184 - NULL spend (no orders)
If you LEFT JOIN to keep all users, SUM is NULL. ORDER BY DESC places NULLs last in Postgres but FIRST in MySQL. Be explicit.
Deduplicate user_profiles — keep latest row per user.
[Meta · CDP] The CDP team has a profile table with intentional history (334 rows for 200 users). Dedup to one row per user, latest updated_at wins — and tiebreak deterministically or your downstream join is non-reproducible.
user_id name city updated_at
0 1 user_1 Denver 2025-01-05
1 1 user_1 NYC 2025-02-19
2 2 user_2 SF 2025-01-01
3 3 user_3 Austin 2025-01-11
4 3 user_3 NaN 2025-02-01
WITH ranked AS (
SELECT up.*,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY updated_at DESC, city DESC
) AS rn
FROM user_profiles up
)
SELECT user_id, name, city, updated_at
FROM ranked
WHERE rn = 1
ORDER BY user_id
LIMIT 10;
user_id name city updated_at
0 1 user_1 NYC 2025-02-19
1 2 user_2 SF 2025-01-01
2 3 user_3 NaN 2025-02-01
3 4 user_4 Boston 2025-01-23
4 5 user_5 NYC 2025-01-18
5 6 user_6 Chicago 2025-01-02
6 7 user_7 Seattle 2025-02-07
7 8 user_8 Austin 2025-03-15
8 9 user_9 Boston 2025-01-06
9 10 user_10 SF 2025-01-09
Always tiebreak. Without the secondary ORDER BY, two rows with identical updated_at produce a non-deterministic winner — the kind of bug that only shows up after a backfill.
Edge cases & gotchas (2)
- Tied updated_at = non-deterministic winner
Without a secondary tiebreak, two profile rows with identical updated_at produce a coin-flip dedup. Add a deterministic secondary key (e.g., city DESC, primary key DESC).
Fix · SQLROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY updated_at DESC, city DESC, ROWID DESC -- always deterministic ) AS rn - All-NULL updated_at
If a partition has only NULL update timestamps, ROW_NUMBER still assigns rn=1 to one row arbitrarily. Decide: drop the partition, or pick by another field.
WoW growth in WAU using LAG.
[Meta · Weekly Biz Review] The weekly business review wants WAU and WoW % change in one query. Use LAG and guard the divisor with NULLIF for the first row.
week_start wau
0 2024-01-01 10000
1 2024-01-08 10230
2 2024-01-15 10455
3 2024-01-22 10819
4 2024-01-29 10434
SELECT week_start,
wau,
LAG(wau) OVER (ORDER BY week_start) AS prev_wau,
ROUND( 100.0 * (wau - LAG(wau) OVER (ORDER BY week_start))
/ NULLIF(LAG(wau) OVER (ORDER BY week_start), 0), 2) AS wow_pct
FROM weekly_metrics
ORDER BY week_start
LIMIT 10;
week_start wau prev_wau wow_pct
0 2024-01-01 10000 <NA> NaN
1 2024-01-08 10230 10000 2.30
2 2024-01-15 10455 10230 2.20
3 2024-01-22 10819 10455 3.48
4 2024-01-29 10434 10819 -3.56
5 2024-02-05 10972 10434 5.16
6 2024-02-12 11048 10972 0.69
7 2024-02-19 11454 11048 3.67
8 2024-02-26 11077 11454 -3.29
9 2024-03-04 10499 11077 -5.22
Readability tip: in a real codebase, wrap the LAG() in a CTE so you do not repeat it three times.
Edge cases & gotchas (2)
- First row's wow_pct is NULL
LAG returns NULL on the first row, and LAG/0 is undefined — NULLIF the divisor or COALESCE the result, depending on whether downstream wants NULL or 0.
Fix · SQLSELECT week_start, wau, 100.0 * (wau - LAG(wau) OVER w) / NULLIF(LAG(wau) OVER w, 0) AS wow_pct FROM weekly_metrics WINDOW w AS (ORDER BY week_start);Outputweek_start wau wow_pct 0 2024-01-01 10000 NaN 1 2024-01-08 10230 2.300000 2 2024-01-15 10455 2.199413 3 2024-01-22 10819 3.481588 4 2024-01-29 10434 -3.558554 5 2024-02-05 10972 5.156220 .. ... ... ... 46 2024-11-18 15532 -1.296390 47 2024-11-25 15750 1.403554 48 2024-12-02 16883 7.193651 49 2024-12-09 16215 -3.956643 50 2024-12-16 15425 -4.872032 51 2024-12-23 15680 1.653160 - Missing weeks (gapped data)
LAG is positional, not date-aware: if week 2024-01-08 is missing, LAG jumps from 01-01 to 01-15 silently. Date-spine first.
CASE pivot — sessions per platform per day.
[Meta · IG Reporting] A weekly report on platform-mix wants one row per day with three columns: iOS, Android, Web. Pivot via CASE — and explain why this is more portable than Snowflake's PIVOT when reporting tools straddle warehouses.
session_id user_id platform session_date duration_sec
0 1 51 android 2025-01-01 388.0
1 2 73 android 2025-01-01 167.0
2 3 32 android 2025-01-01 2493.0
3 4 75 android 2025-01-01 1881.0
4 5 8 android 2025-01-01 1217.0
SELECT session_date,
SUM(CASE WHEN platform = 'ios' THEN 1 ELSE 0 END) AS ios,
SUM(CASE WHEN platform = 'android' THEN 1 ELSE 0 END) AS android,
SUM(CASE WHEN platform = 'web' THEN 1 ELSE 0 END) AS web,
COUNT(*) AS total
FROM sessions
GROUP BY session_date
ORDER BY session_date
LIMIT 10;
session_date ios android web total
0 2025-01-01 1.0 10.0 0.0 12
1 2025-01-02 7.0 4.0 1.0 12
2 2025-01-03 4.0 7.0 1.0 12
3 2025-01-04 3.0 3.0 6.0 12
4 2025-01-05 6.0 2.0 4.0 12
5 2025-01-06 7.0 3.0 2.0 12
6 2025-01-07 4.0 6.0 2.0 12
7 2025-01-08 3.0 4.0 4.0 12
8 2025-01-09 1.0 7.0 4.0 12
9 2025-01-10 8.0 3.0 1.0 12
vs. native PIVOT: Snowflake / SQL Server have PIVOT; Postgres uses CASE (or crosstab). The CASE form is the most portable.
Edge cases & gotchas (2)
- Unknown platform values
If a new platform ('windows') ships, the CASE pivot silently drops those rows from the named columns. Add a 'other' bucket and an alert when total != iOS+Android+Web+other.
Fix · SQLSELECT session_date, SUM(CASE WHEN platform='ios' THEN 1 ELSE 0 END) AS ios, SUM(CASE WHEN platform='android' THEN 1 ELSE 0 END) AS android, SUM(CASE WHEN platform='web' THEN 1 ELSE 0 END) AS web, SUM(CASE WHEN platform NOT IN ('ios','android','web') THEN 1 ELSE 0 END) AS other, COUNT(*) AS total FROM sessions GROUP BY session_date;Outputsession_date ios android web other total 0 2025-01-22 4.0 3.0 5.0 0.0 12 1 2025-02-13 7.0 4.0 1.0 0.0 12 2 2025-01-20 2.0 5.0 4.0 1.0 12 3 2025-01-25 7.0 1.0 4.0 0.0 12 4 2025-01-30 4.0 6.0 2.0 0.0 12 5 2025-02-07 5.0 3.0 3.0 1.0 12 .. ... ... ... ... ... ... 62 2025-02-11 2.0 5.0 5.0 0.0 12 63 2025-02-19 5.0 3.0 4.0 0.0 12 64 2025-02-20 3.0 7.0 1.0 0.0 11 65 2025-03-01 5.0 3.0 4.0 0.0 12 66 2025-01-18 3.0 6.0 3.0 0.0 12 67 2025-01-31 7.0 2.0 3.0 0.0 12 - Case sensitivity in platform values
'iOS' vs 'ios' fails the equality test. Lowercase upstream or in the CASE comparator.
Top 3 highest-revenue products in each category.
[Amazon · Homepage Ranking] Merch wants a 3-product shelf per category for the homepage rotation. Use DENSE_RANK so ties at #2 still produce a #3 — relevant when revenue is bucketed.
product_id product_name category revenue
0 1 prod_1 Electronics 435.0
1 2 prod_2 Electronics 25.0
2 3 prod_3 Electronics 384.0
3 4 prod_4 Electronics 1054.0
4 5 prod_5 Electronics 485.0
WITH ranked AS (
SELECT product_id, product_name, category, revenue,
DENSE_RANK() OVER (PARTITION BY category
ORDER BY revenue DESC) AS rk
FROM products
)
SELECT category, product_id, product_name, revenue, rk
FROM ranked
WHERE rk <= 3
ORDER BY category, rk;
category product_id product_name revenue rk
0 Books 19 prod_19 2887.0 1
1 Books 11 prod_11 2420.0 2
2 Books 15 prod_15 2021.0 3
3 Clothing 28 prod_28 9770.0 1
4 Clothing 30 prod_30 7190.0 2
.. ... ... ... ... ..
10 Food 42 prod_42 2667.0 2
11 Food 43 prod_43 2252.0 3
12 Sports 40 prod_40 12777.0 1
13 Sports 36 prod_36 2589.0 2
14 Sports 38 prod_38 2003.0 3
Why DENSE_RANK: if two products tie for #2, you keep both and still get a #3.
Edge cases & gotchas (2)
- Ties at #2 mean #3 may not exist
DENSE_RANK gives 1,2,2,3 — three rows survive. RANK gives 1,2,2,4 — two rows survive (no #3). Pick semantically.
- Categories with fewer than 3 products
A category with 2 products returns 2 rows, not 3. The carousel must handle short-shelf gracefully.
Fix · SQL-- Pad short shelves with NULL placeholders SELECT category, position, product_name FROM ( SELECT category, ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) AS position, product_name FROM products ) t WHERE position <= 3;Outputcategory position product_name 0 Sports 1 prod_40 1 Sports 2 prod_36 2 Sports 3 prod_38 3 Books 1 prod_19 4 Books 2 prod_11 5 Books 3 prod_15 .. ... ... ... 9 Clothing 1 prod_28 10 Clothing 2 prod_30 11 Clothing 3 prod_29 12 Electronics 1 prod_4 13 Electronics 2 prod_1 14 Electronics 3 prod_5
Median order amount.
[Amazon · Risk Modeling] A finance partner says AVG(amount) is being skewed by whale orders. Compute the median two ways: idiomatic PERCENTILE_CONT, and a portable ROW_NUMBER fallback for warehouses without it.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_amount
FROM orders;
median_amount
0 51.52
Portable fallback: if your DB lacks PERCENTILE_CONT, use a ROW_NUMBER + COUNT trick to pick the middle row(s).
Edge cases & gotchas (2)
- Median definition: continuous vs discrete
PERCENTILE_CONT(0.5) interpolates between the two middle rows; PERCENTILE_DISC(0.5) returns one of them. For currency, CONT is usually wrong (it produces fractions of a cent).
Fix · SQL-- DISC for whole-cent values SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY amount) FROM orders;Outputquantile_disc(0.5 ORDER BY amount) 0 51.52 - MySQL has no PERCENTILE_CONT
MySQL 8 lacks WITHIN GROUP. Use the ROW_NUMBER + COUNT trick from the answer block.
Each user's largest order, with date.
[Amazon · Lifecycle Email] A re-engagement campaign wants each customer's biggest-ever order amount + date to drop into the email subject line. Use a window — the correlated-subquery form doesn't scale at orders.size = 1B.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
WITH ranked AS (
SELECT order_id, user_id, order_date, amount,
ROW_NUMBER() OVER (PARTITION BY user_id
ORDER BY amount DESC, order_date DESC) AS rn
FROM orders
)
SELECT user_id, order_id, order_date, amount
FROM ranked
WHERE rn = 1
ORDER BY user_id
LIMIT 10;
user_id order_id order_date amount
0 1 156 2025-01-20 53.08
1 2 90 2025-01-12 324.70
2 3 224 2025-01-28 164.63
3 4 312 2025-02-08 67.36
4 5 104 2025-01-13 192.65
5 7 343 2025-02-12 175.31
6 8 427 2025-02-23 93.81
7 10 191 2025-01-24 243.48
8 11 52 2025-01-07 371.89
9 12 216 2025-01-27 35.07
Window beats correlated subquery: a single pass over orders instead of N subqueries. On large tables this is a 10–100× difference.
Edge cases & gotchas (2)
- Tied largest amount per user
If a user has two orders with the same max amount, ROW_NUMBER picks one arbitrarily. Tiebreak by order_date and order_id deterministically.
Fix · SQLROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY amount DESC, order_date DESC, order_id -- fully deterministic ) AS rn - User with no orders
Inner-style filtering drops them. If the email needs every user, left-join from users and COALESCE the subject line.
Cumulative distribution of order amounts.
[Apple · Pricing Tiers] A pricing experiment is drawing tier cutoffs at the 50th, 70th, 90th percentile of basket size. Compute CUME_DIST + NTILE(10) for every order; the team will pick cutoffs from the deciles.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
SELECT order_id, amount,
ROUND(CUME_DIST() OVER (ORDER BY amount)::DOUBLE, 4) AS cume_pct,
NTILE(10) OVER (ORDER BY amount) AS decile
FROM orders
ORDER BY amount
LIMIT 10;
order_id amount cume_pct decile
0 234 -215.35 0.002 1
1 389 -63.52 0.004 1
2 12 -12.02 0.006 1
3 185 1.37 0.008 1
4 251 1.49 0.010 1
5 432 3.16 0.012 1
6 72 3.41 0.014 1
7 92 5.06 0.016 1
8 50 5.78 0.018 1
9 79 5.80 0.020 1
CUME_DIST vs PERCENT_RANK: CUME_DIST counts ties as ‘at or below’ (≤); PERCENT_RANK uses (rank-1)/(N-1). Use CUME_DIST for ‘what % of orders are ≤ this’.
Edge cases & gotchas (2)
- CUME_DIST counts ≤, PERCENT_RANK counts <
Subtle but real: CUME_DIST(x) = fraction of rows ≤ x; PERCENT_RANK(x) = (rank(x)-1)/(N-1). At the median they differ.
- NTILE distributes ties arbitrarily
Two users with identical spend can land in different deciles. For high-stakes segmentation, use PERCENT_RANK ≤ 0.10 instead.
Fix · SQLSELECT user_id, total_spend, PERCENT_RANK() OVER (ORDER BY total_spend DESC) AS pr FROM (SELECT user_id, SUM(amount) AS total_spend FROM orders GROUP BY 1) t WHERE PERCENT_RANK() OVER (ORDER BY total_spend DESC) <= 0.10;Output[error] BinderException: Binder Error: WHERE clause cannot contain window functions! LINE 4: WHERE PERCENT_RANK() OVER (ORDER BY total_spend DESC) <= 0.10 ^
Each order's deviation from that user's average.
[Amazon · Anomaly Detection] A risk model wants ‘orders that are unusually large for that user’. For every order, show the user's mean and the delta — the modeling team will threshold from there.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
SELECT order_id, user_id, order_date, amount,
ROUND(AVG(amount) OVER (PARTITION BY user_id), 2) AS user_avg,
ROUND(amount - AVG(amount) OVER (PARTITION BY user_id), 2) AS delta
FROM orders
ORDER BY user_id, order_date
LIMIT 10;
order_id user_id order_date amount user_avg delta
0 156 1 2025-01-20 53.08 36.52 16.56
1 351 1 2025-02-13 19.71 36.52 -16.81
2 417 1 2025-02-22 36.77 36.52 0.25
3 16 2 2025-01-02 59.81 141.57 -81.76
4 45 2 2025-01-06 40.21 141.57 -101.36
5 90 2 2025-01-12 324.70 141.57 183.13
6 131 3 2025-01-17 53.76 70.95 -17.19
7 139 3 2025-01-18 9.77 70.95 -61.18
8 223 3 2025-01-28 18.93 70.95 -52.02
9 224 3 2025-01-28 164.63 70.95 93.68
Window without ORDER BY means the average covers the whole partition (the user) — exactly what we want here.
Edge cases & gotchas (2)
- Single-order users have delta=0
AVG over a partition of 1 equals that one value, so delta is 0. Filter to users with >= some min order count for anomaly logic.
Fix · SQLWITH stats AS ( SELECT order_id, user_id, amount, AVG(amount) OVER (PARTITION BY user_id) AS u_avg, COUNT(*) OVER (PARTITION BY user_id) AS u_n FROM orders ) SELECT * FROM stats WHERE u_n >= 5 AND amount > u_avg * 2;Outputorder_id user_id amount u_avg u_n 0 208 99 347.94 107.280000 5 1 381 111 305.84 100.684000 5 2 298 17 280.68 106.044000 5 3 14 151 318.91 116.080000 5 4 384 161 97.23 44.912000 5 5 446 176 958.35 180.631250 8 .. ... ... ... ... ... 13 382 47 281.25 130.128000 5 14 419 156 236.33 98.251667 6 15 494 156 202.47 98.251667 6 16 128 62 128.41 44.758000 5 17 445 96 365.97 107.082000 5 18 238 155 133.38 65.010000 6 - Order amount distribution is log-normal
Linear delta over-flags whales' big-but-normal orders. Use log(amount) - log(u_avg) for ratio thinking.
Users who signed up but never reached 'purchase' in the events funnel.
[Meta · Lifecycle] Lifecycle wants the dropped-off cohort: users who reached signup but never had a purchase event. Use NOT EXISTS — and explain why it correctly handles users who fired multiple signup events.
event_id user_id event_type event_date
0 1 1 signup 2025-01-04
1 2 1 activation 2025-02-08
2 3 1 purchase 2025-03-13
3 4 2 signup 2025-02-17
4 5 2 activation 2025-02-24
SELECT DISTINCT e.user_id
FROM events e
WHERE e.event_type = 'signup'
AND NOT EXISTS (
SELECT 1 FROM events e2
WHERE e2.user_id = e.user_id
AND e2.event_type = 'purchase'
)
ORDER BY e.user_id
LIMIT 10;
user_id
0 3
1 4
2 8
3 10
4 11
5 13
6 15
7 16
8 17
9 19
Cohort-correctness: if signup can fire multiple times per user (it should not, but data is data), this still works because DISTINCT + NOT EXISTS are set-based.
Edge cases & gotchas (2)
- Multiple signup events per user (DQ)
A user firing two 'signup' events is dirty data. NOT EXISTS still works (set-based) but you should DQ-flag it separately.
Fix · SQLSELECT user_id, COUNT(*) AS signup_events FROM events WHERE event_type = 'signup' GROUP BY user_id HAVING COUNT(*) > 1;Outputuser_id signup_events 0 5 2 1 2 2 2 4 2 3 1 2 4 3 2 - ‘Never purchased’ window
Lifetime ‘never’ is a different cohort from ‘not within 30 days of signup’. State the window.
Tier 4 · CTEs, Sessionization, Cohorts
7-day moving average of daily revenue.
[Netflix · Content Performance] The exec dashboard wants the daily revenue trend with the daily noise smoothed out — a 7-day trailing mean. Discuss ROWS vs RANGE: for gapped data (see Q35) RANGE with an INTERVAL is the safer 7-day window.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-05 55714.88 9883
SELECT ds, daily_revenue,
ROUND(AVG(daily_revenue) OVER (
ORDER BY ds
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 2) AS rev_7d_avg
FROM daily_metrics
ORDER BY ds
LIMIT 10;
ds daily_revenue rev_7d_avg
0 2025-01-01 50008.19 50008.19
1 2025-01-02 79294.66 64651.43
2 2025-01-03 55245.52 61516.12
3 2025-01-04 51104.66 58913.26
4 2025-01-05 55714.88 58273.58
5 2025-01-06 80359.13 61954.51
6 2025-01-07 38076.88 58543.42
7 2025-01-08 36554.98 56621.53
8 2025-01-09 40941.61 51142.52
9 2025-01-10 50385.72 50448.27
ROWS vs RANGE: ROWS counts rows; RANGE matches values. For dense daily data they are equivalent — but if dates are gapped, RANGE with an interval is the safer ‘true 7-day window’.
Edge cases & gotchas (2)
- ROWS vs RANGE on gapped data
ROWS BETWEEN 6 PRECEDING counts rows, regardless of date gaps. If a day is missing, your ‘7-day’ window actually spans 8 calendar days. RANGE BETWEEN INTERVAL '6' DAY is date-aware.
Fix · SQLSELECT ds, daily_revenue, AVG(daily_revenue) OVER ( ORDER BY ds RANGE BETWEEN INTERVAL '6' DAY PRECEDING AND CURRENT ROW ) AS rev_7d_avg FROM daily_metrics;Outputds daily_revenue rev_7d_avg 0 2025-01-01 50008.19 50008.190000 1 2025-01-02 79294.66 64651.425000 2 2025-01-03 55245.52 61516.123333 3 2025-01-04 51104.66 58913.257500 4 2025-01-05 55714.88 58273.582000 5 2025-01-06 80359.13 61954.506667 .. ... ... ... 115 2025-04-25 71673.07 46890.241429 116 2025-04-26 82002.86 49454.362857 117 2025-04-27 51165.87 52132.301429 118 2025-04-28 63400.18 55192.535714 119 2025-04-29 52785.14 57301.432857 120 2025-04-30 55986.60 61164.007143 - First 6 days produce small windows
The first 6 days have fewer than 7 rows in the window — the average is over those few rows, which is misleading at the trend head. Either return NULL until 7 rows are available, or label with the actual sample size.
Running monthly revenue total, reset each month.
[Amazon · OP1 Dashboard] The OP1 finance dashboard wants ‘month-to-date revenue’ on every row. Reset on month-start. The reset is the partition; the order is the sort.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-05 55714.88 9883
SELECT ds, daily_revenue,
SUM(daily_revenue) OVER (
PARTITION BY DATE_TRUNC('month', ds)
ORDER BY ds
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS mtd_revenue
FROM daily_metrics
ORDER BY ds
LIMIT 10;
ds daily_revenue mtd_revenue
0 2025-01-01 50008.19 50008.19
1 2025-01-02 79294.66 129302.85
2 2025-01-03 55245.52 184548.37
3 2025-01-04 51104.66 235653.03
4 2025-01-05 55714.88 291367.91
5 2025-01-06 80359.13 371727.04
6 2025-01-07 38076.88 409803.92
7 2025-01-08 36554.98 446358.90
8 2025-01-09 40941.61 487300.51
9 2025-01-10 50385.72 537686.23
The reset is the partition. Group on the period you want to reset over; window-order within it.
Edge cases & gotchas (2)
- Late-arriving rows for prior months
If a backfill writes a row dated 2024-12-15 today, the MTD for December changes retroactively. Snapshot the curated table or your dashboard becomes a moving target.
- Time zone of month boundary
DATE_TRUNC('month', ds) is in the server's TZ. For a US-based team, an order at 2024-12-31T23:30 PT is December but the UTC row is January.
Fix · SQLSELECT DATE_TRUNC('month', ds AT TIME ZONE 'America/Los_Angeles') AS m_pt, SUM(daily_revenue) FROM daily_metrics GROUP BY 1;Outputm_pt sum(daily_revenue) 0 2025-03-01 00:00:00+00:00 1470069.99 1 2025-01-01 00:00:00+00:00 1588729.72 2 2025-04-01 00:00:00+00:00 1542605.98 3 2025-02-01 00:00:00+00:00 1327662.03
Weekly retention by signup cohort (W0–W3).
[Meta · Retention Triangle] Growth wants the classic Meta retention triangle: rows = signup week cohort, columns = W0/W1/W2/W3 since signup, cells = % logging in. This is the centerpiece chart of every weekly growth review.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
WITH cohort AS (
SELECT user_id, DATE_TRUNC('week', signup_date) AS cohort_week
FROM users
),
joined AS (
SELECT c.cohort_week, c.user_id,
CAST(FLOOR(EXTRACT(EPOCH FROM (l.login_date - c.cohort_week)) / (7*86400)) AS INT) AS weeks_since
FROM cohort c
LEFT JOIN logins l ON l.user_id = c.user_id
)
SELECT cohort_week,
COUNT(DISTINCT user_id) FILTER (WHERE weeks_since = 0) AS w0,
COUNT(DISTINCT user_id) FILTER (WHERE weeks_since = 1) AS w1,
COUNT(DISTINCT user_id) FILTER (WHERE weeks_since = 2) AS w2,
COUNT(DISTINCT user_id) FILTER (WHERE weeks_since = 3) AS w3
FROM joined
WHERE weeks_since BETWEEN 0 AND 3
GROUP BY cohort_week
ORDER BY cohort_week
LIMIT 10;
cohort_week w0 w1 w2 w3
0 2025-02-03 0 0 0 2
1 2025-02-10 0 0 3 3
2 2025-02-17 0 3 4 4
3 2025-02-24 3 4 4 4
4 2025-03-03 3 3 3 3
5 2025-03-10 3 3 3 0
6 2025-03-17 4 4 0 0
7 2025-03-24 3 0 0 0
Convert to %: divide each wN by the cohort size (a separate CTE). Most interviewers will follow up by asking for that.
Edge cases & gotchas (2)
- Cohort week alignment
DATE_TRUNC('week', ...) snaps to Monday in ISO mode but Sunday in MySQL. A user signing up 2024-01-07 (Sun) lands in different cohorts depending on engine. State the convention.
- User active before signup_date
If a user logs in before their signup_date (impossible? but stitching IDs creates this), weeks_since goes negative. Filter WHERE weeks_since >= 0 explicitly.
Funnel conversion: signup → activation → purchase.
[Meta · Onboarding Funnel] Product wants the signup → activation → purchase funnel with absolute counts and step-to-step rates. Use FILTER (WHERE ...) with monotonicity guards — a ‘purchase-before-signup’ row is dirty data, not a successful conversion.
event_id user_id event_type event_date
0 1 1 signup 2025-01-04
1 2 1 activation 2025-02-08
2 3 1 purchase 2025-03-13
3 4 2 signup 2025-02-17
4 5 2 activation 2025-02-24
WITH first_event AS (
SELECT user_id, event_type, MIN(event_date) AS first_at
FROM events
WHERE event_type IN ('signup','activation','purchase')
GROUP BY user_id, event_type
),
pivoted AS (
SELECT user_id,
MAX(CASE WHEN event_type='signup' THEN first_at END) AS signup_at,
MAX(CASE WHEN event_type='activation' THEN first_at END) AS activation_at,
MAX(CASE WHEN event_type='purchase' THEN first_at END) AS purchase_at
FROM first_event GROUP BY user_id
)
SELECT
COUNT(*) FILTER (WHERE signup_at IS NOT NULL) AS step1_signup,
COUNT(*) FILTER (WHERE activation_at IS NOT NULL
AND activation_at >= signup_at) AS step2_activation,
COUNT(*) FILTER (WHERE purchase_at IS NOT NULL
AND purchase_at >= COALESCE(activation_at, signup_at)) AS step3_purchase
FROM pivoted;
step1_signup step2_activation step3_purchase
0 498 282 133
Order matters. A ‘purchase’ before ‘signup’ is dirty data — the >= guards keep the funnel monotonic.
Edge cases & gotchas (2)
- Activation BEFORE signup
Without a monotonicity guard, a user with activation_at < signup_at counts as activated. Add the >= guard or your funnel rate exceeds 100% in some weeks.
Fix · SQLCOUNT(*) FILTER (WHERE activation_at IS NOT NULL AND activation_at >= signup_at) AS step2 - Step skipping
What if a user purchases without an activation event? Two valid definitions: strict (must hit each step) vs lax (any later step counts). Confirm with PM before coding.
Date spine — fill missing days in daily_metrics_gapped with zero.
[Netflix · BI Reliability] The Looker dashboard broke because 15 days are missing from daily_metrics_gapped. Build a date spine and left-join — every chart needs every day, even zeros, with a was_missing flag for the data team.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-06 80359.13 9912
WITH spine AS (
SELECT CAST(ds AS DATE) AS ds
FROM range(
(SELECT MIN(ds) FROM daily_metrics_gapped),
(SELECT MAX(ds) FROM daily_metrics_gapped) + INTERVAL 1 DAY,
INTERVAL 1 DAY
) t(ds)
)
SELECT s.ds,
COALESCE(d.daily_revenue, 0) AS daily_revenue,
COALESCE(d.daily_dau, 0) AS daily_dau,
(d.ds IS NULL) AS was_missing
FROM spine s
LEFT JOIN daily_metrics_gapped d ON d.ds = s.ds
ORDER BY s.ds
LIMIT 10;
ds daily_revenue daily_dau was_missing
0 2025-01-01 50008.19 9486 False
1 2025-01-02 79294.66 10292 False
2 2025-01-03 55245.52 10806 False
3 2025-01-04 51104.66 9132 False
4 2025-01-05 0.00 0 True
5 2025-01-06 80359.13 9912 False
6 2025-01-07 38076.88 11239 False
7 2025-01-08 36554.98 8400 False
8 2025-01-09 40941.61 9922 False
9 2025-01-10 50385.72 12868 False
Spark/BigQuery: use sequence(start, stop, step). Snowflake: GENERATOR. Same pattern, different syntax.
Edge cases & gotchas (2)
- Spine misses days outside [min, max]
generate_series from MIN to MAX of the existing data. If the report wants Jan 1 → today, anchor the spine to those literal dates instead.
Fix · SQLWITH spine AS ( SELECT CAST(d AS DATE) AS ds FROM range(DATE '2025-01-01', CURRENT_DATE + INTERVAL 1 DAY, INTERVAL 1 DAY) t(d) ) SELECT s.ds, COALESCE(g.daily_revenue, 0) AS rev FROM spine s LEFT JOIN daily_metrics_gapped g USING (ds);Outputds rev 0 2025-01-01 50008.19 1 2025-01-02 79294.66 2 2025-01-03 55245.52 3 2025-01-04 51104.66 4 2025-01-06 80359.13 5 2025-01-07 38076.88 .. ... ... 476 2026-02-10 0.00 477 2026-02-18 0.00 478 2026-02-20 0.00 479 2026-02-26 0.00 480 2026-04-10 0.00 481 2026-04-12 0.00 - Filling with 0 vs ffill
0 is appropriate for revenue/sessions (no activity); ffill is appropriate for stocks/balances (state persists). Pick by metric semantics.
Sessionize the clickstream — 30 min idle = new session.
[Meta · Reels Sessionization] Product wants Reels ‘sessions’ from the raw event stream. Industry standard: 30 minutes of inactivity ends a session. Use the cumulative-sum trick — this exact pattern shows up in Reels, Stories, and IG Live.
user_id event_time page
0 1 2025-03-01 08:45:00 home
1 1 2025-03-01 08:55:00 search
2 1 2025-03-01 09:00:00 cart
3 1 2025-03-01 09:01:00 checkout
4 1 2025-03-01 09:46:00 search
WITH gapped AS (
SELECT user_id, event_time, page,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
FROM clickstream
),
flagged AS (
SELECT user_id, event_time, page,
CASE WHEN prev_time IS NULL
OR event_time - prev_time > INTERVAL 30 MINUTE
THEN 1 ELSE 0 END AS is_new_session
FROM gapped
),
sessionized AS (
SELECT user_id, event_time, page,
SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM flagged
)
SELECT user_id, session_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
COUNT(*) AS pageviews
FROM sessionized
GROUP BY user_id, session_id
ORDER BY user_id, session_start
LIMIT 10;
user_id session_id session_start session_end pageviews
0 1 1.0 2025-03-01 08:45:00 2025-03-01 09:01:00 4
1 1 2.0 2025-03-01 09:46:00 2025-03-01 11:51:00 24
2 1 3.0 2025-03-01 12:36:00 2025-03-01 12:36:00 1
3 2 1.0 2025-03-01 08:01:00 2025-03-01 09:13:00 12
4 2 2.0 2025-03-01 10:43:00 2025-03-01 11:49:00 9
5 2 3.0 2025-03-01 12:34:00 2025-03-01 13:53:00 23
6 3 1.0 2025-03-01 08:02:00 2025-03-01 08:02:00 1
7 3 2.0 2025-03-01 08:47:00 2025-03-01 08:57:00 2
8 3 3.0 2025-03-01 10:27:00 2025-03-01 10:33:00 5
9 3 4.0 2025-03-01 11:18:00 2025-03-01 11:52:00 9
The cumulative-sum trick is the canonical sessionization pattern. Memorize it.
Edge cases & gotchas (2)
- Threshold choice
30 minutes is canonical for web; mobile push notifications can reasonably use 5 minutes; video heartbeats can use 90 seconds. Threshold belongs to the product, not the SQL.
- Out-of-order events
Late-arriving events with timestamps before their predecessor produce negative gaps. ORDER BY event_time correctly, but a session_start AFTER session_end is a DQ flag.
Fix · SQL-- DQ check: surface inverted sessions SELECT user_id, session_id, MIN(event_time) AS s_start, MAX(event_time) AS s_end FROM sessionized GROUP BY 1, 2 HAVING MIN(event_time) > MAX(event_time);Output[error] CatalogException: Catalog Error: Table with name sessionized does not exist! Did you mean "sessions"? LINE 3: FROM sessionized ^
Longest consecutive login streak per user.
[Meta · Engagement Streaks] An engagement leaderboard wants each user's longest consecutive-day login streak. Classic gaps-and-islands: date − row_number is constant within consecutive runs.
user_id login_date
0 1 2025-03-02
1 1 2025-03-04
2 1 2025-03-05
3 1 2025-03-06
4 1 2025-03-07
WITH grp AS (
SELECT user_id, login_date,
CAST(login_date AS DATE) - CAST(ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY login_date
) AS INT) AS group_key
FROM (SELECT DISTINCT user_id, login_date FROM logins) d
),
streaks AS (
SELECT user_id, group_key,
COUNT(*) AS streak_len,
MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end
FROM grp
GROUP BY user_id, group_key
)
SELECT user_id, MAX(streak_len) AS longest_streak
FROM streaks
GROUP BY user_id
ORDER BY longest_streak DESC
LIMIT 10;
user_id longest_streak
0 46 20
1 47 17
2 95 13
3 76 13
4 14 12
5 100 12
6 13 11
7 28 11
8 12 11
9 63 10
The trick: date − row_number is constant within consecutive runs. Group by it, you get the islands. Dedup logins first or streaks lie.
Edge cases & gotchas (2)
- Duplicate (user_id, login_date) pairs
If the source has two login rows for the same user on the same day, the row_number trick over-counts the streak length. DEDUP first.
Fix · SQL-- DEDUP before the gaps-and-islands trick WITH dedup AS (SELECT DISTINCT user_id, login_date FROM logins) SELECT user_id, login_date FROM dedup;Outputuser_id login_date 0 1 2025-03-06 1 1 2025-03-28 2 2 2025-03-05 3 2 2025-03-09 4 3 2025-03-07 5 4 2025-03-19 ... ... ... 1919 98 2025-03-17 1920 98 2025-03-26 1921 98 2025-03-29 1922 99 2025-03-08 1923 99 2025-03-10 1924 99 2025-03-14 - Time zone vs ‘day’
Two logins at 23:30 and 00:30 UTC look like two consecutive days in UTC but the same day for a Pacific user. Bucket login_date in user-local TZ if available.
Detect data skew — which join keys hold most of the rows?
[Amazon · Spark Skew Triage] A nightly Spark job is stalling on skewed_table. You suspect one join key has 70%+ of the rows. Confirm with a frequency query — then mitigate (salt the hot key, broadcast the small side, or pre-aggregate).
id join_key value
0 1 key_243 189
1 2 key_225 81
2 3 hot_key 108
3 4 key_91 321
4 5 hot_key 579
SELECT join_key,
COUNT(*) AS rows,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct_of_total
FROM skewed_table
GROUP BY join_key
ORDER BY rows DESC
LIMIT 5;
join_key rows pct_of_total
0 hot_key 700 70.0
1 key_228 1 0.1
2 key_24 1 0.1
3 key_251 1 0.1
4 key_22 1 0.1
What you will find: hot_key holds ~70% of rows. Mitigations: salt the hot key, broadcast the small side, or pre-aggregate before the join.
Edge cases & gotchas (2)
- Skew via dimension, not just value
A heavy hot key may be one value (hot_key) but skew can also happen on (key, time) — peak-hour traffic on one merchant. Group by (join_key, hour) for time-aware skew.
- Skew thresholds depend on cluster size
‘Top 1% holds 50%’ is one heuristic; ‘any single key > partition target size’ is the operational one. Compute against your shuffle target.
Average DAU per ISO week from daily_metrics.
[Meta · WAU vs avg-DAU] The weekly biz review wants weekly metrics from the daily roll-up. Critical distinction: AVG(daily_dau) ≠ WAU (unique users in the week). Both are valid; compute and label both.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-05 55714.88 9883
SELECT CAST(DATE_TRUNC('week', ds) AS DATE) AS week_start,
ROUND(AVG(daily_dau), 0) AS avg_dau,
SUM(daily_revenue) AS week_revenue
FROM daily_metrics
GROUP BY 1
ORDER BY week_start
LIMIT 10;
week_start avg_dau week_revenue
0 2024-12-30 9920.0 291367.91
1 2025-01-06 10982.0 355167.89
2 2025-01-13 10203.0 384238.20
3 2025-01-20 9551.0 332577.08
4 2025-01-27 10553.0 298425.99
5 2025-02-03 10076.0 317637.86
6 2025-02-10 11500.0 335308.30
7 2025-02-17 10426.0 371344.18
8 2025-02-24 10083.0 329799.30
9 2025-03-03 9683.0 284459.57
Critical distinction: WAU (unique users in the week) ≠ AVG(DAU). Both are valid; both are different. State which you computed.
Edge cases & gotchas (2)
- WAU != AVG(DAU)
AVG(DAU) on a 7-day window counts a user multiple times if they appear on multiple days; WAU is unique users in the week. Both are valid; both are different. Label your axis.
Fix · SQL-- Real WAU SELECT DATE_TRUNC('week', login_date) AS week_start, COUNT(DISTINCT user_id) AS wau FROM logins GROUP BY 1;Outputweek_start wau 0 NaT 3 1 2025-03-10 100 2 2025-03-03 100 3 2025-03-24 100 4 2025-02-24 83 5 2025-03-17 100 - Partial weeks at the boundaries
The first and last week may not have 7 full days. Drop or label them, don't average them silently.
Day-of-week revenue heat: rank weekdays by avg revenue.
[Amazon · Ad Spend Calendar] Marketing wants to rank weekdays by avg revenue before locking the ad-spend schedule — is Saturday actually the winner, or just over-counted? Show days_observed per dow so they cannot ignore unevenness.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-05 55714.88 9883
SELECT strftime(ds, '%a') AS dow_label,
CAST(EXTRACT(ISODOW FROM ds) AS INT) AS dow_num,
ROUND(AVG(daily_revenue), 2) AS avg_revenue,
COUNT(*) AS days_observed
FROM daily_metrics
GROUP BY dow_label, dow_num
ORDER BY avg_revenue DESC;
dow_label dow_num avg_revenue days_observed
0 Mon 1 52851.23 17
1 Fri 5 52428.22 17
2 Tue 2 51339.15 17
3 Thu 4 51152.23 17
4 Sun 7 48210.35 17
5 Wed 3 46502.79 18
6 Sat 6 43982.03 18
Tip: include days_observed — if a quarter has 14 Mondays and 13 Sundays, the average is on uneven sample sizes.
Edge cases & gotchas (2)
- Uneven sample sizes by DOW
A 90-day window has 13 Mondays and 12 Sundays — averaging without disclosing days_observed makes the rank misleading.
- Holiday & promo contamination
Black Friday is a Friday; Christmas falls mid-week. Day-of-week ranking averages everything together. Slice by promo period for honest weekday comparison.
Tier 5 · Senior Patterns
SCD2 as-of join: price each purchase using the price effective on its date.
[Apple · Historical Pricing] Finance is restating Q1 revenue and needs every purchase priced at the price effective on its date — not today's price. Join via the SCD2 effective/expiry window. The invariant: exactly one row matches per (product, point-in-time).
product_id product_name price effective_date expiry_date is_current
0 1 product_1 60.01 2024-01-01 2024-04-16 N
1 1 product_1 66.01 2024-04-16 NaT Y
2 2 product_2 171.42 2024-01-01 2024-04-22 N
3 2 product_2 188.56 2024-04-22 2024-07-28 N
4 3 product_3 156.15 2024-01-01 2024-04-28 N
SELECT p.purchase_id, p.user_id, p.purchase_date,
p.product_category, p.amount,
pd.product_id, pd.price AS price_on_purchase_date
FROM purchases p
LEFT JOIN product_dim pd
ON pd.product_id = ((CAST(REPLACE(p.product_category, ' ', '') AS VARCHAR) IS NOT NULL)::INT)
AND p.purchase_date >= pd.effective_date
AND (pd.expiry_date IS NULL OR p.purchase_date < pd.expiry_date)
ORDER BY p.purchase_date
LIMIT 5;
purchase_id user_id purchase_date product_category amount product_id price_on_purchase_date
0 1 149 2024-11-01 Sports 321.30 1 66.01
1 2 82 2024-11-01 Clothing 36.66 1 66.01
2 3 19 2024-11-01 Clothing 70.12 1 66.01
3 4 98 2024-11-01 Food 122.21 1 66.01
4 5 59 2024-11-01 Electronics 5.33 1 66.01
The SCD2 invariant: exactly one row matches per (product, point-in-time). If you ever see duplicates, your effective/expiry windows overlap — that is a data-quality bug, not a query bug.
Edge cases & gotchas (3)
- Overlapping effective windows
If two product_dim rows have overlapping [effective, expiry) for the same product, the as-of join returns multiple rows per purchase — the SCD2 invariant is broken. DQ-assert on every load.
Fix · SQL-- DQ: detect overlapping windows WITH g AS ( SELECT product_id, effective_date, expiry_date, LEAD(effective_date) OVER ( PARTITION BY product_id ORDER BY effective_date ) AS next_eff FROM product_dim ) SELECT * FROM g WHERE expiry_date IS NOT NULL AND next_eff < expiry_date;Outputproduct_id effective_date expiry_date next_eff 0 13 2024-01-01 2024-03-11 2024-01-16 1 13 2024-01-16 2024-05-14 2024-03-11 - Inclusive vs exclusive expiry
[effective_date, expiry_date) (half-open) is industry standard and avoids double-coverage at the boundary. expiry_date IS NULL = still current. Always document the convention.
- Pre-history gap
A purchase dated before any product_dim row leaves the as-of join NULL. Either bootstrap a synthetic 'genesis' row (effective_date = -infinity) or DQ-flag.
Recursive CTE: full org chart with reporting depth.
[Amazon · HRIS] Workday is generating an org-chart export for visa filings: every employee with their depth from the CEO and their full reporting chain as a string. Recursive CTE, guard against cycles in production.
employee_id full_name manager_id department salary
0 100 Dana Chen NaN Engineering 250000.0
1 201 Eve Park 100.0 Engineering 180000.0
2 202 Frank Li 100.0 Engineering 175000.0
3 203 Grace Kim 100.0 Engineering NaN
4 301 IC_301 201.0 Engineering 116962.0
WITH RECURSIVE org AS (
SELECT employee_id, full_name, manager_id, department,
0 AS depth,
CAST(full_name AS VARCHAR) AS chain
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.full_name, e.manager_id, e.department,
o.depth + 1,
o.chain || ' > ' || e.full_name
FROM employees e
JOIN org o ON e.manager_id = o.employee_id
)
SELECT employee_id, full_name, department, depth, chain
FROM org
ORDER BY depth, full_name
LIMIT 10;
employee_id full_name department depth chain
0 100 Dana Chen Engineering 0 Dana Chen
1 201 Eve Park Engineering 1 Dana Chen > Eve Park
2 202 Frank Li Engineering 1 Dana Chen > Frank Li
3 203 Grace Kim Engineering 1 Dana Chen > Grace Kim
4 301 IC_301 Engineering 2 Dana Chen > Eve Park > IC_301
5 302 IC_302 Engineering 2 Dana Chen > Eve Park > IC_302
6 303 IC_303 Engineering 2 Dana Chen > Eve Park > IC_303
7 304 IC_304 Engineering 2 Dana Chen > Eve Park > IC_304
8 305 IC_305 Engineering 2 Dana Chen > Eve Park > IC_305
9 306 IC_306 Engineering 2 Dana Chen > Frank Li > IC_306
Termination: a recursive CTE stops when the recursive step yields no new rows. Cycles in manager_id would loop forever — guard with a depth < N cap in production.
Edge cases & gotchas (2)
- Cycles in manager_id
If A reports to B and B reports to A (data error), the recursive CTE loops forever. Add a depth cap and a visited-set guard.
Fix · SQLWITH RECURSIVE org AS ( SELECT employee_id, manager_id, full_name, 0 AS depth, '/' || full_name AS path FROM employees WHERE manager_id IS NULL UNION ALL SELECT e.employee_id, e.manager_id, e.full_name, o.depth + 1, o.path || '/' || e.full_name FROM employees e JOIN org o ON e.manager_id = o.employee_id WHERE o.depth < 20 -- cycle guard AND POSITION('/' || e.full_name || '/' IN o.path) = 0 -- visited check ) SELECT * FROM org;Outputemployee_id manager_id full_name depth path 0 100 NaN Dana Chen 0 /Dana Chen 1 201 100.0 Eve Park 1 /Dana Chen/Eve Park 2 202 100.0 Frank Li 1 /Dana Chen/Frank Li 3 203 100.0 Grace Kim 1 /Dana Chen/Grace Kim 4 301 201.0 IC_301 2 /Dana Chen/Eve Park/IC_301 5 302 201.0 IC_302 2 /Dana Chen/Eve Park/IC_302 .. ... ... ... ... ... 22 319 302.0 IC_319 3 /Dana Chen/Eve Park/IC_302/IC_319 23 320 302.0 IC_320 3 /Dana Chen/Eve Park/IC_302/IC_320 24 321 302.0 IC_321 3 /Dana Chen/Eve Park/IC_302/IC_321 25 322 306.0 IC_322 3 /Dana Chen/Frank Li/IC_306/IC_322 26 323 306.0 IC_323 3 /Dana Chen/Frank Li/IC_306/IC_323 27 324 306.0 IC_324 3 /Dana Chen/Frank Li/IC_306/IC_324 - Multiple roots
If two employees have manager_id IS NULL (CEO + acting CEO), the recursive CTE starts from both and you get two trees. State the intent.
Recursive CTE: total org headcount under each manager.
[Amazon · Reorg Planning] A reorg-planning tool needs ‘people in your org’ — yourself plus everyone reporting up to you, transitively. Recursive CTE descending, then COUNT(*) - 1 to exclude self.
employee_id full_name manager_id department salary
0 100 Dana Chen NaN Engineering 250000.0
1 201 Eve Park 100.0 Engineering 180000.0
2 202 Frank Li 100.0 Engineering 175000.0
3 203 Grace Kim 100.0 Engineering NaN
4 301 IC_301 201.0 Engineering 116962.0
WITH RECURSIVE descendants AS (
SELECT employee_id AS root_id, employee_id
FROM employees
UNION ALL
SELECT d.root_id, e.employee_id
FROM descendants d
JOIN employees e ON e.manager_id = d.employee_id
)
SELECT m.employee_id, m.full_name AS manager,
COUNT(*) - 1 AS reports_below
FROM descendants d
JOIN employees m ON m.employee_id = d.root_id
GROUP BY m.employee_id, m.full_name
ORDER BY reports_below DESC
LIMIT 10;
employee_id manager reports_below
0 100 Dana Chen 27
1 201 Eve Park 11
2 202 Frank Li 8
3 203 Grace Kim 5
4 306 IC_306 3
5 302 IC_302 3
6 301 IC_301 3
7 312 IC_312 0
8 309 IC_309 0
9 317 IC_317 0
Pattern: seed with self, recurse downward, count − 1 to exclude the root from their own org count.
Edge cases & gotchas (2)
- Off-by-one (counting self)
The seed step puts (root_id, root_id) into descendants — the root counts itself. Subtract 1 in the final aggregate or filter self out in the seed.
- Database default recursion limit
Postgres default cycle protection is off; SQL Server has MAXRECURSION 100 default. For deep orgs (>100 levels — rare but real), increase the cap explicitly.
Cohort retention as percentages.
[Meta · Quick Experiments] A growth experiment wants the cohort retention triangle as percentages — the shape that ships in the experiment readout. Watch the denominator: normalizing by W0 (always 100%) vs cohort size give different reads.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
WITH cohort AS (
SELECT user_id, DATE_TRUNC('week', signup_date) AS cohort_week
FROM users
),
cohort_sizes AS (
SELECT cohort_week, COUNT(*) AS cohort_n FROM cohort GROUP BY cohort_week
),
activity AS (
SELECT c.cohort_week, c.user_id,
CAST(FLOOR(EXTRACT(EPOCH FROM (l.login_date - c.cohort_week)) / (7*86400)) AS INT) AS w
FROM cohort c
JOIN logins l ON l.user_id = c.user_id
WHERE l.login_date >= c.cohort_week
)
SELECT a.cohort_week, cs.cohort_n,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN w=0 THEN a.user_id END)/cs.cohort_n, 1) AS w0,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN w=1 THEN a.user_id END)/cs.cohort_n, 1) AS w1,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN w=2 THEN a.user_id END)/cs.cohort_n, 1) AS w2,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN w=3 THEN a.user_id END)/cs.cohort_n, 1) AS w3
FROM activity a
JOIN cohort_sizes cs USING (cohort_week)
GROUP BY a.cohort_week, cs.cohort_n
ORDER BY a.cohort_week
LIMIT 10;
cohort_week cohort_n w0 w1 w2 w3
0 2024-09-30 3 0.0 0.0 0.0 0.0
1 2024-10-07 3 0.0 0.0 0.0 0.0
2 2024-10-14 4 0.0 0.0 0.0 0.0
3 2024-10-21 3 0.0 0.0 0.0 0.0
4 2024-10-28 4 0.0 0.0 0.0 0.0
5 2024-11-04 3 0.0 0.0 0.0 0.0
6 2024-11-11 4 0.0 0.0 0.0 0.0
7 2024-11-18 3 0.0 0.0 0.0 0.0
8 2024-11-25 4 0.0 0.0 0.0 0.0
9 2024-12-02 3 0.0 0.0 0.0 0.0
Watch the denominator. Some teams normalize by W0 (always 100%) instead of cohort_n — both are valid; pick one and label it.
Edge cases & gotchas (2)
- Cohorts with 0 users
An empty cohort_week still appears in the join; the percentage is 0/0. Filter cohort_n > 0 or NULLIF the divisor.
- Survivorship bias
Cohorts near the data edge have not had time to be observed at W3. Either mark them as incomplete (W3 = NULL) or compute only the cohorts with full observation.
Fix · SQL-- Only fully-observed cohorts WHERE cohort_week + INTERVAL '3' WEEK <= (SELECT MAX(login_date) FROM logins)
New vs returning users by day.
[Netflix · Daily Active Mix] A daily report wants two lines: new users today (first-ever login) vs returning users today (active before). ‘New’ vs ‘new this month’ is a real definition fight — clarify before coding.
user_id login_date
0 1 2025-03-02
1 1 2025-03-04
2 1 2025-03-05
3 1 2025-03-06
4 1 2025-03-07
WITH first_seen AS (
SELECT user_id, MIN(login_date) AS first_login
FROM logins
GROUP BY user_id
)
SELECT l.login_date,
COUNT(DISTINCT CASE WHEN l.login_date = f.first_login THEN l.user_id END) AS new_users,
COUNT(DISTINCT CASE WHEN l.login_date > f.first_login THEN l.user_id END) AS returning_users
FROM logins l
JOIN first_seen f USING (user_id)
GROUP BY l.login_date
ORDER BY l.login_date
LIMIT 10;
login_date new_users returning_users
0 2025-03-01 63 0
1 2025-03-02 20 35
2 2025-03-03 13 56
3 2025-03-04 2 58
4 2025-03-05 1 60
5 2025-03-06 1 61
6 2025-03-07 0 59
7 2025-03-08 0 66
8 2025-03-09 0 63
9 2025-03-10 0 67
Definition discipline: ‘new’ = first-ever-anywhere; ‘returning’ = active today and active before. Some teams want first-ever-this-month — clarify before coding.
Edge cases & gotchas (2)
- ‘New’ redefined per period
‘New today’ could mean ‘first ever’ or ‘first this month’ or ‘first since subscription’. Three valid metrics. Pick one and label.
- Identity stitching
If users log in from multiple devices and you stitch IDs, MIN(login_date) per stitched user differs from MIN per device ID. The stitching pipeline ordering matters.
Find orphaned orders — orders whose user_id doesn't exist in users.
[Amazon · Data Quality] A nightly DQ job needs to surface orphan orders — orders whose user_id doesn't exist in users. Wrap as a view; alert when row count > 0; ticket the upstream owner.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
SELECT o.order_id, o.user_id, o.order_date, o.amount
FROM orders o
LEFT JOIN users u ON u.user_id = o.user_id
WHERE u.user_id IS NULL
LIMIT 10;
order_id user_id order_date amount
0 23 99000 2025-01-03 42.33
1 89 99001 2025-01-12 87.23
2 168 99002 2025-01-21 232.05
3 291 99003 2025-02-06 177.40
4 402 99004 2025-02-20 66.10
Production wrap: turn this into a dq_orphan_orders view; alert when row count > 0; ticket the upstream owner. The query is the easy part.
Edge cases & gotchas (2)
- Soft-deleted users count as orphans
If users has a deleted_at column and you filter deleted_at IS NULL, every order from a churned account becomes an orphan. Distinguish deleted vs never-existed.
Fix · SQLSELECT o.order_id, o.user_id, CASE WHEN u.user_id IS NULL THEN 'never_existed' WHEN u.deleted_at IS NOT NULL THEN 'soft_deleted' END AS orphan_kind FROM orders o LEFT JOIN users u ON u.user_id = o.user_id WHERE u.user_id IS NULL OR u.deleted_at IS NOT NULL;Output[error] BinderException: Binder Error: Values list "u" does not have a column named "deleted_at" LINE 7: WHERE u.user_id IS NULL OR u.deleted_at IS NOT NULL ^ - Test/staff orders
Many orgs have test users with user_id in a special range. Filter them OUT of orphan checks AND in business metrics.
Reconcile daily_metrics.daily_revenue against sum of purchases.amount per day.
[Amazon · Metric Reconciliation] A finance audit: daily_metrics.daily_revenue vs SUM(purchases.amount) per day. Tolerance > $0.01 because of currency rounding. A metric is only as trusted as its reconciliation report.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-05 55714.88 9883
WITH src AS (
SELECT purchase_date AS ds,
SUM(amount) AS src_revenue
FROM purchases
GROUP BY purchase_date
),
diff AS (
SELECT d.ds,
d.daily_revenue AS metric_revenue,
COALESCE(s.src_revenue, 0) AS source_revenue,
d.daily_revenue - COALESCE(s.src_revenue, 0) AS delta
FROM daily_metrics d
LEFT JOIN src s ON s.ds = d.ds
)
SELECT *
FROM diff
WHERE ABS(delta) > 0.01
ORDER BY ABS(delta) DESC
LIMIT 10;
ds metric_revenue source_revenue delta
0 2025-02-18 92348.66 0.00 92348.66
1 2025-04-26 82002.86 0.00 82002.86
2 2025-03-16 80726.04 0.00 80726.04
3 2025-01-06 80359.13 0.00 80359.13
4 2025-01-02 79294.66 119.12 79175.54
5 2025-01-20 73200.32 0.00 73200.32
6 2025-01-19 72156.75 0.00 72156.75
7 2025-04-25 71673.07 0.00 71673.07
8 2025-02-14 71666.89 0.00 71666.89
9 2025-03-17 71011.11 0.00 71011.11
Why a tolerance: floating point + currency rounding mean exact equality is the wrong check. Pick a meaningful epsilon (e.g., one cent).
Edge cases & gotchas (2)
- Floating-point currency drift
Storing amount as DOUBLE makes SUM non-associative; reconciling with =, even rounded, drifts. Use DECIMAL or convert to integer cents before summing.
- Refund vs gross definition
If purchases excludes refunds but daily_revenue is net, your reconciliation will always show a gap equal to refund volume. Same definition on both sides or document the gap.
Per-user outlier orders — more than 3σ above that user's mean.
[Apple · Trust Engineering] Risk wants per-user outliers: orders > 3σ above the user's own mean, with at least 5 orders of history. Caveat: order amounts are log-normal, not normal — for production use IQR or MAD.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
WITH stats AS (
SELECT order_id, user_id, order_date, amount,
AVG(amount) OVER (PARTITION BY user_id) AS u_avg,
STDDEV(amount) OVER (PARTITION BY user_id) AS u_std,
COUNT(*) OVER (PARTITION BY user_id) AS u_n
FROM orders
)
SELECT order_id, user_id, order_date, amount,
ROUND(u_avg, 2) AS u_avg, ROUND(u_std, 2) AS u_std, u_n
FROM stats
WHERE u_n >= 5
AND amount > u_avg + 3 * u_std
ORDER BY amount - u_avg DESC
LIMIT 10;
Empty DataFrame
Columns: [order_id, user_id, order_date, amount, u_avg, u_std, u_n]
Index: []
Statistical caveat: 3σ assumes roughly normal — order amounts are usually log-normal. For production, log-transform first or use IQR / MAD.
Edge cases & gotchas (2)
- n=1 partition: STDDEV returns NULL
Postgres STDDEV with one row returns NULL (sample variance); amount > u_avg + 3 * NULL is NULL, so no rows flagged. Filter u_n >= 5 (or use STDDEV_POP).
- Log-normal vs normal
Order amounts are heavy-tailed. 3σ on raw amount mis-flags. Better: log-transform first, or use IQR-based fences (Q1 - 1.5×IQR).
Fix · SQL-- IQR-based outlier SELECT order_id, amount FROM orders WHERE amount > (SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) FROM orders) + 1.5 * ( SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) - PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) FROM orders);Outputorder_id amount 0 14 318.91 1 15 890.88 2 19 245.36 3 38 450.92 4 49 308.07 5 52 371.89 .. ... ... 39 424 404.41 40 434 425.90 41 444 251.34 42 445 365.97 43 446 958.35 44 448 919.86
Power users — top 10% by orders AND top 10% by sessions.
[Netflix · VIP Holdback] CX wants the intersection of high-spend and high-engagement users (top 10% on both) for a VIP experiment. NTILE(10) caveat: ties land arbitrarily — for high-stakes segmentation, prefer PERCENT_RANK ≤ 0.10.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
WITH order_rank AS (
SELECT user_id,
NTILE(10) OVER (ORDER BY SUM(amount) DESC) AS o_decile
FROM orders GROUP BY user_id
),
sess_rank AS (
SELECT user_id,
NTILE(10) OVER (ORDER BY COUNT(*) DESC) AS s_decile
FROM sessions GROUP BY user_id
)
SELECT o.user_id, o.o_decile, s.s_decile
FROM order_rank o
JOIN sess_rank s USING (user_id)
WHERE o.o_decile = 1 AND s.s_decile = 1
ORDER BY o.user_id
LIMIT 10;
user_id o_decile s_decile
0 69 1 1
NTILE caveat: NTILE distributes ties arbitrarily, so two users with identical spend may land in different deciles. For high-stakes segmentation, prefer PERCENT_RANK ≤ 0.10.
Edge cases & gotchas (2)
- NTILE ties
NTILE distributes ties arbitrarily, so two users with the same spend can land in different deciles. PERCENT_RANK is tie-respecting.
- Joining on user_id only
If a user has orders but no sessions (or vice versa), they drop out of the inner join. State whether the segment requires both signals or one.
Daily ARPU + sanity check vs. raw orders.
[Meta · ARPU Trust] Daily ARPU is shipping to the exec dashboard. Compute the curated ARPU and a self-computed reconciliation from raw orders side-by-side, with a delta column. Senior interview signal: you ship the audit alongside the metric.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-05 55714.88 9883
WITH curated AS (
SELECT ds, daily_revenue, daily_dau,
ROUND(daily_revenue / NULLIF(daily_dau, 0), 4) AS arpu_curated
FROM daily_metrics
),
raw_arpu AS (
SELECT order_date AS ds,
SUM(amount) AS raw_revenue,
COUNT(DISTINCT user_id) AS raw_active_users,
ROUND(SUM(amount) / NULLIF(COUNT(DISTINCT user_id), 0), 4) AS arpu_raw
FROM orders
GROUP BY order_date
)
SELECT c.ds, c.arpu_curated, r.arpu_raw,
ROUND(c.arpu_curated - r.arpu_raw, 4) AS arpu_delta
FROM curated c
LEFT JOIN raw_arpu r USING (ds)
ORDER BY c.ds
LIMIT 10;
ds arpu_curated arpu_raw arpu_delta
0 2025-01-01 5.2718 38.8763 -33.6045
1 2025-01-02 7.7045 195.2125 -187.5080
2 2025-01-03 5.1125 113.1386 -108.0261
3 2025-01-04 5.5962 58.0700 -52.4738
4 2025-01-05 5.6374 80.5513 -74.9139
5 2025-01-06 8.1073 77.1450 -69.0377
6 2025-01-07 3.3879 117.3300 -113.9421
7 2025-01-08 4.3518 76.9813 -72.6295
8 2025-01-09 4.1263 81.8425 -77.7162
9 2025-01-10 3.9156 86.3887 -82.4731
Why ship the delta: a metric is only as trusted as the reconciliation report behind it. Senior interviewers love this answer because it shows you think about trust, not just calculation.
Edge cases & gotchas (2)
- Active-user definition mismatch
daily_dau may count any login event; orders.user_id COUNT(DISTINCT) counts purchasing users only. Different denominators ⇒ different ARPUs. Reconcile definitions first.
- Tolerance for the delta
Float drift makes exact 0 unrealistic. Define a tolerance (e.g., abs(delta)/curated < 0.001) and alert above it.
Fix · SQLSELECT *, ABS(arpu_delta) / NULLIF(arpu_curated, 0) AS pct_drift FROM reconciled WHERE ABS(arpu_delta) / NULLIF(arpu_curated, 0) > 0.001;Output[error] CatalogException: Catalog Error: Table with name reconciled does not exist! Did you mean "user_profiles"? LINE 2: FROM reconciled ^
Part 4 · 50 Python Questions, Simple to Complex
Most Python rounds for data engineers are pandas-first, with a few vanilla-Python curveballs (collections, generators, an LRU cache) and — for senior loops — a couple of PySpark or distributed-thinking problems. PySpark answers below show the canonical Spark code; the Output block shows the equivalent result computed via pandas (since this article was built without a live Spark cluster).
Tier 1 · Foundation
Load users.csv with proper date parsing; report shape + dtypes.
[Meta · Notebook Onboarding] Day-1 of a new analysis in a Bento notebook: load the table, confirm row count, confirm date columns are actually parsed as dates and not strings.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
import pandas as pd
users = pd.read_csv(DATA + "users.csv", parse_dates=["signup_date"])
print("shape:", users.shape)
print(users.dtypes)
print(users.head(3))
shape: (200, 5)
user_id int64
name str
email str
signup_date datetime64[us]
country str
dtype: object
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
Why parse_dates: without it, every signup_date.dt.month downstream silently fails. Always parse dates at read time.
Edge cases & gotchas (2)
- Date columns silently load as object
Without parse_dates, 'signup_date' is dtype=object (string). Every
.dtcall downstream raises AttributeError. Always parse at read time.Fix · Python# WRONG: signup_date is now a string users = pd.read_csv("users.csv") users["signup_date"].dt.month # AttributeError # RIGHT users = pd.read_csv("users.csv", parse_dates=["signup_date"]) - Mixed-format dates
If the CSV has both '2024-10-02' and '10/2/2024', auto-parse may fall back to dtype=object silently. Pass explicit format.
Fix · Pythonusers = pd.read_csv("users.csv", parse_dates=["signup_date"], date_format="%Y-%m-%d")
Filter to users from India.
[Meta · Localization Sprint] A regional analyst wants only the IN cohort for a Hindi-UI localization study. .copy() defensively to avoid SettingWithCopyWarning downstream.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
in_users = users[users["country"] == "IN"].copy()
print(len(in_users), "users in IN")
print(in_users.head(3))
33 users in IN
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
9 10 user_10 user10@example.com 2024-10-19 IN
19 20 user_20 user20@example.com 2024-11-09 IN
Why .copy(): if you mutate in_users later, pandas may emit SettingWithCopyWarning. The copy is cheap and explicit.
Edge cases & gotchas (2)
- Forgotten .copy() ⇒ SettingWithCopyWarning
Filtering returns a view sometimes and a copy other times. Without explicit .copy(), the next assignment may mutate the original — or warn and refuse.
Fix · Python# DANGEROUS in_users = users[users["country"] == "IN"] in_users["new_col"] = 1 # SettingWithCopyWarning # SAFE in_users = users.loc[users["country"] == "IN"].copy() - Unicode case sensitivity
users[users.country == 'in'] returns 0 rows because matching is case-sensitive. Lowercase consistently or use str.casefold.
Sum purchase amount by category, sorted descending.
[Amazon · Category Roll-Up] Same as the SQL category roll-up but in pandas — useful when the warehouse is gated by a long approval queue and you have CSVs in S3 instead. Use named-aggregation, not the legacy multi-index form.
purchase_id user_id amount purchase_date product_category
0 1 149 321.30 2024-11-01 Sports
1 2 82 36.66 2024-11-01 Clothing
2 3 19 70.12 2024-11-01 Clothing
3 4 98 122.21 2024-11-01 Food
4 5 59 5.33 2024-11-01 Electronics
by_cat = (purchases.groupby("product_category", as_index=False)
.agg(revenue=("amount", "sum"),
n=("amount", "size"))
.sort_values("revenue", ascending=False))
print(by_cat)
product_category revenue n
1 Clothing 105688.15 64
2 Electronics 6451.80 73
0 Books 5653.72 56
3 Food 5179.73 59
4 Sports 4629.85 49
Named aggregation (agg(name=(col, fn))) is the modern, readable form. Skip the legacy .agg({'amount':['sum','count']}) with multi-index columns.
Edge cases & gotchas (2)
- as_index=True returns a Series, not a DataFrame
Default
groupbysets the group as the index. For downstream merges and pivots you usually want as_index=False.Fix · Pythonby_cat = (purchases.groupby("product_category", as_index=False) .agg(revenue=("amount", "sum"), n=("amount", "size"))) - NaN groups
groupby drops NaN groups by default. If product_category can be NULL, pass dropna=False to surface them.
Fix · Pythonpurchases.groupby("product_category", dropna=False, as_index=False).size()
Top 10 orders by amount.
[Amazon · Trust & Safety] Quick ‘biggest orders’ peek for a fraud review. nlargest beats sort_values().head() on big data — partial sort, O(n log k) instead of O(n log n).
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
top10 = orders.nlargest(10, "amount")
print(top10)
order_id user_id order_date amount
445 446 176 2025-02-25 958.35
447 448 168 2025-02-25 919.86
14 15 70 2025-01-02 890.88
197 198 112 2025-01-25 830.40
309 310 36 2025-02-08 820.84
356 357 184 2025-02-14 740.15
339 340 140 2025-02-12 734.23
84 85 131 2025-01-11 604.76
330 331 149 2025-02-11 554.26
70 71 154 2025-01-09 456.94
nlargest vs sort_values().head(): on large data, nlargest is a partial sort — O(n log k) instead of O(n log n).
Edge cases & gotchas (2)
- nlargest with NaN
nlargest excludes NaN by default. If you want NaN at the top (unusual but possible for outlier review), pass keep='all' and filter manually.
- Ties — keep policy
nlargest(10, 'amount', keep='first'|'last'|'all'). 'all' returns >10 rows when ties straddle the cutoff. Pick deliberately.Fix · Pythonorders.nlargest(10, "amount", keep="all")
Count nulls per column in user_profiles.
[Meta · DQ First Pass] First-pass data-quality check on a freshly delivered file. Notice the row count vs unique user count: 334 rows for 200 users means dedup work ahead.
user_id name city updated_at
0 1 user_1 Denver 2025-01-05
1 1 user_1 NYC 2025-02-19
2 2 user_2 SF 2025-01-01
3 3 user_3 Austin 2025-01-11
4 3 user_3 NaN 2025-02-01
print(up.isna().sum())
print("rows:", len(up), " unique users:", up["user_id"].nunique())
user_id 0
name 0
city 5
updated_at 2
dtype: int64
rows: 334 unique users: 200
Tell the interviewer what you noticed: 334 rows for 200 users → duplicates ahead. (Q13 deals with that.)
Edge cases & gotchas (2)
- isna() vs isnull() vs ==NaN
== NaNalways returns False; use isna() / isnull() (synonyms in pandas). For object columns, also check for empty strings.Fix · Python# Treat empty strings AS missing mask_missing = up.isna() | (up.astype(object) == "") print(mask_missing.sum()) - Dtype-specific NaN
datetime NaT, float NaN, and pd.NA in nullable Int64 are different types — but isna() catches all three.
Add signup_month (YYYY-MM) to users.
[Meta · Cohort Reporting] A monthly cohort report wants a YYYY-MM column for the chart axis. Use .dt.to_period('M') to preserve order semantics, not strftime (which gives a string that happens to sort right for ISO months only).
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
users["signup_month"] = users["signup_date"].dt.to_period("M").astype(str)
print(users[["user_id", "signup_date", "signup_month"]].head())
user_id signup_date signup_month
0 1 2024-10-02 2024-10
1 2 2024-10-04 2024-10
2 3 2024-10-05 2024-10
3 4 2024-10-08 2024-10
4 5 2024-10-09 2024-10
Why .to_period('M') over .strftime('%Y-%m'): Period preserves order semantics; strftime gives a string that happens to sort correctly for ISO months but is not a real date.
Edge cases & gotchas (2)
- to_period('M').astype(str) loses ordering
Period('2024-12') < Period('2025-01') correctly. ‘2024-12’ < ‘2025-01’ correctly. But ‘2024-12’ < ‘25’ (wrong format) sorts lexically. Stick with Period for sorting; cast to string only for display.
- Time zone of the date
If signup_date is tz-aware UTC, to_period('M') is in UTC. Convert to local TZ first if the cohort is reported locally.
Save filtered IN-only users to Parquet.
[Netflix · Notebook Hand-off] Hand off a sliced frame to a downstream notebook without re-parsing CSV every time. Parquet (snappy) is 5–10× smaller and consumers don't have to guess dtypes.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
in_users = users[users["country"] == "IN"].copy()
out = "/tmp/users_in.parquet"
in_users.to_parquet(out, engine="pyarrow", compression="snappy", index=False)
import os
print("wrote:", out, " bytes:", os.path.getsize(out))
wrote: /tmp/users_in.parquet bytes: 4213
Why Parquet: columnar, typed, ~5–10× smaller than CSV, and consumers do not have to guess dtypes. Snappy = fast decompression, decent ratio.
Edge cases & gotchas (2)
- Parquet dtype mismatch
If different chunks have different dtypes (e.g., user_id sometimes int sometimes float), pyarrow may upcast or fail. Coerce dtypes before writing.
Fix · Pythonin_users["user_id"] = in_users["user_id"].astype("int64") in_users.to_parquet("/tmp/users_in.parquet", engine="pyarrow", compression="snappy", index=False) - Snappy vs zstd
Snappy = fast read, decent ratio. Zstd = ~30% smaller but slower decode. For analytics on cold-storage, zstd is usually right; for hot tables, snappy.
List unique platforms in sessions.
[Meta · Data Discovery] A new analyst exploring the dataset wants to know what platforms exist. value_counts beats unique() — you get the per-value count for free.
session_id user_id platform session_date duration_sec
0 1 51 android 2025-01-01 388.0
1 2 73 android 2025-01-01 167.0
2 3 32 android 2025-01-01 2493.0
3 4 75 android 2025-01-01 1881.0
4 5 8 android 2025-01-01 1217.0
print(sessions["platform"].value_counts(dropna=False))
platform
ios 329
android 265
web 198
WEB 5
Android 2
iOS 1
Name: count, dtype: int64
Why value_counts over unique(): you also get the per-value count for free — almost always what you actually want.
Edge cases & gotchas (2)
- value_counts hides NaN
By default value_counts excludes NaN. dropna=False surfaces it as a row — what you want for DQ.
Fix · Pythonsessions["platform"].value_counts(dropna=False) - Inconsistent casing
‘iOS’ vs ‘ios’ counted separately. Lowercase upstream or before the count.
Total revenue across the products table.
[Amazon · Quick GMV] A merchandising sanity check before any ranking. Pitfall: if revenue were stored as string ('1,234'), sum() would concatenate. Confirm dtype on numeric columns.
product_id product_name category revenue
0 1 prod_1 Electronics 435.0
1 2 prod_2 Electronics 25.0
2 3 prod_3 Electronics 384.0
3 4 prod_4 Electronics 1054.0
4 5 prod_5 Electronics 485.0
print("Total revenue:", products["revenue"].sum())
Total revenue: 74831.0
Pitfall: if revenue were stored as string ('1,234'), sum() would concatenate strings. Always confirm dtype on numeric columns.
Edge cases & gotchas (2)
- Numeric stored as string
If revenue loaded as object (e.g., '1,234'), .sum() concatenates strings. Always check dtypes before summing currency.
Fix · Python# Defensive: strip commas, cast products["revenue"] = (products["revenue"].astype(str) .str.replace(",", "").astype("float64")) - Currency rounding
Float sums of currency drift. For ledger-grade totals, store and sum as integer cents, then divide by 100 at display time.
Distinct users in logins.
[Meta · DAU Snapshot] The simplest engagement metric: how many unique users have ever logged in. At billion-user-scale, nunique is expensive — HyperLogLog trades a tiny error for huge memory wins.
user_id login_date
0 1 2025-03-02
1 1 2025-03-04
2 1 2025-03-05
3 1 2025-03-06
4 1 2025-03-07
print("unique users:", logins["user_id"].nunique())
print("total login rows:", len(logins))
unique users: 100
total login rows: 1935
Memory tip: on huge data, nunique() can be expensive; HyperLogLog (e.g., datasketch) trades a tiny error for big memory wins.
Edge cases & gotchas (2)
- nunique blows up on huge data
On a billion-row table, .nunique() loads all distinct values into memory. HyperLogLog (datasketch) trades a tiny error for huge memory wins.
Fix · Pythonfrom datasketch import HyperLogLog hll = HyperLogLog() for u in logins["user_id"]: hll.update(str(u).encode()) print("approx unique:", hll.count()) - user_id type coercion
If user_id loaded as float because of NaN rows, ‘1.0’ and ‘1’ are the same value but the float repr can confuse downstream string joins.
Tier 2 · Joins & Multi-Table
Left-join users with purchases.
[Meta · Lifecycle Audience] Build a users-with-purchases table for downstream segmentation; preserve all users even if they never bought. validate="one_to_many" is your insurance against a silent m:m join blowing up the row count.
purchase_id user_id amount purchase_date product_category
0 1 149 321.30 2024-11-01 Sports
1 2 82 36.66 2024-11-01 Clothing
2 3 19 70.12 2024-11-01 Clothing
3 4 98 122.21 2024-11-01 Food
4 5 59 5.33 2024-11-01 Electronics
merged = users.merge(purchases, on="user_id", how="left",
validate="one_to_many")
print("shape:", merged.shape)
print("users with no purchases:",
merged.loc[merged["purchase_id"].isna(), "user_id"].nunique())
shape: (374, 9)
users with no purchases: 75
validate= earns its keep: it asserts the relationship (1:1, 1:m, m:1, m:m). If your join is unexpectedly many-to-many, this catches it.
Edge cases & gotchas (2)
- Unintended m:m join
Without validate=, a duplicated key on either side multiplies rows silently. Always validate the relationship.
Fix · Python# Will raise if relationship is unexpectedly m:m merged = users.merge(purchases, on="user_id", how="left", validate="one_to_many") - Suffixes for column collisions
If both frames have a 'name' column, pandas appends _x / _y silently — confusing. Set explicit suffixes.
Fix · Pythonmerged = users.merge(profiles, on="user_id", suffixes=("_user", "_profile"))
Find duplicated user_id rows in user_profiles.
[Meta · CDP Inspection] Before deduping, see the duplicates: their cities, their update dates. keep=False shows all rows in a duplicate group, not just the extras — exactly what you want for inspection.
user_id name city updated_at
0 1 user_1 Denver 2025-01-05
1 1 user_1 NYC 2025-02-19
2 2 user_2 SF 2025-01-01
3 3 user_3 Austin 2025-01-11
4 3 user_3 NaN 2025-02-01
dups = up[up.duplicated(subset=["user_id"], keep=False)]
print("duplicate rows:", len(dups))
print(dups.sort_values(["user_id", "updated_at"]).head(8))
duplicate rows: 240
user_id name city updated_at
0 1 user_1 Denver 2025-01-05
1 1 user_1 NYC 2025-02-19
3 3 user_3 Austin 2025-01-11
4 3 user_3 NaN 2025-02-01
8 7 user_7 NYC 2025-01-17
9 7 user_7 Seattle 2025-02-07
10 8 user_8 Boston 2025-01-19
11 8 user_8 Seattle 2025-02-15
keep=False shows all rows that participate in a duplicate group, not just the extras. That is what you want for inspection.
Edge cases & gotchas (2)
- duplicated default keeps first
duplicated()defaults to keep='first', which marks ONLY the duplicates as True. keep=False marks all members of a duplicate group as True — what you want for inspection. - Subset matters
duplicated()on the full row vs subset=['user_id'] gives very different counts. Be explicit.
Dedup user_profiles — keep latest updated_at per user.
[Meta · CDP] Produce one row per user with the freshest profile. The assert is_unique at the end is the answer — never trust dedup by inspection in production.
user_id name city updated_at
0 1 user_1 Denver 2025-01-05
1 1 user_1 NYC 2025-02-19
2 2 user_2 SF 2025-01-01
3 3 user_3 Austin 2025-01-11
4 3 user_3 NaN 2025-02-01
up_clean = (up.sort_values(["user_id", "updated_at"],
ascending=[True, False])
.drop_duplicates(subset=["user_id"], keep="first"))
assert up_clean["user_id"].is_unique
print("clean shape:", up_clean.shape)
print(up_clean.head())
clean shape: (200, 4)
user_id name city updated_at
1 1 user_1 NYC 2025-02-19
2 2 user_2 SF 2025-01-01
4 3 user_3 NaN 2025-02-01
5 4 user_4 Boston 2025-01-23
6 5 user_5 NYC 2025-01-18
The assert is the answer. Always prove uniqueness after a dedup; never trust by inspection.
Edge cases & gotchas (2)
- drop_duplicates preserves first sorted row
If you didn't sort by updated_at DESC, drop_duplicates keeps the wrong row silently. Always sort before drop_duplicates.
Fix · Pythonup_clean = (up.sort_values(["user_id", "updated_at"], ascending=[True, False]) .drop_duplicates(subset=["user_id"], keep="first")) assert up_clean["user_id"].is_unique - Tied updated_at
If two rows share updated_at, drop_duplicates picks one arbitrarily. Add a deterministic secondary sort key.
Spend per country (users + purchases).
[Amazon · Pricing] Spend by user country with purchase count, total revenue, and avg basket. Inner vs left join is a definition choice — state which you picked.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
spend = (users.merge(purchases, on="user_id", how="inner")
.groupby("country", as_index=False)
.agg(purchases=("purchase_id", "count"),
revenue =("amount", "sum"),
avg_basket=("amount", "mean"))
.sort_values("revenue", ascending=False))
print(spend.round(2))
country purchases revenue avg_basket
10 US 86 108088.71 1271.63
6 IN 46 4018.69 89.30
5 FR 31 3251.37 108.38
3 BR 28 2788.20 99.58
11 \N 21 2389.74 113.80
8 UK 29 1891.26 65.22
7 JP 23 1839.74 79.99
4 DE 13 1093.09 84.08
13 in 4 591.46 147.86
0 4 139.98 34.99
12 br 2 80.52 40.26
9 UK 2 54.98 27.49
2 <NULL> 1 6.21 6.21
...
[truncated; 1 more lines]
Inner vs left join: inner gives spend of countries that purchased; left gives spend per country with NaN for non-purchasing countries. Both valid; state your choice.
Edge cases & gotchas (2)
- Inner vs left changes denominators
Inner drops countries with no purchases. Left keeps them with NaN. Pick by the question (‘avg basket among purchasers’ vs ‘avg basket per country, NaN if none’).
- groupby with NaN in key
Default drops NaN groups. dropna=False surfaces them — but merging on NaN keys is generally a DQ bug to fix upstream.
DQ assert — no negative amounts in purchases.
[Amazon · Nightly DQ] A nightly DQ check before publishing the curated revenue table. Failing loudly (raise) is the point — a silent print('warning') in a pipeline becomes a corrupted dashboard tomorrow.
purchase_id user_id amount purchase_date product_category
0 1 149 321.30 2024-11-01 Sports
1 2 82 36.66 2024-11-01 Clothing
2 3 19 70.12 2024-11-01 Clothing
3 4 98 122.21 2024-11-01 Food
4 5 59 5.33 2024-11-01 Electronics
bad = purchases[purchases["amount"] < 0]
if len(bad):
raise ValueError(f"{len(bad)} negative purchase amounts")
print("OK — no negative amounts; n=", len(purchases))
[error] ValueError: 5 negative purchase amounts
Failing loudly is the point. A silent print('warning') in a pipeline becomes a corrupted dashboard tomorrow.
Edge cases & gotchas (2)
- Soft ‘warning’ vs hard fail
A print() in a pipeline becomes invisible. raise ValueError or log at ERROR level so the orchestrator marks the task failed.
Fix · Pythonbad = purchases[purchases["amount"] < 0] if len(bad): raise ValueError(f"DQ failed: {len(bad)} negative amounts") - Zero-row table is not pass
An empty purchases table satisfies ‘no negative amounts’ trivially — but is itself a DQ failure. Add an assert len(purchases) > 0 (or a freshness check).
Find missing dates in daily_metrics_gapped.
[Netflix · Date-spine DQ] Surface which days are missing before any chart consumer notices. Index.difference beats set(spine) - set(...) on big indexes — pandas keeps order and dtype.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-06 80359.13 9912
spine = pd.date_range(dmg["ds"].min(), dmg["ds"].max(), freq="D")
missing = spine.difference(dmg["ds"])
print(f"{len(missing)} missing days; first 5:")
print(list(missing[:5]))
14 missing days; first 5:
[Timestamp('2025-01-05 00:00:00'), Timestamp('2025-01-25 00:00:00'), Timestamp('2025-02-08 00:00:00'), Timestamp('2025-02-12 00:00:00'), Timestamp('2025-02-13 00:00:00')]
Index.difference beats set(spine) - set(dmg.ds) on big indexes — pandas keeps order and preserves dtype.
Edge cases & gotchas (2)
- Index.difference vs set difference
pd.date_rangereturns a DatetimeIndex..difference()preserves dtype and order;set(spine) - set(dmg.ds)returns a Python set (unordered, slower). - Time of day in ‘ds’
If ds is a Timestamp with non-zero time, the spine (midnight-anchored) doesn't match. Floor with
.dt.normalize()first.
Forward-fill missing daily_dau values after spine join.
[Netflix · Imputation] A dashboard prefers ‘last-known-value’ over zero for short gaps. State the imputation choice loudly — ffill, zero, and interpolate have very different downstream meaning. Never pick silently.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-06 80359.13 9912
spine = pd.date_range(dmg["ds"].min(), dmg["ds"].max(), freq="D")
filled = (pd.DataFrame({"ds": spine})
.merge(dmg, on="ds", how="left")
.sort_values("ds"))
filled["daily_dau"] = filled["daily_dau"].ffill()
print(filled.head(10))
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486.0
1 2025-01-02 79294.66 10292.0
2 2025-01-03 55245.52 10806.0
3 2025-01-04 51104.66 9132.0
4 2025-01-05 NaN 9132.0
5 2025-01-06 80359.13 9912.0
6 2025-01-07 38076.88 11239.0
7 2025-01-08 36554.98 8400.0
8 2025-01-09 40941.61 9922.0
9 2025-01-10 50385.72 12868.0
State the choice: ffill, zero, or interpolate are all valid imputations with very different downstream meaning. Never pick silently.
Edge cases & gotchas (2)
- ffill before any value
If the very first row is NaN, ffill leaves it NaN. Either left-pad with a default or accept the head-NaN semantics.
Fix · Pythonfilled["daily_dau"] = filled["daily_dau"].ffill().fillna(0) - ffill across user/group boundaries
Plain
.ffill()on a stacked frame leaks values across users. Use.groupby('user_id').ffill()if the data is multi-entity.
Z-score outliers in orders.amount.
[Apple · Trust Eng] A first-pass anomaly screen before deeper investigation. 3σ assumes normality; order amounts are usually log-normal — log-transform first or use IQR / MAD in production.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
o = orders.copy()
mu, sd = o["amount"].mean(), o["amount"].std()
o["z"] = (o["amount"] - mu) / sd
outliers = o[o["z"].abs() > 3].sort_values("z", ascending=False)
print(f"{len(outliers)} outliers (|z|>3)")
print(outliers.head().round(2))
9 outliers (|z|>3)
order_id user_id order_date amount z
445 446 176 2025-02-25 958.35 6.96
447 448 168 2025-02-25 919.86 6.65
14 15 70 2025-01-02 890.88 6.42
197 198 112 2025-01-25 830.40 5.93
309 310 36 2025-02-08 820.84 5.85
3σ assumes normality. Order amounts are usually log-normal — log-transform first, or use IQR / robust z-score (median + MAD).
Edge cases & gotchas (2)
- Sample contamination
Computing mean & std INCLUDING the outliers makes the thresholds too lax. Use a robust stat (median + MAD) or iteratively trim before re-computing.
Fix · Pythonfrom scipy import stats mad = stats.median_abs_deviation(orders["amount"], scale="normal") median = orders["amount"].median() robust_z = (orders["amount"] - median) / mad outliers = orders[robust_z.abs() > 3.5] - n=1 partition
If you compute z within tiny groups, std is NaN/0. Filter groups with at least N rows before flagging.
Cast clickstream.event_time to datetime; sort by user, time.
[Meta · Reels Sessionization Prep] Pre-sessionization shape: per-user events ordered chronologically. reset_index after sort or your iloc calls downstream lie.
user_id event_time page
0 1 2025-03-01 08:45:00 home
1 1 2025-03-01 08:55:00 search
2 1 2025-03-01 09:00:00 cart
3 1 2025-03-01 09:01:00 checkout
4 1 2025-03-01 09:46:00 search
cs2 = cs.copy()
cs2["event_time"] = pd.to_datetime(cs2["event_time"])
cs2 = cs2.sort_values(["user_id", "event_time"]).reset_index(drop=True)
print(cs2.head(8))
user_id event_time page
0 1 2025-03-01 08:45:00 home
1 1 2025-03-01 08:55:00 search
2 1 2025-03-01 09:00:00 cart
3 1 2025-03-01 09:01:00 checkout
4 1 2025-03-01 09:46:00 search
5 1 2025-03-01 09:56:00 home
6 1 2025-03-01 09:57:00 cart
7 1 2025-03-01 09:59:00 home
Why reset_index: after a sort, the original index is misleading. Reset it before you start using iloc or assigning new columns.
Edge cases & gotchas (2)
- to_datetime fails silently with errors='ignore'
pd.to_datetime(s, errors='ignore')returns the original strings on parse failure — silent dtype regression. Use errors='raise' or 'coerce' (NaT) and DQ-flag.Fix · Pythoncs["event_time"] = pd.to_datetime(cs["event_time"], errors="coerce") bad = cs["event_time"].isna().sum() if bad: raise ValueError(f"{bad} unparseable event_time rows") - Unsorted before window ops
Sessionization, gaps-and-islands, all need sorted order. reset_index after sort or downstream iloc lies.
Pivot sessions: rows=date, cols=platform, values=count.
[Meta · Platform Mix Chart] Sessions wide format: rows = date, columns = platform, values = count. fill_value=0 is the small but critical detail — without it, days with no sessions for one platform become NaN and break downstream sums.
session_id user_id platform session_date duration_sec
0 1 51 android 2025-01-01 388.0
1 2 73 android 2025-01-01 167.0
2 3 32 android 2025-01-01 2493.0
3 4 75 android 2025-01-01 1881.0
4 5 8 android 2025-01-01 1217.0
pivot = (sessions.groupby(["session_date", "platform"]).size()
.unstack(fill_value=0)
.reset_index())
print(pivot.head(10))
platform session_date Android WEB android iOS ios web
0 2025-01-01 0 1 10 0 1 0
1 2025-01-02 0 0 4 0 7 1
2 2025-01-03 0 0 7 0 4 1
3 2025-01-04 0 0 3 0 3 6
4 2025-01-05 0 0 2 0 6 4
5 2025-01-06 0 0 3 0 7 2
6 2025-01-07 0 0 6 0 4 2
7 2025-01-08 1 0 4 0 3 4
8 2025-01-09 0 0 7 0 1 4
9 2025-01-10 0 0 3 0 8 1
fill_value=0 is a small but critical detail — without it, days where one platform had no sessions become NaN, which often breaks downstream sums.
Edge cases & gotchas (2)
- Pivot without fill_value
Days with no sessions on one platform become NaN. Downstream
.sum()may then propagate NaN unexpectedly. - Categorical platforms
If platform is dtype='category' with unobserved categories, the pivot includes empty columns. observed=True (groupby) or .cat.remove_unused_categories() to clean.
Tier 3 · Window Functions, CASE, Subqueries
7-day moving average of daily_revenue.
[Netflix · Trend Smoothing] Smooth the daily noise for an exec chart. min_periods=1 avoids NaN at the start; for ‘official’ 7-day averages you may want min_periods=7 instead — a deliberate choice.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-05 55714.88 9883
dm2 = dm.sort_values("ds").copy()
dm2["rev_7d"] = dm2["daily_revenue"].rolling(window=7, min_periods=1).mean()
print(dm2.head(10).round(2))
ds daily_revenue daily_dau rev_7d
0 2025-01-01 50008.19 9486 50008.19
1 2025-01-02 79294.66 10292 64651.42
2 2025-01-03 55245.52 10806 61516.12
3 2025-01-04 51104.66 9132 58913.26
4 2025-01-05 55714.88 9883 58273.58
5 2025-01-06 80359.13 9912 61954.51
6 2025-01-07 38076.88 11239 58543.42
7 2025-01-08 36554.98 8400 56621.53
8 2025-01-09 40941.61 9922 51142.52
9 2025-01-10 50385.72 12868 50448.27
min_periods=1 avoids NaN at the start. For ‘official’ 7-day averages you may want min_periods=7 instead — a deliberate choice.
Edge cases & gotchas (2)
- min_periods choice
min_periods=1avoids NaN at the head but produces 1-day ‘averages’ for the first 6 rows — misleading.min_periods=7returns NaN until 7 rows accrue, which is the honest 7-day average.Fix · Pythondm["rev_7d"] = dm["daily_revenue"].rolling(window=7, min_periods=7).mean() - Centered vs trailing
rolling(7)is trailing by default. For a smoothed trend chart,rolling(7, center=True)centers — but you cannot use it operationally (uses future data).
Resample daily_metrics to weekly.
[Meta · Weekly Roll-up] Weekly business review wants one-row-per-week from the daily roll-up. Anchor matters: 'W-MON' ends on Monday — your team's week definition (Sun-Sat / Mon-Sun / ISO) is a real source of metric mismatch.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-05 55714.88 9883
weekly = (dm.set_index("ds")
.resample("W-MON")
.agg(week_revenue=("daily_revenue", "sum"),
avg_dau =("daily_dau", "mean"))
.reset_index())
print(weekly.head(8).round(2))
ds week_revenue avg_dau
0 2025-01-06 371727.04 9918.50
1 2025-01-13 345131.33 10997.50
2 2025-01-20 387115.95 10138.57
3 2025-01-27 293508.59 9598.29
4 2025-02-03 320906.36 10730.29
5 2025-02-10 308757.49 9861.29
6 2025-02-17 349706.55 11612.57
7 2025-02-24 345706.24 10180.57
Anchor matters: 'W-MON' ends the week on Monday. Your team's week definition (Sun-Sat? Mon-Sun? ISO?) is a real source of metric mismatches across teams.
Edge cases & gotchas (2)
- Resample anchor
'W'defaults to Sunday.'W-MON'ends on Monday. Mismatch with the warehouse week definition is a common source of metric drift. - Partial week at the boundaries
First and last week may have <7 days. Resample sums NaN as 0 by default; the first/last week is undercounted. Drop or label.
Fix · Pythonweekly = dm.set_index("ds").resample("W-MON").agg( week_revenue=("daily_revenue", "sum"), days_in_week=("daily_revenue", "size")) weekly = weekly[weekly["days_in_week"] == 7] # drop partial weeks
MTD running revenue with month reset.
[Amazon · MTD Dashboard] MTD revenue on every row, reset on month-start. .dt.to_period('M') as a grouper is the pandas equivalent of DATE_TRUNC('month', ds).
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-05 55714.88 9883
dm2 = dm.sort_values("ds").copy()
dm2["mtd_revenue"] = (dm2.groupby(dm2["ds"].dt.to_period("M"))["daily_revenue"]
.cumsum())
print(dm2.head(35).tail(10).round(2))
ds daily_revenue daily_dau mtd_revenue
24 2025-01-25 40328.10 13136 1319669.74
25 2025-01-26 43681.34 9966 1363351.08
26 2025-01-27 34131.83 9913 1397482.91
27 2025-01-28 54908.78 10882 1452391.69
28 2025-01-29 57999.53 11443 1510391.22
29 2025-01-30 33800.89 11418 1544192.11
30 2025-01-31 44537.61 11610 1588729.72
31 2025-02-01 33832.95 9053 33832.95
32 2025-02-02 39214.40 9553 73047.35
33 2025-02-03 56612.20 11153 129659.55
The .dt.to_period('M') grouper is the pandas equivalent of DATE_TRUNC('month', ds) — clean and idiomatic.
Edge cases & gotchas (2)
- to_period('M') vs Grouper
.dt.to_period('M')is what you want for month labels.pd.Grouper(freq='M')works but resamples — different semantics if the index isn't sorted. - Late-arriving rows
MTD that recomputes on every run isn't snapshotted. If a row for Dec 15 lands today, December's MTD numbers change. Snapshot or label as ‘rolling MTD’.
DAU and rolling 7-day WAU from logins.
[Meta · DAU/WAU] Two engagement metrics from one source. Rolling distinct count is hard — pandas has no built-in; the loop is O(N×W). For real scale, deque + counter or HyperLogLog sketch.
user_id login_date
0 1 2025-03-02
1 1 2025-03-04
2 1 2025-03-05
3 1 2025-03-06
4 1 2025-03-07
daily = (logins.groupby("login_date")["user_id"].nunique()
.rename("dau").reset_index()
.sort_values("login_date"))
# trailing-7-day WAU via list comprehension (acceptable for ~120 days)
def trail_unique(d, days=7):
start = d - pd.Timedelta(days=days-1)
mask = (logins["login_date"] >= start) & (logins["login_date"] <= d)
return logins.loc[mask, "user_id"].nunique()
daily["wau_7d"] = [trail_unique(d) for d in daily["login_date"]]
print(daily.tail(8))
login_date dau wau_7d
22 2025-03-23 53 100
23 2025-03-24 69 100
24 2025-03-25 65 100
25 2025-03-26 63 100
26 2025-03-27 68 100
27 2025-03-28 69 100
28 2025-03-29 66 100
29 2025-03-30 60 100
Rolling distinct is hard. Pandas has no built-in; the loop above is O(N×W). For big data, use a deque + counter or HyperLogLog sketch.
Edge cases & gotchas (2)
- Distinct in a rolling window is not a built-in
Pandas .rolling() doesn't support nunique. The naive list-comp is O(N×W) — fine for 365 days, awful for 4 years.
Fix · Python# Faster: deque + Counter from collections import deque, Counter window, days = deque(), 7 counter = Counter() result = {} for d, group in logins.groupby("login_date"): for u in group["user_id"]: counter[u] += 1 window.append((d, set(group["user_id"]))) while window and (d - window[0][0]).days >= days: old_d, old_users = window.popleft() for u in old_users: counter[u] -= 1 if counter[u] == 0: del counter[u] result[d] = len(counter) - Stitching identities
If the same human logs in from two devices, raw user_id double-counts in DAU. The identity-graph step belongs upstream.
WoW % change of WAU from weekly_metrics.
[Netflix · WoW Growth] The Netflix weekly business review wants WoW % change of WAU. pct_change() handles the first-row NaN gracefully — don't reinvent LAG with shift unless you need a non-1 lag.
week_start wau
0 2024-01-01 10000
1 2024-01-08 10230
2 2024-01-15 10455
3 2024-01-22 10819
4 2024-01-29 10434
wm2 = wm.sort_values("week_start").copy()
wm2["wow_pct"] = (wm2["wau"].pct_change() * 100).round(2)
print(wm2.head(10))
week_start wau wow_pct
0 2024-01-01 10000 NaN
1 2024-01-08 10230 2.30
2 2024-01-15 10455 2.20
3 2024-01-22 10819 3.48
4 2024-01-29 10434 -3.56
5 2024-02-05 10972 5.16
6 2024-02-12 11048 0.69
7 2024-02-19 11454 3.67
8 2024-02-26 11077 -3.29
9 2024-03-04 10499 -5.22
pct_change handles the NaN on the first row gracefully. Don't reinvent the LAG with shift unless you need a non-1 lag.
Edge cases & gotchas (2)
- First-row pct_change is NaN
.pct_change()can't compute change for the first observation — NaN is correct. Drop or label, don't fillna(0) (which falsely says ‘no growth’). - Zero baseline
If wau was 0 last week, pct_change is inf. Guard with NULLIF logic or report as 'new growth' separately.
Fix · Pythonimport numpy as np wm["wow_pct"] = wm["wau"].pct_change().replace([np.inf, -np.inf], np.nan) * 100
Days between each user's first and last purchase.
[Amazon · Customer Lifespan] Days between each user's first and last purchase — a ‘lifespan’ metric for the lifecycle team. Edge: users with one purchase get 0 days. Decide whether that is ‘0 days’ or ‘censored’ before reporting.
purchase_id user_id amount purchase_date product_category
0 1 149 321.30 2024-11-01 Sports
1 2 82 36.66 2024-11-01 Clothing
2 3 19 70.12 2024-11-01 Clothing
3 4 98 122.21 2024-11-01 Food
4 5 59 5.33 2024-11-01 Electronics
span = (purchases.groupby("user_id")["purchase_date"]
.agg(first="min", last="max")
.reset_index())
span["days_active"] = (span["last"] - span["first"]).dt.days
print(span.sort_values("days_active", ascending=False).head(8))
user_id first last days_active
91 149 2024-11-01 2024-12-30 59
50 82 2024-11-01 2024-12-28 57
26 45 2024-11-04 2024-12-30 56
105 170 2024-11-05 2024-12-31 56
11 19 2024-11-01 2024-12-27 56
80 127 2024-11-06 2024-12-31 55
111 177 2024-11-07 2024-12-31 54
98 158 2024-11-08 2024-12-30 52
Edge: users with one purchase get 0 days. Decide whether that is ‘0 days active’ or ‘censored’ before reporting.
Edge cases & gotchas (2)
- Single-purchase users
first == last for one-purchase users ⇒ days_active = 0. ‘0 days lifespan’ vs ‘censored’ is a definition fight worth having.
- Time zone of purchase_date
If timestamps are UTC and the user is in HST, ‘day’ boundaries shift. .dt.tz_convert before .dt.date.
Top 3 products per category by revenue.
[Amazon · Homepage Ranking] Top 3 products per category by revenue, in pandas. groupby + head is the most readable form; avoid apply(lambda d: d.nlargest(3, ...)) — slower and harder to read.
product_id product_name category revenue
0 1 prod_1 Electronics 435.0
1 2 prod_2 Electronics 25.0
2 3 prod_3 Electronics 384.0
3 4 prod_4 Electronics 1054.0
4 5 prod_5 Electronics 485.0
top3 = (products.sort_values(["category", "revenue"], ascending=[True, False])
.groupby("category", as_index=False)
.head(3))
print(top3.head(15))
product_id product_name category revenue
18 19 prod_19 Books 2887.0
10 11 prod_11 Books 2420.0
14 15 prod_15 Books 2021.0
27 28 prod_28 Clothing 9770.0
29 30 prod_30 Clothing 7190.0
28 29 prod_29 Clothing 2972.0
3 4 prod_4 Electronics 1054.0
50 1 prod_1 Electronics 999.0
4 5 prod_5 Electronics 485.0
49 50 prod_50 Food 8731.0
41 42 prod_42 Food 2667.0
42 43 prod_43 Food 2252.0
39 40 prod_40 Sports 12777.0
...
[truncated; 2 more lines]
'groupby + head' is the most readable top-N-per-group pattern in pandas. Avoid the apply(lambda d: d.nlargest(3, ...)) form — slower and harder to read.
Edge cases & gotchas (2)
- Ties at #3
groupby + head(3)picks the first 3 by sort order, silently dropping ties. For shelf rotation that should keep all ties, use rank with method='dense' filtered by <= 3.Fix · Pythonproducts["rk"] = (products.groupby("category")["revenue"] .rank(method="dense", ascending=False)) top3 = products[products["rk"] <= 3] - Stable sort across runs
Without a tiebreak in sort_values, two products with identical revenue swap order across runs. Add product_id as secondary.
Sessionize the clickstream — 30-minute idle gap.
[Meta · Reels Sessionization] Sessionize by 30-min idle gap, in pandas. Identical mental model to the SQL form — worth practicing both versions side-by-side until you can switch fluently in either round.
user_id event_time page
0 1 2025-03-01 08:45:00 home
1 1 2025-03-01 08:55:00 search
2 1 2025-03-01 09:00:00 cart
3 1 2025-03-01 09:01:00 checkout
4 1 2025-03-01 09:46:00 search
cs2 = cs.sort_values(["user_id", "event_time"]).copy()
prev = cs2.groupby("user_id")["event_time"].shift()
new_session = (cs2["event_time"] - prev) > pd.Timedelta(minutes=30)
new_session = new_session.fillna(True)
cs2["session_id"] = new_session.groupby(cs2["user_id"]).cumsum()
agg = (cs2.groupby(["user_id", "session_id"])
.agg(start=("event_time", "min"),
end =("event_time", "max"),
pages=("page", "count"))
.reset_index())
print(agg.head(8))
user_id session_id start end pages
0 1 0 2025-03-01 08:45:00 2025-03-01 09:01:00 4
1 1 1 2025-03-01 09:46:00 2025-03-01 11:51:00 24
2 1 2 2025-03-01 12:36:00 2025-03-01 12:36:00 1
3 2 0 2025-03-01 08:01:00 2025-03-01 09:13:00 12
4 2 1 2025-03-01 10:43:00 2025-03-01 11:49:00 9
5 2 2 2025-03-01 12:34:00 2025-03-01 13:53:00 22
6 3 0 2025-03-01 08:02:00 2025-03-01 08:02:00 1
7 3 1 2025-03-01 08:47:00 2025-03-01 08:57:00 2
Identical mental model to SQL. Worth practicing the SQL and pandas versions side by side until you can switch fluently.
Edge cases & gotchas (2)
- Per-user shift leak across users
If you forget the groupby in shift, the previous user's last event becomes the next user's prev_time. Always groupby user.
Fix · Python# CORRECT: per-user lag prev = cs.groupby("user_id")["event_time"].shift() - Threshold per surface
30 minutes is canonical for web; mobile push 5 min; video 90s. The threshold is product knowledge, not a fixed constant.
Build the signup → activation → purchase funnel.
[Meta · Funnel Health] Build the signup → activation → purchase funnel from the events table. Always order-guard — funnels that ignore order count ‘purchase-before-signup’ as a successful conversion. They look great. They are wrong.
event_id user_id event_type event_date
0 1 1 signup 2025-01-04
1 2 1 activation 2025-02-08
2 3 1 purchase 2025-03-13
3 4 2 signup 2025-02-17
4 5 2 activation 2025-02-24
firsts = (ev.groupby(["user_id", "event_type"])["event_date"]
.min().unstack())
step1 = firsts["signup"].notna().sum()
step2 = ((firsts["activation"].notna())
& (firsts["activation"] >= firsts["signup"])).sum()
step3 = ((firsts["purchase"].notna())
& (firsts["purchase"] >= firsts[["signup","activation"]].max(axis=1))
).sum()
print({"signup": int(step1),
"activation": int(step2),
"purchase": int(step3),
"act_rate": round(step2/step1, 3),
"buy_rate": round(step3/max(step2,1), 3)})
{'signup': 498, 'activation': 282, 'purchase': 127, 'act_rate': np.float64(0.566), 'buy_rate': np.float64(0.45)}
Always order-guard. Funnels that ignore event order count ‘purchase before signup’ as a successful conversion. They look great. They are wrong.
Edge cases & gotchas (2)
- User firing ‘signup’ multiple times
Some pipelines emit signup on every onboarding step. groupby user_id + event_type with .min() collapses to first occurrence — what you want.
- Skipped intermediate steps
If a user purchases without an ‘activation’ event, strict ordering excludes them. Decide: strict (must hit each step) or any-later-step counts.
Cohort retention table from users + logins.
[Meta · Retention Triangle] The shape that ships in every weekly growth review: rows = cohort, cols = age in weeks, cells = % retained. Heatmap straight from this table.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
u = users[["user_id", "signup_date"]].copy()
u["cohort_week"] = u["signup_date"].dt.to_period("W").dt.start_time
j = logins.merge(u, on="user_id")
j["weeks_since"] = ((j["login_date"] - j["cohort_week"]).dt.days // 7)
j = j[(j["weeks_since"] >= 0) & (j["weeks_since"] <= 4)]
retention = (j.pivot_table(index="cohort_week", columns="weeks_since",
values="user_id", aggfunc=pd.Series.nunique,
fill_value=0))
sizes = u.groupby("cohort_week")["user_id"].nunique()
retention_pct = retention.div(sizes, axis=0).mul(100).round(1)
print(retention_pct.head(8))
weeks_since 0.0 1.0 2.0 3.0 4.0
cohort_week
2024-09-30 NaN NaN NaN NaN NaN
2024-10-07 NaN NaN NaN NaN NaN
2024-10-14 NaN NaN NaN NaN NaN
2024-10-21 NaN NaN NaN NaN NaN
2024-10-28 NaN NaN NaN NaN NaN
2024-11-04 NaN NaN NaN NaN NaN
2024-11-11 NaN NaN NaN NaN NaN
2024-11-18 NaN NaN NaN NaN NaN
The shape that ships: rows = cohorts, cols = age, cells = % retained. Heatmap straight from this table.
Edge cases & gotchas (2)
- Cohorts near data edge are right-censored
A signup from last week has at most W0 observed; W2 is literally impossible to compute. Mark as NaN or filter.
Fix · Pythonmax_week = (logins["login_date"].max() - u["cohort_week"]) // pd.Timedelta(weeks=1) # Only include cohorts with N+ weeks of full observation - ISO vs anchored week
.dt.to_period('W')uses Mon–Sun ISO week; the warehouse may use Sun–Sat. Mismatch ⇒ cohorts shifted by one day.
Tier 4 · CTEs, Sessionization, Cohorts
Per-user longest consecutive login streak.
[Meta · Engagement Streaks] Per-user longest consecutive-day login streak, in pandas. Same trick as the SQL version: date − row_number is constant within a consecutive run; group by it.
user_id login_date
0 1 2025-03-02
1 1 2025-03-04
2 1 2025-03-05
3 1 2025-03-06
4 1 2025-03-07
l = (logins.drop_duplicates(["user_id", "login_date"])
.sort_values(["user_id", "login_date"]).copy())
l["rn"] = l.groupby("user_id").cumcount()
l["grp"] = l["login_date"] - pd.to_timedelta(l["rn"], unit="D")
streaks = (l.groupby(["user_id", "grp"])["login_date"]
.agg(["min", "max", "count"])
.reset_index()
.rename(columns={"count": "streak_len"}))
longest = (streaks.groupby("user_id", as_index=False)["streak_len"]
.max()
.sort_values("streak_len", ascending=False))
print(longest.head(8))
user_id streak_len
45 46 20
46 47 17
75 76 13
94 95 13
13 14 12
99 100 12
11 12 11
27 28 11
Same trick as SQL: date − row_number is constant within a consecutive run; group by it.
Edge cases & gotchas (2)
- Duplicate (user, date) pairs poison the streak math
Two login rows for the same user/day inflate row_number, shifting the group_key. drop_duplicates on (user_id, login_date) first.
- Time zone vs ‘day’ bucket
23:30 PT and 00:30 PT are the same day for a Pacific user but different days in UTC. Bucket login_date in user-local TZ.
SCD2 as-of join: price each purchase using the price effective on its date.
[Apple · Historical Pricing] Re-price each purchase at the price effective on its date, in pandas. merge_asof is the SCD2 superpower — it picks the most recent matching row whose effective_date ≤ purchase_date.
product_id product_name price effective_date expiry_date is_current
0 1 product_1 60.01 2024-01-01 2024-04-16 N
1 1 product_1 66.01 2024-04-16 NaT Y
2 2 product_2 171.42 2024-01-01 2024-04-22 N
3 2 product_2 188.56 2024-04-22 2024-07-28 N
4 3 product_3 156.15 2024-01-01 2024-04-28 N
pdim = pd_dim.copy()
pdim["product_id"] = pdim["product_id"].astype("int64")
pdim = pdim.sort_values(["effective_date","product_id"])
p = purchases.copy().sort_values("purchase_date")
p["product_id"] = (p["product_category"].astype("category").cat.codes + 1).astype("int64")
priced = pd.merge_asof(p, pdim,
by="product_id",
left_on="purchase_date",
right_on="effective_date",
direction="backward")
print(priced[["purchase_id","purchase_date","product_id",
"amount","price"]].head(8))
[error] ValueError: Merge keys contain null values on left side
merge_asof is the SCD2 superpower in pandas: it picks the most recent matching row whose effective_date ≤ purchase_date.
Edge cases & gotchas (2)
- merge_asof requires sorted keys
Both sides must be sorted on the on= column. Sort first or merge_asof raises ValueError: right keys must be sorted.
Fix · Pythonpdim = pd_dim.sort_values("effective_date") p = purchases.sort_values("purchase_date") priced = pd.merge_asof(p, pdim, by="product_id", left_on="purchase_date", right_on="effective_date", direction="backward") - Tolerance + direction
direction='nearest'can match a future row — wrong for as-of. Use 'backward' (last <= left) and addtolerance=pd.Timedelta(days=N)to bound staleness.
Detect skewed keys; salt the hot key into N buckets.
[Amazon · Spark Skew Mitigation] Detect skewed keys and salt the hot key into N buckets so a downstream Spark join can parallelize. The other side of the join must replicate the small-side rows for each salt bucket — salting only one side breaks the join.
id join_key value
0 1 key_243 189
1 2 key_225 81
2 3 hot_key 108
3 4 key_91 321
4 5 hot_key 579
import random
random.seed(42)
freq = skew["join_key"].value_counts(normalize=True).head()
print("top frequencies:")
print(freq.round(3))
N_SALTS = 8
hot = freq[freq > 0.05].index.tolist()
def salt(k):
return f"{k}__{random.randint(0, N_SALTS-1)}" if k in hot else k
skew2 = skew.copy()
skew2["join_key_salted"] = skew2["join_key"].map(salt)
print()
print("after salting (top 6):")
print(skew2["join_key_salted"].value_counts().head(6))
top frequencies:
join_key
hot_key 0.700
key_243 0.001
key_225 0.001
key_91 0.001
key_137 0.001
Name: proportion, dtype: float64
after salting (top 6):
join_key_salted
hot_key__3 108
hot_key__4 96
hot_key__1 95
...
[truncated; 4 more lines]
The other side of the join must replicate the small-side rows for each salt bucket. Salting only one side breaks the join.
Edge cases & gotchas (2)
- Salting only one side breaks the join
If you salt the fact table but not the dim, no salted key matches anything. The dim must be CROSS JOINed with the salt values.
Fix · Pythonsalts = pd.DataFrame({"salt": range(N_SALTS)}) dim_salted = dim.merge(salts, how="cross") dim_salted["join_key_salted"] = (dim_salted["join_key"] .str.cat(dim_salted["salt"].astype(str), sep="__")) - Unsalt before final aggregation
After the salted join, the final group-by must un-salt (strip the suffix) to get the real key. Forgetting this step is the most common skew-mitigation bug.
Build the org tree from employees as a nested dict.
[Amazon · HRIS JSON Export] An HR portal needs a JSON-friendly nested org tree: each manager → reports → their reports → …. Iterative DFS or bounded recursion — a cycle in manager_id would blow the stack.
employee_id full_name manager_id department salary
0 100 Dana Chen NaN Engineering 250000.0
1 201 Eve Park 100.0 Engineering 180000.0
2 202 Frank Li 100.0 Engineering 175000.0
3 203 Grace Kim 100.0 Engineering NaN
4 301 IC_301 201.0 Engineering 116962.0
children = {}
for _, r in emp.iterrows():
children.setdefault(r["manager_id"], []).append(r["employee_id"])
def build(node_id):
row = emp.loc[emp["employee_id"] == node_id].iloc[0]
return {
"id": int(node_id),
"name": row["full_name"],
"dept": row["department"],
"reports": [build(c) for c in children.get(node_id, [])],
}
roots = emp.loc[emp["manager_id"].isna(), "employee_id"].tolist()
tree = [build(r) for r in roots]
print("root:", tree[0]["name"], "->", len(tree[0]["reports"]), "direct reports")
import json
print(json.dumps(tree[0], default=str, indent=2)[:400], "...")
root: Dana Chen -> 3 direct reports
{
"id": 100,
"name": "Dana Chen",
"dept": "Engineering",
"reports": [
{
"id": 201,
"name": "Eve Park",
"dept": "Engineering",
"reports": [
{
"id": 301,
"name": "IC_301",
...
[truncated; 7 more lines]
Recursion safety: bound the depth (or use iterative DFS) — a cycle in manager_id would blow the stack.
Edge cases & gotchas (2)
- Cycle in manager_id blows the stack
Recursive build() with a cycle recurses until RecursionError. Use iterative DFS or a visited set.
Fix · Pythondef build(node_id, visited=None): visited = visited or set() if node_id in visited: raise ValueError(f"cycle at {node_id}") visited = visited | {node_id} return {...} - Multiple roots
If two employees have manager_id IS NULL, the tree has two roots. Decide whether the JSON is a list of trees or a synthetic single root with both as children.
Median order amount per user.
[Apple · Robust Stats] Median order amount per user — robust to whales. Grouped quantile is built-in; for arbitrary percentiles use .agg(p90=('amount', lambda s: s.quantile(0.9))).
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
med = (orders.groupby("user_id", as_index=False)["amount"]
.quantile(0.5)
.rename(columns={"amount": "median_amount"}))
print(med.sort_values("median_amount", ascending=False).head(8).round(2))
user_id median_amount
73 85 425.90
166 184 384.46
133 149 315.26
88 102 312.58
39 44 275.96
29 34 240.75
137 153 232.63
181 99002 232.05
Grouped quantile is built-in. For arbitrary percentiles use .agg(p90=('amount', lambda s: s.quantile(0.9))).
Edge cases & gotchas (2)
- Quantile interpolation
Default
quantile(0.5, interpolation='linear')interpolates between two middle rows — fractional cents result. Use 'lower' / 'higher' / 'midpoint' for currency.Fix · Pythonmed = (orders.groupby("user_id")["amount"] .quantile(0.5, interpolation="lower")) - Single-order users
n=1 partitions return that one value as the median. Filter to users with min order count if ‘typical’ basket is the goal.
Memory-efficient CSV read in chunks.
[Amazon · S3 Streaming] A 50-GB clickstream won't fit in RAM. Stream-aggregate via chunksize bounded memory + usecols at parse time.
user_id event_time page
0 1 2025-03-01 08:45:00 home
1 1 2025-03-01 08:55:00 search
2 1 2025-03-01 09:00:00 cart
3 1 2025-03-01 09:01:00 checkout
4 1 2025-03-01 09:46:00 search
page_counts = pd.Series(dtype=int)
src = DATA + "clickstream.csv"
for chunk in pd.read_csv(src, chunksize=500, usecols=["page"]):
page_counts = page_counts.add(chunk["page"].value_counts(),
fill_value=0)
print(page_counts.sort_values(ascending=False).astype(int).head())
page
checkout 406
search 403
product 393
home 389
cart 377
dtype: int64
Two memory wins: chunksize bounds RAM; usecols skips columns we do not need at parse time.
Edge cases & gotchas (2)
- Loss of dtypes between chunks
If chunk 1 has user_id as int and chunk 2 has it as float (NaNs introduced), concat raises a dtype warning. Pre-declare dtypes via
dtype=in read_csv.Fix · Pythonfor chunk in pd.read_csv(src, chunksize=200_000, usecols=["page", "user_id"], dtype={"user_id": "Int64", "page": "string"}): ... - Shuffled ordering between chunks
If your aggregation needs ordering (sessionization, gaps), chunking by row count is wrong — chunks aren't time-aligned. Chunk by date instead.
Validate a schema contract for orders.
[Netflix · Schema Contract] A nightly ingest must fail loudly if the upstream schema or data ranges drift. In real production: Great Expectations / pandera / Pydantic; on a whiteboard, the bare-asserts version below.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
def validate_orders(df):
expected_cols = ["order_id", "user_id", "order_date", "amount"]
missing = set(expected_cols) - set(df.columns)
assert not missing, f"missing columns: {missing}"
assert df["order_id"].is_unique, "duplicate order_id"
assert (df["amount"] > 0).all(), "non-positive amounts"
assert df["order_date"].notna().all(), "null order_date"
return True
print("orders schema OK:", validate_orders(orders))
[error] AssertionError: duplicate order_id
In a real job you would use Great Expectations / pandera / Pydantic; the bare-asserts version is what you write on a whiteboard.
Edge cases & gotchas (2)
- assert is stripped under -O
Python's
assertcompiles out under -O optimization. Useraise ValueError(...)in production DQ checks.Fix · Pythonif not df["order_id"].is_unique: raise ValueError("duplicate order_id") - Schema vs values
The function checks dtypes and uniqueness. Real DQ also checks value ranges, freshness (max date), volume (row count not 0). Use Great Expectations / pandera for the full set.
Idempotent upsert: append only new orders to a Parquet table.
[Amazon · Idempotent ETL] A streaming-ish ETL where re-running the same batch must not double-count. Idempotency = primary key + last-write-wins. In a real warehouse: MERGE INTO; in a lakehouse: Delta / Iceberg MERGE.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
import os
out = "/tmp/orders_curated.parquet"
new_batch = orders.head(100)
if os.path.exists(out):
existing = pd.read_parquet(out)
combined = pd.concat([existing, new_batch])
combined = combined.drop_duplicates(subset=["order_id"], keep="last")
else:
combined = new_batch
combined.to_parquet(out, index=False)
print("rows in curated:", len(combined))
rows in curated: 100
Idempotency = primary key + last-write-wins. In a real warehouse this is MERGE INTO; in a lakehouse it is Delta / Iceberg MERGE.
Edge cases & gotchas (2)
- Atomic vs partial writes
to_parquet writing then crashing leaves a half-written file. Stage to a temp path and rename atomically, or use Delta/Iceberg.
Fix · Pythontmp = out + ".tmp" combined.to_parquet(tmp, index=False) import os; os.replace(tmp, out) # atomic on POSIX FS - Concurrent writers
Two ETL runs writing to the same Parquet at once corrupt each other. Use a per-batch path + a manifest file, or a lakehouse format with optimistic concurrency.
Build a Type-2 SCD from a stream of price updates.
[Apple · Catalog History] Build a Type-2 SCD from a stream of price updates. The invariant: for any product, exactly one row has is_current='Y', and intervals do not overlap. Test both with asserts.
(see linked tables)
updates = pd.DataFrame([
{"product_id": 1, "price": 60.01, "change_date": "2024-01-01"},
{"product_id": 1, "price": 66.01, "change_date": "2024-04-16"},
{"product_id": 2, "price": 171.42, "change_date": "2024-01-01"},
{"product_id": 2, "price": 188.56, "change_date": "2024-04-22"},
{"product_id": 2, "price": 199.99, "change_date": "2024-07-28"},
])
updates["change_date"] = pd.to_datetime(updates["change_date"])
updates = updates.sort_values(["product_id", "change_date"])
updates["effective_date"] = updates["change_date"]
updates["expiry_date"] = (updates.groupby("product_id")["change_date"]
.shift(-1))
updates["is_current"] = np.where(updates["expiry_date"].isna(), "Y", "N")
print(updates)
product_id price change_date effective_date expiry_date is_current
0 1 60.01 2024-01-01 2024-01-01 2024-04-16 N
1 1 66.01 2024-04-16 2024-04-16 NaT Y
2 2 171.42 2024-01-01 2024-01-01 2024-04-22 N
3 2 188.56 2024-04-22 2024-04-22 2024-07-28 N
4 2 199.99 2024-07-28 2024-07-28 NaT Y
The invariant: for any product, exactly one row has is_current='Y', and intervals do not overlap. Test both with asserts.
Edge cases & gotchas (2)
- Out-of-order updates
If updates arrive with change_date going backwards, the SCD2 build inserts a row that overlaps prior windows. Sort and validate before assigning effective/expiry.
- Same-day price flip-flops
Two updates with the same change_date for the same product produce a zero-width window — invariant violation. Dedup or flag.
Validate existing product_dim — no overlaps, exactly one current row per product.
[Apple · SCD2 Health Check] SCD2 health checks are first-class data products. Validate product_dim: no overlapping windows, exactly one current row per product. Run on every load and alert when the count regresses.
product_id product_name price effective_date expiry_date is_current
0 1 product_1 60.01 2024-01-01 2024-04-16 N
1 1 product_1 66.01 2024-04-16 NaT Y
2 2 product_2 171.42 2024-01-01 2024-04-22 N
3 2 product_2 188.56 2024-04-22 2024-07-28 N
4 3 product_3 156.15 2024-01-01 2024-04-28 N
def overlap_count(g):
g = g.sort_values("effective_date")
end = g["expiry_date"].fillna(pd.Timestamp("2999-01-01"))
nxt = g["effective_date"].shift(-1)
return (nxt < end).fillna(False).sum()
bad_overlaps = (pd_dim.groupby("product_id", group_keys=False)
.apply(overlap_count).sum())
current_per_product = pd_dim.groupby("product_id")["is_current"].apply(
lambda s: (s == "Y").sum())
print("overlapping intervals:", int(bad_overlaps))
print("products with !=1 current row:",
int((current_per_product != 1).sum()))
overlapping intervals: 11
products with !=1 current row: 10
SCD2 health checks are first-class data products. Run them on every load and alert when they regress.
Edge cases & gotchas (2)
- expiry_date NULL convention
NULL = current is conventional but the analytic query
expiry_date > CURRENT_DATEexcludes NULL rows (NULL > anything is NULL). UseCOALESCE(expiry_date, '9999-12-31')or test IS NULL explicitly. - Boundary inclusivity
If [effective, expiry] is closed-closed, a fact dated exactly on the boundary matches BOTH rows. Standard is half-open [effective, expiry). Document and assert.
Fix · Python# Half-open semantics in the as-of join condition mask = (fact["dt"] >= dim["effective"]) & (fact["dt"] < dim["expiry"].fillna(pd.Timestamp.max))
Tier 5 · Senior Patterns
PySpark anti-join: users with no purchases.
[Meta · Lifecycle in Spark] Same anti-join as the SQL version, but on a billion-user table where pandas would OOM. left_anti is the canonical Spark form — clearer than the LEFT-JOIN-WHERE-NULL pattern, and the optimizer treats it directly.
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
users_df = spark.read.option("header", True).csv("users.csv")
purchases_df = spark.read.option("header", True).csv("purchases.csv")
never_bought = users_df.join(purchases_df, "user_id", "left_anti")
never_bought.show(5)
print("count:", never_bought.count())
[representative; equivalent to pandas LEFT JOIN with IS NULL]
user_id name country signup_date
0 1 user_1 IN 2024-10-02
1 2 user_2 UK 2024-10-04
5 4 user_4 JP 2024-10-08
12 8 user_8 US 2024-10-15
17 10 user_10 IN 2024-10-19
left_anti is the canonical anti-join in Spark — clearer than the LEFT-JOIN-WHERE-NULL pattern, and the optimizer treats it directly.
Edge cases & gotchas (2)
- left_anti drops everything if right side has NULLs in key
Spark's left_anti uses NULL-aware comparison: if the right key has NULL rows, they don't match, and the anti-join keeps all left rows. Confirm by counting matches before relying on it.
- Header inference
option('header', True)assumes headers; withoutinferSchema=Trueall columns are strings. user_id == ‘1’ vs 1 silently fails the join.Fix · Pythonusers_df = (spark.read .option("header", True) .option("inferSchema", True) .csv("users.csv"))
PySpark window: top-3 products per category.
[Amazon · Catalog Ranking at Scale] Top-3 products per category in PySpark. Window cost: partitioning by a high-cardinality column shuffles hard — for very wide top-N, consider a two-stage approx-then-exact pattern.
product_id product_name category revenue
0 1 prod_1 Electronics 435.0
1 2 prod_2 Electronics 25.0
2 3 prod_3 Electronics 384.0
3 4 prod_4 Electronics 1054.0
4 5 prod_5 Electronics 485.0
from pyspark.sql import functions as F, Window
products_df = (spark.read.option("header", True).csv("products.csv")
.withColumn("revenue", F.col("revenue").cast("int")))
w = Window.partitionBy("category").orderBy(F.desc("revenue"))
top = (products_df.withColumn("rk", F.dense_rank().over(w))
.filter(F.col("rk") <= 3))
top.show()
[representative output, computed via pandas equivalent]
product_id product_name category revenue rk
18 19 prod_19 Books 2887.0 1
10 11 prod_11 Books 2420.0 2
14 15 prod_15 Books 2021.0 3
27 28 prod_28 Clothing 9770.0 1
29 30 prod_30 Clothing 7190.0 2
28 29 prod_29 Clothing 2972.0 3
3 4 prod_4 Electronics 1054.0 1
50 1 prod_1 Electronics 999.0 2
Window cost: partition by a high-cardinality column and Spark may shuffle hard. For very wide top-N, consider a two-stage approx-then-exact.
Edge cases & gotchas (2)
- Window without partitionBy is a single-partition shuffle
Window.orderBy(...)alone (no partitionBy) collapses all data to one task. For top-N per group, partition by the group key; for global top-N, accept the shuffle but considerlimitfirst. - dense_rank vs row_number
If two products tie at #2, dense_rank gives 1,2,2,3 (3 rows ≤ 3); row_number gives 1,2,3 deterministically. Same SQL tradeoff: pick by intent.
Broadcast join in PySpark — small dimension on the right side.
[Meta · Production Spark] A facts × tiny-dim join shouldn't shuffle the giant facts. Force a broadcast hint when the small side is borderline above the auto threshold (spark.sql.autoBroadcastJoinThreshold, default 10 MB).
user_id name email signup_date country
0 1 user_1 user1@example.com 2024-10-02 IN
1 2 user_2 user2@example.com 2024-10-04 UK
2 3 user_3 user3@example.com 2024-10-05 JP
3 4 user_4 user4@example.com 2024-10-08 JP
4 5 user_5 user5@example.com 2024-10-09 US
from pyspark.sql.functions import broadcast
orders_df = spark.read.option("header", True).csv("orders.csv")
users_df = spark.read.option("header", True).csv("users.csv")
joined = orders_df.join(broadcast(users_df), "user_id", "left")
joined.explain() # confirm BroadcastHashJoin in the plan
== Physical Plan ==
*(2) Project [order_id, user_id, order_date, amount, name, country, signup_date]
+- *(2) BroadcastHashJoin [user_id], [user_id], LeftOuter, BuildRight
:- FileScan csv ... [orders.csv]
+- BroadcastExchange HashedRelationBroadcastMode(...)
+- FileScan csv ... [users.csv]
Threshold: Spark auto-broadcasts under spark.sql.autoBroadcastJoinThreshold (default 10 MB). For larger-but-still-small dims, the explicit hint forces the optimizer.
Edge cases & gotchas (2)
- Broadcast threshold
Spark auto-broadcasts under
spark.sql.autoBroadcastJoinThreshold(default 10 MB). Above that, force with the broadcast() hint. Above ~200 MB, broadcast becomes a memory liability — use bucketing instead. - Skew in the BIG side, not the small one
Broadcast helps when the small side is the dim. If the FACTS are skewed on the join key, broadcast still works but the skewed task in the fact is the bottleneck. Salt the fact instead.
PySpark funnel — countDistinct per step.
[Meta · Funnel at Scale] Same signup → activation → purchase funnel as Q34, but on the billion-row events table. For exact counts at huge scale, swap countDistinct for approx_count_distinct with a tunable rsd — 5% error is fine for a funnel chart.
event_id user_id event_type event_date
0 1 1 signup 2025-01-04
1 2 1 activation 2025-02-08
2 3 1 purchase 2025-03-13
3 4 2 signup 2025-02-17
4 5 2 activation 2025-02-24
events_df = spark.read.option("header", True).csv("events.csv")
firsts = (events_df.groupBy("user_id", "event_type")
.agg(F.min("event_date").alias("first_at"))
.groupBy("user_id")
.pivot("event_type", ["signup","activation","purchase"])
.agg(F.first("first_at")))
funnel = firsts.agg(
F.countDistinct(F.when(F.col("signup").isNotNull(),
F.col("user_id"))).alias("signups"),
F.countDistinct(F.when((F.col("activation").isNotNull())
& (F.col("activation") >= F.col("signup")),
F.col("user_id"))).alias("activations"),
F.countDistinct(F.when((F.col("purchase").isNotNull())
& (F.col("purchase") >= F.col("activation")),
F.col("user_id"))).alias("purchasers"),
)
funnel.show()
[representative; counts computed via pandas]
signups = 498
activations = 282
purchasers = 127
For exact counts at huge scale, consider approx_count_distinct with a tunable rsd — usually 5% error is fine for a funnel chart.
Edge cases & gotchas (2)
- approx_count_distinct rsd tradeoff
approx_count_distinct(col, rsd=0.05)means 5% relative standard deviation. For a funnel chart 5% is fine; for billing it isn't. Pick deliberately.Fix · Pythonfrom pyspark.sql.functions import approx_count_distinct funnel = events_df.agg(approx_count_distinct("user_id", rsd=0.05) .alias("approx_unique")) - Pivot with unknown values
.pivot('event_type')without an explicit list scans all values (extra job). Pass the list['signup','activation','purchase']to skip the scan.
Custom hash partitioning to spread skew.
[Amazon · Skew at Petabyte Scale] Custom hash partitioning with a salted hot key. Tradeoff: salting helps the shuffle, but downstream aggregation must un-salt (strip the suffix) before the final group-by. Don't forget the unsalt step.
id join_key value
0 1 key_243 189
1 2 key_225 81
2 3 hot_key 108
3 4 key_91 321
4 5 hot_key 579
from pyspark.sql import functions as F
skew_df = spark.read.option("header", True).csv("skewed_table.csv")
hot, N = "hot_key", 16
salted = (skew_df
.withColumn("salt", F.when(F.col("join_key") == hot,
(F.rand() * N).cast("int"))
.otherwise(F.lit(0)))
.withColumn("salted_key", F.concat_ws("__", "join_key", "salt"))
.repartition(64, "salted_key"))
salted.groupBy("salted_key").count().orderBy(F.desc("count")).show(5)
+--------------+-----+
| salted_key|count|
+--------------+-----+
|hot_key__0 | 45|
|hot_key__7 | 44|
|hot_key__15 | 44|
|hot_key__3 | 43|
|hot_key__9 | 42|
+--------------+-----+
[hot_key spread across 16 buckets vs ~700 rows on one]
Tradeoff: salting helps shuffle, but downstream needs to un-salt by stripping the suffix before final aggregation. Don't forget the unsalt step.
Edge cases & gotchas (2)
- Salt count vs cluster parallelism
N=16 salts means at most 16 tasks for the hot key. If your cluster has 64 cores, 16 salts under-utilize. Tune N to the available parallelism.
- Salt the SMALL side too
After salting the fact, the dim must be replicated across the salt buckets — either by cross-joining with a salt-numbers table or by emitting N copies, one per salt.
Fix · Pythonfrom pyspark.sql.functions import explode, array, lit dim_replicated = dim.withColumn("salt", explode(array(*[lit(i) for i in range(N)])))
Build a tiny DAG executor in pure Python (topological sort).
[Netflix · Mini Orchestrator] A 5-line task DAG, no Airflow. Knowing toposort + cycle detection by heart proves you have built a scheduler, not just used one — Netflix's Maestro and Meta's Dataswarm both started this way.
(see linked tables)
from collections import defaultdict, deque
def topo_run(graph, run_fn):
indeg = defaultdict(int)
rev = defaultdict(list)
for node, deps in graph.items():
indeg[node]
for d in deps:
indeg[node] += 1
rev[d].append(node)
q = deque([n for n, k in indeg.items() if k == 0])
order = []
while q:
n = q.popleft()
run_fn(n)
order.append(n)
for nxt in rev[n]:
indeg[nxt] -= 1
if indeg[nxt] == 0:
q.append(nxt)
if len(order) != len(graph):
raise ValueError("cycle in DAG")
return order
dag = {"extract":[], "stage":["extract"], "transform":["stage"],
"publish":["transform"], "audit":["transform"]}
ran = topo_run(dag, lambda n: print("running:", n))
print("order:", ran)
running: extract
running: stage
running: transform
running: publish
running: audit
order: ['extract', 'stage', 'transform', 'publish', 'audit']
Why this question shows up: orchestration is half the job. Knowing toposort + cycle detection by heart proves you have built (not just used) a scheduler.
Edge cases & gotchas (2)
- Cycle detection
If the DAG has a cycle,
topo_runreturns fewer nodes than the graph; the final assert catches it. Without the assert, the executor silently runs a partial DAG. - Failure handling
The toy executor doesn't handle task failure. Real schedulers retry-with-backoff, mark downstream as upstream_failed, and expose state to a dashboard.
Implement an LRU cache (no functools).
[Generic FAANG · Caching] A classic mid-level Python question: implement an LRU cache without functools.lru_cache. OrderedDict gives O(1) get/put; the doubly-linked-list-from-scratch version impresses the senior interviewer.
(see linked tables)
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.cap = capacity
self.d = OrderedDict()
def get(self, key):
if key not in self.d:
return None
self.d.move_to_end(key)
return self.d[key]
def put(self, key, value):
if key in self.d:
self.d.move_to_end(key)
self.d[key] = value
if len(self.d) > self.cap:
self.d.popitem(last=False)
c = LRUCache(2)
c.put("a", 1); c.put("b", 2); c.get("a"); c.put("c", 3)
print(list(c.d.items()))
[('a', 1), ('c', 3)]
O(1) get/put. If you write the doubly-linked-list version from scratch you will impress, but the OrderedDict version is the real-world answer.
Edge cases & gotchas (2)
- Thread safety
OrderedDictisn't thread-safe. For multi-threaded callers wrap with a Lock, or usefunctools.lru_cache(which is thread-safe).Fix · Pythonfrom threading import Lock class LRUCache: def __init__(self, cap): self.cap = cap; self.d = OrderedDict(); self.lock = Lock() def get(self, k): with self.lock: if k not in self.d: return None self.d.move_to_end(k); return self.d[k] - Eviction policy edge
Capacity 0 should be valid (no caching). Test that
puton a 0-capacity cache is a no-op rather than an exception.
Streaming median with two heaps.
[Meta · Streaming Aggregations] Compute a running median in O(log n) per insert. Two-heap pattern shows up in interviews and in real streaming aggregations. Memorize the invariant: len(lo) ∈ {len(hi), len(hi)+1}.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
import heapq
class StreamingMedian:
def __init__(self):
self.lo = [] # max-heap (invert sign)
self.hi = [] # min-heap
def add(self, x):
heapq.heappush(self.lo, -x)
heapq.heappush(self.hi, -heapq.heappop(self.lo))
if len(self.hi) > len(self.lo):
heapq.heappush(self.lo, -heapq.heappop(self.hi))
def median(self):
if len(self.lo) > len(self.hi):
return -self.lo[0]
return (-self.lo[0] + self.hi[0]) / 2
m = StreamingMedian()
for v in orders["amount"].tolist():
m.add(v)
print("running median (final):", round(m.median(), 2))
print("vs orders['amount'].median() :", round(orders["amount"].median(), 2))
running median (final): 51.52
vs orders['amount'].median() : 51.52
Two-heap pattern shows up in interviews and in real streaming aggregations. Worth memorizing the invariant: len(lo) ∈ {len(hi), len(hi)+1}.
Edge cases & gotchas (2)
- Removing values mid-stream
The two-heap pattern only supports add. Real streaming median with deletes needs a counter for tombstones and lazy removal — non-trivial.
- Numeric stability
For very large N with floats, the median itself is fine, but downstream variance computation can drift. Use Welford's online algorithm for variance.
Backfill orchestrator: regenerate daily_metrics for a missing date range.
[Amazon · Backfill Discipline] A bug in the upstream job lost 5 days of metrics. Reprocess only those days, idempotently. Partition-per-day + overwrite is the lakehouse pattern — re-running for the same day produces the same files.
order_id user_id order_date amount
0 1 57 2025-01-01 19.64
1 2 128 2025-01-01 66.85
2 3 151 2025-01-01 11.51
3 4 150 2025-01-01 56.81
4 5 196 2025-01-01 13.16
from pathlib import Path
def compute_day(ds):
o = orders[orders["order_date"] == ds]
return {"ds": str(ds.date()),
"daily_revenue": float(o["amount"].sum()),
"daily_dau": int(o["user_id"].nunique())}
def backfill(start, end, out_dir):
Path(out_dir).mkdir(exist_ok=True)
written = []
for ds in pd.date_range(start, end, freq="D"):
rec = compute_day(ds)
path = Path(out_dir) / f"ds={rec['ds']}/data.parquet"
path.parent.mkdir(parents=True, exist_ok=True)
pd.DataFrame([rec]).to_parquet(path, index=False)
written.append(rec)
return written
written = backfill("2025-01-15", "2025-01-19", "/tmp/dm_partitions")
for r in written: print(r)
{'ds': '2025-01-15', 'daily_revenue': 824.22, 'daily_dau': 8}
{'ds': '2025-01-16', 'daily_revenue': 1033.55, 'daily_dau': 8}
{'ds': '2025-01-17', 'daily_revenue': 568.84, 'daily_dau': 8}
{'ds': '2025-01-18', 'daily_revenue': 499.1, 'daily_dau': 8}
{'ds': '2025-01-19', 'daily_revenue': 455.85, 'daily_dau': 8}
Partition-per-day + overwrite is the lakehouse pattern. Re-running for the same day produces the same files — that is idempotency.
Edge cases & gotchas (2)
- Idempotent re-run
Running backfill for 2025-01-15 twice must produce the same files. Overwriting the partition (Path.rename or Delta MERGE) achieves this; appending does not.
- Late-arriving rows mid-backfill
If new rows for 2025-01-15 land while you backfill, you may miss them. Snapshot the source (e.g., a fixed read offset) before computing.
Fix · Python# Snapshot orders to a versioned read orders_snap = orders[orders["order_date"] <= ds] # bound by your run timestamp
Idempotent ETL: write a daily partition with overwrite-by-date semantics.
[Apple · Daily Fact Publication] The publish-a-daily-fact-table pattern every senior DE has written 50 times. Production reality: on object stores this becomes Delta / Iceberg replaceWhere or BigQuery's WRITE_TRUNCATE with a partition decorator. The shape of the answer is what matters.
ds daily_revenue daily_dau
0 2025-01-01 50008.19 9486
1 2025-01-02 79294.66 10292
2 2025-01-03 55245.52 10806
3 2025-01-04 51104.66 9132
4 2025-01-05 55714.88 9883
from pathlib import Path
import shutil
def publish_daily_fact(df, ds, root):
assert (df["ds"].astype(str) == ds).all(), "frame contains other dates"
target = Path(root) / f"ds={ds}"
tmp = Path(root) / f".staging_{ds}"
if tmp.exists(): shutil.rmtree(tmp)
tmp.mkdir(parents=True)
df.to_parquet(tmp / "data.parquet", index=False)
if target.exists(): shutil.rmtree(target)
tmp.rename(target)
return target
ds_str = "2025-01-20"
dm_today = dm[dm["ds"].astype(str).str[:10] == ds_str].copy()
dm_today["ds"] = dm_today["ds"].astype(str).str[:10]
out = publish_daily_fact(dm_today, ds_str, "/tmp/dm_curated")
print("published:", out)
print(list(out.iterdir()))
published: /tmp/dm_curated/ds=2025-01-20
[PosixPath('/tmp/dm_curated/ds=2025-01-20/data.parquet')]
Production reality: on object stores this becomes Delta / Iceberg replaceWhere or BigQuery's WRITE_TRUNCATE with a partition decorator. The shape of the answer is what matters.
Edge cases & gotchas (2)
- Atomic rename across filesystems
Path.rename is atomic only within the same filesystem. On S3, rename is a copy + delete, not atomic. Use Delta/Iceberg
replaceWhereor write to a manifest file. - Frame contains other dates
The assert guards against passing a multi-date frame. Without it, you'd silently overwrite the partition with rows from other days. Always validate the partition key before publish.
Bonus · Privacy & the Pre-/Post-ATT Era
April 2021 was the dividing line. Before iOS App Tracking Transparency (ATT), deterministic user tracking via IDFA was the default; ad-tech and growth pipelines were built around 1:1 user-event joins. After ATT, ~75% of users decline tracking, and SKAdNetwork-style aggregated postbacks become the default signal.
This shift broke half the SQL and Python pipelines on every Marketing Science / Ads / Growth team at Meta, Google, Amazon, and Apple. Interviewers at those companies love to probe whether candidates have lived through it. Below, 5 questions that reframe the existing dataset around the ATT boundary.
Pre- vs post-ATT conversion rate (consent gating).
[Meta · Marketing Science] A growth lead wants the apples-to-apples conversion rate of users who signed up before Apple's ATT prompt rolled out vs after. Post-ATT, only ~25% of users opt in to deterministic tracking — the rest fall back to SKAdNetwork-style aggregated signals.
user_id name email signup_date country att_era att_consent
0 1 user_1 user1@example.com 2024-10-02 IN pre 1
1 2 user_2 user2@example.com 2024-10-04 UK pre 1
2 3 user_3 user3@example.com 2024-10-05 JP pre 1
3 4 user_4 user4@example.com 2024-10-08 JP pre 1
4 5 user_5 user5@example.com 2024-10-09 US pre 1
SELECT u.att_era,
COUNT(DISTINCT u.user_id) AS users,
COUNT(DISTINCT p.user_id) AS purchasers,
ROUND(100.0 * COUNT(DISTINCT p.user_id) / NULLIF(COUNT(DISTINCT u.user_id), 0), 2) AS conv_pct
FROM users_att u
LEFT JOIN purchases p ON p.user_id = u.user_id
GROUP BY u.att_era
ORDER BY u.att_era;
att_era users purchasers conv_pct
0 post 184 116 63.04
1 pre 16 9 56.25
What an interviewer probes: definition discipline. ‘Conversion’ for the post-ATT cohort is fundamentally noisier — purchases that fired but were never reported back to your attribution provider don't show up here. State the bias explicitly before drawing conclusions.
Deterministic-only retention: opt-in users vs full cohort.
[Meta · Attribution] Post-ATT, only the consented users have deterministic event-level data. Compute retention W0/W1/W2 for the consented subset and compare against the full cohort — the gap is the ‘ATT measurement tax’ that survival analysis has to model around.
user_id name email signup_date country att_era att_consent
0 1 user_1 user1@example.com 2024-10-02 IN pre 1
1 2 user_2 user2@example.com 2024-10-04 UK pre 1
2 3 user_3 user3@example.com 2024-10-05 JP pre 1
3 4 user_4 user4@example.com 2024-10-08 JP pre 1
4 5 user_5 user5@example.com 2024-10-09 US pre 1
WITH cohort AS (
SELECT user_id, att_era, att_consent,
DATE_TRUNC('week', signup_date) AS cohort_week
FROM users_att
),
joined AS (
SELECT c.cohort_week, c.user_id, c.att_consent,
CAST(FLOOR(EXTRACT(EPOCH FROM (l.login_date - c.cohort_week)) / (7*86400)) AS INT) AS w
FROM cohort c
JOIN logins l ON l.user_id = c.user_id
WHERE l.login_date >= c.cohort_week
)
SELECT att_consent,
COUNT(DISTINCT user_id) FILTER (WHERE w = 0) AS w0_users,
COUNT(DISTINCT user_id) FILTER (WHERE w = 1) AS w1_users,
COUNT(DISTINCT user_id) FILTER (WHERE w = 2) AS w2_users,
ROUND(100.0 * COUNT(DISTINCT user_id) FILTER (WHERE w = 1)
/ NULLIF(COUNT(DISTINCT user_id) FILTER (WHERE w = 0), 0), 1) AS w1_pct,
ROUND(100.0 * COUNT(DISTINCT user_id) FILTER (WHERE w = 2)
/ NULLIF(COUNT(DISTINCT user_id) FILTER (WHERE w = 0), 0), 1) AS w2_pct
FROM joined
GROUP BY att_consent
ORDER BY att_consent DESC;
att_consent w0_users w1_users w2_users w1_pct w2_pct
0 1 1 3 3 300.0 300.0
1 0 15 14 14 93.3 93.3
The senior signal: consented-only retention is biased — the people who opt in are systematically different (engaged power users disproportionately accept). Calling out the selection bias before the interviewer asks is the difference between Senior and Staff.
SKAdNetwork-style aggregated postback: bucket conversions, no user IDs.
[Apple · App Store / Meta · CAPI] SKAdNetwork postbacks arrive 24-48h after the conversion, with no user_id — only an aggregated ‘conversion value’ bucket per (campaign, source app, day). Build the equivalent: bucket purchases into low/med/high CV per user-cohort-day, then aggregate to (campaign, day) without ever exposing user_id.
purchase_id user_id amount purchase_date product_category
0 1 149 321.30 2024-11-01 Sports
1 2 82 36.66 2024-11-01 Clothing
2 3 19 70.12 2024-11-01 Clothing
3 4 98 122.21 2024-11-01 Food
4 5 59 5.33 2024-11-01 Electronics
import pandas as pd
# Conversion-value mapping (CVM): bucket amount into 4 levels (Apple gives 6 bits = 64)
def cv_bucket(amt):
if amt < 50: return "cv0_low"
if amt < 150: return "cv1_med"
if amt < 300: return "cv2_high"
return "cv3_premium"
p = purchases.copy()
p["cv_bucket"] = p["amount"].apply(cv_bucket)
p["postback_dt"] = p["purchase_date"] + pd.Timedelta(hours=36)
# Synthesize a campaign id (would come from ad_attributed_install in real life)
p["campaign_id"] = (p["user_id"] % 5).astype(str).radd("camp_")
# Aggregated postback: NO user_id leaves this aggregation
postback = (p.groupby(["campaign_id", p["postback_dt"].dt.date, "cv_bucket"])
.size().rename("conversions").reset_index())
postback.columns = ["campaign_id", "postback_date", "cv_bucket", "conversions"]
# Apply k-anonymity threshold: drop rows with conversions < 10 (real SKAN min)
postback = postback[postback["conversions"] >= 3]
print(postback.head(10))
Empty DataFrame
Columns: [campaign_id, postback_date, cv_bucket, conversions]
Index: []
Real-world callout: Apple's SKAdNetwork enforces a minimum threshold (originally called the ‘crowd anonymity tier’) to prevent re-identification. If your aggregation has rows with conversions=1, you've broken privacy. Always apply the k-anonymity filter inside the aggregation function, not as an afterthought downstream.
Deterministic vs probabilistic attribution: which campaign drove the conversion?
[Meta · Marketing Science / Apple · AEM] Pre-ATT: every conversion has a deterministic last-touch campaign. Post-ATT: SKAN gives one aggregated postback per install per campaign, with no user-level link. Build both attribution views side-by-side from the same purchase data, and surface the divergence — that gap is what marketing teams now have to model.
event_id user_id event_type event_date
0 1 1 signup 2025-01-04
1 2 1 activation 2025-02-08
2 3 1 purchase 2025-03-13
3 4 2 signup 2025-02-17
4 5 2 activation 2025-02-24
import pandas as pd
# Synthetic deterministic last-touch: every signup-event mapped to a campaign
ev = events.copy()
ev["campaign_id"] = (ev["user_id"] % 5).astype(str).radd("camp_")
last_touch = (ev.sort_values(["user_id", "event_date"])
.drop_duplicates("user_id", keep="last")
[["user_id", "campaign_id", "event_date"]])
p = purchases.copy()
det = (p.merge(last_touch, on="user_id", how="left")
.groupby("campaign_id", as_index=False)
.agg(det_conversions=("purchase_id", "count"),
det_revenue=("amount", "sum")))
# Probabilistic: assume SKAN reports only 25% of post-ATT users w/ campaign
import numpy as np
np.random.seed(7)
p["reported"] = np.random.rand(len(p)) < 0.25
prob = (p[p["reported"]]
.merge(last_touch, on="user_id", how="left")
.groupby("campaign_id", as_index=False)
.agg(prob_conversions=("purchase_id", "count"),
prob_revenue=("amount", "sum")))
# Project the under-reported probabilistic view back up to full
prob["prob_conversions_scaled"] = prob["prob_conversions"] / 0.25
prob["prob_revenue_scaled"] = prob["prob_revenue"] / 0.25
cmp = det.merge(prob, on="campaign_id", how="outer")
cmp["det_vs_prob_pct_gap"] = (
100 * (cmp["prob_revenue_scaled"] - cmp["det_revenue"])
/ cmp["det_revenue"].where(cmp["det_revenue"] > 0)
).round(1)
print(cmp.round(2))
campaign_id det_conversions ... prob_revenue_scaled det_vs_prob_pct_gap
0 camp_0 43 ... 4708.56 2.7
1 camp_1 47 ... 2577.72 -20.9
2 camp_2 82 ... 8435.84 33.8
3 camp_3 70 ... 9506.56 12.4
4 camp_4 57 ... 3351.84 -96.8
[5 rows x 8 columns]
Why this question is asked: any DE on a Marketing Mix / MMM team gets this within the first month. The shape of the answer interviewers want: (a) build both attribution lenses, (b) compute the gap, (c) name the bias (selection on opt-in), (d) propose a calibration approach (geo-lift test, holdback, MMM).
Identity stitching: deterministic IDFA-based vs probabilistic post-ATT.
[Meta · Identity / Amazon · Ads] Pre-ATT, you stitch sessions to users via IDFA (deterministic). Post-ATT, you have to fall back on probabilistic graphs (IP + device + behavioral fingerprints). Show both: a deterministic merge, and a probabilistic merge that allows ambiguous matches and reports a confidence score.
session_id user_id platform session_date duration_sec
0 1 51 android 2025-01-01 388.0
1 2 73 android 2025-01-01 167.0
2 3 32 android 2025-01-01 2493.0
3 4 75 android 2025-01-01 1881.0
4 5 8 android 2025-01-01 1217.0
import pandas as pd
s = sessions[["session_id", "user_id", "platform", "session_date"]].copy()
prof = user_profiles.copy()
# DETERMINISTIC: clean 1:1 join on user_id (pre-ATT, IDFA available)
det = s.merge(prof, on="user_id", how="left")
det_match_rate = det["city"].notna().mean()
# PROBABILISTIC: drop user_id (simulate ATT), match on (date, platform) buckets
# In real life: IP + device fingerprint + behavioral. Here, we use platform+date as a poor proxy.
prof_p = prof.copy()
prof_p["match_date"] = prof_p["updated_at"]
s_p = s.copy()
s_p["match_date"] = s_p["session_date"]
prob = s_p.merge(prof_p[["match_date","city","name"]], on="match_date", how="left")
# Each session may now have multiple ambiguous candidates - count fan-out
fanout = prob.groupby("session_id").size()
ambiguous_pct = (fanout > 1).mean()
print(f"Deterministic match rate (pre-ATT): {det_match_rate:.1%}")
print(f"Probabilistic ambiguity (post-ATT): {ambiguous_pct:.1%} sessions have >1 candidate")
print(f"Avg candidates per session: {fanout.mean():.2f}")
Deterministic match rate (pre-ATT): 97.8%
Probabilistic ambiguity (post-ATT): 73.4% sessions have >1 candidate
Avg candidates per session: 4.84
Senior framing: probabilistic stitching is never ‘wrong’ — it's a probability distribution. The interviewer wants to hear you acknowledge the confidence-score model (e.g., logistic over feature-similarity scores) and the downstream impact: every funnel rate becomes a posterior estimate, not a point value.
Part 5 · Closing & Further Reading
Two things separate candidates who pass these rounds from candidates who don't, and neither is ‘knowing more SQL’.
1. You restate the problem before you write code. ‘So you want X grouped by Y for the last Z days, treating nulls as W — is that right?’ That single sentence is the difference between a hire and a no-hire on most loops.
2. You name the tradeoff before the interviewer asks. ‘I am using a window here instead of a correlated subquery because it is a single pass over the table; the cost is that we need the data sortable in memory.’ Volunteered tradeoffs signal seniority.
What to do with this bank
- Run the dataset locally. Postgres, DuckDB, or pandas. Querying real rows beats reading SQL on a screen.
- Hide the answer first. Write your version, then expand the card. The compare step is where insight forms.
- Translate. For each question, write both the SQL and the pandas form. Switching fluently is the senior signal.
- Time-box. Tier 1–2 in 5 minutes; Tier 3 in 10; Tier 4 in 15; Tier 5 in 20. If you stall, peek, learn, redo from scratch tomorrow.
- Mock with a friend. Speaking out loud — narrating clarifying questions, naming tradeoffs — is the only way to build that muscle.
Companion articles on PaddySpeaks
- Data Engineering Interview Prep — Senior / L5 Deep Dive — modeling, batch, streaming, Spark internals, lakehouse, 40+ scenarios.
- Ads Data Engineering Interview Prep — auctions, attribution, identity & consent, ads SQL.
- The Bhagavad Gita of Data Engineering — the philosophical companion piece on craft, dharma, and the discipline.
100 questions. One dataset. Real outputs. The rest is reps.