SQL Q1 (Tier 1): How many users do we have per country?

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q2 (Tier 1): Top 5 highest-paid employees with their department.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q3 (Tier 1): Total revenue across all purchases.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q4 (Tier 1): Which products generated more than $1,000 in revenue?

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q5 (Tier 1): Average session duration by platform.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q6 (Tier 1): How many sessions occur on weekends?

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q7 (Tier 1): List all distinct event types in the events table.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q8 (Tier 1): How many users signed up in Q4 2024?

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q9 (Tier 1): Largest single order amount and its date.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q10 (Tier 1): Sessions per day for the most recent 7 days in the data.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q11 (Tier 2): Find users who have never made a purchase.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q12 (Tier 2): Average purchase amount per user country.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q13 (Tier 2): Which users have no profile row?

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q14 (Tier 2): Top 10 spenders — name and total spend.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q15 (Tier 2): List each employee with their manager's name.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q16 (Tier 2): Salary delta vs. each employee's manager.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q17 (Tier 2): Top 3 product categories by total purchase revenue.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q18 (Tier 2): Users active in both events and logins.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q19 (Tier 2): Country-level signup→purchase conversion rate.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q20 (Tier 2): Join products with the current price from product_dim (SCD2).

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q21 (Tier 3): Rank users by total order spend (with ties).

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q22 (Tier 3): Deduplicate user_profiles — keep latest row per user.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q23 (Tier 3): WoW growth in WAU using LAG.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q24 (Tier 3): CASE pivot — sessions per platform per day.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q25 (Tier 3): Top 3 highest-revenue products in each category.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q26 (Tier 3): Median order amount.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q27 (Tier 3): Each user's largest order, with date.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q28 (Tier 3): Cumulative distribution of order amounts.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q29 (Tier 3): Each order's deviation from that user's average.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q31 (Tier 4): 7-day moving average of daily revenue.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q32 (Tier 4): Running monthly revenue total, reset each month.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q33 (Tier 4): Weekly retention by signup cohort (W0–W3).

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q34 (Tier 4): Funnel conversion: signup → activation → purchase.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q36 (Tier 4): Sessionize the clickstream — 30 min idle = new session.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q37 (Tier 4): Longest consecutive login streak per user.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q38 (Tier 4): Detect data skew — which join keys hold most of the rows?

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q39 (Tier 4): Average DAU per ISO week from daily_metrics.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q40 (Tier 4): Day-of-week revenue heat: rank weekdays by avg revenue.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q42 (Tier 5): Recursive CTE: full org chart with reporting depth.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q43 (Tier 5): Recursive CTE: total org headcount under each manager.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q44 (Tier 5): Cohort retention as percentages.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q45 (Tier 5): New vs returning users by day.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q48 (Tier 5): Per-user outlier orders — more than 3σ above that user's mean.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q49 (Tier 5): Power users — top 10% by orders AND top 10% by sessions.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

SQL Q50 (Tier 5): Daily ARPU + sanity check vs. raw orders.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

Python Q1 (Tier 1): Load users.csv with proper date parsing; report shape + dtypes.

Practice question from the PaddySpeaks Introduction to Data Engineering bank. Source data, code, and computed output are shown in the article.

Introduction to Data Engineering — 100 SQL & Python Interview Questions (with real outputs)

Part 0 · What Data Engineering Actually Is

95% of data engineering interview loops gate on SQL and Python before you ever see a system-design board.

Most candidates still prepare for these rounds the way they would prep for LeetCode — random questions, no structure, no theory of what the interviewer is actually scoring. That is why so many strong candidates wash out at the screen.

This article is the prep I wish I had had: 100 questions on one realistic dataset, ordered simple → complex, framed as the prompts a Meta / Amazon / Apple / Netflix / Google interviewer would actually give you. Every question shows the source data, the code, the actual computed output, and the failure modes a senior interviewer probes after your first answer lands.

If you only have one weekend, jump to the Top-20 must-do questions. If you have two weeks, do them all. If you are interviewing for an Ads / Marketing Science role, the Pre-/Post-iOS-ATT bonus section at the bottom is required reading.

Most people meet a data engineer through their absence. The dashboard is stale. The number doesn't match the other number. The ML model trained on yesterday's distribution and now nobody trusts the score. Somewhere upstream, a pipeline shrugged.

Data engineering is the discipline of moving data from where it is born — clicks, payments, sensors, logs, third-party APIs — to where decisions get made, and doing it with enough rigor that the people downstream (analysts, scientists, ML systems, executives, regulators) can build on top without flinching.

A Working Definition

A data engineer designs, builds, and operates the systems that turn raw events into reliable, queryable, governed datasets — at the latency, cost, and quality the business has agreed to.

The five things you actually do

Ingest. Pull data from operational systems (Postgres, Kafka, S3 drops, SaaS APIs) without losing rows, without double-counting, and without melting the source.
Model. Decide what a ‘user’, an ‘order’, a ‘session’, a ‘subscription’ means — and shape the data so a hundred analysts get the same answer.
Transform. Write SQL/Python/Spark that turns raw events into clean facts and dimensions, idempotently, on a schedule, and re-runnable.
Serve. Make those datasets available — warehouse tables, lakehouse files, low-latency caches, feature stores, BI semantic layers.
Operate. Monitor freshness, completeness, schema drift, cost, lineage; respond when something breaks; explain why a number changed.

This article focuses sharply on the two skills every loop tests: SQL and Python. If you cannot ace the SQL screen and the coding round, the design round never happens.

Part 1 · How to Prepare for SQL & Python Rounds

Most candidates over-study the wrong things — memorizing every window function, every pandas method — and under-study the things interviewers actually weight: clarifying the question, picking the right primitive, narrating tradeoffs, and writing code that runs.

The Four Things Interviewers Score

1. Clarity — Did you restate the problem? Did you ask about edge cases (nulls, ties, late data, time zones)?

2. Correctness — Does the query / function actually return the right answer on the obvious test cases?

3. Reasoning — Can you talk through why you picked this join / window / pattern? Can you compare it to one alternative?

4. Production sense — Would this scale? Is it idempotent? What happens at 100x the volume? What happens if the source schema changes?

How to use this question bank

Each of the 100 questions below is a closed card by default — click the title to expand. Inside you get four pieces:

Scenario — what a real PM, analyst, or ops lead would actually ask.
Source — a sample of the input table(s) so you can see what the data looks like.
Code — the SQL or Python answer, deliberately written for readability over cleverness.
Output — the actual result of running that code on the dataset (computed at build time, not hand-typed).
Why it matters — the production angle, edge case, or interviewer follow-up.

Read the scenario, write your own answer, then compare. The compare-after-attempt step is where the learning lives.

Part 2 · The Practice Dataset

Every question below uses the same realistic e-commerce dataset — 15 tables totaling roughly 7,000 rows — small enough to run on your laptop, structured enough to surface real interview patterns. The CSVs and a Postgres / MySQL setup script live in interview/sample dataset/.

Practice dataset: 15 tables and what each is used for.
Table	Rows	Columns	Used for
`users`	200	`user_id, name, email, signup_date, country`	Anti-joins, cohorts, country roll-ups
`purchases`	300	`purchase_id, user_id, amount, purchase_date, product_category`	Revenue, category roll-ups
`user_profiles`	334	`user_id, name, city, updated_at`	Dedup (intentional duplicates)
`weekly_metrics`	52	`week_start, wau`	WoW growth with LAG
`sessions`	800	`session_id, user_id, platform, session_date, duration_sec`	CASE pivots, platform splits
`products`	50	`product_id, product_name, category, revenue`	Top-N per group
`clickstream`	1,971	`user_id, event_time, page`	Sessionization (30-min idle)
`events`	1,014	`event_id, user_id, event_type, event_date`	Funnel: signup → activation → purchase
`logins`	1,925	`user_id, login_date`	Consecutive streaks (gaps & islands)
`daily_metrics`	120	`ds, daily_revenue, daily_dau`	Running totals, moving averages
`daily_metrics_gapped`	105	`ds, daily_revenue, daily_dau`	Date spine — 15 missing dates to fill
`employees`	28	`employee_id, full_name, manager_id, department, salary`	Self-joins, recursive org traversal
`orders`	500	`order_id, user_id, order_date, amount`	Cumulative distribution, correlated subquery
`product_dim`	68	`product_id, product_name, price, effective_date, expiry_date, is_current`	SCD Type 2
`skewed_table`	1,000	`id, join_key, value`	Skew detection (~70% rows on `hot_key`)

The five tiers

Tier 1 · Foundation

Single table, SELECT/WHERE/ORDER/LIMIT, basic aggregates.

Tier 2 · Joins

Multi-table joins, GROUP BY, anti-joins, semi-joins.

Tier 3 · Windows

Window functions, CASE, subqueries, dedup.

Tier 4 · Patterns

CTEs, sessionization, funnels, cohorts, date spines.

Tier 5 · Senior

Recursive CTEs, SCD2, skew, Spark, idempotency.

Top 20 · If you only study 20 questions, make it these

These twenty are the highest signal-to-noise ratio in the bank. They cover ~80% of the patterns that show up in MAANG SQL screens and Python coding rounds, in roughly the order you should master them. Skim the rest at your peril; do these and the interviewer will run out of pattern types before they run out of clock.

SQL · Tier 2

Q11 · Anti-join (users with no purchases)

Why: tests NULL discipline. The trap is NOT IN; the right answer is LEFT JOIN ... IS NULL or NOT EXISTS. Almost every loop probes this pattern.

SQL · Tier 3

Q22 · Dedup with deterministic tiebreak

Why: tests idempotency thinking. Tied updated_at without a secondary sort gives non-reproducible output across backfills — the bug that surfaces only in production.

SQL · Tier 3

Q23 · LAG for WoW growth

Why: the most-asked window function in any growth-team loop. Bonus signal: knowing to NULLIF the divisor for the first row.

SQL · Tier 3

Q25 · Top-N per group

Why: tests scaling logic. Correlated subquery works at small data; DENSE_RANK() OVER PARTITION BY is the only answer that scales. 100× cost difference at 1B rows.

SQL · Tier 4

Q34 · Funnel with monotonicity guards

Why: the senior catch. Most candidates compute counts per stage and call it a funnel. The interviewer wants the activation_at >= signup_at guard — without it your conversion rate exceeds 100% in some weeks.

SQL · Tier 4

Q36 · Sessionization (30-min idle)

Why: the cumulative-sum trick on (time - prev_time) > threshold. This exact pattern shows up in Reels, Stories, IG Live, YouTube watch-time, every video product. Memorize it.

SQL · Tier 4

Q37 · Streaks (gaps & islands)

Why: the canonical pattern. date - row_number() is constant within consecutive runs. If you have to think about this in the interview, you have not internalized it — drill it.

SQL · Tier 5

Q41 · SCD2 as-of join

Why: the senior data-modeling test. The invariant matters: exactly one row per (entity, point-in-time). Half-open [effective, expiry) windows or you get duplicate matches at boundaries.

SQL · Tier 5

Q42 · Recursive CTE for org chart

Why: tests whether you have ever built a hierarchical pipeline. Bonus signal: volunteering the cycle-guard pattern (depth < 20 + visited-set) before the interviewer asks.

SQL · Tier 5

Q47 · Reconciliation against raw facts

Why: the L5 signal. A metric is only as trusted as its reconciliation report. Ship the audit alongside the metric — it is a habit MAANG senior DEs all share.

Python · Tier 2

Q11 · `merge` with `validate=`

Why: the cheapest production-bug-prevention pattern in pandas. Use it once after a silent m:m blow-up burns you, and you'll use it forever.

Python · Tier 2

Q13 · Dedup with `assert is_unique`

Why: the senior habit. Never trust dedup by inspection. The assert is the answer.

Python · Tier 3

Q21 · `rolling` with `min_periods`

Why: tests definition discipline. min_periods=1 is convenient but produces 1-day ‘averages’ for the head of the series. min_periods=7 is the honest answer. Pick deliberately.

Python · Tier 3

Q28 · Sessionize in pandas

Why: identical mental model to the SQL form. Practice both side-by-side until you can switch fluently — in either round.

Python · Tier 3

Q30 · Cohort retention table

Why: the shape that ships in every weekly growth review. Rows = cohort, cols = age, cells = % retained. Heatmap straight from this table.

Python · Tier 4

Q32 · `merge_asof` for SCD2

Why: the SCD2 superpower in pandas. Solo this question and you understand half of dimensional modeling.

Python · Tier 4

Q33 · Salt the hot key (skew mitigation)

Why: the senior Spark signal even in a pandas round. Knowing to salt both sides of the join is what separates “has read about Spark” from “has shipped a Spark job at scale.”

Python · Tier 5

Q46 · DAG executor with toposort

Why: tests whether you have built a scheduler, not just used one. Maestro (Netflix) and Dataswarm (Meta) both started this way.

Python · Tier 5

Q47 · LRU cache from scratch

Why: classic mid-level Python question. OrderedDict gives O(1) get/put; the doubly-linked-list-from-scratch version impresses the senior interviewer.

Python · Tier 5

Q50 · Atomic daily-partition publish

Why: the publish-a-daily-fact-table pattern every senior DE has written 50 times. The shape of the answer matters more than the syntax.

Part 3 · 50 SQL Questions, Simple to Complex

Open a Postgres / DuckDB / SQLite shell, load the CSVs, and try each question on your own before reading the solution. Solutions use ANSI SQL with a Postgres / DuckDB flavor; portability notes call out where Spark / BigQuery / Snowflake differ. Every Output block was produced by actually running the query against the dataset at build time.

Tier 1 · Foundation

SQL Q1 Tier 1 · Foundation How many users do we have per country?

[Meta · IG Growth] An Instagram growth lead is deciding which language UI to localize next. They want a country split of the active user base before scoping the launch.

Source · users

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

SQL

SELECT country, COUNT(*) AS user_count
FROM users
GROUP BY country
ORDER BY user_count DESC;

Output

   country  user_count
0       US          53
1       IN          33
2       BR          22
3       UK          20
4       FR          20
..     ...         ...
11      in           2
12                   1
13      US           1
14     UK            1
15  <NULL>           1

Why it matters: The bread-and-butter GROUP BY. Interviewers watch whether you alias the count, sort the result, and resist SELECT *.

Edge cases & gotchas (4)

‘Missing’ hides in five different shapes

Real-world country columns leak more than just NULL: '\N' (the default CSV null sentinel from COPY ... FORMAT csv), the literal string '<NULL>', empty strings, whitespace-only rows (' '), and 'NA' / 'N/A'. A naive COALESCE(country, '<NULL>') only catches the first one. The fix has to canonicalize all of them — and report which dirty variants existed so the DQ team can ticket the upstream.

Fix · SQL

-- Production-grade: collapse every flavor of "missing" into one bucket,
-- normalize case + whitespace so 'BR' and 'br' aggregate together,
-- and surface raw_variants so the DQ team can ticket the upstream owner.
WITH cleaned AS (
  SELECT
    CASE
      WHEN country IS NULL                                              THEN '<NULL>'
      WHEN TRIM(country) = ''                                            THEN '<EMPTY>'
      WHEN UPPER(TRIM(country)) IN ('\N','NULL','<NULL>','NA','N/A')    THEN '<NULL>'
      ELSE UPPER(TRIM(country))             -- normalize case + trim
    END AS country_clean,
    country AS country_raw
  FROM users
)
SELECT country_clean,
       COUNT(*)                                                          AS n,
       COUNT(DISTINCT country_raw)                                       AS raw_variants,
       STRING_AGG(DISTINCT '''' || COALESCE(country_raw, 'NULL') || '''', ', ' ORDER BY '''' || COALESCE(country_raw, 'NULL') || '''') AS variants_seen
FROM cleaned
GROUP BY country_clean
ORDER BY n DESC;

-- Expected output collapses dirty variants into clean buckets,
-- with raw_variants flagging which buckets had multiple input forms.

Output

  country_clean   n  raw_variants           variants_seen
0            US  54             2             ' US', 'US'
1            IN  35             2              'IN', 'in'
2            BR  24             2              'BR', 'br'
3            UK  21             2             'UK ', 'UK'
4            FR  20             1                    'FR'
5            JP  18             1                    'JP'
6        <NULL>  17             2  '<NULL>', 'NULL', '\N'
7            DE   7             1                    'DE'
8       <EMPTY>   4             2             '   ', '  '

Case-only differences (‘BR’ vs ‘br’)

GROUP BY country is case-sensitive on Postgres / DuckDB / Snowflake (with default collation). 'BR' and 'br' are two buckets — almost always a DQ bug, never the intent. Normalize with UPPER(TRIM(country)) at the GROUP BY boundary, but keep the raw value visible for the audit trail.

Fix · SQL

-- See variants per canonical bucket
SELECT UPPER(TRIM(country))                          AS country_clean,
       COUNT(*)                                       AS n,
       ARRAY_AGG(DISTINCT country ORDER BY country)   AS raw_forms_observed
FROM users
WHERE country IS NOT NULL
GROUP BY UPPER(TRIM(country))
HAVING COUNT(DISTINCT country) > 1                    -- only buckets with dirty forms
ORDER BY n DESC;

Output

  country_clean   n raw_forms_observed
0            US  54          [ US, US]
1            IN  35           [IN, in]
2            BR  24           [BR, br]
3            UK  21          [UK, UK ]
4                 4          [  ,    ]

Empty table returns no rows (not zero)

If users is empty, the query returns no rows — not a row with count=0. Wrap in a sentinel CTE if downstream needs a deterministic shape (e.g., a daily metrics job that UNIONs yesterday with today). Self-contained version below.

Fix · SQL

-- Empty-safe roll-up: always returns at least one row
WITH base AS (
  SELECT country, COUNT(*) AS n
  FROM users WHERE 1=0    -- simulates an empty source for demo
  GROUP BY country
)
SELECT country, n FROM base
UNION ALL
SELECT '<empty_table>', 0 WHERE NOT EXISTS (SELECT 1 FROM base);

Output

         country  n
0  <empty_table>  0

COUNT(*) vs COUNT(country) diverge on NULL

COUNT(*) counts every row including those with NULL country. COUNT(country) skips NULLs. COUNT(DISTINCT country) also skips NULLs. The difference between the first two is your NULL count for free. Sentinel collision warning: in the output below, distinct_incl_null_unsafe equals distinct_excl_null — surprising at first, until you notice the dirty data already contains a literal string '<NULL>', so a COALESCE(col, '<NULL>') sentinel collides with a real value and adds nothing. The portable, sentinel-free fix is to count NULL as +1 explicitly. Avoid CHR(0) as a sentinel: Postgres rejects NUL bytes in TEXT / VARCHAR columns with SQL Error [54000]: null character not permitted. DuckDB allows it; Postgres does not.

Fix · SQL

-- Section 1 — counts side-by-side, all portable across
-- Postgres / DuckDB / Snowflake / BigQuery.
SELECT
  COUNT(*)                                                  AS rows_total,
  COUNT(country)                                            AS rows_with_country,
  COUNT(*) - COUNT(country)                                 AS country_null_rows,
  COUNT(DISTINCT country)                                   AS distinct_excl_null,
  -- UNSAFE: this sentinel collides with the literal string '<NULL>' in the data
  COUNT(DISTINCT COALESCE(country, '<NULL>'))               AS distinct_incl_null_unsafe,
  -- SAFE: pick a sentinel obviously not in the domain. Long + unlikely.
  COUNT(DISTINCT COALESCE(country, '__NULL_SENTINEL_8f3a__')) AS distinct_incl_null_safe,
  -- BEST: no sentinel at all. Add 1 if any NULLs exist.
  COUNT(DISTINCT country)
    + CASE WHEN COUNT(*) > COUNT(country) THEN 1 ELSE 0 END AS distinct_incl_null_explicit
FROM users;

-- Section 2 — what ARE the distinct values (so you can SEE the mess).
-- Quote each value so empty strings and whitespace become visible.
SELECT
  COALESCE(country, '__NULL__')                              AS country_value_or_null_marker,
  '''' || COALESCE(country, '__NULL__') || ''''              AS country_quoted,
  COUNT(*)                                                   AS n_rows
FROM users
GROUP BY country
ORDER BY n_rows DESC, country_quoted;

Output

   rows_total  rows_with_country  country_null_rows  distinct_excl_null  distinct_incl_null_unsafe  distinct_incl_null_safe  distinct_incl_null_explicit
0         200                193                  7                  15                         15                       16                           16

──── result 2 ────
   country_value_or_null_marker country_quoted  n_rows
0                            US           'US'      53
1                            IN           'IN'      33
2                            BR           'BR'      22
3                            FR           'FR'      20
4                            UK           'UK'      20
5                            JP           'JP'      18
..                          ...            ...     ...
10                           br           'br'       2
11                           in           'in'       2
12                                       '   '       1
13                           US          ' US'       1
14                       <NULL>       '<NULL>'       1
15                          UK           'UK '       1

SQL Q2 Tier 1 · Foundation Top 5 highest-paid employees with their department.

[Amazon · People Tech] A comp partner is calibrating the L5/L6 band before the next focal cycle. They want the top earners surfaced with department, to flag any outliers for HRBP review.

Source · employees

   employee_id  full_name  manager_id   department    salary
0          100  Dana Chen         NaN  Engineering  250000.0
1          201   Eve Park       100.0  Engineering  180000.0
2          202   Frank Li       100.0  Engineering  175000.0
3          203  Grace Kim       100.0  Engineering       NaN
4          301     IC_301       201.0  Engineering  116962.0

SQL

SELECT employee_id, full_name, department, salary
FROM employees
ORDER BY salary DESC
LIMIT 5;

Output

   employee_id  full_name   department    salary
0          100  Dana Chen  Engineering  250000.0
1          201   Eve Park  Engineering  180000.0
2          202   Frank Li  Engineering  175000.0
3          315     IC_315  Engineering  161966.0
4          312     IC_312  Engineering  145797.0

Follow-up: “What if two people tie at #5?” → discuss FETCH FIRST 5 ROWS WITH TIES or RANK() OVER (ORDER BY salary DESC).

Edge cases & gotchas (2)

Ties at the cutoff

Plain LIMIT 5 silently drops one of two tied salaries. Use FETCH FIRST 5 ROWS WITH TIES (Postgres 13+, Snowflake) or RANK() to make ties explicit.

Fix · SQL

-- Postgres 13+: keep all ties at the cutoff
SELECT * FROM employees ORDER BY salary DESC
FETCH FIRST 5 ROWS WITH TIES;

-- Portable: filter by RANK
WITH ranked AS (
  SELECT *, RANK() OVER (ORDER BY salary DESC) AS rk FROM employees
)
SELECT * FROM ranked WHERE rk <= 5;

Output

[error] ParserException: Parser Error: syntax error at or near "WITH"

LINE 3: FETCH FIRST 5 ROWS WITH TIES
                           ^

NULL salary breaks ORDER BY semantics

Postgres orders NULLs LAST in DESC by default; MySQL orders them FIRST. State your DB or be explicit with NULLS LAST.

Fix · SQL

SELECT * FROM employees
ORDER BY salary DESC NULLS LAST   -- portable across Postgres, Snowflake, BigQuery
LIMIT 5;

Output

   employee_id  full_name  manager_id   department    salary
0          100  Dana Chen         NaN  Engineering  250000.0
1          201   Eve Park       100.0  Engineering  180000.0
2          202   Frank Li       100.0  Engineering  175000.0
3          315     IC_315       203.0  Engineering  161966.0
4          312     IC_312       203.0  Engineering  145797.0

SQL Q3 Tier 1 · Foundation Total revenue across all purchases.

[Amazon · Retail Finance] Day-1 of an OP1 review: the S-team wants the simplest possible top-line for the deck — lifetime GMV booked through purchases.

Source · purchases

   purchase_id  user_id  amount purchase_date product_category
0            1      149  321.30    2024-11-01           Sports
1            2       82   36.66    2024-11-01         Clothing
2            3       19   70.12    2024-11-01         Clothing
3            4       98  122.21    2024-11-01             Food
4            5       59    5.33    2024-11-01      Electronics

SQL

SELECT ROUND(SUM(amount), 2) AS total_revenue
FROM purchases;

Output

   total_revenue
0      127603.25

Edge case: If amount can be NULL, SUM already ignores nulls — but state that explicitly. Interviewers like when you pre-empt the null question.

Edge cases & gotchas (4)

How every aggregate handles NULL (the full matrix)

COUNT(*) counts rows. COUNT(col) counts non-NULL values of col. SUM/AVG/MIN/MAX all skip NULLs entirely. The denominator surprise: AVG(col) drops NULL rows from both numerator AND denominator — not what you want if NULL means ‘real zero’. The all-NULL surprise: SUM(col) on a column where every row is NULL returns NULL, not 0. The dataset has 3 NULL amount rows after the dirty-data injection, so we can show the divergence side-by-side.

Fix · SQL

-- Every aggregate's NULL behavior, on the same data:
SELECT
  COUNT(*)                         AS rows_total,                 -- counts rows incl. NULL
  COUNT(amount)                    AS amount_non_null,            -- skips NULL
  COUNT(*) - COUNT(amount)         AS amount_null_rows,           -- diff = NULL count
  COUNT(DISTINCT amount)           AS distinct_amounts,           -- skips NULL too
  ROUND(SUM(amount), 2)            AS sum_amount,                 -- NULL-safe iff at least 1 non-null
  COALESCE(ROUND(SUM(amount),2),0) AS sum_amount_safe,            -- always a number
  ROUND(AVG(amount), 2)            AS avg_amount_pandas_default,  -- mean of NON-NULL only
  ROUND(SUM(amount) / COUNT(*), 2) AS avg_treating_null_as_zero,  -- different number!
  MIN(amount)                      AS min_amount,                 -- skips NULL
  MAX(amount)                      AS max_amount                  -- skips NULL
FROM purchases;

Output

   rows_total  amount_non_null  amount_null_rows  distinct_amounts  sum_amount  sum_amount_safe  avg_amount_pandas_default  avg_treating_null_as_zero  min_amount  max_amount
0         301              298                 3               296   127603.25        127603.25                      428.2                     423.93     -228.72    99999.99

AVG when the column is all NULL

AVG returns NULL (not 0) on a partition where every row is NULL. If you guard with COALESCE(AVG(x), 0) you risk hiding the ‘no data at all’ signal in a real zero. Better: emit COUNT alongside, or use NULLIF / COALESCE deliberately with a label.

Fix · SQL

-- Per-category averages with explicit empty-handling
SELECT product_category,
       COUNT(*)                                    AS rows,
       COUNT(amount)                               AS rows_with_amount,
       AVG(amount)                                 AS avg_raw,
       COALESCE(AVG(amount), 0)                    AS avg_zero_filled,
       CASE WHEN COUNT(amount) = 0 THEN 'no_data'
            ELSE CAST(ROUND(AVG(amount),2) AS VARCHAR)
       END                                         AS avg_labeled
FROM purchases
GROUP BY product_category
ORDER BY rows DESC;

Output

  product_category  rows  rows_with_amount      avg_raw  avg_zero_filled avg_labeled
0      Electronics    73                73    88.380822        88.380822       88.38
1         Clothing    64                64  1651.377344      1651.377344     1651.38
2             Food    59                59    87.792034        87.792034       87.79
3            Books    56                55   102.794909       102.794909      102.79
4           Sports    49                47    98.507447        98.507447       98.51

COALESCE(SUM(x), 0) vs SUM(COALESCE(x, 0))

Both forms return the same number when the table has rows — SUM ignores NULLs either way. They diverge only on the empty input case:
• COALESCE(SUM(x), 0) applies the default to the aggregated result → returns 0 on empty (and NULL-only) tables.
• SUM(COALESCE(x, 0)) replaces NULL with 0 row-by-row before summing → returns 0 on a NULL-only table, but NULL on a truly empty input (SQL standard: SUM of zero rows is NULL).
Pick by use case: dashboards that must always show a number (KPIs) want the outside form. Per-row imputation (e.g., treating NULL price as 0 for ledger math) wants the inside form.

Fix · SQL

-- Side-by-side on three scenarios:
--   (a) the real table (has rows + 3 NULL amounts)
--   (b) NULL-only table  (all amount IS NULL)
--   (c) empty table      (zero rows)
WITH
  null_only AS (SELECT * FROM purchases WHERE 1=0
                UNION ALL SELECT 999, 1, NULL, '2025-01-01', 'demo'),
  empty_set AS (SELECT * FROM purchases WHERE 1=0)
SELECT 'real_table'  AS scenario,
       SUM(amount)                          AS sum_raw,
       COALESCE(SUM(amount), 0)             AS coalesce_outside,  -- final NULL -> 0
       SUM(COALESCE(amount, 0))             AS coalesce_inside    -- NULL row -> 0 then sum
FROM purchases
UNION ALL
SELECT 'null_only_table',
       SUM(amount), COALESCE(SUM(amount), 0), SUM(COALESCE(amount, 0))
FROM null_only
UNION ALL
SELECT 'empty_table',
       SUM(amount), COALESCE(SUM(amount), 0), SUM(COALESCE(amount, 0))
FROM empty_set;

Output

          scenario    sum_raw  coalesce_outside  coalesce_inside
0       real_table  127603.25         127603.25        127603.25
1  null_only_table        NaN              0.00             0.00
2      empty_table        NaN              0.00              NaN

Currency precision (float drift vs DECIMAL)
Storing amount as DOUBLE/FLOAT will lose cents at scale — SUM is non-associative on floats. The schema uses DECIMAL(10,2); confirm before reporting to finance.
Fix · SQL
```
-- SAFE: explicit DECIMAL cast prevents float drift
SELECT ROUND(SUM(amount)::DECIMAL(18,2), 2) AS revenue
FROM purchases
WHERE amount IS NOT NULL;
```
Output
```
     revenue
0  127603.25
```

SQL Q4 Tier 1 · Foundation Which products generated more than $1,000 in revenue?

[Amazon · Prime Day Merch] Merchandising wants the ‘winners shelf’ for the homepage rotation: every SKU that has cleared $1,000 in revenue.

Source · products

   product_id product_name     category  revenue
0           1       prod_1  Electronics    435.0
1           2       prod_2  Electronics     25.0
2           3       prod_3  Electronics    384.0
3           4       prod_4  Electronics   1054.0
4           5       prod_5  Electronics    485.0

SQL

SELECT product_id, product_name, category, revenue
FROM products
WHERE revenue > 1000
ORDER BY revenue DESC;

Output

    product_id product_name     category  revenue
0           40      prod_40       Sports  12777.0
1           28      prod_28     Clothing   9770.0
2           50      prod_50         Food   8731.0
3           30      prod_30     Clothing   7190.0
4           29      prod_29     Clothing   2972.0
..         ...          ...          ...      ...
11          38      prod_38       Sports   2003.0
12          45      prod_45         Food   1827.0
13          12      prod_12        Books   1698.0
14          17      prod_17        Books   1565.0
15           4       prod_4  Electronics   1054.0

Why it matters: WHERE + sort. Trivial — but watch candidates who use HAVING here without a GROUP BY.

Edge cases & gotchas (2)

WHERE vs HAVING

If revenue is a pre-aggregate, WHERE works. If you compute it in this query (SUM), you must use HAVING. Mixing them is a frequent interview trap.

Fix · SQL

-- Computed revenue: HAVING, not WHERE
SELECT product_category, SUM(amount) AS revenue
FROM purchases
GROUP BY product_category
HAVING SUM(amount) > 1000;

Output

  product_category    revenue
0            Books    5653.72
1         Clothing  105688.15
2      Electronics    6451.80
3             Food    5179.73
4           Sports    4629.85

Inclusive vs exclusive boundary
Always confirm '> 1000' vs '>= 1000'. Off-by-one at category boundaries silently changes the report.

SQL Q5 Tier 1 · Foundation Average session duration by platform.

[Meta · IG Product] Before the notification redesign ships, product wants average session length by platform — does iOS actually engage longer than Android, controlled for sample size?

Source · sessions

   session_id  user_id platform session_date  duration_sec
0           1       51  android   2025-01-01         388.0
1           2       73  android   2025-01-01         167.0
2           3       32  android   2025-01-01        2493.0
3           4       75  android   2025-01-01        1881.0
4           5        8  android   2025-01-01        1217.0

SQL

SELECT platform,
       COUNT(*)                          AS sessions,
       ROUND(AVG(duration_sec)::DOUBLE, 1) AS avg_duration_sec
FROM sessions
GROUP BY platform
ORDER BY avg_duration_sec DESC;

Output

  platform  sessions  avg_duration_sec
0      WEB         5            2907.6
1      iOS         1            2691.0
2  Android         2            2085.5
3      web       198            1823.9
4      ios       329            1744.4
5  android       265            1679.0

Follow-up: “Why include the count?” → averages without sample size are dangerously seductive.

Edge cases & gotchas (3)

Average without sample size lies

Always emit COUNT alongside AVG. A platform with 5 sessions and a 10-second average looks identical to one with 50,000 sessions and the same average — but only one is statistically meaningful.

Fix · SQL

SELECT platform, COUNT(*) AS sessions,
       AVG(duration_sec) AS avg_dur,
       STDDEV(duration_sec) AS sd_dur,
       AVG(duration_sec) / SQRT(COUNT(*)) AS sem
FROM sessions GROUP BY platform;

Output

  platform  sessions      avg_dur       sd_dur          sem
0  Android         2  2085.500000   713.470742  1474.671192
1  android       265  1678.973485  1011.429126   103.138521
2      web       198  1823.908629  1007.822862   129.619542
3      iOS         1  2691.000000          NaN  2691.000000
4      ios       329  1744.445122  1062.677389    96.174373
5      WEB         5  2907.600000   775.846828  1300.318250

Outlier-skewed mean

A few 12-hour zombie sessions pull AVG up. Report median or a trimmed mean alongside.

Fix · SQL

SELECT platform,
       AVG(duration_sec) AS avg_dur,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_sec) AS p50_dur
FROM sessions GROUP BY platform;

Output

  platform      avg_dur  p50_dur
0  Android  2085.500000   2085.5
1      WEB  2907.600000   3257.0
2  android  1678.973485   1557.5
3      web  1823.908629   1928.0
4      iOS  2691.000000   2691.0
5      ios  1744.445122   1719.5

AVG silently drops NULL and negative durations

AVG ignores NULL — but a negative duration_sec (corrupt event) is treated as a real value and pulls the mean down. The denominator surprise: NULL rows are excluded from both numerator and denominator. Always pre-filter to valid positives before averaging, AND emit a dropped count so the DQ trail is visible.

Fix · SQL

SELECT
  LOWER(platform)                                  AS platform,
  COUNT(*)                                         AS rows_total,
  COUNT(duration_sec)                              AS rows_with_duration,
  SUM(CASE WHEN duration_sec IS NULL THEN 1 END)   AS dropped_null,
  SUM(CASE WHEN duration_sec < 0    THEN 1 END)    AS dropped_negative,
  ROUND(AVG(duration_sec), 1)                      AS avg_raw,                    -- pulled down by negatives
  ROUND(AVG(CASE WHEN duration_sec >= 0
                  THEN duration_sec END), 1)       AS avg_clean                   -- valid positives only
FROM sessions
GROUP BY 1
ORDER BY rows_total DESC;

Output

  platform  rows_total  rows_with_duration  dropped_null  dropped_negative  avg_raw  avg_clean
0      ios         330                 329           1.0               1.0   1747.3     1752.7
1  android         267                 266           1.0               1.0   1682.0     1688.4
2      web         203                 202           1.0               NaN   1850.7     1850.7

SQL Q6 Tier 1 · Foundation How many sessions occur on weekends?

[Netflix · Member Ops] An ops lead suspects weekend viewing is what is driving the Tier-1 support backlog. Size weekend session volume so capacity planning has a number to work with.

Source · sessions

   session_id  user_id platform session_date  duration_sec
0           1       51  android   2025-01-01         388.0
1           2       73  android   2025-01-01         167.0
2           3       32  android   2025-01-01        2493.0
3           4       75  android   2025-01-01        1881.0
4           5        8  android   2025-01-01        1217.0

SQL

SELECT COUNT(*) AS weekend_sessions
FROM sessions
WHERE EXTRACT(DOW FROM session_date) IN (0, 6);

Output

   weekend_sessions
0               224

Portability: MySQL uses DAYOFWEEK() (1=Sun…7=Sat); Spark dayofweek() uses the same convention as MySQL. Always state your DB.

Edge cases & gotchas (2)

DOW numbering is DB-specific

Postgres EXTRACT(DOW): Sun=0..Sat=6. ISODOW: Mon=1..Sun=7. MySQL DAYOFWEEK: Sun=1..Sat=7. Wrong constant = wrong answer.

Fix · SQL

-- Portable: name-based
SELECT COUNT(*) FROM sessions
WHERE strftime(session_date, '%w') IN ('0','6');  -- DuckDB / SQLite
-- Or: EXTRACT(ISODOW FROM session_date) IN (6, 7)  -- Postgres

Output

   count_star()
0           224

Time zone drift
If session_date is derived from a UTC timestamp, a 'weekend' for a Sydney user differs from one for an LA user. Convert to user-local TZ before bucketing.

SQL Q7 Tier 1 · Foundation List all distinct event types in the events table.

[Meta · Onboarding DE] A new analyst on the funnel team needs to know what canonical event types exist before writing any SQL. The lookup table is documented but they want to verify against the actual fact stream.

Source · events

   event_id  user_id  event_type event_date
0         1        1      signup 2025-01-04
1         2        1  activation 2025-02-08
2         3        1    purchase 2025-03-13
3         4        2      signup 2025-02-17
4         5        2  activation 2025-02-24

SQL

SELECT DISTINCT event_type
FROM events
ORDER BY event_type;

Output

   event_type
0  activation
1    purchase
2      signup
3         NaN

Production angle: In a real warehouse, prefer querying a documented enum/lookup table instead of DISTINCT on a high-cardinality fact table.

Edge cases & gotchas (2)

DISTINCT on huge fact tables is expensive

On a 10B-row events table, SELECT DISTINCT does a full scan. Hit the documented enum/lookup table or a daily-cardinality summary.

Fix · SQL

-- Cheap on a billion rows: the catalog table
SELECT type FROM event_types ORDER BY type;

Output

[error] CatalogException: Catalog Error: Table with name event_types does not exist!
Did you mean "events"?

LINE 2: SELECT type FROM event_types ORDER BY type
                         ^

Case sensitivity

DISTINCT is case-sensitive. 'SignUp' and 'signup' are two values; lowercase before grouping if upstream is dirty.

Fix · SQL

SELECT DISTINCT LOWER(event_type) FROM events ORDER BY 1;

Output

  lower(event_type)
0        activation
1          purchase
2            signup
3               NaN

SQL Q8 Tier 1 · Foundation How many users signed up in Q4 2024?

[Google · Ads Marketing] The growth team is reconciling Q4 2024 acquisition against the quarterly board pack. Count signups landed in that window, with a half-open date range.

Source · users

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

SQL

SELECT COUNT(*) AS q4_2024_signups
FROM users
WHERE signup_date >= DATE '2024-10-01'
  AND signup_date <  DATE '2025-01-01';

Output

   q4_2024_signups
0               46

Why half-open ranges: >= start AND < end is timezone-safe, leap-day-safe, and avoids the classic BETWEEN '2024-12-31' off-by-one when timestamps creep in.

Edge cases & gotchas (2)

BETWEEN is INCLUSIVE on both ends

BETWEEN '2024-10-01' AND '2024-12-31' silently misses any midnight-on-Dec-31 timestamps. Half-open ranges (>= start AND < end) are timezone- and DST-safe.

Fix · SQL

-- WRONG when signup_date can be a TIMESTAMP
WHERE signup_date BETWEEN '2024-10-01' AND '2024-12-31'

-- RIGHT: half-open
WHERE signup_date >= DATE '2024-10-01'
  AND signup_date <  DATE '2025-01-01'

Time zone of the report
‘Q4 2024’ in PT vs UTC differs by 8 hours. State the report TZ and CONVERT/AT TIME ZONE if signup_date is UTC.

SQL Q9 Tier 1 · Foundation Largest single order amount and its date.

[Amazon · Trust & Safety] A fraud analyst is doing an outlier review and wants the top-of-distribution: the single largest order ever booked, with date for forensic context.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

SQL

SELECT order_id, user_id, order_date, amount
FROM orders
ORDER BY amount DESC
LIMIT 1;

Output

   order_id  user_id order_date  amount
0       446      176 2025-02-25  958.35

Alternative: A subquery with WHERE amount = (SELECT MAX(amount) FROM orders) — and discuss why ORDER BY ... LIMIT is usually clearer and almost always indexed-friendlier.

Edge cases & gotchas (2)

Multiple tied maxima

If two orders share the max amount, LIMIT 1 returns one non-deterministically. Use ROW_NUMBER with a deterministic tiebreak.

Fix · SQL

SELECT order_id, user_id, order_date, amount
FROM (
  SELECT *, ROW_NUMBER() OVER (ORDER BY amount DESC, order_id) AS rn
  FROM orders
) t
WHERE rn = 1;

Output

   order_id  user_id order_date  amount
0       446      176 2025-02-25  958.35

Negative amounts (refunds)
If the table also stores refunds with negative amounts, MAX still returns the largest positive. But MIN may surprise: the ‘biggest refund’ is the most negative.

SQL Q10 Tier 1 · Foundation Sessions per day for the most recent 7 days in the data.

[Netflix · DataOps] An on-call DE just shipped a schema change to the sessions topic. Before paging the team, they want a freshness sanity check: session counts for the last 7 days the data actually covers.

Source · sessions

   session_id  user_id platform session_date  duration_sec
0           1       51  android   2025-01-01         388.0
1           2       73  android   2025-01-01         167.0
2           3       32  android   2025-01-01        2493.0
3           4       75  android   2025-01-01        1881.0
4           5        8  android   2025-01-01        1217.0

SQL

SELECT session_date, COUNT(*) AS sessions
FROM sessions
WHERE session_date >= (SELECT MAX(session_date) - INTERVAL '6 days' FROM sessions)
GROUP BY session_date
ORDER BY session_date;

Output

  session_date  sessions
0   2099-06-01         1

Why MAX() instead of CURRENT_DATE: the data may be stale; you want ‘last 7 days that exist’, not ‘last 7 calendar days’.

Edge cases & gotchas (2)

Sparse data near the head

MAX(session_date) - 6 days might span days with zero sessions. Date-spine left-join if you want exactly 7 daily rows.

Fix · SQL

WITH spine AS (
  SELECT (SELECT MAX(session_date) FROM sessions) - INTERVAL n DAY AS d
  FROM range(0, 7) t(n)
)
SELECT s.d, COUNT(x.session_id) AS sessions
FROM spine s
LEFT JOIN sessions x ON x.session_date = s.d
GROUP BY s.d ORDER BY s.d;

Output

[error] ParserException: Parser Error: syntax error at or near "DAY"

LINE 2: ...  SELECT (SELECT MAX(session_date) FROM sessions) - INTERVAL n DAY AS d
                                                                          ^

Time zone vs ‘today’
CURRENT_DATE returns the server's date — usually UTC, sometimes the cluster region. Use AT TIME ZONE to report in the dashboard's TZ.

Tier 2 · Joins & Multi-Table

SQL Q11 Tier 2 · Joins Find users who have never made a purchase.

[Meta · Lifecycle] The lifecycle team wants to send a free-trial coupon to every user who signed up but never converted to a purchase event. Build the audience list with an anti-join — and explain why NOT IN is the wrong tool.

Source · users ⬥ purchases

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

SQL

SELECT u.user_id, u.name, u.country, u.signup_date
FROM users u
LEFT JOIN purchases p ON p.user_id = u.user_id
WHERE p.purchase_id IS NULL;

Output

    user_id      name country signup_date
0         1    user_1      IN  2024-10-02
1         2    user_2      UK  2024-10-04
2         4    user_4      JP  2024-10-08
3         8    user_8      US  2024-10-15
4        10   user_10      IN  2024-10-19
..      ...       ...     ...         ...
70      193  user_193      US  2025-10-21
71      194  user_194      UK  2025-10-23
72      195  user_195      BR  2025-10-25
73      197  user_197      US  2025-10-28
74      199  user_199      US  2025-11-02

Why anti-join via LEFT JOIN + IS NULL: often plans better than NOT IN (which silently breaks on NULLs) or NOT EXISTS.

Edge cases & gotchas (3)

NOT IN poisons on NULL

If the subquery returns any NULL, NOT IN evaluates to UNKNOWN for every row — the result set becomes empty. Anti-join via LEFT JOIN + IS NULL or NOT EXISTS is NULL-safe.

Fix · SQL

-- DANGEROUS
SELECT * FROM users WHERE user_id NOT IN (SELECT user_id FROM purchases);

-- SAFE
SELECT u.* FROM users u
LEFT JOIN purchases p ON p.user_id = u.user_id
WHERE p.purchase_id IS NULL;

Output

    user_id      name                email signup_date country
0         1    user_1    user1@example.com  2024-10-02      IN
1         2    user_2    user2@example.com  2024-10-04      UK
2         4    user_4    user4@example.com  2024-10-08      JP
3         8    user_8    user8@example.com  2024-10-15      US
4        10   user_10   user10@example.com  2024-10-19      IN
5        12   user_12   user12@example.com  2024-10-23      \N
..      ...       ...                  ...         ...     ...
69      190  user_190  user190@example.com  2025-10-15      FR
70      193  user_193  user193@example.com  2025-10-21      US
71      194  user_194  user194@example.com  2025-10-23      UK
72      195  user_195  user195@example.com  2025-10-25      BR
73      197  user_197  user197@example.com  2025-10-28      US
74      199  user_199  user199@example.com  2025-11-02      US

──── result 2 ────
    user_id      name                email signup_date country
0         1    user_1    user1@example.com  2024-10-02      IN
1         2    user_2    user2@example.com  2024-10-04      UK
2         4    user_4    user4@example.com  2024-10-08      JP
3         8    user_8    user8@example.com  2024-10-15      US
4        10   user_10   user10@example.com  2024-10-19      IN
5        12   user_12   user12@example.com  2024-10-23      \N
..      ...       ...                  ...         ...     ...
69      190  user_190  user190@example.com  2025-10-15      FR
70      193  user_193  user193@example.com  2025-10-21      US
71      194  user_194  user194@example.com  2025-10-23      UK
72      195  user_195  user195@example.com  2025-10-25      BR
73      197  user_197  user197@example.com  2025-10-28      US
74      199  user_199  user199@example.com  2025-11-02      US

Soft-deleted users
If users has a deleted_at column, you might be sending a coupon to a churned account. Filter the audience BEFORE the anti-join.

WHERE country != 'US' excludes NULL too

Three-valued logic strikes again: NULL != 'US' evaluates to UNKNOWN, not TRUE — so users with NULL country drop out of the audience silently. If you actually want ‘not US, including unknowns’, use IS DISTINCT FROM (Postgres / DuckDB) or an explicit OR. Side-by-side counts show the gap.

Fix · SQL

-- All three answers in one row, so the gap is unmissable
SELECT
  SUM(CASE WHEN country != 'US'                    THEN 1 ELSE 0 END) AS not_us_naive,        -- drops NULL
  SUM(CASE WHEN country IS DISTINCT FROM 'US'      THEN 1 ELSE 0 END) AS not_us_safe,         -- NULL-safe
  SUM(CASE WHEN country != 'US' OR country IS NULL THEN 1 ELSE 0 END) AS not_us_explicit_or,  -- same idea
  SUM(CASE WHEN country IS NULL                    THEN 1 ELSE 0 END) AS gap_explained        -- = safe - naive
FROM users;

Output

   not_us_naive  not_us_safe  not_us_explicit_or  gap_explained
0         140.0        147.0               147.0            7.0

SQL Q12 Tier 2 · Joins Average purchase amount per user country.

[Amazon · Pricing Science] Pricing wants to test whether IN, UK, and US baskets differ enough to justify country-specific promo pricing. Average purchase amount by user country, with sample size.

Source · users + purchases

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

SQL

SELECT u.country,
       COUNT(p.purchase_id)        AS purchases,
       ROUND(AVG(p.amount), 2)     AS avg_basket
FROM users u
JOIN purchases p ON p.user_id = u.user_id
GROUP BY u.country
ORDER BY avg_basket DESC;

Output

   country  purchases  avg_basket
0       US         86     1271.63
1       in          4      147.87
2      NaN          8      135.66
3       \N         21      113.80
4       FR         31      108.38
..     ...        ...         ...
10      br          2       40.26
11                  4       34.99
12     UK           2       27.49
13  <NULL>          1        6.21
14      US          1        3.62

Trap: A LEFT JOIN here would inflate countries with many no-purchase users — they would show AVG = NULL, not 0, which is correct but easy to mis-read.

Edge cases & gotchas (2)

Inner vs left changes the denominator

INNER JOIN gives the avg only for purchasing countries. LEFT JOIN with AVG gives NULL for non-purchasers — easy to mis-read as 0.

Fix · SQL

-- LEFT JOIN, COALESCE for clarity
SELECT u.country,
       COALESCE(ROUND(AVG(p.amount), 2), 0) AS avg_basket,
       COUNT(p.purchase_id)                  AS purchases
FROM users u
LEFT JOIN purchases p ON p.user_id = u.user_id
GROUP BY u.country;

Output

   country  avg_basket  purchases
0       JP       79.99         23
1      NaN      135.66          8
2       FR      108.38         31
3   <NULL>        6.21          1
4       BR       99.58         28
5                34.99          4
..     ...         ...        ...
10      US        3.62          1
11                0.00          0
12     UK        27.49          2
13      UK       65.22         29
14      in      147.87          4
15      DE       84.08         13

Per-country sample size
Always emit purchase count alongside average. A 1-purchase country with avg $999 looks identical to a 1,000-purchase country with avg $999.

SQL Q13 Tier 2 · Joins Which users have no profile row?

[Meta · Privacy Eng] Before a profile-personalization feature goes GA, the privacy review needs a list of users who have a users row but no user_profiles row — these users have not consented to the profile-data flow.

Source · users ⬥ user_profiles

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

SQL

SELECT u.user_id, u.name
FROM users u
WHERE NOT EXISTS (
    SELECT 1 FROM user_profiles up WHERE up.user_id = u.user_id
);

Output

Empty DataFrame
Columns: [user_id, name]
Index: []

Why NOT EXISTS: safe with NULLs, semantically clearest, and the optimizer typically transforms it into an anti-join.

Edge cases & gotchas (2)

Multiple profile rows per user
user_profiles intentionally has 334 rows for 200 users (Q22's dedup target). NOT EXISTS still works — it short-circuits — but if you switched to LEFT JOIN, you must SELECT DISTINCT.
Fix · SQL
```
-- LEFT JOIN form needs DISTINCT
SELECT DISTINCT u.user_id, u.name
FROM users u
LEFT JOIN user_profiles up ON up.user_id = u.user_id
WHERE up.user_id IS NULL;
```
Output
```
Empty DataFrame
Columns: [user_id, name]
Index: []
```
NULL user_id in user_profiles
If user_profiles has rows with NULL user_id (corrupt loads), they do not match anything and don't pollute the anti-join — but they are a DQ bug worth alerting on separately.

SQL Q14 Tier 2 · Joins Top 10 spenders — name and total spend.

[Netflix · CX] The white-glove support team is spinning up a VIP queue. They want the top 10 spenders by lifetime billing — name + total + order count — so phone routing can lift them.

Source · users + orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

SQL

SELECT u.user_id, u.name,
       ROUND(SUM(o.amount), 2) AS total_spend,
       COUNT(o.order_id)       AS order_count
FROM users u
JOIN orders o ON o.user_id = u.user_id
GROUP BY u.user_id, u.name
ORDER BY total_spend DESC
LIMIT 10;

Output

   user_id      name  total_spend  order_count
0      176  user_176      1445.05            8
1       70   user_70      1337.78           10
2      168  user_168      1117.62            3
3      112  user_112       971.71            6
4       36   user_36       952.59            3
5      140  user_140       808.65            3
6      184  user_184       768.93            2
7      133  user_133       752.55            4
8      131  user_131       741.68            4
9       38   user_38       715.93            4

Note: orders.user_id is not a strict FK in the dataset — discuss what happens if an order references a missing user (use LEFT JOIN to find them; see Q46).

Edge cases & gotchas (2)

Orphan order rows inflate the spend

If orders.user_id can reference a missing users row, INNER JOIN drops them; LEFT JOIN keeps them but the user name is NULL. State which behavior the dashboard wants.

Fix · SQL

SELECT u.name,
       SUM(o.amount) FILTER (WHERE u.user_id IS NOT NULL) AS spend_known,
       SUM(o.amount) FILTER (WHERE u.user_id IS NULL)     AS spend_orphan
FROM orders o
LEFT JOIN users u ON u.user_id = o.user_id;

Output

[error] BinderException: Binder Error: column "name" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(name)" if the exact value of "name" is not important.

LINE 1: SELECT u.name,
               ^

Refunded orders
If amount can be negative (refunds), SUM is the net. Some teams report gross + refund separately; clarify before coding.

SQL Q15 Tier 2 · Joins List each employee with their manager's name.

[Amazon · HR] HR is generating onboarding packets and wants a flat employee → manager mapping. The CEO has no manager and must still appear in the export.

Source · employees self-join

   employee_id  full_name  manager_id   department    salary
0          100  Dana Chen         NaN  Engineering  250000.0
1          201   Eve Park       100.0  Engineering  180000.0
2          202   Frank Li       100.0  Engineering  175000.0
3          203  Grace Kim       100.0  Engineering       NaN
4          301     IC_301       201.0  Engineering  116962.0

SQL

SELECT e.employee_id, e.full_name AS employee, e.department,
       m.full_name AS manager
FROM employees e
LEFT JOIN employees m ON m.employee_id = e.manager_id
ORDER BY e.department, e.full_name;

Output

    employee_id   employee   department    manager
0           100  Dana Chen  Engineering        NaN
1           201   Eve Park  Engineering  Dana Chen
2           202   Frank Li  Engineering  Dana Chen
3           203  Grace Kim  Engineering  Dana Chen
4           301     IC_301  Engineering   Eve Park
..          ...        ...          ...        ...
23          320     IC_320  Engineering     IC_302
24          321     IC_321  Engineering     IC_302
25          322     IC_322  Engineering     IC_306
26          323     IC_323  Engineering     IC_306
27          324     IC_324  Engineering     IC_306

Why LEFT JOIN: the CEO has no manager (manager_id IS NULL) and would otherwise vanish.

Edge cases & gotchas (2)

Self-join hides the CEO

INNER JOIN drops the CEO (manager_id IS NULL). LEFT JOIN keeps them but their manager column is NULL — which the export must handle, not crash.

Fix · SQL

SELECT e.full_name AS employee,
       COALESCE(m.full_name, '<no manager>') AS manager
FROM employees e
LEFT JOIN employees m ON m.employee_id = e.manager_id;

Output

     employee       manager
0    Eve Park     Dana Chen
1    Frank Li     Dana Chen
2   Grace Kim     Dana Chen
3      IC_301      Eve Park
4      IC_302      Eve Park
5      IC_303      Eve Park
..        ...           ...
22     IC_320        IC_302
23     IC_321        IC_302
24     IC_322        IC_306
25     IC_323        IC_306
26     IC_324        IC_306
27  Dana Chen  <no manager>

Stale manager_id (manager left)
If a manager has been terminated and the row removed, the report shows NULL manager. Real HRIS systems keep terminated employees with a deleted_at flag for exactly this reason.

SQL Q16 Tier 2 · Joins Salary delta vs. each employee's manager.

[Amazon · People Tech] A comp review is flagging ‘manager-paid-less-than-report’ cases — a classic signal of a stale promo or under-leveled report. Surface every pair.

Source · employees self-join

   employee_id  full_name  manager_id   department    salary
0          100  Dana Chen         NaN  Engineering  250000.0
1          201   Eve Park       100.0  Engineering  180000.0
2          202   Frank Li       100.0  Engineering  175000.0
3          203  Grace Kim       100.0  Engineering       NaN
4          301     IC_301       201.0  Engineering  116962.0

SQL

SELECT e.full_name AS employee, e.salary AS emp_salary,
       m.full_name AS manager,  m.salary AS mgr_salary,
       (m.salary - e.salary)    AS manager_premium
FROM employees e
JOIN employees m ON m.employee_id = e.manager_id
WHERE m.salary < e.salary;

Output

  employee  emp_salary manager  mgr_salary  manager_premium
0   IC_316    121279.0  IC_301    116962.0          -4317.0
1   IC_317    131077.0  IC_301    116962.0         -14115.0

Follow-up: “What about skip-level comparisons?” — that is the recursive CTE in Q42–Q43.

Edge cases & gotchas (2)

Promo lag, not bug

A new senior IC may legitimately out-earn their L4 manager. Don't auto-flag; surface the pair for HRBP review with comp band context.

Fix · SQL

SELECT e.full_name AS report,  e.salary, e.department,
       m.full_name AS manager, m.salary,
       (e.salary - m.salary) AS report_premium
FROM employees e
JOIN employees m ON m.employee_id = e.manager_id
WHERE e.salary > m.salary
ORDER BY report_premium DESC;

Output

   report    salary   department manager  salary_1  report_premium
0  IC_317  131077.0  Engineering  IC_301  116962.0         14115.0
1  IC_316  121279.0  Engineering  IC_301  116962.0          4317.0

Currency / part-time normalization
Salaries should be normalized to FTE / annualized USD before comparison. A 0.6-FTE manager out-earning a 1.0-FTE report is expected, not a flag.

SQL Q17 Tier 2 · Joins Top 3 product categories by total purchase revenue.

[Amazon · Holiday Merch] Merchandising is locking the homepage carousel for Prime Day. Top 3 product categories by revenue and purchase count — confirm both definitions agree.

Source · purchases

   purchase_id  user_id  amount purchase_date product_category
0            1      149  321.30    2024-11-01           Sports
1            2       82   36.66    2024-11-01         Clothing
2            3       19   70.12    2024-11-01         Clothing
3            4       98  122.21    2024-11-01             Food
4            5       59    5.33    2024-11-01      Electronics

SQL

SELECT product_category,
       ROUND(SUM(amount), 2) AS revenue,
       COUNT(*)              AS purchases
FROM purchases
GROUP BY product_category
ORDER BY revenue DESC
LIMIT 3;

Output

  product_category    revenue  purchases
0         Clothing  105688.15         64
1      Electronics    6451.80         73
2            Books    5653.72         56

Subtle: ‘by revenue’ can mean SUM(amount) or COUNT(*) × avg price. Always confirm which revenue definition before writing.

Edge cases & gotchas (2)

Definition of revenue
‘Top by revenue’ can mean SUM(amount) (gross), or SUM(amount * (1 - return_rate)) (net), or COUNT * AVG (units × ASP). Confirm before writing.

Tied #3 silently dropped

LIMIT 3 drops one of two tied categories at #3. Use DENSE_RANK if the carousel can show ties.

Fix · SQL

WITH ranked AS (
  SELECT product_category, SUM(amount) AS revenue,
         DENSE_RANK() OVER (ORDER BY SUM(amount) DESC) AS rk
  FROM purchases GROUP BY product_category
)
SELECT * FROM ranked WHERE rk <= 3;

Output

  product_category    revenue  rk
0         Clothing  105688.15   1
1      Electronics    6451.80   2
2            Books    5653.72   3

SQL Q18 Tier 2 · Joins Users active in both events and logins.

[Meta · DAU Definition] Engagement is debating ‘truly active’ — does it require both an event and a login that day? Find the intersection of users in events and logins as a starting reference set.

Source · events ⋂ logins

   event_id  user_id  event_type event_date
0         1        1      signup 2025-01-04
1         2        1  activation 2025-02-08
2         3        1    purchase 2025-03-13
3         4        2      signup 2025-02-17
4         5        2  activation 2025-02-24

SQL

SELECT user_id FROM events
INTERSECT
SELECT user_id FROM logins;

Output

    user_id
0        35
1        37
2        54
3        62
4        64
..      ...
95       73
96       76
97       82
98       89
99       94

Tradeoff: INTERSECT is concise and dedups for free; EXISTS wins when one side is much larger or you want to stream early.

Edge cases & gotchas (2)

INTERSECT removes duplicates
INTERSECT in standard SQL is set semantics — duplicates collapse. INTERSECT ALL preserves duplicates. Check your DB; not every engine implements ALL.

Time-window mismatch

‘Active in events AND logins’ needs a window — last 7 days? Lifetime? The intersection definition is meaningless without one. Add a date filter.

Fix · SQL

SELECT user_id FROM events  WHERE event_date >= CURRENT_DATE - INTERVAL '7' DAY
INTERSECT
SELECT user_id FROM logins  WHERE login_date >= CURRENT_DATE - INTERVAL '7' DAY;

Output

Empty DataFrame
Columns: [user_id]
Index: []

SQL Q19 Tier 2 · Joins Country-level signup→purchase conversion rate.

[Meta · Reels Funnel] Growth is debugging a country-level dropoff in the signup → purchase funnel. Compute conversion rate per country and sort — the IN cohort is suspected of payment-friction issues.

Source · users + purchases

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

SQL

SELECT u.country,
       COUNT(DISTINCT u.user_id)                         AS signups,
       COUNT(DISTINCT p.user_id)                         AS purchasers,
       ROUND(100.0 * COUNT(DISTINCT p.user_id)
                   / NULLIF(COUNT(DISTINCT u.user_id), 0), 2) AS conv_pct
FROM users u
LEFT JOIN purchases p ON p.user_id = u.user_id
GROUP BY u.country
ORDER BY conv_pct DESC;

Output

   country  signups  purchasers  conv_pct
0   <NULL>        1           1    100.00
1       in        2           2    100.00
2       US        1           1    100.00
3      UK         1           1    100.00
4       \N        9           8     88.89
..     ...      ...         ...       ...
11     NaN        7           4     57.14
12      BR       22          12     54.55
13      br        2           1     50.00
14      UK       20           9     45.00
15                1           0      0.00

Why NULLIF(..., 0): defends against divide-by-zero if a country has no signups (will not happen here, but it is a habit).

Edge cases & gotchas (2)

Cohort definition

‘Conversion’ is undefined without a window: ever, first 7 days, first 30 days, first session? State which.

Fix · SQL

-- 30-day-window conversion (more honest than &lsquo;ever&rsquo;)
SELECT u.country,
       100.0 * COUNT(DISTINCT p.user_id) FILTER (
           WHERE p.purchase_date <= u.signup_date + INTERVAL '30' DAY
       ) / NULLIF(COUNT(DISTINCT u.user_id), 0) AS conv_30d_pct
FROM users u
LEFT JOIN purchases p ON p.user_id = u.user_id
GROUP BY u.country;

Output

   country  conv_30d_pct
0   <NULL>    100.000000
1              66.666667
2       JP     61.111111
3      NaN     57.142857
4       US     62.264151
5       IN     54.545455
..     ...           ...
10      BR     50.000000
11      in    100.000000
12     UK     100.000000
13      FR     60.000000
14      UK     40.000000
15      DE     71.428571

Tiny-cohort countries
DE with 8 users gives a 100% conversion that is statistically meaningless. Add a min_cohort filter for the chart, but report raw counts for completeness.

SQL Q20 Tier 2 · Joins Join products with the current price from product_dim (SCD2).

[Apple · App Store Pricing] A pricing dashboard needs the live SKU price for each product, ignoring expired historical rows. Use the SCD2 dim with is_current = 'Y' — and flag why this is wrong for as-of-historical queries (see Q41).

Source · products + product_dim

   product_id product_name   price effective_date expiry_date is_current
0           1    product_1   60.01     2024-01-01  2024-04-16          N
1           1    product_1   66.01     2024-04-16         NaT          Y
2           2    product_2  171.42     2024-01-01  2024-04-22          N
3           2    product_2  188.56     2024-04-22  2024-07-28          N
4           3    product_3  156.15     2024-01-01  2024-04-28          N

SQL

SELECT p.product_id, p.product_name, p.category,
       pd.price AS current_price, pd.effective_date
FROM products p
LEFT JOIN product_dim pd
       ON pd.product_id = p.product_id
      AND pd.is_current = 'Y'
ORDER BY p.product_id
LIMIT 10;

Output

   product_id product_name     category  current_price effective_date
0           1       prod_1  Electronics          66.01     2024-04-16
1           1       prod_1  Electronics          66.01     2024-04-16
2           2       prod_2  Electronics            NaN            NaT
3           3       prod_3  Electronics            NaN            NaT
4           4       prod_4  Electronics         119.64     2024-06-06
5           5       prod_5  Electronics          38.98     2024-03-26
6           6       prod_6  Electronics            NaN            NaT
7           7       prod_7  Electronics            NaN            NaT
8           8       prod_8  Electronics         170.15     2024-03-05
9           9       prod_9  Electronics          64.08     2024-11-14

Setup for Q41: using is_current = 'Y' is fast but only correct for ‘now’ queries. For ‘as-of-some-historical-date’ you must use effective_date / expiry_date windows.

Edge cases & gotchas (2)

Multiple is_current='Y' per product

If your SCD2 builder has a bug, two rows can have is_current='Y' for the same product. Add an assert.

Fix · SQL

-- DQ: must be 1 current row per product
SELECT product_id, COUNT(*) AS current_rows
FROM product_dim
WHERE is_current = 'Y'
GROUP BY product_id
HAVING COUNT(*) > 1;

Output

   product_id  current_rows
0          11             2
1          19             2

is_current='Y' is a denormalization
It is a maintained flag — easily out of sync with effective/expiry. For correctness-critical queries, derive 'current' from expiry_date IS NULL OR expiry_date > CURRENT_DATE.

Tier 3 · Window Functions, CASE, Subqueries

SQL Q21 Tier 3 · Windows Rank users by total order spend (with ties).

[Amazon · Loyalty A/B] A loyalty experiment is ranking treatment users by lifetime spend to pick the top tier. Use both RANK and DENSE_RANK so ties are handled deterministically — ties matter when the cutoff is paid.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

SQL

SELECT user_id,
       ROUND(SUM(amount), 2)                          AS total_spend,
       RANK()       OVER (ORDER BY SUM(amount) DESC)  AS spend_rank,
       DENSE_RANK() OVER (ORDER BY SUM(amount) DESC)  AS spend_dense_rank
FROM orders
GROUP BY user_id
ORDER BY spend_rank
LIMIT 10;

Output

   user_id  total_spend  spend_rank  spend_dense_rank
0      176      1445.05           1                 1
1       70      1337.78           2                 2
2      168      1117.62           3                 3
3      112       971.71           4                 4
4       36       952.59           5                 5
5      140       808.65           6                 6
6      184       768.93           7                 7
7      133       752.55           8                 8
8      131       741.68           9                 9
9       38       715.93          10                10

RANK vs DENSE_RANK vs ROW_NUMBER: ranks skip after ties (1,2,2,4); dense_rank does not (1,2,2,3); row_number breaks ties arbitrarily. Pick deliberately.

Edge cases & gotchas (2)

RANK skips, DENSE_RANK doesn't, ROW_NUMBER picks one

If three users tie at #1: RANK gives 1,1,1,4; DENSE_RANK gives 1,1,1,2; ROW_NUMBER gives 1,2,3 non-deterministically. Pick by downstream consumer.

Fix · SQL

SELECT user_id, total_spend,
       RANK()       OVER (ORDER BY total_spend DESC) AS r,
       DENSE_RANK() OVER (ORDER BY total_spend DESC) AS dr,
       ROW_NUMBER() OVER (ORDER BY total_spend DESC, user_id) AS rn
FROM (SELECT user_id, SUM(amount) AS total_spend FROM orders GROUP BY 1) t;

Output

     user_id  total_spend    r   dr   rn
0        176      1445.05    1    1    1
1         70      1337.78    2    2    2
2        168      1117.62    3    3    3
3        112       971.71    4    4    4
4         36       952.59    5    5    5
5        140       808.65    6    6    6
..       ...          ...  ...  ...  ...
178       15        13.34  179  179  179
179       22         9.74  180  180  180
180       41         8.39  181  181  181
181       71         7.88  182  182  182
182      183         3.41  183  183  183
183       25         3.16  184  184  184

NULL spend (no orders)
If you LEFT JOIN to keep all users, SUM is NULL. ORDER BY DESC places NULLs last in Postgres but FIRST in MySQL. Be explicit.

SQL Q22 Tier 3 · Windows Deduplicate user_profiles — keep latest row per user.

[Meta · CDP] The CDP team has a profile table with intentional history (334 rows for 200 users). Dedup to one row per user, latest updated_at wins — and tiebreak deterministically or your downstream join is non-reproducible.

Source · user_profiles

   user_id    name    city updated_at
0        1  user_1  Denver 2025-01-05
1        1  user_1     NYC 2025-02-19
2        2  user_2      SF 2025-01-01
3        3  user_3  Austin 2025-01-11
4        3  user_3     NaN 2025-02-01

SQL

WITH ranked AS (
    SELECT up.*,
           ROW_NUMBER() OVER (
               PARTITION BY user_id
               ORDER BY updated_at DESC, city DESC
           ) AS rn
    FROM user_profiles up
)
SELECT user_id, name, city, updated_at
FROM ranked
WHERE rn = 1
ORDER BY user_id
LIMIT 10;

Output

   user_id      name     city updated_at
0        1    user_1      NYC 2025-02-19
1        2    user_2       SF 2025-01-01
2        3    user_3      NaN 2025-02-01
3        4    user_4   Boston 2025-01-23
4        5    user_5      NYC 2025-01-18
5        6    user_6  Chicago 2025-01-02
6        7    user_7  Seattle 2025-02-07
7        8    user_8   Austin 2025-03-15
8        9    user_9   Boston 2025-01-06
9       10   user_10       SF 2025-01-09

Always tiebreak. Without the secondary ORDER BY, two rows with identical updated_at produce a non-deterministic winner — the kind of bug that only shows up after a backfill.

Edge cases & gotchas (2)

Tied updated_at = non-deterministic winner
Without a secondary tiebreak, two profile rows with identical updated_at produce a coin-flip dedup. Add a deterministic secondary key (e.g., city DESC, primary key DESC).
Fix · SQL
```
ROW_NUMBER() OVER (
  PARTITION BY user_id
  ORDER BY updated_at DESC, city DESC, ROWID DESC   -- always deterministic
) AS rn
```
All-NULL updated_at
If a partition has only NULL update timestamps, ROW_NUMBER still assigns rn=1 to one row arbitrarily. Decide: drop the partition, or pick by another field.

SQL Q23 Tier 3 · Windows WoW growth in WAU using LAG.

[Meta · Weekly Biz Review] The weekly business review wants WAU and WoW % change in one query. Use LAG and guard the divisor with NULLIF for the first row.

Source · weekly_metrics

  week_start    wau
0 2024-01-01  10000
1 2024-01-08  10230
2 2024-01-15  10455
3 2024-01-22  10819
4 2024-01-29  10434

SQL

SELECT week_start,
       wau,
       LAG(wau) OVER (ORDER BY week_start)                  AS prev_wau,
       ROUND( 100.0 * (wau - LAG(wau) OVER (ORDER BY week_start))
                    / NULLIF(LAG(wau) OVER (ORDER BY week_start), 0), 2) AS wow_pct
FROM weekly_metrics
ORDER BY week_start
LIMIT 10;

Output

  week_start    wau  prev_wau  wow_pct
0 2024-01-01  10000      <NA>      NaN
1 2024-01-08  10230     10000     2.30
2 2024-01-15  10455     10230     2.20
3 2024-01-22  10819     10455     3.48
4 2024-01-29  10434     10819    -3.56
5 2024-02-05  10972     10434     5.16
6 2024-02-12  11048     10972     0.69
7 2024-02-19  11454     11048     3.67
8 2024-02-26  11077     11454    -3.29
9 2024-03-04  10499     11077    -5.22

Readability tip: in a real codebase, wrap the LAG() in a CTE so you do not repeat it three times.

Edge cases & gotchas (2)

First row's wow_pct is NULL

LAG returns NULL on the first row, and LAG/0 is undefined — NULLIF the divisor or COALESCE the result, depending on whether downstream wants NULL or 0.

Fix · SQL

SELECT week_start, wau,
       100.0 * (wau - LAG(wau) OVER w)
             / NULLIF(LAG(wau) OVER w, 0) AS wow_pct
FROM weekly_metrics
WINDOW w AS (ORDER BY week_start);

Output

   week_start    wau   wow_pct
0  2024-01-01  10000       NaN
1  2024-01-08  10230  2.300000
2  2024-01-15  10455  2.199413
3  2024-01-22  10819  3.481588
4  2024-01-29  10434 -3.558554
5  2024-02-05  10972  5.156220
..        ...    ...       ...
46 2024-11-18  15532 -1.296390
47 2024-11-25  15750  1.403554
48 2024-12-02  16883  7.193651
49 2024-12-09  16215 -3.956643
50 2024-12-16  15425 -4.872032
51 2024-12-23  15680  1.653160

Missing weeks (gapped data)
LAG is positional, not date-aware: if week 2024-01-08 is missing, LAG jumps from 01-01 to 01-15 silently. Date-spine first.

SQL Q24 Tier 3 · Windows CASE pivot — sessions per platform per day.

[Meta · IG Reporting] A weekly report on platform-mix wants one row per day with three columns: iOS, Android, Web. Pivot via CASE — and explain why this is more portable than Snowflake's PIVOT when reporting tools straddle warehouses.

Source · sessions

   session_id  user_id platform session_date  duration_sec
0           1       51  android   2025-01-01         388.0
1           2       73  android   2025-01-01         167.0
2           3       32  android   2025-01-01        2493.0
3           4       75  android   2025-01-01        1881.0
4           5        8  android   2025-01-01        1217.0

SQL

SELECT session_date,
       SUM(CASE WHEN platform = 'ios'     THEN 1 ELSE 0 END) AS ios,
       SUM(CASE WHEN platform = 'android' THEN 1 ELSE 0 END) AS android,
       SUM(CASE WHEN platform = 'web'     THEN 1 ELSE 0 END) AS web,
       COUNT(*)                                              AS total
FROM sessions
GROUP BY session_date
ORDER BY session_date
LIMIT 10;

Output

  session_date  ios  android  web  total
0   2025-01-01  1.0     10.0  0.0     12
1   2025-01-02  7.0      4.0  1.0     12
2   2025-01-03  4.0      7.0  1.0     12
3   2025-01-04  3.0      3.0  6.0     12
4   2025-01-05  6.0      2.0  4.0     12
5   2025-01-06  7.0      3.0  2.0     12
6   2025-01-07  4.0      6.0  2.0     12
7   2025-01-08  3.0      4.0  4.0     12
8   2025-01-09  1.0      7.0  4.0     12
9   2025-01-10  8.0      3.0  1.0     12

vs. native PIVOT: Snowflake / SQL Server have PIVOT; Postgres uses CASE (or crosstab). The CASE form is the most portable.

Edge cases & gotchas (2)

Unknown platform values

If a new platform ('windows') ships, the CASE pivot silently drops those rows from the named columns. Add a 'other' bucket and an alert when total != iOS+Android+Web+other.

Fix · SQL

SELECT session_date,
   SUM(CASE WHEN platform='ios'     THEN 1 ELSE 0 END) AS ios,
   SUM(CASE WHEN platform='android' THEN 1 ELSE 0 END) AS android,
   SUM(CASE WHEN platform='web'     THEN 1 ELSE 0 END) AS web,
   SUM(CASE WHEN platform NOT IN ('ios','android','web') THEN 1 ELSE 0 END) AS other,
   COUNT(*) AS total
FROM sessions GROUP BY session_date;

Output

   session_date  ios  android  web  other  total
0    2025-01-22  4.0      3.0  5.0    0.0     12
1    2025-02-13  7.0      4.0  1.0    0.0     12
2    2025-01-20  2.0      5.0  4.0    1.0     12
3    2025-01-25  7.0      1.0  4.0    0.0     12
4    2025-01-30  4.0      6.0  2.0    0.0     12
5    2025-02-07  5.0      3.0  3.0    1.0     12
..          ...  ...      ...  ...    ...    ...
62   2025-02-11  2.0      5.0  5.0    0.0     12
63   2025-02-19  5.0      3.0  4.0    0.0     12
64   2025-02-20  3.0      7.0  1.0    0.0     11
65   2025-03-01  5.0      3.0  4.0    0.0     12
66   2025-01-18  3.0      6.0  3.0    0.0     12
67   2025-01-31  7.0      2.0  3.0    0.0     12

Case sensitivity in platform values
'iOS' vs 'ios' fails the equality test. Lowercase upstream or in the CASE comparator.

SQL Q25 Tier 3 · Windows Top 3 highest-revenue products in each category.

[Amazon · Homepage Ranking] Merch wants a 3-product shelf per category for the homepage rotation. Use DENSE_RANK so ties at #2 still produce a #3 — relevant when revenue is bucketed.

Source · products

   product_id product_name     category  revenue
0           1       prod_1  Electronics    435.0
1           2       prod_2  Electronics     25.0
2           3       prod_3  Electronics    384.0
3           4       prod_4  Electronics   1054.0
4           5       prod_5  Electronics    485.0

SQL

WITH ranked AS (
    SELECT product_id, product_name, category, revenue,
           DENSE_RANK() OVER (PARTITION BY category
                              ORDER BY revenue DESC) AS rk
    FROM products
)
SELECT category, product_id, product_name, revenue, rk
FROM ranked
WHERE rk <= 3
ORDER BY category, rk;

Output

    category  product_id product_name  revenue  rk
0      Books          19      prod_19   2887.0   1
1      Books          11      prod_11   2420.0   2
2      Books          15      prod_15   2021.0   3
3   Clothing          28      prod_28   9770.0   1
4   Clothing          30      prod_30   7190.0   2
..       ...         ...          ...      ...  ..
10      Food          42      prod_42   2667.0   2
11      Food          43      prod_43   2252.0   3
12    Sports          40      prod_40  12777.0   1
13    Sports          36      prod_36   2589.0   2
14    Sports          38      prod_38   2003.0   3

Why DENSE_RANK: if two products tie for #2, you keep both and still get a #3.

Edge cases & gotchas (2)

Ties at #2 mean #3 may not exist
DENSE_RANK gives 1,2,2,3 — three rows survive. RANK gives 1,2,2,4 — two rows survive (no #3). Pick semantically.

Categories with fewer than 3 products

A category with 2 products returns 2 rows, not 3. The carousel must handle short-shelf gracefully.

Fix · SQL

-- Pad short shelves with NULL placeholders
SELECT category, position, product_name
FROM (
  SELECT category, ROW_NUMBER() OVER (PARTITION BY category
                                      ORDER BY revenue DESC) AS position,
         product_name
  FROM products
) t
WHERE position <= 3;

Output

       category  position product_name
0        Sports         1      prod_40
1        Sports         2      prod_36
2        Sports         3      prod_38
3         Books         1      prod_19
4         Books         2      prod_11
5         Books         3      prod_15
..          ...       ...          ...
9      Clothing         1      prod_28
10     Clothing         2      prod_30
11     Clothing         3      prod_29
12  Electronics         1       prod_4
13  Electronics         2       prod_1
14  Electronics         3       prod_5

SQL Q26 Tier 3 · Windows Median order amount.

[Amazon · Risk Modeling] A finance partner says AVG(amount) is being skewed by whale orders. Compute the median two ways: idiomatic PERCENTILE_CONT, and a portable ROW_NUMBER fallback for warehouses without it.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

SQL

SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_amount
FROM orders;

Output

   median_amount
0          51.52

Portable fallback: if your DB lacks PERCENTILE_CONT, use a ROW_NUMBER + COUNT trick to pick the middle row(s).

Edge cases & gotchas (2)

Median definition: continuous vs discrete
PERCENTILE_CONT(0.5) interpolates between the two middle rows; PERCENTILE_DISC(0.5) returns one of them. For currency, CONT is usually wrong (it produces fractions of a cent).
Fix · SQL
```
-- DISC for whole-cent values
SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY amount) FROM orders;
```
Output
```
   quantile_disc(0.5 ORDER BY amount)
0                               51.52
```
MySQL has no PERCENTILE_CONT
MySQL 8 lacks WITHIN GROUP. Use the ROW_NUMBER + COUNT trick from the answer block.

SQL Q27 Tier 3 · Windows Each user's largest order, with date.

[Amazon · Lifecycle Email] A re-engagement campaign wants each customer's biggest-ever order amount + date to drop into the email subject line. Use a window — the correlated-subquery form doesn't scale at orders.size = 1B.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

SQL

WITH ranked AS (
    SELECT order_id, user_id, order_date, amount,
           ROW_NUMBER() OVER (PARTITION BY user_id
                              ORDER BY amount DESC, order_date DESC) AS rn
    FROM orders
)
SELECT user_id, order_id, order_date, amount
FROM ranked
WHERE rn = 1
ORDER BY user_id
LIMIT 10;

Output

   user_id  order_id order_date  amount
0        1       156 2025-01-20   53.08
1        2        90 2025-01-12  324.70
2        3       224 2025-01-28  164.63
3        4       312 2025-02-08   67.36
4        5       104 2025-01-13  192.65
5        7       343 2025-02-12  175.31
6        8       427 2025-02-23   93.81
7       10       191 2025-01-24  243.48
8       11        52 2025-01-07  371.89
9       12       216 2025-01-27   35.07

Window beats correlated subquery: a single pass over orders instead of N subqueries. On large tables this is a 10–100× difference.

Edge cases & gotchas (2)

Tied largest amount per user
If a user has two orders with the same max amount, ROW_NUMBER picks one arbitrarily. Tiebreak by order_date and order_id deterministically.
Fix · SQL
```
ROW_NUMBER() OVER (
  PARTITION BY user_id
  ORDER BY amount DESC, order_date DESC, order_id  -- fully deterministic
) AS rn
```
User with no orders
Inner-style filtering drops them. If the email needs every user, left-join from users and COALESCE the subject line.

SQL Q28 Tier 3 · Windows Cumulative distribution of order amounts.

[Apple · Pricing Tiers] A pricing experiment is drawing tier cutoffs at the 50th, 70th, 90th percentile of basket size. Compute CUME_DIST + NTILE(10) for every order; the team will pick cutoffs from the deciles.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

SQL

SELECT order_id, amount,
       ROUND(CUME_DIST() OVER (ORDER BY amount)::DOUBLE, 4) AS cume_pct,
       NTILE(10)         OVER (ORDER BY amount)             AS decile
FROM orders
ORDER BY amount
LIMIT 10;

Output

   order_id  amount  cume_pct  decile
0       234 -215.35     0.002       1
1       389  -63.52     0.004       1
2        12  -12.02     0.006       1
3       185    1.37     0.008       1
4       251    1.49     0.010       1
5       432    3.16     0.012       1
6        72    3.41     0.014       1
7        92    5.06     0.016       1
8        50    5.78     0.018       1
9        79    5.80     0.020       1

CUME_DIST vs PERCENT_RANK: CUME_DIST counts ties as ‘at or below’ (≤); PERCENT_RANK uses (rank-1)/(N-1). Use CUME_DIST for ‘what % of orders are ≤ this’.

Edge cases & gotchas (2)

CUME_DIST counts ≤, PERCENT_RANK counts <
Subtle but real: CUME_DIST(x) = fraction of rows ≤ x; PERCENT_RANK(x) = (rank(x)-1)/(N-1). At the median they differ.

NTILE distributes ties arbitrarily

Two users with identical spend can land in different deciles. For high-stakes segmentation, use PERCENT_RANK ≤ 0.10 instead.

Fix · SQL

SELECT user_id, total_spend,
       PERCENT_RANK() OVER (ORDER BY total_spend DESC) AS pr
FROM (SELECT user_id, SUM(amount) AS total_spend FROM orders GROUP BY 1) t
WHERE PERCENT_RANK() OVER (ORDER BY total_spend DESC) <= 0.10;

Output

[error] BinderException: Binder Error: WHERE clause cannot contain window functions!

LINE 4: WHERE PERCENT_RANK() OVER (ORDER BY total_spend DESC) <= 0.10
              ^

SQL Q29 Tier 3 · Windows Each order's deviation from that user's average.

[Amazon · Anomaly Detection] A risk model wants ‘orders that are unusually large for that user’. For every order, show the user's mean and the delta — the modeling team will threshold from there.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

SQL

SELECT order_id, user_id, order_date, amount,
       ROUND(AVG(amount) OVER (PARTITION BY user_id), 2) AS user_avg,
       ROUND(amount - AVG(amount) OVER (PARTITION BY user_id), 2) AS delta
FROM orders
ORDER BY user_id, order_date
LIMIT 10;

Output

   order_id  user_id order_date  amount  user_avg   delta
0       156        1 2025-01-20   53.08     36.52   16.56
1       351        1 2025-02-13   19.71     36.52  -16.81
2       417        1 2025-02-22   36.77     36.52    0.25
3        16        2 2025-01-02   59.81    141.57  -81.76
4        45        2 2025-01-06   40.21    141.57 -101.36
5        90        2 2025-01-12  324.70    141.57  183.13
6       131        3 2025-01-17   53.76     70.95  -17.19
7       139        3 2025-01-18    9.77     70.95  -61.18
8       223        3 2025-01-28   18.93     70.95  -52.02
9       224        3 2025-01-28  164.63     70.95   93.68

Window without ORDER BY means the average covers the whole partition (the user) — exactly what we want here.

Edge cases & gotchas (2)

Single-order users have delta=0

AVG over a partition of 1 equals that one value, so delta is 0. Filter to users with >= some min order count for anomaly logic.

Fix · SQL

WITH stats AS (
  SELECT order_id, user_id, amount,
         AVG(amount) OVER (PARTITION BY user_id) AS u_avg,
         COUNT(*)    OVER (PARTITION BY user_id) AS u_n
  FROM orders
)
SELECT * FROM stats WHERE u_n >= 5 AND amount > u_avg * 2;

Output

    order_id  user_id  amount       u_avg  u_n
0        208       99  347.94  107.280000    5
1        381      111  305.84  100.684000    5
2        298       17  280.68  106.044000    5
3         14      151  318.91  116.080000    5
4        384      161   97.23   44.912000    5
5        446      176  958.35  180.631250    8
..       ...      ...     ...         ...  ...
13       382       47  281.25  130.128000    5
14       419      156  236.33   98.251667    6
15       494      156  202.47   98.251667    6
16       128       62  128.41   44.758000    5
17       445       96  365.97  107.082000    5
18       238      155  133.38   65.010000    6

Order amount distribution is log-normal
Linear delta over-flags whales' big-but-normal orders. Use log(amount) - log(u_avg) for ratio thinking.

SQL Q30 Tier 3 · Windows Users who signed up but never reached 'purchase' in the events funnel.

[Meta · Lifecycle] Lifecycle wants the dropped-off cohort: users who reached signup but never had a purchase event. Use NOT EXISTS — and explain why it correctly handles users who fired multiple signup events.

Source · events

   event_id  user_id  event_type event_date
0         1        1      signup 2025-01-04
1         2        1  activation 2025-02-08
2         3        1    purchase 2025-03-13
3         4        2      signup 2025-02-17
4         5        2  activation 2025-02-24

SQL

SELECT DISTINCT e.user_id
FROM events e
WHERE e.event_type = 'signup'
  AND NOT EXISTS (
      SELECT 1 FROM events e2
      WHERE e2.user_id = e.user_id
        AND e2.event_type = 'purchase'
  )
ORDER BY e.user_id
LIMIT 10;

Output

Cohort-correctness: if signup can fire multiple times per user (it should not, but data is data), this still works because DISTINCT + NOT EXISTS are set-based.

Edge cases & gotchas (2)

Multiple signup events per user (DQ)

A user firing two 'signup' events is dirty data. NOT EXISTS still works (set-based) but you should DQ-flag it separately.

Fix · SQL

SELECT user_id, COUNT(*) AS signup_events
FROM events WHERE event_type = 'signup'
GROUP BY user_id HAVING COUNT(*) > 1;

Output

   user_id  signup_events
0        5              2
1        2              2
2        4              2
3        1              2
4        3              2

‘Never purchased’ window
Lifetime ‘never’ is a different cohort from ‘not within 30 days of signup’. State the window.

Tier 4 · CTEs, Sessionization, Cohorts

SQL Q31 Tier 4 · Patterns 7-day moving average of daily revenue.

[Netflix · Content Performance] The exec dashboard wants the daily revenue trend with the daily noise smoothed out — a 7-day trailing mean. Discuss ROWS vs RANGE: for gapped data (see Q35) RANGE with an INTERVAL is the safer 7-day window.

Source · daily_metrics

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-05       55714.88       9883

SQL

SELECT ds, daily_revenue,
       ROUND(AVG(daily_revenue) OVER (
           ORDER BY ds
           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
       ), 2) AS rev_7d_avg
FROM daily_metrics
ORDER BY ds
LIMIT 10;

Output

          ds  daily_revenue  rev_7d_avg
0 2025-01-01       50008.19    50008.19
1 2025-01-02       79294.66    64651.43
2 2025-01-03       55245.52    61516.12
3 2025-01-04       51104.66    58913.26
4 2025-01-05       55714.88    58273.58
5 2025-01-06       80359.13    61954.51
6 2025-01-07       38076.88    58543.42
7 2025-01-08       36554.98    56621.53
8 2025-01-09       40941.61    51142.52
9 2025-01-10       50385.72    50448.27

ROWS vs RANGE: ROWS counts rows; RANGE matches values. For dense daily data they are equivalent — but if dates are gapped, RANGE with an interval is the safer ‘true 7-day window’.

Edge cases & gotchas (2)

ROWS vs RANGE on gapped data

ROWS BETWEEN 6 PRECEDING counts rows, regardless of date gaps. If a day is missing, your ‘7-day’ window actually spans 8 calendar days. RANGE BETWEEN INTERVAL '6' DAY is date-aware.

Fix · SQL

SELECT ds, daily_revenue,
       AVG(daily_revenue) OVER (
           ORDER BY ds
           RANGE BETWEEN INTERVAL '6' DAY PRECEDING AND CURRENT ROW
       ) AS rev_7d_avg
FROM daily_metrics;

Output

            ds  daily_revenue    rev_7d_avg
0   2025-01-01       50008.19  50008.190000
1   2025-01-02       79294.66  64651.425000
2   2025-01-03       55245.52  61516.123333
3   2025-01-04       51104.66  58913.257500
4   2025-01-05       55714.88  58273.582000
5   2025-01-06       80359.13  61954.506667
..         ...            ...           ...
115 2025-04-25       71673.07  46890.241429
116 2025-04-26       82002.86  49454.362857
117 2025-04-27       51165.87  52132.301429
118 2025-04-28       63400.18  55192.535714
119 2025-04-29       52785.14  57301.432857
120 2025-04-30       55986.60  61164.007143

First 6 days produce small windows
The first 6 days have fewer than 7 rows in the window — the average is over those few rows, which is misleading at the trend head. Either return NULL until 7 rows are available, or label with the actual sample size.

SQL Q32 Tier 4 · Patterns Running monthly revenue total, reset each month.

[Amazon · OP1 Dashboard] The OP1 finance dashboard wants ‘month-to-date revenue’ on every row. Reset on month-start. The reset is the partition; the order is the sort.

Source · daily_metrics

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-05       55714.88       9883

SQL

SELECT ds, daily_revenue,
       SUM(daily_revenue) OVER (
           PARTITION BY DATE_TRUNC('month', ds)
           ORDER BY ds
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) AS mtd_revenue
FROM daily_metrics
ORDER BY ds
LIMIT 10;

Output

          ds  daily_revenue  mtd_revenue
0 2025-01-01       50008.19     50008.19
1 2025-01-02       79294.66    129302.85
2 2025-01-03       55245.52    184548.37
3 2025-01-04       51104.66    235653.03
4 2025-01-05       55714.88    291367.91
5 2025-01-06       80359.13    371727.04
6 2025-01-07       38076.88    409803.92
7 2025-01-08       36554.98    446358.90
8 2025-01-09       40941.61    487300.51
9 2025-01-10       50385.72    537686.23

The reset is the partition. Group on the period you want to reset over; window-order within it.

Edge cases & gotchas (2)

Late-arriving rows for prior months
If a backfill writes a row dated 2024-12-15 today, the MTD for December changes retroactively. Snapshot the curated table or your dashboard becomes a moving target.

Time zone of month boundary

DATE_TRUNC('month', ds) is in the server's TZ. For a US-based team, an order at 2024-12-31T23:30 PT is December but the UTC row is January.

Fix · SQL

SELECT DATE_TRUNC('month', ds AT TIME ZONE 'America/Los_Angeles') AS m_pt,
       SUM(daily_revenue)
FROM daily_metrics GROUP BY 1;

Output

                       m_pt  sum(daily_revenue)
0 2025-03-01 00:00:00+00:00          1470069.99
1 2025-01-01 00:00:00+00:00          1588729.72
2 2025-04-01 00:00:00+00:00          1542605.98
3 2025-02-01 00:00:00+00:00          1327662.03

SQL Q33 Tier 4 · Patterns Weekly retention by signup cohort (W0–W3).

[Meta · Retention Triangle] Growth wants the classic Meta retention triangle: rows = signup week cohort, columns = W0/W1/W2/W3 since signup, cells = % logging in. This is the centerpiece chart of every weekly growth review.

Source · users + logins

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

SQL

WITH cohort AS (
    SELECT user_id, DATE_TRUNC('week', signup_date) AS cohort_week
    FROM users
),
joined AS (
    SELECT c.cohort_week, c.user_id,
           CAST(FLOOR(EXTRACT(EPOCH FROM (l.login_date - c.cohort_week)) / (7*86400)) AS INT) AS weeks_since
    FROM cohort c
    LEFT JOIN logins l ON l.user_id = c.user_id
)
SELECT cohort_week,
       COUNT(DISTINCT user_id) FILTER (WHERE weeks_since = 0) AS w0,
       COUNT(DISTINCT user_id) FILTER (WHERE weeks_since = 1) AS w1,
       COUNT(DISTINCT user_id) FILTER (WHERE weeks_since = 2) AS w2,
       COUNT(DISTINCT user_id) FILTER (WHERE weeks_since = 3) AS w3
FROM joined
WHERE weeks_since BETWEEN 0 AND 3
GROUP BY cohort_week
ORDER BY cohort_week
LIMIT 10;

Output

  cohort_week  w0  w1  w2  w3
0  2025-02-03   0   0   0   2
1  2025-02-10   0   0   3   3
2  2025-02-17   0   3   4   4
3  2025-02-24   3   4   4   4
4  2025-03-03   3   3   3   3
5  2025-03-10   3   3   3   0
6  2025-03-17   4   4   0   0
7  2025-03-24   3   0   0   0

Convert to %: divide each wN by the cohort size (a separate CTE). Most interviewers will follow up by asking for that.

Edge cases & gotchas (2)

Cohort week alignment
DATE_TRUNC('week', ...) snaps to Monday in ISO mode but Sunday in MySQL. A user signing up 2024-01-07 (Sun) lands in different cohorts depending on engine. State the convention.
User active before signup_date
If a user logs in before their signup_date (impossible? but stitching IDs creates this), weeks_since goes negative. Filter WHERE weeks_since >= 0 explicitly.

SQL Q34 Tier 4 · Patterns Funnel conversion: signup → activation → purchase.

[Meta · Onboarding Funnel] Product wants the signup → activation → purchase funnel with absolute counts and step-to-step rates. Use FILTER (WHERE ...) with monotonicity guards — a ‘purchase-before-signup’ row is dirty data, not a successful conversion.

Source · events

   event_id  user_id  event_type event_date
0         1        1      signup 2025-01-04
1         2        1  activation 2025-02-08
2         3        1    purchase 2025-03-13
3         4        2      signup 2025-02-17
4         5        2  activation 2025-02-24

SQL

WITH first_event AS (
    SELECT user_id, event_type, MIN(event_date) AS first_at
    FROM events
    WHERE event_type IN ('signup','activation','purchase')
    GROUP BY user_id, event_type
),
pivoted AS (
    SELECT user_id,
           MAX(CASE WHEN event_type='signup'     THEN first_at END) AS signup_at,
           MAX(CASE WHEN event_type='activation' THEN first_at END) AS activation_at,
           MAX(CASE WHEN event_type='purchase'   THEN first_at END) AS purchase_at
    FROM first_event GROUP BY user_id
)
SELECT
    COUNT(*) FILTER (WHERE signup_at IS NOT NULL) AS step1_signup,
    COUNT(*) FILTER (WHERE activation_at IS NOT NULL
                     AND activation_at >= signup_at) AS step2_activation,
    COUNT(*) FILTER (WHERE purchase_at IS NOT NULL
                     AND purchase_at >= COALESCE(activation_at, signup_at)) AS step3_purchase
FROM pivoted;

Output

   step1_signup  step2_activation  step3_purchase
0           498               282             133

Order matters. A ‘purchase’ before ‘signup’ is dirty data — the >= guards keep the funnel monotonic.

Edge cases & gotchas (2)

Activation BEFORE signup
Without a monotonicity guard, a user with activation_at < signup_at counts as activated. Add the >= guard or your funnel rate exceeds 100% in some weeks.
Fix · SQL
```
COUNT(*) FILTER (WHERE activation_at IS NOT NULL
                   AND activation_at >= signup_at) AS step2
```
Step skipping
What if a user purchases without an activation event? Two valid definitions: strict (must hit each step) vs lax (any later step counts). Confirm with PM before coding.

SQL Q35 Tier 4 · Patterns Date spine — fill missing days in daily_metrics_gapped with zero.

[Netflix · BI Reliability] The Looker dashboard broke because 15 days are missing from daily_metrics_gapped. Build a date spine and left-join — every chart needs every day, even zeros, with a was_missing flag for the data team.

Source · daily_metrics_gapped

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-06       80359.13       9912

SQL

WITH spine AS (
    SELECT CAST(ds AS DATE) AS ds
    FROM range(
        (SELECT MIN(ds) FROM daily_metrics_gapped),
        (SELECT MAX(ds) FROM daily_metrics_gapped) + INTERVAL 1 DAY,
        INTERVAL 1 DAY
    ) t(ds)
)
SELECT s.ds,
       COALESCE(d.daily_revenue, 0) AS daily_revenue,
       COALESCE(d.daily_dau,     0) AS daily_dau,
       (d.ds IS NULL)               AS was_missing
FROM spine s
LEFT JOIN daily_metrics_gapped d ON d.ds = s.ds
ORDER BY s.ds
LIMIT 10;

Output

          ds  daily_revenue  daily_dau  was_missing
0 2025-01-01       50008.19       9486        False
1 2025-01-02       79294.66      10292        False
2 2025-01-03       55245.52      10806        False
3 2025-01-04       51104.66       9132        False
4 2025-01-05           0.00          0         True
5 2025-01-06       80359.13       9912        False
6 2025-01-07       38076.88      11239        False
7 2025-01-08       36554.98       8400        False
8 2025-01-09       40941.61       9922        False
9 2025-01-10       50385.72      12868        False

Spark/BigQuery: use sequence(start, stop, step). Snowflake: GENERATOR. Same pattern, different syntax.

Edge cases & gotchas (2)

Spine misses days outside [min, max]

generate_series from MIN to MAX of the existing data. If the report wants Jan 1 → today, anchor the spine to those literal dates instead.

Fix · SQL

WITH spine AS (
  SELECT CAST(d AS DATE) AS ds
  FROM range(DATE '2025-01-01', CURRENT_DATE + INTERVAL 1 DAY, INTERVAL 1 DAY) t(d)
)
SELECT s.ds, COALESCE(g.daily_revenue, 0) AS rev
FROM spine s LEFT JOIN daily_metrics_gapped g USING (ds);

Output

            ds       rev
0   2025-01-01  50008.19
1   2025-01-02  79294.66
2   2025-01-03  55245.52
3   2025-01-04  51104.66
4   2025-01-06  80359.13
5   2025-01-07  38076.88
..         ...       ...
476 2026-02-10      0.00
477 2026-02-18      0.00
478 2026-02-20      0.00
479 2026-02-26      0.00
480 2026-04-10      0.00
481 2026-04-12      0.00

Filling with 0 vs ffill
0 is appropriate for revenue/sessions (no activity); ffill is appropriate for stocks/balances (state persists). Pick by metric semantics.

SQL Q36 Tier 4 · Patterns Sessionize the clickstream — 30 min idle = new session.

[Meta · Reels Sessionization] Product wants Reels ‘sessions’ from the raw event stream. Industry standard: 30 minutes of inactivity ends a session. Use the cumulative-sum trick — this exact pattern shows up in Reels, Stories, and IG Live.

Source · clickstream

   user_id          event_time      page
0        1 2025-03-01 08:45:00      home
1        1 2025-03-01 08:55:00    search
2        1 2025-03-01 09:00:00      cart
3        1 2025-03-01 09:01:00  checkout
4        1 2025-03-01 09:46:00    search

SQL

WITH gapped AS (
    SELECT user_id, event_time, page,
           LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
    FROM clickstream
),
flagged AS (
    SELECT user_id, event_time, page,
           CASE WHEN prev_time IS NULL
                 OR event_time - prev_time > INTERVAL 30 MINUTE
                THEN 1 ELSE 0 END AS is_new_session
    FROM gapped
),
sessionized AS (
    SELECT user_id, event_time, page,
           SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
    FROM flagged
)
SELECT user_id, session_id,
       MIN(event_time) AS session_start,
       MAX(event_time) AS session_end,
       COUNT(*)        AS pageviews
FROM sessionized
GROUP BY user_id, session_id
ORDER BY user_id, session_start
LIMIT 10;

Output

   user_id  session_id       session_start         session_end  pageviews
0        1         1.0 2025-03-01 08:45:00 2025-03-01 09:01:00          4
1        1         2.0 2025-03-01 09:46:00 2025-03-01 11:51:00         24
2        1         3.0 2025-03-01 12:36:00 2025-03-01 12:36:00          1
3        2         1.0 2025-03-01 08:01:00 2025-03-01 09:13:00         12
4        2         2.0 2025-03-01 10:43:00 2025-03-01 11:49:00          9
5        2         3.0 2025-03-01 12:34:00 2025-03-01 13:53:00         23
6        3         1.0 2025-03-01 08:02:00 2025-03-01 08:02:00          1
7        3         2.0 2025-03-01 08:47:00 2025-03-01 08:57:00          2
8        3         3.0 2025-03-01 10:27:00 2025-03-01 10:33:00          5
9        3         4.0 2025-03-01 11:18:00 2025-03-01 11:52:00          9

The cumulative-sum trick is the canonical sessionization pattern. Memorize it.

Edge cases & gotchas (2)

Threshold choice
30 minutes is canonical for web; mobile push notifications can reasonably use 5 minutes; video heartbeats can use 90 seconds. Threshold belongs to the product, not the SQL.

Out-of-order events

Late-arriving events with timestamps before their predecessor produce negative gaps. ORDER BY event_time correctly, but a session_start AFTER session_end is a DQ flag.

Fix · SQL

-- DQ check: surface inverted sessions
SELECT user_id, session_id, MIN(event_time) AS s_start, MAX(event_time) AS s_end
FROM sessionized
GROUP BY 1, 2
HAVING MIN(event_time) > MAX(event_time);

Output

[error] CatalogException: Catalog Error: Table with name sessionized does not exist!
Did you mean "sessions"?

LINE 3: FROM sessionized
             ^

SQL Q37 Tier 4 · Patterns Longest consecutive login streak per user.

[Meta · Engagement Streaks] An engagement leaderboard wants each user's longest consecutive-day login streak. Classic gaps-and-islands: date − row_number is constant within consecutive runs.

Source · logins

   user_id login_date
0        1 2025-03-02
1        1 2025-03-04
2        1 2025-03-05
3        1 2025-03-06
4        1 2025-03-07

SQL

WITH grp AS (
    SELECT user_id, login_date,
           CAST(login_date AS DATE) - CAST(ROW_NUMBER() OVER (
               PARTITION BY user_id ORDER BY login_date
           ) AS INT) AS group_key
    FROM (SELECT DISTINCT user_id, login_date FROM logins) d
),
streaks AS (
    SELECT user_id, group_key,
           COUNT(*)        AS streak_len,
           MIN(login_date) AS streak_start,
           MAX(login_date) AS streak_end
    FROM grp
    GROUP BY user_id, group_key
)
SELECT user_id, MAX(streak_len) AS longest_streak
FROM streaks
GROUP BY user_id
ORDER BY longest_streak DESC
LIMIT 10;

Output

   user_id  longest_streak
0       46              20
1       47              17
2       95              13
3       76              13
4       14              12
5      100              12
6       13              11
7       28              11
8       12              11
9       63              10

The trick: date − row_number is constant within consecutive runs. Group by it, you get the islands. Dedup logins first or streaks lie.

Edge cases & gotchas (2)

Duplicate (user_id, login_date) pairs

If the source has two login rows for the same user on the same day, the row_number trick over-counts the streak length. DEDUP first.

Fix · SQL

-- DEDUP before the gaps-and-islands trick
WITH dedup AS (SELECT DISTINCT user_id, login_date FROM logins)
SELECT user_id, login_date FROM dedup;

Output

      user_id login_date
0           1 2025-03-06
1           1 2025-03-28
2           2 2025-03-05
3           2 2025-03-09
4           3 2025-03-07
5           4 2025-03-19
...       ...        ...
1919       98 2025-03-17
1920       98 2025-03-26
1921       98 2025-03-29
1922       99 2025-03-08
1923       99 2025-03-10
1924       99 2025-03-14

Time zone vs ‘day’
Two logins at 23:30 and 00:30 UTC look like two consecutive days in UTC but the same day for a Pacific user. Bucket login_date in user-local TZ if available.

SQL Q38 Tier 4 · Patterns Detect data skew — which join keys hold most of the rows?

[Amazon · Spark Skew Triage] A nightly Spark job is stalling on skewed_table. You suspect one join key has 70%+ of the rows. Confirm with a frequency query — then mitigate (salt the hot key, broadcast the small side, or pre-aggregate).

Source · skewed_table

   id join_key  value
0   1  key_243    189
1   2  key_225     81
2   3  hot_key    108
3   4   key_91    321
4   5  hot_key    579

SQL

SELECT join_key,
       COUNT(*)                                     AS rows,
       ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct_of_total
FROM skewed_table
GROUP BY join_key
ORDER BY rows DESC
LIMIT 5;

Output

  join_key  rows  pct_of_total
0  hot_key   700          70.0
1  key_228     1           0.1
2   key_24     1           0.1
3  key_251     1           0.1
4   key_22     1           0.1

What you will find: hot_key holds ~70% of rows. Mitigations: salt the hot key, broadcast the small side, or pre-aggregate before the join.

Edge cases & gotchas (2)

Skew via dimension, not just value
A heavy hot key may be one value (hot_key) but skew can also happen on (key, time) — peak-hour traffic on one merchant. Group by (join_key, hour) for time-aware skew.
Skew thresholds depend on cluster size
‘Top 1% holds 50%’ is one heuristic; ‘any single key > partition target size’ is the operational one. Compute against your shuffle target.

SQL Q39 Tier 4 · Patterns Average DAU per ISO week from daily_metrics.

[Meta · WAU vs avg-DAU] The weekly biz review wants weekly metrics from the daily roll-up. Critical distinction: AVG(daily_dau) ≠ WAU (unique users in the week). Both are valid; compute and label both.

Source · daily_metrics

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-05       55714.88       9883

SQL

SELECT CAST(DATE_TRUNC('week', ds) AS DATE) AS week_start,
       ROUND(AVG(daily_dau), 0)                  AS avg_dau,
       SUM(daily_revenue)                        AS week_revenue
FROM daily_metrics
GROUP BY 1
ORDER BY week_start
LIMIT 10;

Output

  week_start  avg_dau  week_revenue
0 2024-12-30   9920.0     291367.91
1 2025-01-06  10982.0     355167.89
2 2025-01-13  10203.0     384238.20
3 2025-01-20   9551.0     332577.08
4 2025-01-27  10553.0     298425.99
5 2025-02-03  10076.0     317637.86
6 2025-02-10  11500.0     335308.30
7 2025-02-17  10426.0     371344.18
8 2025-02-24  10083.0     329799.30
9 2025-03-03   9683.0     284459.57

Critical distinction: WAU (unique users in the week) ≠ AVG(DAU). Both are valid; both are different. State which you computed.

Edge cases & gotchas (2)

WAU != AVG(DAU)

AVG(DAU) on a 7-day window counts a user multiple times if they appear on multiple days; WAU is unique users in the week. Both are valid; both are different. Label your axis.

Fix · SQL

-- Real WAU
SELECT DATE_TRUNC('week', login_date) AS week_start,
       COUNT(DISTINCT user_id)        AS wau
FROM logins GROUP BY 1;

Output

  week_start  wau
0        NaT    3
1 2025-03-10  100
2 2025-03-03  100
3 2025-03-24  100
4 2025-02-24   83
5 2025-03-17  100

Partial weeks at the boundaries
The first and last week may not have 7 full days. Drop or label them, don't average them silently.

SQL Q40 Tier 4 · Patterns Day-of-week revenue heat: rank weekdays by avg revenue.

[Amazon · Ad Spend Calendar] Marketing wants to rank weekdays by avg revenue before locking the ad-spend schedule — is Saturday actually the winner, or just over-counted? Show days_observed per dow so they cannot ignore unevenness.

Source · daily_metrics

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-05       55714.88       9883

SQL

SELECT strftime(ds, '%a')              AS dow_label,
       CAST(EXTRACT(ISODOW FROM ds) AS INT)  AS dow_num,
       ROUND(AVG(daily_revenue), 2)          AS avg_revenue,
       COUNT(*)                              AS days_observed
FROM daily_metrics
GROUP BY dow_label, dow_num
ORDER BY avg_revenue DESC;

Output

  dow_label  dow_num  avg_revenue  days_observed
0       Mon        1     52851.23             17
1       Fri        5     52428.22             17
2       Tue        2     51339.15             17
3       Thu        4     51152.23             17
4       Sun        7     48210.35             17
5       Wed        3     46502.79             18
6       Sat        6     43982.03             18

Tip: include days_observed — if a quarter has 14 Mondays and 13 Sundays, the average is on uneven sample sizes.

Edge cases & gotchas (2)

Uneven sample sizes by DOW
A 90-day window has 13 Mondays and 12 Sundays — averaging without disclosing days_observed makes the rank misleading.
Holiday & promo contamination
Black Friday is a Friday; Christmas falls mid-week. Day-of-week ranking averages everything together. Slice by promo period for honest weekday comparison.

Tier 5 · Senior Patterns

SQL Q41 Tier 5 · Senior SCD2 as-of join: price each purchase using the price effective on its date.

[Apple · Historical Pricing] Finance is restating Q1 revenue and needs every purchase priced at the price effective on its date — not today's price. Join via the SCD2 effective/expiry window. The invariant: exactly one row matches per (product, point-in-time).

Source · purchases + product_dim

   product_id product_name   price effective_date expiry_date is_current
0           1    product_1   60.01     2024-01-01  2024-04-16          N
1           1    product_1   66.01     2024-04-16         NaT          Y
2           2    product_2  171.42     2024-01-01  2024-04-22          N
3           2    product_2  188.56     2024-04-22  2024-07-28          N
4           3    product_3  156.15     2024-01-01  2024-04-28          N

SQL

SELECT p.purchase_id, p.user_id, p.purchase_date,
       p.product_category, p.amount,
       pd.product_id, pd.price AS price_on_purchase_date
FROM purchases p
LEFT JOIN product_dim pd
       ON pd.product_id = ((CAST(REPLACE(p.product_category, ' ', '') AS VARCHAR) IS NOT NULL)::INT)
      AND p.purchase_date >= pd.effective_date
      AND (pd.expiry_date IS NULL OR p.purchase_date < pd.expiry_date)
ORDER BY p.purchase_date
LIMIT 5;

Output

   purchase_id  user_id purchase_date product_category  amount  product_id  price_on_purchase_date
0            1      149    2024-11-01           Sports  321.30           1                   66.01
1            2       82    2024-11-01         Clothing   36.66           1                   66.01
2            3       19    2024-11-01         Clothing   70.12           1                   66.01
3            4       98    2024-11-01             Food  122.21           1                   66.01
4            5       59    2024-11-01      Electronics    5.33           1                   66.01

The SCD2 invariant: exactly one row matches per (product, point-in-time). If you ever see duplicates, your effective/expiry windows overlap — that is a data-quality bug, not a query bug.

Edge cases & gotchas (3)

Overlapping effective windows

If two product_dim rows have overlapping [effective, expiry) for the same product, the as-of join returns multiple rows per purchase — the SCD2 invariant is broken. DQ-assert on every load.

Fix · SQL

-- DQ: detect overlapping windows
WITH g AS (
  SELECT product_id, effective_date, expiry_date,
         LEAD(effective_date) OVER (
             PARTITION BY product_id ORDER BY effective_date
         ) AS next_eff
  FROM product_dim
)
SELECT * FROM g
WHERE expiry_date IS NOT NULL AND next_eff < expiry_date;

Output

   product_id effective_date expiry_date   next_eff
0          13     2024-01-01  2024-03-11 2024-01-16
1          13     2024-01-16  2024-05-14 2024-03-11

Inclusive vs exclusive expiry
[effective_date, expiry_date) (half-open) is industry standard and avoids double-coverage at the boundary. expiry_date IS NULL = still current. Always document the convention.
Pre-history gap
A purchase dated before any product_dim row leaves the as-of join NULL. Either bootstrap a synthetic 'genesis' row (effective_date = -infinity) or DQ-flag.

SQL Q42 Tier 5 · Senior Recursive CTE: full org chart with reporting depth.

[Amazon · HRIS] Workday is generating an org-chart export for visa filings: every employee with their depth from the CEO and their full reporting chain as a string. Recursive CTE, guard against cycles in production.

Source · employees

   employee_id  full_name  manager_id   department    salary
0          100  Dana Chen         NaN  Engineering  250000.0
1          201   Eve Park       100.0  Engineering  180000.0
2          202   Frank Li       100.0  Engineering  175000.0
3          203  Grace Kim       100.0  Engineering       NaN
4          301     IC_301       201.0  Engineering  116962.0

SQL

WITH RECURSIVE org AS (
    SELECT employee_id, full_name, manager_id, department,
           0                          AS depth,
           CAST(full_name AS VARCHAR) AS chain
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    SELECT e.employee_id, e.full_name, e.manager_id, e.department,
           o.depth + 1,
           o.chain || ' > ' || e.full_name
    FROM employees e
    JOIN org o ON e.manager_id = o.employee_id
)
SELECT employee_id, full_name, department, depth, chain
FROM org
ORDER BY depth, full_name
LIMIT 10;

Output

   employee_id  full_name   department  depth                          chain
0          100  Dana Chen  Engineering      0                      Dana Chen
1          201   Eve Park  Engineering      1           Dana Chen > Eve Park
2          202   Frank Li  Engineering      1           Dana Chen > Frank Li
3          203  Grace Kim  Engineering      1          Dana Chen > Grace Kim
4          301     IC_301  Engineering      2  Dana Chen > Eve Park > IC_301
5          302     IC_302  Engineering      2  Dana Chen > Eve Park > IC_302
6          303     IC_303  Engineering      2  Dana Chen > Eve Park > IC_303
7          304     IC_304  Engineering      2  Dana Chen > Eve Park > IC_304
8          305     IC_305  Engineering      2  Dana Chen > Eve Park > IC_305
9          306     IC_306  Engineering      2  Dana Chen > Frank Li > IC_306

Termination: a recursive CTE stops when the recursive step yields no new rows. Cycles in manager_id would loop forever — guard with a depth < N cap in production.

Edge cases & gotchas (2)

Cycles in manager_id

If A reports to B and B reports to A (data error), the recursive CTE loops forever. Add a depth cap and a visited-set guard.

Fix · SQL

WITH RECURSIVE org AS (
    SELECT employee_id, manager_id, full_name, 0 AS depth, '/' || full_name AS path
    FROM employees WHERE manager_id IS NULL
  UNION ALL
    SELECT e.employee_id, e.manager_id, e.full_name, o.depth + 1,
           o.path || '/' || e.full_name
    FROM employees e JOIN org o ON e.manager_id = o.employee_id
    WHERE o.depth < 20                 -- cycle guard
      AND POSITION('/' || e.full_name || '/' IN o.path) = 0  -- visited check
)
SELECT * FROM org;

Output

    employee_id  manager_id  full_name  depth                               path
0           100         NaN  Dana Chen      0                         /Dana Chen
1           201       100.0   Eve Park      1                /Dana Chen/Eve Park
2           202       100.0   Frank Li      1                /Dana Chen/Frank Li
3           203       100.0  Grace Kim      1               /Dana Chen/Grace Kim
4           301       201.0     IC_301      2         /Dana Chen/Eve Park/IC_301
5           302       201.0     IC_302      2         /Dana Chen/Eve Park/IC_302
..          ...         ...        ...    ...                                ...
22          319       302.0     IC_319      3  /Dana Chen/Eve Park/IC_302/IC_319
23          320       302.0     IC_320      3  /Dana Chen/Eve Park/IC_302/IC_320
24          321       302.0     IC_321      3  /Dana Chen/Eve Park/IC_302/IC_321
25          322       306.0     IC_322      3  /Dana Chen/Frank Li/IC_306/IC_322
26          323       306.0     IC_323      3  /Dana Chen/Frank Li/IC_306/IC_323
27          324       306.0     IC_324      3  /Dana Chen/Frank Li/IC_306/IC_324

Multiple roots
If two employees have manager_id IS NULL (CEO + acting CEO), the recursive CTE starts from both and you get two trees. State the intent.

SQL Q43 Tier 5 · Senior Recursive CTE: total org headcount under each manager.

[Amazon · Reorg Planning] A reorg-planning tool needs ‘people in your org’ — yourself plus everyone reporting up to you, transitively. Recursive CTE descending, then COUNT(*) - 1 to exclude self.

Source · employees

   employee_id  full_name  manager_id   department    salary
0          100  Dana Chen         NaN  Engineering  250000.0
1          201   Eve Park       100.0  Engineering  180000.0
2          202   Frank Li       100.0  Engineering  175000.0
3          203  Grace Kim       100.0  Engineering       NaN
4          301     IC_301       201.0  Engineering  116962.0

SQL

WITH RECURSIVE descendants AS (
    SELECT employee_id AS root_id, employee_id
    FROM employees

    UNION ALL

    SELECT d.root_id, e.employee_id
    FROM descendants d
    JOIN employees   e ON e.manager_id = d.employee_id
)
SELECT m.employee_id, m.full_name AS manager,
       COUNT(*) - 1 AS reports_below
FROM descendants d
JOIN employees   m ON m.employee_id = d.root_id
GROUP BY m.employee_id, m.full_name
ORDER BY reports_below DESC
LIMIT 10;

Output

   employee_id    manager  reports_below
0          100  Dana Chen             27
1          201   Eve Park             11
2          202   Frank Li              8
3          203  Grace Kim              5
4          306     IC_306              3
5          302     IC_302              3
6          301     IC_301              3
7          312     IC_312              0
8          309     IC_309              0
9          317     IC_317              0

Pattern: seed with self, recurse downward, count − 1 to exclude the root from their own org count.

Edge cases & gotchas (2)

Off-by-one (counting self)
The seed step puts (root_id, root_id) into descendants — the root counts itself. Subtract 1 in the final aggregate or filter self out in the seed.
Database default recursion limit
Postgres default cycle protection is off; SQL Server has MAXRECURSION 100 default. For deep orgs (>100 levels — rare but real), increase the cap explicitly.

SQL Q44 Tier 5 · Senior Cohort retention as percentages.

[Meta · Quick Experiments] A growth experiment wants the cohort retention triangle as percentages — the shape that ships in the experiment readout. Watch the denominator: normalizing by W0 (always 100%) vs cohort size give different reads.

Source · users + logins

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

SQL

WITH cohort AS (
    SELECT user_id, DATE_TRUNC('week', signup_date) AS cohort_week
    FROM users
),
cohort_sizes AS (
    SELECT cohort_week, COUNT(*) AS cohort_n FROM cohort GROUP BY cohort_week
),
activity AS (
    SELECT c.cohort_week, c.user_id,
           CAST(FLOOR(EXTRACT(EPOCH FROM (l.login_date - c.cohort_week)) / (7*86400)) AS INT) AS w
    FROM cohort c
    JOIN logins l ON l.user_id = c.user_id
    WHERE l.login_date >= c.cohort_week
)
SELECT a.cohort_week, cs.cohort_n,
       ROUND(100.0 * COUNT(DISTINCT CASE WHEN w=0 THEN a.user_id END)/cs.cohort_n, 1) AS w0,
       ROUND(100.0 * COUNT(DISTINCT CASE WHEN w=1 THEN a.user_id END)/cs.cohort_n, 1) AS w1,
       ROUND(100.0 * COUNT(DISTINCT CASE WHEN w=2 THEN a.user_id END)/cs.cohort_n, 1) AS w2,
       ROUND(100.0 * COUNT(DISTINCT CASE WHEN w=3 THEN a.user_id END)/cs.cohort_n, 1) AS w3
FROM activity a
JOIN cohort_sizes cs USING (cohort_week)
GROUP BY a.cohort_week, cs.cohort_n
ORDER BY a.cohort_week
LIMIT 10;

Output

  cohort_week  cohort_n   w0   w1   w2   w3
0  2024-09-30         3  0.0  0.0  0.0  0.0
1  2024-10-07         3  0.0  0.0  0.0  0.0
2  2024-10-14         4  0.0  0.0  0.0  0.0
3  2024-10-21         3  0.0  0.0  0.0  0.0
4  2024-10-28         4  0.0  0.0  0.0  0.0
5  2024-11-04         3  0.0  0.0  0.0  0.0
6  2024-11-11         4  0.0  0.0  0.0  0.0
7  2024-11-18         3  0.0  0.0  0.0  0.0
8  2024-11-25         4  0.0  0.0  0.0  0.0
9  2024-12-02         3  0.0  0.0  0.0  0.0

Watch the denominator. Some teams normalize by W0 (always 100%) instead of cohort_n — both are valid; pick one and label it.

Edge cases & gotchas (2)

Cohorts with 0 users
An empty cohort_week still appears in the join; the percentage is 0/0. Filter cohort_n > 0 or NULLIF the divisor.
Survivorship bias
Cohorts near the data edge have not had time to be observed at W3. Either mark them as incomplete (W3 = NULL) or compute only the cohorts with full observation.
Fix · SQL
```
-- Only fully-observed cohorts
WHERE cohort_week + INTERVAL '3' WEEK <= (SELECT MAX(login_date) FROM logins)
```

SQL Q45 Tier 5 · Senior New vs returning users by day.

[Netflix · Daily Active Mix] A daily report wants two lines: new users today (first-ever login) vs returning users today (active before). ‘New’ vs ‘new this month’ is a real definition fight — clarify before coding.

Source · logins

   user_id login_date
0        1 2025-03-02
1        1 2025-03-04
2        1 2025-03-05
3        1 2025-03-06
4        1 2025-03-07

SQL

WITH first_seen AS (
    SELECT user_id, MIN(login_date) AS first_login
    FROM logins
    GROUP BY user_id
)
SELECT l.login_date,
       COUNT(DISTINCT CASE WHEN l.login_date  = f.first_login THEN l.user_id END) AS new_users,
       COUNT(DISTINCT CASE WHEN l.login_date > f.first_login THEN l.user_id END) AS returning_users
FROM logins     l
JOIN first_seen f USING (user_id)
GROUP BY l.login_date
ORDER BY l.login_date
LIMIT 10;

Output

  login_date  new_users  returning_users
0 2025-03-01         63                0
1 2025-03-02         20               35
2 2025-03-03         13               56
3 2025-03-04          2               58
4 2025-03-05          1               60
5 2025-03-06          1               61
6 2025-03-07          0               59
7 2025-03-08          0               66
8 2025-03-09          0               63
9 2025-03-10          0               67

Definition discipline: ‘new’ = first-ever-anywhere; ‘returning’ = active today and active before. Some teams want first-ever-this-month — clarify before coding.

Edge cases & gotchas (2)

‘New’ redefined per period
‘New today’ could mean ‘first ever’ or ‘first this month’ or ‘first since subscription’. Three valid metrics. Pick one and label.
Identity stitching
If users log in from multiple devices and you stitch IDs, MIN(login_date) per stitched user differs from MIN per device ID. The stitching pipeline ordering matters.

SQL Q46 Tier 5 · Senior Find orphaned orders — orders whose user_id doesn't exist in users.

[Amazon · Data Quality] A nightly DQ job needs to surface orphan orders — orders whose user_id doesn't exist in users. Wrap as a view; alert when row count > 0; ticket the upstream owner.

Source · orders ⬥ users

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

SQL

SELECT o.order_id, o.user_id, o.order_date, o.amount
FROM orders o
LEFT JOIN users u ON u.user_id = o.user_id
WHERE u.user_id IS NULL
LIMIT 10;

Output

   order_id  user_id order_date  amount
0        23    99000 2025-01-03   42.33
1        89    99001 2025-01-12   87.23
2       168    99002 2025-01-21  232.05
3       291    99003 2025-02-06  177.40
4       402    99004 2025-02-20   66.10

Production wrap: turn this into a dq_orphan_orders view; alert when row count > 0; ticket the upstream owner. The query is the easy part.

Edge cases & gotchas (2)

Soft-deleted users count as orphans

If users has a deleted_at column and you filter deleted_at IS NULL, every order from a churned account becomes an orphan. Distinguish deleted vs never-existed.

Fix · SQL

SELECT o.order_id, o.user_id,
       CASE WHEN u.user_id IS NULL          THEN 'never_existed'
            WHEN u.deleted_at IS NOT NULL   THEN 'soft_deleted'
       END AS orphan_kind
FROM orders o
LEFT JOIN users u ON u.user_id = o.user_id
WHERE u.user_id IS NULL OR u.deleted_at IS NOT NULL;

Output

[error] BinderException: Binder Error: Values list "u" does not have a column named "deleted_at"

LINE 7: WHERE u.user_id IS NULL OR u.deleted_at IS NOT NULL
                                   ^

Test/staff orders
Many orgs have test users with user_id in a special range. Filter them OUT of orphan checks AND in business metrics.

SQL Q47 Tier 5 · Senior Reconcile daily_metrics.daily_revenue against sum of purchases.amount per day.

[Amazon · Metric Reconciliation] A finance audit: daily_metrics.daily_revenue vs SUM(purchases.amount) per day. Tolerance > $0.01 because of currency rounding. A metric is only as trusted as its reconciliation report.

Source · daily_metrics + purchases

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-05       55714.88       9883

SQL

WITH src AS (
    SELECT purchase_date AS ds,
           SUM(amount)    AS src_revenue
    FROM purchases
    GROUP BY purchase_date
),
diff AS (
    SELECT d.ds,
           d.daily_revenue                     AS metric_revenue,
           COALESCE(s.src_revenue, 0)          AS source_revenue,
           d.daily_revenue - COALESCE(s.src_revenue, 0) AS delta
    FROM daily_metrics d
    LEFT JOIN src s ON s.ds = d.ds
)
SELECT *
FROM diff
WHERE ABS(delta) > 0.01
ORDER BY ABS(delta) DESC
LIMIT 10;

Output

          ds  metric_revenue  source_revenue     delta
0 2025-02-18        92348.66            0.00  92348.66
1 2025-04-26        82002.86            0.00  82002.86
2 2025-03-16        80726.04            0.00  80726.04
3 2025-01-06        80359.13            0.00  80359.13
4 2025-01-02        79294.66          119.12  79175.54
5 2025-01-20        73200.32            0.00  73200.32
6 2025-01-19        72156.75            0.00  72156.75
7 2025-04-25        71673.07            0.00  71673.07
8 2025-02-14        71666.89            0.00  71666.89
9 2025-03-17        71011.11            0.00  71011.11

Why a tolerance: floating point + currency rounding mean exact equality is the wrong check. Pick a meaningful epsilon (e.g., one cent).

Edge cases & gotchas (2)

Floating-point currency drift
Storing amount as DOUBLE makes SUM non-associative; reconciling with =, even rounded, drifts. Use DECIMAL or convert to integer cents before summing.
Refund vs gross definition
If purchases excludes refunds but daily_revenue is net, your reconciliation will always show a gap equal to refund volume. Same definition on both sides or document the gap.

SQL Q48 Tier 5 · Senior Per-user outlier orders — more than 3σ above that user's mean.

[Apple · Trust Engineering] Risk wants per-user outliers: orders > 3σ above the user's own mean, with at least 5 orders of history. Caveat: order amounts are log-normal, not normal — for production use IQR or MAD.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

SQL

WITH stats AS (
    SELECT order_id, user_id, order_date, amount,
           AVG(amount)    OVER (PARTITION BY user_id) AS u_avg,
           STDDEV(amount) OVER (PARTITION BY user_id) AS u_std,
           COUNT(*)       OVER (PARTITION BY user_id) AS u_n
    FROM orders
)
SELECT order_id, user_id, order_date, amount,
       ROUND(u_avg, 2) AS u_avg, ROUND(u_std, 2) AS u_std, u_n
FROM stats
WHERE u_n >= 5
  AND amount > u_avg + 3 * u_std
ORDER BY amount - u_avg DESC
LIMIT 10;

Output

Empty DataFrame
Columns: [order_id, user_id, order_date, amount, u_avg, u_std, u_n]
Index: []

Statistical caveat: 3σ assumes roughly normal — order amounts are usually log-normal. For production, log-transform first or use IQR / MAD.

Edge cases & gotchas (2)

n=1 partition: STDDEV returns NULL
Postgres STDDEV with one row returns NULL (sample variance); amount > u_avg + 3 * NULL is NULL, so no rows flagged. Filter u_n >= 5 (or use STDDEV_POP).

Log-normal vs normal

Order amounts are heavy-tailed. 3σ on raw amount mis-flags. Better: log-transform first, or use IQR-based fences (Q1 - 1.5×IQR).

Fix · SQL

-- IQR-based outlier
SELECT order_id, amount
FROM orders
WHERE amount > (SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) FROM orders)
              + 1.5 * (
              SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount)
                   - PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount)
              FROM orders);

Output

    order_id  amount
0         14  318.91
1         15  890.88
2         19  245.36
3         38  450.92
4         49  308.07
5         52  371.89
..       ...     ...
39       424  404.41
40       434  425.90
41       444  251.34
42       445  365.97
43       446  958.35
44       448  919.86

SQL Q49 Tier 5 · Senior Power users — top 10% by orders AND top 10% by sessions.

[Netflix · VIP Holdback] CX wants the intersection of high-spend and high-engagement users (top 10% on both) for a VIP experiment. NTILE(10) caveat: ties land arbitrarily — for high-stakes segmentation, prefer PERCENT_RANK ≤ 0.10.

Source · orders ⋂ sessions

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

SQL

WITH order_rank AS (
    SELECT user_id,
           NTILE(10) OVER (ORDER BY SUM(amount) DESC) AS o_decile
    FROM orders GROUP BY user_id
),
sess_rank AS (
    SELECT user_id,
           NTILE(10) OVER (ORDER BY COUNT(*) DESC)    AS s_decile
    FROM sessions GROUP BY user_id
)
SELECT o.user_id, o.o_decile, s.s_decile
FROM order_rank o
JOIN sess_rank  s USING (user_id)
WHERE o.o_decile = 1 AND s.s_decile = 1
ORDER BY o.user_id
LIMIT 10;

Output

   user_id  o_decile  s_decile
0       69         1         1

NTILE caveat: NTILE distributes ties arbitrarily, so two users with identical spend may land in different deciles. For high-stakes segmentation, prefer PERCENT_RANK ≤ 0.10.

Edge cases & gotchas (2)

NTILE ties
NTILE distributes ties arbitrarily, so two users with the same spend can land in different deciles. PERCENT_RANK is tie-respecting.
Joining on user_id only
If a user has orders but no sessions (or vice versa), they drop out of the inner join. State whether the segment requires both signals or one.

SQL Q50 Tier 5 · Senior Daily ARPU + sanity check vs. raw orders.

[Meta · ARPU Trust] Daily ARPU is shipping to the exec dashboard. Compute the curated ARPU and a self-computed reconciliation from raw orders side-by-side, with a delta column. Senior interview signal: you ship the audit alongside the metric.

Source · daily_metrics + orders

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-05       55714.88       9883

SQL

WITH curated AS (
    SELECT ds, daily_revenue, daily_dau,
           ROUND(daily_revenue / NULLIF(daily_dau, 0), 4) AS arpu_curated
    FROM daily_metrics
),
raw_arpu AS (
    SELECT order_date AS ds,
           SUM(amount)              AS raw_revenue,
           COUNT(DISTINCT user_id)  AS raw_active_users,
           ROUND(SUM(amount) / NULLIF(COUNT(DISTINCT user_id), 0), 4) AS arpu_raw
    FROM orders
    GROUP BY order_date
)
SELECT c.ds, c.arpu_curated, r.arpu_raw,
       ROUND(c.arpu_curated - r.arpu_raw, 4) AS arpu_delta
FROM curated c
LEFT JOIN raw_arpu r USING (ds)
ORDER BY c.ds
LIMIT 10;

Output

          ds  arpu_curated  arpu_raw  arpu_delta
0 2025-01-01        5.2718   38.8763    -33.6045
1 2025-01-02        7.7045  195.2125   -187.5080
2 2025-01-03        5.1125  113.1386   -108.0261
3 2025-01-04        5.5962   58.0700    -52.4738
4 2025-01-05        5.6374   80.5513    -74.9139
5 2025-01-06        8.1073   77.1450    -69.0377
6 2025-01-07        3.3879  117.3300   -113.9421
7 2025-01-08        4.3518   76.9813    -72.6295
8 2025-01-09        4.1263   81.8425    -77.7162
9 2025-01-10        3.9156   86.3887    -82.4731

Why ship the delta: a metric is only as trusted as the reconciliation report behind it. Senior interviewers love this answer because it shows you think about trust, not just calculation.

Edge cases & gotchas (2)

Active-user definition mismatch
daily_dau may count any login event; orders.user_id COUNT(DISTINCT) counts purchasing users only. Different denominators ⇒ different ARPUs. Reconcile definitions first.

Tolerance for the delta

Float drift makes exact 0 unrealistic. Define a tolerance (e.g., abs(delta)/curated < 0.001) and alert above it.

Fix · SQL

SELECT *, ABS(arpu_delta) / NULLIF(arpu_curated, 0) AS pct_drift
FROM reconciled
WHERE ABS(arpu_delta) / NULLIF(arpu_curated, 0) > 0.001;

Output

[error] CatalogException: Catalog Error: Table with name reconciled does not exist!
Did you mean "user_profiles"?

LINE 2: FROM reconciled
             ^

Part 4 · 50 Python Questions, Simple to Complex

Most Python rounds for data engineers are pandas-first, with a few vanilla-Python curveballs (collections, generators, an LRU cache) and — for senior loops — a couple of PySpark or distributed-thinking problems. PySpark answers below show the canonical Spark code; the Output block shows the equivalent result computed via pandas (since this article was built without a live Spark cluster).

Tier 1 · Foundation

PY Q1 Tier 1 · Foundation Load users.csv with proper date parsing; report shape + dtypes.

[Meta · Notebook Onboarding] Day-1 of a new analysis in a Bento notebook: load the table, confirm row count, confirm date columns are actually parsed as dates and not strings.

Source · users.csv

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

Python

import pandas as pd
users = pd.read_csv(DATA + "users.csv", parse_dates=["signup_date"])
print("shape:", users.shape)
print(users.dtypes)
print(users.head(3))

Output

shape: (200, 5)
user_id                 int64
name                      str
email                     str
signup_date    datetime64[us]
country                   str
dtype: object
   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP

Why parse_dates: without it, every signup_date.dt.month downstream silently fails. Always parse dates at read time.

Edge cases & gotchas (2)

Date columns silently load as object

Without parse_dates, 'signup_date' is dtype=object (string). Every .dt call downstream raises AttributeError. Always parse at read time.

Fix · Python

# WRONG: signup_date is now a string
users = pd.read_csv("users.csv")
users["signup_date"].dt.month   # AttributeError

# RIGHT
users = pd.read_csv("users.csv", parse_dates=["signup_date"])

Mixed-format dates
If the CSV has both '2024-10-02' and '10/2/2024', auto-parse may fall back to dtype=object silently. Pass explicit format.
Fix · Python
```
users = pd.read_csv("users.csv",
                    parse_dates=["signup_date"],
                    date_format="%Y-%m-%d")
```

PY Q2 Tier 1 · Foundation Filter to users from India.

[Meta · Localization Sprint] A regional analyst wants only the IN cohort for a Hindi-UI localization study. .copy() defensively to avoid SettingWithCopyWarning downstream.

Source · users

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

Python

in_users = users[users["country"] == "IN"].copy()
print(len(in_users), "users in IN")
print(in_users.head(3))

Output

33 users in IN
    user_id     name               email signup_date country
0         1   user_1   user1@example.com  2024-10-02      IN
9        10  user_10  user10@example.com  2024-10-19      IN
19       20  user_20  user20@example.com  2024-11-09      IN

Why .copy(): if you mutate in_users later, pandas may emit SettingWithCopyWarning. The copy is cheap and explicit.

Edge cases & gotchas (2)

Forgotten .copy() ⇒ SettingWithCopyWarning
Filtering returns a view sometimes and a copy other times. Without explicit .copy(), the next assignment may mutate the original — or warn and refuse.
Fix · Python
```
# DANGEROUS
in_users = users[users["country"] == "IN"]
in_users["new_col"] = 1   # SettingWithCopyWarning

# SAFE
in_users = users.loc[users["country"] == "IN"].copy()
```
Unicode case sensitivity
users[users.country == 'in'] returns 0 rows because matching is case-sensitive. Lowercase consistently or use str.casefold.

PY Q3 Tier 1 · Foundation Sum purchase amount by category, sorted descending.

[Amazon · Category Roll-Up] Same as the SQL category roll-up but in pandas — useful when the warehouse is gated by a long approval queue and you have CSVs in S3 instead. Use named-aggregation, not the legacy multi-index form.

Source · purchases

   purchase_id  user_id  amount purchase_date product_category
0            1      149  321.30    2024-11-01           Sports
1            2       82   36.66    2024-11-01         Clothing
2            3       19   70.12    2024-11-01         Clothing
3            4       98  122.21    2024-11-01             Food
4            5       59    5.33    2024-11-01      Electronics

Python

by_cat = (purchases.groupby("product_category", as_index=False)
                   .agg(revenue=("amount", "sum"),
                        n=("amount", "size"))
                   .sort_values("revenue", ascending=False))
print(by_cat)

Output

  product_category    revenue   n
1         Clothing  105688.15  64
2      Electronics    6451.80  73
0            Books    5653.72  56
3             Food    5179.73  59
4           Sports    4629.85  49

Named aggregation (agg(name=(col, fn))) is the modern, readable form. Skip the legacy .agg({'amount':['sum','count']}) with multi-index columns.

Edge cases & gotchas (2)

as_index=True returns a Series, not a DataFrame
Default groupby sets the group as the index. For downstream merges and pivots you usually want as_index=False.
Fix · Python
```
by_cat = (purchases.groupby("product_category", as_index=False)
                   .agg(revenue=("amount", "sum"),
                        n=("amount", "size")))
```
NaN groups
groupby drops NaN groups by default. If product_category can be NULL, pass dropna=False to surface them.
Fix · Python
```
purchases.groupby("product_category", dropna=False, as_index=False).size()
```

PY Q4 Tier 1 · Foundation Top 10 orders by amount.

[Amazon · Trust & Safety] Quick ‘biggest orders’ peek for a fraud review. nlargest beats sort_values().head() on big data — partial sort, O(n log k) instead of O(n log n).

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

Python

top10 = orders.nlargest(10, "amount")
print(top10)

Output

     order_id  user_id order_date  amount
445       446      176 2025-02-25  958.35
447       448      168 2025-02-25  919.86
14         15       70 2025-01-02  890.88
197       198      112 2025-01-25  830.40
309       310       36 2025-02-08  820.84
356       357      184 2025-02-14  740.15
339       340      140 2025-02-12  734.23
84         85      131 2025-01-11  604.76
330       331      149 2025-02-11  554.26
70         71      154 2025-01-09  456.94

nlargest vs sort_values().head(): on large data, nlargest is a partial sort — O(n log k) instead of O(n log n).

Edge cases & gotchas (2)

nlargest with NaN
nlargest excludes NaN by default. If you want NaN at the top (unusual but possible for outlier review), pass keep='all' and filter manually.
Ties — keep policy
nlargest(10, 'amount', keep='first'|'last'|'all'). 'all' returns >10 rows when ties straddle the cutoff. Pick deliberately.
Fix · Python
```
orders.nlargest(10, "amount", keep="all")
```

PY Q5 Tier 1 · Foundation Count nulls per column in user_profiles.

[Meta · DQ First Pass] First-pass data-quality check on a freshly delivered file. Notice the row count vs unique user count: 334 rows for 200 users means dedup work ahead.

Source · user_profiles

   user_id    name    city updated_at
0        1  user_1  Denver 2025-01-05
1        1  user_1     NYC 2025-02-19
2        2  user_2      SF 2025-01-01
3        3  user_3  Austin 2025-01-11
4        3  user_3     NaN 2025-02-01

Python

print(up.isna().sum())
print("rows:", len(up), "  unique users:", up["user_id"].nunique())

Output

user_id       0
name          0
city          5
updated_at    2
dtype: int64
rows: 334   unique users: 200

Tell the interviewer what you noticed: 334 rows for 200 users → duplicates ahead. (Q13 deals with that.)

Edge cases & gotchas (2)

isna() vs isnull() vs ==NaN
== NaN always returns False; use isna() / isnull() (synonyms in pandas). For object columns, also check for empty strings.
Fix · Python
```
# Treat empty strings AS missing
mask_missing = up.isna() | (up.astype(object) == "")
print(mask_missing.sum())
```
Dtype-specific NaN
datetime NaT, float NaN, and pd.NA in nullable Int64 are different types — but isna() catches all three.

PY Q6 Tier 1 · Foundation Add signup_month (YYYY-MM) to users.

[Meta · Cohort Reporting] A monthly cohort report wants a YYYY-MM column for the chart axis. Use .dt.to_period('M') to preserve order semantics, not strftime (which gives a string that happens to sort right for ISO months only).

Source · users

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

Python

users["signup_month"] = users["signup_date"].dt.to_period("M").astype(str)
print(users[["user_id", "signup_date", "signup_month"]].head())

Output

   user_id signup_date signup_month
0        1  2024-10-02      2024-10
1        2  2024-10-04      2024-10
2        3  2024-10-05      2024-10
3        4  2024-10-08      2024-10
4        5  2024-10-09      2024-10

Why .to_period('M') over .strftime('%Y-%m'): Period preserves order semantics; strftime gives a string that happens to sort correctly for ISO months but is not a real date.

Edge cases & gotchas (2)

to_period('M').astype(str) loses ordering
Period('2024-12') < Period('2025-01') correctly. ‘2024-12’ < ‘2025-01’ correctly. But ‘2024-12’ < ‘25’ (wrong format) sorts lexically. Stick with Period for sorting; cast to string only for display.
Time zone of the date
If signup_date is tz-aware UTC, to_period('M') is in UTC. Convert to local TZ first if the cohort is reported locally.

PY Q7 Tier 1 · Foundation Save filtered IN-only users to Parquet.

[Netflix · Notebook Hand-off] Hand off a sliced frame to a downstream notebook without re-parsing CSV every time. Parquet (snappy) is 5–10× smaller and consumers don't have to guess dtypes.

Source · users (filtered)

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

Python

in_users = users[users["country"] == "IN"].copy()
out = "/tmp/users_in.parquet"
in_users.to_parquet(out, engine="pyarrow", compression="snappy", index=False)
import os
print("wrote:", out, "  bytes:", os.path.getsize(out))

Output

wrote: /tmp/users_in.parquet   bytes: 4213

Why Parquet: columnar, typed, ~5–10× smaller than CSV, and consumers do not have to guess dtypes. Snappy = fast decompression, decent ratio.

Edge cases & gotchas (2)

Parquet dtype mismatch
If different chunks have different dtypes (e.g., user_id sometimes int sometimes float), pyarrow may upcast or fail. Coerce dtypes before writing.
Fix · Python
```
in_users["user_id"] = in_users["user_id"].astype("int64")
in_users.to_parquet("/tmp/users_in.parquet", engine="pyarrow",
                    compression="snappy", index=False)
```
Snappy vs zstd
Snappy = fast read, decent ratio. Zstd = ~30% smaller but slower decode. For analytics on cold-storage, zstd is usually right; for hot tables, snappy.

PY Q8 Tier 1 · Foundation List unique platforms in sessions.

[Meta · Data Discovery] A new analyst exploring the dataset wants to know what platforms exist. value_counts beats unique() — you get the per-value count for free.

Source · sessions

   session_id  user_id platform session_date  duration_sec
0           1       51  android   2025-01-01         388.0
1           2       73  android   2025-01-01         167.0
2           3       32  android   2025-01-01        2493.0
3           4       75  android   2025-01-01        1881.0
4           5        8  android   2025-01-01        1217.0

Python

print(sessions["platform"].value_counts(dropna=False))

Output

platform
ios        329
android    265
web        198
WEB          5
Android      2
iOS          1
Name: count, dtype: int64

Why value_counts over unique(): you also get the per-value count for free — almost always what you actually want.

Edge cases & gotchas (2)

value_counts hides NaN
By default value_counts excludes NaN. dropna=False surfaces it as a row — what you want for DQ.
Fix · Python
```
sessions["platform"].value_counts(dropna=False)
```
Inconsistent casing
‘iOS’ vs ‘ios’ counted separately. Lowercase upstream or before the count.

PY Q9 Tier 1 · Foundation Total revenue across the products table.

[Amazon · Quick GMV] A merchandising sanity check before any ranking. Pitfall: if revenue were stored as string ('1,234'), sum() would concatenate. Confirm dtype on numeric columns.

Source · products

   product_id product_name     category  revenue
0           1       prod_1  Electronics    435.0
1           2       prod_2  Electronics     25.0
2           3       prod_3  Electronics    384.0
3           4       prod_4  Electronics   1054.0
4           5       prod_5  Electronics    485.0

Python

print("Total revenue:", products["revenue"].sum())

Output

Total revenue: 74831.0

Pitfall: if revenue were stored as string ('1,234'), sum() would concatenate strings. Always confirm dtype on numeric columns.

Edge cases & gotchas (2)

Numeric stored as string
If revenue loaded as object (e.g., '1,234'), .sum() concatenates strings. Always check dtypes before summing currency.
Fix · Python
```
# Defensive: strip commas, cast
products["revenue"] = (products["revenue"].astype(str)
                        .str.replace(",", "").astype("float64"))
```
Currency rounding
Float sums of currency drift. For ledger-grade totals, store and sum as integer cents, then divide by 100 at display time.

PY Q10 Tier 1 · Foundation Distinct users in logins.

[Meta · DAU Snapshot] The simplest engagement metric: how many unique users have ever logged in. At billion-user-scale, nunique is expensive — HyperLogLog trades a tiny error for huge memory wins.

Source · logins

   user_id login_date
0        1 2025-03-02
1        1 2025-03-04
2        1 2025-03-05
3        1 2025-03-06
4        1 2025-03-07

Python

print("unique users:", logins["user_id"].nunique())
print("total login rows:", len(logins))

Output

unique users: 100
total login rows: 1935

Memory tip: on huge data, nunique() can be expensive; HyperLogLog (e.g., datasketch) trades a tiny error for big memory wins.

Edge cases & gotchas (2)

nunique blows up on huge data
On a billion-row table, .nunique() loads all distinct values into memory. HyperLogLog (datasketch) trades a tiny error for huge memory wins.
Fix · Python
```
from datasketch import HyperLogLog
hll = HyperLogLog()
for u in logins["user_id"]: hll.update(str(u).encode())
print("approx unique:", hll.count())
```
user_id type coercion
If user_id loaded as float because of NaN rows, ‘1.0’ and ‘1’ are the same value but the float repr can confuse downstream string joins.

Tier 2 · Joins & Multi-Table

PY Q11 Tier 2 · Joins Left-join users with purchases.

[Meta · Lifecycle Audience] Build a users-with-purchases table for downstream segmentation; preserve all users even if they never bought. validate="one_to_many" is your insurance against a silent m:m join blowing up the row count.

Source · users + purchases

   purchase_id  user_id  amount purchase_date product_category
0            1      149  321.30    2024-11-01           Sports
1            2       82   36.66    2024-11-01         Clothing
2            3       19   70.12    2024-11-01         Clothing
3            4       98  122.21    2024-11-01             Food
4            5       59    5.33    2024-11-01      Electronics

Python

merged = users.merge(purchases, on="user_id", how="left",
                     validate="one_to_many")
print("shape:", merged.shape)
print("users with no purchases:",
      merged.loc[merged["purchase_id"].isna(), "user_id"].nunique())

Output

shape: (374, 9)
users with no purchases: 75

validate= earns its keep: it asserts the relationship (1:1, 1:m, m:1, m:m). If your join is unexpectedly many-to-many, this catches it.

Edge cases & gotchas (2)

Unintended m:m join

Without validate=, a duplicated key on either side multiplies rows silently. Always validate the relationship.

Fix · Python

# Will raise if relationship is unexpectedly m:m
merged = users.merge(purchases, on="user_id", how="left",
                     validate="one_to_many")

Suffixes for column collisions
If both frames have a 'name' column, pandas appends _x / _y silently — confusing. Set explicit suffixes.
Fix · Python
```
merged = users.merge(profiles, on="user_id",
                     suffixes=("_user", "_profile"))
```

PY Q12 Tier 2 · Joins Find duplicated user_id rows in user_profiles.

[Meta · CDP Inspection] Before deduping, see the duplicates: their cities, their update dates. keep=False shows all rows in a duplicate group, not just the extras — exactly what you want for inspection.

Source · user_profiles

   user_id    name    city updated_at
0        1  user_1  Denver 2025-01-05
1        1  user_1     NYC 2025-02-19
2        2  user_2      SF 2025-01-01
3        3  user_3  Austin 2025-01-11
4        3  user_3     NaN 2025-02-01

Python

dups = up[up.duplicated(subset=["user_id"], keep=False)]
print("duplicate rows:", len(dups))
print(dups.sort_values(["user_id", "updated_at"]).head(8))

Output

duplicate rows: 240
    user_id    name     city updated_at
0         1  user_1   Denver 2025-01-05
1         1  user_1      NYC 2025-02-19
3         3  user_3   Austin 2025-01-11
4         3  user_3      NaN 2025-02-01
8         7  user_7      NYC 2025-01-17
9         7  user_7  Seattle 2025-02-07
10        8  user_8   Boston 2025-01-19
11        8  user_8  Seattle 2025-02-15

keep=False shows all rows that participate in a duplicate group, not just the extras. That is what you want for inspection.

Edge cases & gotchas (2)

duplicated default keeps first
duplicated() defaults to keep='first', which marks ONLY the duplicates as True. keep=False marks all members of a duplicate group as True — what you want for inspection.
Subset matters
duplicated() on the full row vs subset=['user_id'] gives very different counts. Be explicit.

PY Q13 Tier 2 · Joins Dedup user_profiles — keep latest updated_at per user.

[Meta · CDP] Produce one row per user with the freshest profile. The assert is_unique at the end is the answer — never trust dedup by inspection in production.

Source · user_profiles

   user_id    name    city updated_at
0        1  user_1  Denver 2025-01-05
1        1  user_1     NYC 2025-02-19
2        2  user_2      SF 2025-01-01
3        3  user_3  Austin 2025-01-11
4        3  user_3     NaN 2025-02-01

Python

up_clean = (up.sort_values(["user_id", "updated_at"],
                           ascending=[True, False])
              .drop_duplicates(subset=["user_id"], keep="first"))
assert up_clean["user_id"].is_unique
print("clean shape:", up_clean.shape)
print(up_clean.head())

Output

clean shape: (200, 4)
   user_id    name    city updated_at
1        1  user_1     NYC 2025-02-19
2        2  user_2      SF 2025-01-01
4        3  user_3     NaN 2025-02-01
5        4  user_4  Boston 2025-01-23
6        5  user_5     NYC 2025-01-18

The assert is the answer. Always prove uniqueness after a dedup; never trust by inspection.

Edge cases & gotchas (2)

drop_duplicates preserves first sorted row

If you didn't sort by updated_at DESC, drop_duplicates keeps the wrong row silently. Always sort before drop_duplicates.

Fix · Python

up_clean = (up.sort_values(["user_id", "updated_at"],
                           ascending=[True, False])
              .drop_duplicates(subset=["user_id"], keep="first"))
assert up_clean["user_id"].is_unique

Tied updated_at
If two rows share updated_at, drop_duplicates picks one arbitrarily. Add a deterministic secondary sort key.

PY Q14 Tier 2 · Joins Spend per country (users + purchases).

[Amazon · Pricing] Spend by user country with purchase count, total revenue, and avg basket. Inner vs left join is a definition choice — state which you picked.

Source · users + purchases

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

Python

spend = (users.merge(purchases, on="user_id", how="inner")
              .groupby("country", as_index=False)
              .agg(purchases=("purchase_id", "count"),
                   revenue  =("amount", "sum"),
                   avg_basket=("amount", "mean"))
              .sort_values("revenue", ascending=False))
print(spend.round(2))

Output

   country  purchases    revenue  avg_basket
10      US         86  108088.71     1271.63
6       IN         46    4018.69       89.30
5       FR         31    3251.37      108.38
3       BR         28    2788.20       99.58
11      \N         21    2389.74      113.80
8       UK         29    1891.26       65.22
7       JP         23    1839.74       79.99
4       DE         13    1093.09       84.08
13      in          4     591.46      147.86
0                   4     139.98       34.99
12      br          2      80.52       40.26
9      UK           2      54.98       27.49
2   <NULL>          1       6.21        6.21
...
[truncated; 1 more lines]

Inner vs left join: inner gives spend of countries that purchased; left gives spend per country with NaN for non-purchasing countries. Both valid; state your choice.

Edge cases & gotchas (2)

Inner vs left changes denominators
Inner drops countries with no purchases. Left keeps them with NaN. Pick by the question (‘avg basket among purchasers’ vs ‘avg basket per country, NaN if none’).
groupby with NaN in key
Default drops NaN groups. dropna=False surfaces them — but merging on NaN keys is generally a DQ bug to fix upstream.

PY Q15 Tier 2 · Joins DQ assert — no negative amounts in purchases.

[Amazon · Nightly DQ] A nightly DQ check before publishing the curated revenue table. Failing loudly (raise) is the point — a silent print('warning') in a pipeline becomes a corrupted dashboard tomorrow.

Source · purchases

   purchase_id  user_id  amount purchase_date product_category
0            1      149  321.30    2024-11-01           Sports
1            2       82   36.66    2024-11-01         Clothing
2            3       19   70.12    2024-11-01         Clothing
3            4       98  122.21    2024-11-01             Food
4            5       59    5.33    2024-11-01      Electronics

Python

bad = purchases[purchases["amount"] < 0]
if len(bad):
    raise ValueError(f"{len(bad)} negative purchase amounts")
print("OK — no negative amounts; n=", len(purchases))

Output

[error] ValueError: 5 negative purchase amounts

Failing loudly is the point. A silent print('warning') in a pipeline becomes a corrupted dashboard tomorrow.

Edge cases & gotchas (2)

Soft ‘warning’ vs hard fail
A print() in a pipeline becomes invisible. raise ValueError or log at ERROR level so the orchestrator marks the task failed.
Fix · Python
```
bad = purchases[purchases["amount"] < 0]
if len(bad):
    raise ValueError(f"DQ failed: {len(bad)} negative amounts")
```
Zero-row table is not pass
An empty purchases table satisfies ‘no negative amounts’ trivially — but is itself a DQ failure. Add an assert len(purchases) > 0 (or a freshness check).

PY Q16 Tier 2 · Joins Find missing dates in daily_metrics_gapped.

[Netflix · Date-spine DQ] Surface which days are missing before any chart consumer notices. Index.difference beats set(spine) - set(...) on big indexes — pandas keeps order and dtype.

Source · daily_metrics_gapped

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-06       80359.13       9912

Python

spine = pd.date_range(dmg["ds"].min(), dmg["ds"].max(), freq="D")
missing = spine.difference(dmg["ds"])
print(f"{len(missing)} missing days; first 5:")
print(list(missing[:5]))

Output

14 missing days; first 5:
[Timestamp('2025-01-05 00:00:00'), Timestamp('2025-01-25 00:00:00'), Timestamp('2025-02-08 00:00:00'), Timestamp('2025-02-12 00:00:00'), Timestamp('2025-02-13 00:00:00')]

Index.difference beats set(spine) - set(dmg.ds) on big indexes — pandas keeps order and preserves dtype.

Edge cases & gotchas (2)

Index.difference vs set difference
pd.date_range returns a DatetimeIndex. .difference() preserves dtype and order; set(spine) - set(dmg.ds) returns a Python set (unordered, slower).
Time of day in ‘ds’
If ds is a Timestamp with non-zero time, the spine (midnight-anchored) doesn't match. Floor with .dt.normalize() first.

PY Q17 Tier 2 · Joins Forward-fill missing daily_dau values after spine join.

[Netflix · Imputation] A dashboard prefers ‘last-known-value’ over zero for short gaps. State the imputation choice loudly — ffill, zero, and interpolate have very different downstream meaning. Never pick silently.

Source · daily_metrics_gapped

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-06       80359.13       9912

Python

spine = pd.date_range(dmg["ds"].min(), dmg["ds"].max(), freq="D")
filled = (pd.DataFrame({"ds": spine})
            .merge(dmg, on="ds", how="left")
            .sort_values("ds"))
filled["daily_dau"] = filled["daily_dau"].ffill()
print(filled.head(10))

Output

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19     9486.0
1 2025-01-02       79294.66    10292.0
2 2025-01-03       55245.52    10806.0
3 2025-01-04       51104.66     9132.0
4 2025-01-05            NaN     9132.0
5 2025-01-06       80359.13     9912.0
6 2025-01-07       38076.88    11239.0
7 2025-01-08       36554.98     8400.0
8 2025-01-09       40941.61     9922.0
9 2025-01-10       50385.72    12868.0

State the choice: ffill, zero, or interpolate are all valid imputations with very different downstream meaning. Never pick silently.

Edge cases & gotchas (2)

ffill before any value
If the very first row is NaN, ffill leaves it NaN. Either left-pad with a default or accept the head-NaN semantics.
Fix · Python
```
filled["daily_dau"] = filled["daily_dau"].ffill().fillna(0)
```
ffill across user/group boundaries
Plain .ffill() on a stacked frame leaks values across users. Use .groupby('user_id').ffill() if the data is multi-entity.

PY Q18 Tier 2 · Joins Z-score outliers in orders.amount.

[Apple · Trust Eng] A first-pass anomaly screen before deeper investigation. 3σ assumes normality; order amounts are usually log-normal — log-transform first or use IQR / MAD in production.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

Python

o = orders.copy()
mu, sd = o["amount"].mean(), o["amount"].std()
o["z"] = (o["amount"] - mu) / sd
outliers = o[o["z"].abs() > 3].sort_values("z", ascending=False)
print(f"{len(outliers)} outliers (|z|>3)")
print(outliers.head().round(2))

Output

9 outliers (|z|>3)
     order_id  user_id order_date  amount     z
445       446      176 2025-02-25  958.35  6.96
447       448      168 2025-02-25  919.86  6.65
14         15       70 2025-01-02  890.88  6.42
197       198      112 2025-01-25  830.40  5.93
309       310       36 2025-02-08  820.84  5.85

3σ assumes normality. Order amounts are usually log-normal — log-transform first, or use IQR / robust z-score (median + MAD).

Edge cases & gotchas (2)

Sample contamination

Computing mean & std INCLUDING the outliers makes the thresholds too lax. Use a robust stat (median + MAD) or iteratively trim before re-computing.

Fix · Python

from scipy import stats
mad = stats.median_abs_deviation(orders["amount"], scale="normal")
median = orders["amount"].median()
robust_z = (orders["amount"] - median) / mad
outliers = orders[robust_z.abs() > 3.5]

n=1 partition
If you compute z within tiny groups, std is NaN/0. Filter groups with at least N rows before flagging.

PY Q19 Tier 2 · Joins Cast clickstream.event_time to datetime; sort by user, time.

[Meta · Reels Sessionization Prep] Pre-sessionization shape: per-user events ordered chronologically. reset_index after sort or your iloc calls downstream lie.

Source · clickstream

   user_id          event_time      page
0        1 2025-03-01 08:45:00      home
1        1 2025-03-01 08:55:00    search
2        1 2025-03-01 09:00:00      cart
3        1 2025-03-01 09:01:00  checkout
4        1 2025-03-01 09:46:00    search

Python

cs2 = cs.copy()
cs2["event_time"] = pd.to_datetime(cs2["event_time"])
cs2 = cs2.sort_values(["user_id", "event_time"]).reset_index(drop=True)
print(cs2.head(8))

Output

   user_id          event_time      page
0        1 2025-03-01 08:45:00      home
1        1 2025-03-01 08:55:00    search
2        1 2025-03-01 09:00:00      cart
3        1 2025-03-01 09:01:00  checkout
4        1 2025-03-01 09:46:00    search
5        1 2025-03-01 09:56:00      home
6        1 2025-03-01 09:57:00      cart
7        1 2025-03-01 09:59:00      home

Why reset_index: after a sort, the original index is misleading. Reset it before you start using iloc or assigning new columns.

Edge cases & gotchas (2)

to_datetime fails silently with errors='ignore'
pd.to_datetime(s, errors='ignore') returns the original strings on parse failure — silent dtype regression. Use errors='raise' or 'coerce' (NaT) and DQ-flag.
Fix · Python
```
cs["event_time"] = pd.to_datetime(cs["event_time"], errors="coerce")
bad = cs["event_time"].isna().sum()
if bad: raise ValueError(f"{bad} unparseable event_time rows")
```
Unsorted before window ops
Sessionization, gaps-and-islands, all need sorted order. reset_index after sort or downstream iloc lies.

PY Q20 Tier 2 · Joins Pivot sessions: rows=date, cols=platform, values=count.

[Meta · Platform Mix Chart] Sessions wide format: rows = date, columns = platform, values = count. fill_value=0 is the small but critical detail — without it, days with no sessions for one platform become NaN and break downstream sums.

Source · sessions

   session_id  user_id platform session_date  duration_sec
0           1       51  android   2025-01-01         388.0
1           2       73  android   2025-01-01         167.0
2           3       32  android   2025-01-01        2493.0
3           4       75  android   2025-01-01        1881.0
4           5        8  android   2025-01-01        1217.0

Python

pivot = (sessions.groupby(["session_date", "platform"]).size()
                 .unstack(fill_value=0)
                 .reset_index())
print(pivot.head(10))

Output

platform session_date  Android  WEB  android  iOS  ios  web
0          2025-01-01        0    1       10    0    1    0
1          2025-01-02        0    0        4    0    7    1
2          2025-01-03        0    0        7    0    4    1
3          2025-01-04        0    0        3    0    3    6
4          2025-01-05        0    0        2    0    6    4
5          2025-01-06        0    0        3    0    7    2
6          2025-01-07        0    0        6    0    4    2
7          2025-01-08        1    0        4    0    3    4
8          2025-01-09        0    0        7    0    1    4
9          2025-01-10        0    0        3    0    8    1

fill_value=0 is a small but critical detail — without it, days where one platform had no sessions become NaN, which often breaks downstream sums.

Edge cases & gotchas (2)

Pivot without fill_value
Days with no sessions on one platform become NaN. Downstream .sum() may then propagate NaN unexpectedly.
Categorical platforms
If platform is dtype='category' with unobserved categories, the pivot includes empty columns. observed=True (groupby) or .cat.remove_unused_categories() to clean.

Tier 3 · Window Functions, CASE, Subqueries

PY Q21 Tier 3 · Windows 7-day moving average of daily_revenue.

[Netflix · Trend Smoothing] Smooth the daily noise for an exec chart. min_periods=1 avoids NaN at the start; for ‘official’ 7-day averages you may want min_periods=7 instead — a deliberate choice.

Source · daily_metrics

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-05       55714.88       9883

Python

dm2 = dm.sort_values("ds").copy()
dm2["rev_7d"] = dm2["daily_revenue"].rolling(window=7, min_periods=1).mean()
print(dm2.head(10).round(2))

Output

          ds  daily_revenue  daily_dau    rev_7d
0 2025-01-01       50008.19       9486  50008.19
1 2025-01-02       79294.66      10292  64651.42
2 2025-01-03       55245.52      10806  61516.12
3 2025-01-04       51104.66       9132  58913.26
4 2025-01-05       55714.88       9883  58273.58
5 2025-01-06       80359.13       9912  61954.51
6 2025-01-07       38076.88      11239  58543.42
7 2025-01-08       36554.98       8400  56621.53
8 2025-01-09       40941.61       9922  51142.52
9 2025-01-10       50385.72      12868  50448.27

min_periods=1 avoids NaN at the start. For ‘official’ 7-day averages you may want min_periods=7 instead — a deliberate choice.

Edge cases & gotchas (2)

min_periods choice
min_periods=1 avoids NaN at the head but produces 1-day ‘averages’ for the first 6 rows — misleading. min_periods=7 returns NaN until 7 rows accrue, which is the honest 7-day average.
Fix · Python
```
dm["rev_7d"] = dm["daily_revenue"].rolling(window=7, min_periods=7).mean()
```
Centered vs trailing
rolling(7) is trailing by default. For a smoothed trend chart, rolling(7, center=True) centers — but you cannot use it operationally (uses future data).

PY Q22 Tier 3 · Windows Resample daily_metrics to weekly.

[Meta · Weekly Roll-up] Weekly business review wants one-row-per-week from the daily roll-up. Anchor matters: 'W-MON' ends on Monday — your team's week definition (Sun-Sat / Mon-Sun / ISO) is a real source of metric mismatch.

Source · daily_metrics

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-05       55714.88       9883

Python

weekly = (dm.set_index("ds")
            .resample("W-MON")
            .agg(week_revenue=("daily_revenue", "sum"),
                 avg_dau     =("daily_dau",     "mean"))
            .reset_index())
print(weekly.head(8).round(2))

Output

          ds  week_revenue   avg_dau
0 2025-01-06     371727.04   9918.50
1 2025-01-13     345131.33  10997.50
2 2025-01-20     387115.95  10138.57
3 2025-01-27     293508.59   9598.29
4 2025-02-03     320906.36  10730.29
5 2025-02-10     308757.49   9861.29
6 2025-02-17     349706.55  11612.57
7 2025-02-24     345706.24  10180.57

Anchor matters: 'W-MON' ends the week on Monday. Your team's week definition (Sun-Sat? Mon-Sun? ISO?) is a real source of metric mismatches across teams.

Edge cases & gotchas (2)

Resample anchor
'W' defaults to Sunday. 'W-MON' ends on Monday. Mismatch with the warehouse week definition is a common source of metric drift.

Partial week at the boundaries

First and last week may have <7 days. Resample sums NaN as 0 by default; the first/last week is undercounted. Drop or label.

Fix · Python

weekly = dm.set_index("ds").resample("W-MON").agg(
    week_revenue=("daily_revenue", "sum"),
    days_in_week=("daily_revenue", "size"))
weekly = weekly[weekly["days_in_week"] == 7]   # drop partial weeks

PY Q23 Tier 3 · Windows MTD running revenue with month reset.

[Amazon · MTD Dashboard] MTD revenue on every row, reset on month-start. .dt.to_period('M') as a grouper is the pandas equivalent of DATE_TRUNC('month', ds).

Source · daily_metrics

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-05       55714.88       9883

Python

dm2 = dm.sort_values("ds").copy()
dm2["mtd_revenue"] = (dm2.groupby(dm2["ds"].dt.to_period("M"))["daily_revenue"]
                         .cumsum())
print(dm2.head(35).tail(10).round(2))

Output

           ds  daily_revenue  daily_dau  mtd_revenue
24 2025-01-25       40328.10      13136   1319669.74
25 2025-01-26       43681.34       9966   1363351.08
26 2025-01-27       34131.83       9913   1397482.91
27 2025-01-28       54908.78      10882   1452391.69
28 2025-01-29       57999.53      11443   1510391.22
29 2025-01-30       33800.89      11418   1544192.11
30 2025-01-31       44537.61      11610   1588729.72
31 2025-02-01       33832.95       9053     33832.95
32 2025-02-02       39214.40       9553     73047.35
33 2025-02-03       56612.20      11153    129659.55

The .dt.to_period('M') grouper is the pandas equivalent of DATE_TRUNC('month', ds) — clean and idiomatic.

Edge cases & gotchas (2)

to_period('M') vs Grouper
.dt.to_period('M') is what you want for month labels. pd.Grouper(freq='M') works but resamples — different semantics if the index isn't sorted.
Late-arriving rows
MTD that recomputes on every run isn't snapshotted. If a row for Dec 15 lands today, December's MTD numbers change. Snapshot or label as ‘rolling MTD’.

PY Q24 Tier 3 · Windows DAU and rolling 7-day WAU from logins.

[Meta · DAU/WAU] Two engagement metrics from one source. Rolling distinct count is hard — pandas has no built-in; the loop is O(N×W). For real scale, deque + counter or HyperLogLog sketch.

Source · logins

   user_id login_date
0        1 2025-03-02
1        1 2025-03-04
2        1 2025-03-05
3        1 2025-03-06
4        1 2025-03-07

Python

daily = (logins.groupby("login_date")["user_id"].nunique()
               .rename("dau").reset_index()
               .sort_values("login_date"))

# trailing-7-day WAU via list comprehension (acceptable for ~120 days)
def trail_unique(d, days=7):
    start = d - pd.Timedelta(days=days-1)
    mask = (logins["login_date"] >= start) & (logins["login_date"] <= d)
    return logins.loc[mask, "user_id"].nunique()

daily["wau_7d"] = [trail_unique(d) for d in daily["login_date"]]
print(daily.tail(8))

Output

   login_date  dau  wau_7d
22 2025-03-23   53     100
23 2025-03-24   69     100
24 2025-03-25   65     100
25 2025-03-26   63     100
26 2025-03-27   68     100
27 2025-03-28   69     100
28 2025-03-29   66     100
29 2025-03-30   60     100

Rolling distinct is hard. Pandas has no built-in; the loop above is O(N×W). For big data, use a deque + counter or HyperLogLog sketch.

Edge cases & gotchas (2)

Distinct in a rolling window is not a built-in

Pandas .rolling() doesn't support nunique. The naive list-comp is O(N×W) — fine for 365 days, awful for 4 years.

Fix · Python

# Faster: deque + Counter
from collections import deque, Counter
window, days = deque(), 7
counter = Counter()
result = {}
for d, group in logins.groupby("login_date"):
    for u in group["user_id"]: counter[u] += 1
    window.append((d, set(group["user_id"])))
    while window and (d - window[0][0]).days >= days:
        old_d, old_users = window.popleft()
        for u in old_users:
            counter[u] -= 1
            if counter[u] == 0: del counter[u]
    result[d] = len(counter)

Stitching identities
If the same human logs in from two devices, raw user_id double-counts in DAU. The identity-graph step belongs upstream.

PY Q25 Tier 3 · Windows WoW % change of WAU from weekly_metrics.

[Netflix · WoW Growth] The Netflix weekly business review wants WoW % change of WAU. pct_change() handles the first-row NaN gracefully — don't reinvent LAG with shift unless you need a non-1 lag.

Source · weekly_metrics

  week_start    wau
0 2024-01-01  10000
1 2024-01-08  10230
2 2024-01-15  10455
3 2024-01-22  10819
4 2024-01-29  10434

Python

wm2 = wm.sort_values("week_start").copy()
wm2["wow_pct"] = (wm2["wau"].pct_change() * 100).round(2)
print(wm2.head(10))

Output

  week_start    wau  wow_pct
0 2024-01-01  10000      NaN
1 2024-01-08  10230     2.30
2 2024-01-15  10455     2.20
3 2024-01-22  10819     3.48
4 2024-01-29  10434    -3.56
5 2024-02-05  10972     5.16
6 2024-02-12  11048     0.69
7 2024-02-19  11454     3.67
8 2024-02-26  11077    -3.29
9 2024-03-04  10499    -5.22

pct_change handles the NaN on the first row gracefully. Don't reinvent the LAG with shift unless you need a non-1 lag.

Edge cases & gotchas (2)

First-row pct_change is NaN
.pct_change() can't compute change for the first observation — NaN is correct. Drop or label, don't fillna(0) (which falsely says ‘no growth’).
Zero baseline
If wau was 0 last week, pct_change is inf. Guard with NULLIF logic or report as 'new growth' separately.
Fix · Python
```
import numpy as np
wm["wow_pct"] = wm["wau"].pct_change().replace([np.inf, -np.inf], np.nan) * 100
```

PY Q26 Tier 3 · Windows Days between each user's first and last purchase.

[Amazon · Customer Lifespan] Days between each user's first and last purchase — a ‘lifespan’ metric for the lifecycle team. Edge: users with one purchase get 0 days. Decide whether that is ‘0 days’ or ‘censored’ before reporting.

Source · purchases

   purchase_id  user_id  amount purchase_date product_category
0            1      149  321.30    2024-11-01           Sports
1            2       82   36.66    2024-11-01         Clothing
2            3       19   70.12    2024-11-01         Clothing
3            4       98  122.21    2024-11-01             Food
4            5       59    5.33    2024-11-01      Electronics

Python

span = (purchases.groupby("user_id")["purchase_date"]
                  .agg(first="min", last="max")
                  .reset_index())
span["days_active"] = (span["last"] - span["first"]).dt.days
print(span.sort_values("days_active", ascending=False).head(8))

Output

     user_id      first       last  days_active
91       149 2024-11-01 2024-12-30           59
50        82 2024-11-01 2024-12-28           57
26        45 2024-11-04 2024-12-30           56
105      170 2024-11-05 2024-12-31           56
11        19 2024-11-01 2024-12-27           56
80       127 2024-11-06 2024-12-31           55
111      177 2024-11-07 2024-12-31           54
98       158 2024-11-08 2024-12-30           52

Edge: users with one purchase get 0 days. Decide whether that is ‘0 days active’ or ‘censored’ before reporting.

Edge cases & gotchas (2)

Single-purchase users
first == last for one-purchase users ⇒ days_active = 0. ‘0 days lifespan’ vs ‘censored’ is a definition fight worth having.
Time zone of purchase_date
If timestamps are UTC and the user is in HST, ‘day’ boundaries shift. .dt.tz_convert before .dt.date.

PY Q27 Tier 3 · Windows Top 3 products per category by revenue.

[Amazon · Homepage Ranking] Top 3 products per category by revenue, in pandas. groupby + head is the most readable form; avoid apply(lambda d: d.nlargest(3, ...)) — slower and harder to read.

Source · products

   product_id product_name     category  revenue
0           1       prod_1  Electronics    435.0
1           2       prod_2  Electronics     25.0
2           3       prod_3  Electronics    384.0
3           4       prod_4  Electronics   1054.0
4           5       prod_5  Electronics    485.0

Python

top3 = (products.sort_values(["category", "revenue"], ascending=[True, False])
                .groupby("category", as_index=False)
                .head(3))
print(top3.head(15))

Output

    product_id product_name     category  revenue
18          19      prod_19        Books   2887.0
10          11      prod_11        Books   2420.0
14          15      prod_15        Books   2021.0
27          28      prod_28     Clothing   9770.0
29          30      prod_30     Clothing   7190.0
28          29      prod_29     Clothing   2972.0
3            4       prod_4  Electronics   1054.0
50           1       prod_1  Electronics    999.0
4            5       prod_5  Electronics    485.0
49          50      prod_50         Food   8731.0
41          42      prod_42         Food   2667.0
42          43      prod_43         Food   2252.0
39          40      prod_40       Sports  12777.0
...
[truncated; 2 more lines]

'groupby + head' is the most readable top-N-per-group pattern in pandas. Avoid the apply(lambda d: d.nlargest(3, ...)) form — slower and harder to read.

Edge cases & gotchas (2)

Ties at #3
groupby + head(3) picks the first 3 by sort order, silently dropping ties. For shelf rotation that should keep all ties, use rank with method='dense' filtered by <= 3.
Fix · Python
```
products["rk"] = (products.groupby("category")["revenue"]
                              .rank(method="dense", ascending=False))
top3 = products[products["rk"] <= 3]
```
Stable sort across runs
Without a tiebreak in sort_values, two products with identical revenue swap order across runs. Add product_id as secondary.

PY Q28 Tier 3 · Windows Sessionize the clickstream — 30-minute idle gap.

[Meta · Reels Sessionization] Sessionize by 30-min idle gap, in pandas. Identical mental model to the SQL form — worth practicing both versions side-by-side until you can switch fluently in either round.

Source · clickstream

   user_id          event_time      page
0        1 2025-03-01 08:45:00      home
1        1 2025-03-01 08:55:00    search
2        1 2025-03-01 09:00:00      cart
3        1 2025-03-01 09:01:00  checkout
4        1 2025-03-01 09:46:00    search

Python

cs2 = cs.sort_values(["user_id", "event_time"]).copy()
prev = cs2.groupby("user_id")["event_time"].shift()
new_session = (cs2["event_time"] - prev) > pd.Timedelta(minutes=30)
new_session = new_session.fillna(True)
cs2["session_id"] = new_session.groupby(cs2["user_id"]).cumsum()

agg = (cs2.groupby(["user_id", "session_id"])
          .agg(start=("event_time", "min"),
               end  =("event_time", "max"),
               pages=("page", "count"))
          .reset_index())
print(agg.head(8))

Output

   user_id  session_id               start                 end  pages
0        1           0 2025-03-01 08:45:00 2025-03-01 09:01:00      4
1        1           1 2025-03-01 09:46:00 2025-03-01 11:51:00     24
2        1           2 2025-03-01 12:36:00 2025-03-01 12:36:00      1
3        2           0 2025-03-01 08:01:00 2025-03-01 09:13:00     12
4        2           1 2025-03-01 10:43:00 2025-03-01 11:49:00      9
5        2           2 2025-03-01 12:34:00 2025-03-01 13:53:00     22
6        3           0 2025-03-01 08:02:00 2025-03-01 08:02:00      1
7        3           1 2025-03-01 08:47:00 2025-03-01 08:57:00      2

Identical mental model to SQL. Worth practicing the SQL and pandas versions side by side until you can switch fluently.

Edge cases & gotchas (2)

Per-user shift leak across users
If you forget the groupby in shift, the previous user's last event becomes the next user's prev_time. Always groupby user.
Fix · Python
```
# CORRECT: per-user lag
prev = cs.groupby("user_id")["event_time"].shift()
```
Threshold per surface
30 minutes is canonical for web; mobile push 5 min; video 90s. The threshold is product knowledge, not a fixed constant.

PY Q29 Tier 3 · Windows Build the signup → activation → purchase funnel.

[Meta · Funnel Health] Build the signup → activation → purchase funnel from the events table. Always order-guard — funnels that ignore order count ‘purchase-before-signup’ as a successful conversion. They look great. They are wrong.

Source · events

   event_id  user_id  event_type event_date
0         1        1      signup 2025-01-04
1         2        1  activation 2025-02-08
2         3        1    purchase 2025-03-13
3         4        2      signup 2025-02-17
4         5        2  activation 2025-02-24

Python

firsts = (ev.groupby(["user_id", "event_type"])["event_date"]
            .min().unstack())

step1 = firsts["signup"].notna().sum()
step2 = ((firsts["activation"].notna())
         & (firsts["activation"] >= firsts["signup"])).sum()
step3 = ((firsts["purchase"].notna())
         & (firsts["purchase"] >= firsts[["signup","activation"]].max(axis=1))
        ).sum()
print({"signup": int(step1),
       "activation": int(step2),
       "purchase": int(step3),
       "act_rate": round(step2/step1, 3),
       "buy_rate": round(step3/max(step2,1), 3)})

Output

{'signup': 498, 'activation': 282, 'purchase': 127, 'act_rate': np.float64(0.566), 'buy_rate': np.float64(0.45)}

Always order-guard. Funnels that ignore event order count ‘purchase before signup’ as a successful conversion. They look great. They are wrong.

Edge cases & gotchas (2)

User firing ‘signup’ multiple times
Some pipelines emit signup on every onboarding step. groupby user_id + event_type with .min() collapses to first occurrence — what you want.
Skipped intermediate steps
If a user purchases without an ‘activation’ event, strict ordering excludes them. Decide: strict (must hit each step) or any-later-step counts.

PY Q30 Tier 3 · Windows Cohort retention table from users + logins.

[Meta · Retention Triangle] The shape that ships in every weekly growth review: rows = cohort, cols = age in weeks, cells = % retained. Heatmap straight from this table.

Source · users + logins

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

Python

u = users[["user_id", "signup_date"]].copy()
u["cohort_week"] = u["signup_date"].dt.to_period("W").dt.start_time

j = logins.merge(u, on="user_id")
j["weeks_since"] = ((j["login_date"] - j["cohort_week"]).dt.days // 7)
j = j[(j["weeks_since"] >= 0) & (j["weeks_since"] <= 4)]

retention = (j.pivot_table(index="cohort_week", columns="weeks_since",
                           values="user_id", aggfunc=pd.Series.nunique,
                           fill_value=0))
sizes = u.groupby("cohort_week")["user_id"].nunique()
retention_pct = retention.div(sizes, axis=0).mul(100).round(1)
print(retention_pct.head(8))

Output

weeks_since  0.0  1.0  2.0  3.0  4.0
cohort_week                         
2024-09-30   NaN  NaN  NaN  NaN  NaN
2024-10-07   NaN  NaN  NaN  NaN  NaN
2024-10-14   NaN  NaN  NaN  NaN  NaN
2024-10-21   NaN  NaN  NaN  NaN  NaN
2024-10-28   NaN  NaN  NaN  NaN  NaN
2024-11-04   NaN  NaN  NaN  NaN  NaN
2024-11-11   NaN  NaN  NaN  NaN  NaN
2024-11-18   NaN  NaN  NaN  NaN  NaN

The shape that ships: rows = cohorts, cols = age, cells = % retained. Heatmap straight from this table.

Edge cases & gotchas (2)

Cohorts near data edge are right-censored
A signup from last week has at most W0 observed; W2 is literally impossible to compute. Mark as NaN or filter.
Fix · Python
```
max_week = (logins["login_date"].max() - u["cohort_week"]) // pd.Timedelta(weeks=1)
# Only include cohorts with N+ weeks of full observation
```
ISO vs anchored week
.dt.to_period('W') uses Mon–Sun ISO week; the warehouse may use Sun–Sat. Mismatch ⇒ cohorts shifted by one day.

Tier 4 · CTEs, Sessionization, Cohorts

PY Q31 Tier 4 · Patterns Per-user longest consecutive login streak.

[Meta · Engagement Streaks] Per-user longest consecutive-day login streak, in pandas. Same trick as the SQL version: date − row_number is constant within a consecutive run; group by it.

Source · logins

   user_id login_date
0        1 2025-03-02
1        1 2025-03-04
2        1 2025-03-05
3        1 2025-03-06
4        1 2025-03-07

Python

l = (logins.drop_duplicates(["user_id", "login_date"])
            .sort_values(["user_id", "login_date"]).copy())
l["rn"] = l.groupby("user_id").cumcount()
l["grp"] = l["login_date"] - pd.to_timedelta(l["rn"], unit="D")

streaks = (l.groupby(["user_id", "grp"])["login_date"]
            .agg(["min", "max", "count"])
            .reset_index()
            .rename(columns={"count": "streak_len"}))

longest = (streaks.groupby("user_id", as_index=False)["streak_len"]
                  .max()
                  .sort_values("streak_len", ascending=False))
print(longest.head(8))

Output

    user_id  streak_len
45       46          20
46       47          17
75       76          13
94       95          13
13       14          12
99      100          12
11       12          11
27       28          11

Same trick as SQL: date − row_number is constant within a consecutive run; group by it.

Edge cases & gotchas (2)

Duplicate (user, date) pairs poison the streak math
Two login rows for the same user/day inflate row_number, shifting the group_key. drop_duplicates on (user_id, login_date) first.
Time zone vs ‘day’ bucket
23:30 PT and 00:30 PT are the same day for a Pacific user but different days in UTC. Bucket login_date in user-local TZ.

PY Q32 Tier 4 · Patterns SCD2 as-of join: price each purchase using the price effective on its date.

[Apple · Historical Pricing] Re-price each purchase at the price effective on its date, in pandas. merge_asof is the SCD2 superpower — it picks the most recent matching row whose effective_date ≤ purchase_date.

Source · purchases + product_dim

   product_id product_name   price effective_date expiry_date is_current
0           1    product_1   60.01     2024-01-01  2024-04-16          N
1           1    product_1   66.01     2024-04-16         NaT          Y
2           2    product_2  171.42     2024-01-01  2024-04-22          N
3           2    product_2  188.56     2024-04-22  2024-07-28          N
4           3    product_3  156.15     2024-01-01  2024-04-28          N

Python

pdim = pd_dim.copy()
pdim["product_id"] = pdim["product_id"].astype("int64")
pdim = pdim.sort_values(["effective_date","product_id"])

p = purchases.copy().sort_values("purchase_date")
p["product_id"] = (p["product_category"].astype("category").cat.codes + 1).astype("int64")

priced = pd.merge_asof(p, pdim,
                       by="product_id",
                       left_on="purchase_date",
                       right_on="effective_date",
                       direction="backward")
print(priced[["purchase_id","purchase_date","product_id",
              "amount","price"]].head(8))

Output

[error] ValueError: Merge keys contain null values on left side

merge_asof is the SCD2 superpower in pandas: it picks the most recent matching row whose effective_date ≤ purchase_date.

Edge cases & gotchas (2)

merge_asof requires sorted keys

Both sides must be sorted on the on= column. Sort first or merge_asof raises ValueError: right keys must be sorted.

Fix · Python

pdim = pd_dim.sort_values("effective_date")
p    = purchases.sort_values("purchase_date")
priced = pd.merge_asof(p, pdim, by="product_id",
                       left_on="purchase_date",
                       right_on="effective_date",
                       direction="backward")

Tolerance + direction
direction='nearest' can match a future row — wrong for as-of. Use 'backward' (last <= left) and add tolerance=pd.Timedelta(days=N) to bound staleness.

PY Q33 Tier 4 · Patterns Detect skewed keys; salt the hot key into N buckets.

[Amazon · Spark Skew Mitigation] Detect skewed keys and salt the hot key into N buckets so a downstream Spark join can parallelize. The other side of the join must replicate the small-side rows for each salt bucket — salting only one side breaks the join.

Source · skewed_table

   id join_key  value
0   1  key_243    189
1   2  key_225     81
2   3  hot_key    108
3   4   key_91    321
4   5  hot_key    579

Python

import random
random.seed(42)
freq = skew["join_key"].value_counts(normalize=True).head()
print("top frequencies:")
print(freq.round(3))

N_SALTS = 8
hot = freq[freq > 0.05].index.tolist()
def salt(k):
    return f"{k}__{random.randint(0, N_SALTS-1)}" if k in hot else k

skew2 = skew.copy()
skew2["join_key_salted"] = skew2["join_key"].map(salt)
print()
print("after salting (top 6):")
print(skew2["join_key_salted"].value_counts().head(6))

Output

top frequencies:
join_key
hot_key    0.700
key_243    0.001
key_225    0.001
key_91     0.001
key_137    0.001
Name: proportion, dtype: float64

after salting (top 6):
join_key_salted
hot_key__3    108
hot_key__4     96
hot_key__1     95
...
[truncated; 4 more lines]

The other side of the join must replicate the small-side rows for each salt bucket. Salting only one side breaks the join.

Edge cases & gotchas (2)

Salting only one side breaks the join

If you salt the fact table but not the dim, no salted key matches anything. The dim must be CROSS JOINed with the salt values.

Fix · Python

salts = pd.DataFrame({"salt": range(N_SALTS)})
dim_salted = dim.merge(salts, how="cross")
dim_salted["join_key_salted"] = (dim_salted["join_key"]
                                 .str.cat(dim_salted["salt"].astype(str), sep="__"))

Unsalt before final aggregation
After the salted join, the final group-by must un-salt (strip the suffix) to get the real key. Forgetting this step is the most common skew-mitigation bug.

PY Q34 Tier 4 · Patterns Build the org tree from employees as a nested dict.

[Amazon · HRIS JSON Export] An HR portal needs a JSON-friendly nested org tree: each manager → reports → their reports → …. Iterative DFS or bounded recursion — a cycle in manager_id would blow the stack.

Source · employees

   employee_id  full_name  manager_id   department    salary
0          100  Dana Chen         NaN  Engineering  250000.0
1          201   Eve Park       100.0  Engineering  180000.0
2          202   Frank Li       100.0  Engineering  175000.0
3          203  Grace Kim       100.0  Engineering       NaN
4          301     IC_301       201.0  Engineering  116962.0

Python

children = {}
for _, r in emp.iterrows():
    children.setdefault(r["manager_id"], []).append(r["employee_id"])

def build(node_id):
    row = emp.loc[emp["employee_id"] == node_id].iloc[0]
    return {
        "id":   int(node_id),
        "name": row["full_name"],
        "dept": row["department"],
        "reports": [build(c) for c in children.get(node_id, [])],
    }

roots = emp.loc[emp["manager_id"].isna(), "employee_id"].tolist()
tree = [build(r) for r in roots]
print("root:", tree[0]["name"], "->", len(tree[0]["reports"]), "direct reports")
import json
print(json.dumps(tree[0], default=str, indent=2)[:400], "...")

Output

root: Dana Chen -> 3 direct reports
{
  "id": 100,
  "name": "Dana Chen",
  "dept": "Engineering",
  "reports": [
    {
      "id": 201,
      "name": "Eve Park",
      "dept": "Engineering",
      "reports": [
        {
          "id": 301,
          "name": "IC_301",
...
[truncated; 7 more lines]

Recursion safety: bound the depth (or use iterative DFS) — a cycle in manager_id would blow the stack.

Edge cases & gotchas (2)

Cycle in manager_id blows the stack

Recursive build() with a cycle recurses until RecursionError. Use iterative DFS or a visited set.

Fix · Python

def build(node_id, visited=None):
    visited = visited or set()
    if node_id in visited:
        raise ValueError(f"cycle at {node_id}")
    visited = visited | {node_id}
    return {...}

Multiple roots
If two employees have manager_id IS NULL, the tree has two roots. Decide whether the JSON is a list of trees or a synthetic single root with both as children.

PY Q35 Tier 4 · Patterns Median order amount per user.

[Apple · Robust Stats] Median order amount per user — robust to whales. Grouped quantile is built-in; for arbitrary percentiles use .agg(p90=('amount', lambda s: s.quantile(0.9))).

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

Python

med = (orders.groupby("user_id", as_index=False)["amount"]
              .quantile(0.5)
              .rename(columns={"amount": "median_amount"}))
print(med.sort_values("median_amount", ascending=False).head(8).round(2))

Output

     user_id  median_amount
73        85         425.90
166      184         384.46
133      149         315.26
88       102         312.58
39        44         275.96
29        34         240.75
137      153         232.63
181    99002         232.05

Grouped quantile is built-in. For arbitrary percentiles use .agg(p90=('amount', lambda s: s.quantile(0.9))).

Edge cases & gotchas (2)

Quantile interpolation
Default quantile(0.5, interpolation='linear') interpolates between two middle rows — fractional cents result. Use 'lower' / 'higher' / 'midpoint' for currency.
Fix · Python
```
med = (orders.groupby("user_id")["amount"]
              .quantile(0.5, interpolation="lower"))
```
Single-order users
n=1 partitions return that one value as the median. Filter to users with min order count if ‘typical’ basket is the goal.

PY Q36 Tier 4 · Patterns Memory-efficient CSV read in chunks.

[Amazon · S3 Streaming] A 50-GB clickstream won't fit in RAM. Stream-aggregate via chunksize bounded memory + usecols at parse time.

Source · clickstream.csv

   user_id          event_time      page
0        1 2025-03-01 08:45:00      home
1        1 2025-03-01 08:55:00    search
2        1 2025-03-01 09:00:00      cart
3        1 2025-03-01 09:01:00  checkout
4        1 2025-03-01 09:46:00    search

Python

page_counts = pd.Series(dtype=int)
src = DATA + "clickstream.csv"
for chunk in pd.read_csv(src, chunksize=500, usecols=["page"]):
    page_counts = page_counts.add(chunk["page"].value_counts(),
                                  fill_value=0)
print(page_counts.sort_values(ascending=False).astype(int).head())

Output

page
checkout    406
search      403
product     393
home        389
cart        377
dtype: int64

Two memory wins: chunksize bounds RAM; usecols skips columns we do not need at parse time.

Edge cases & gotchas (2)

Loss of dtypes between chunks
If chunk 1 has user_id as int and chunk 2 has it as float (NaNs introduced), concat raises a dtype warning. Pre-declare dtypes via dtype= in read_csv.
Fix · Python
```
for chunk in pd.read_csv(src, chunksize=200_000,
                         usecols=["page", "user_id"],
                         dtype={"user_id": "Int64", "page": "string"}):
    ...
```
Shuffled ordering between chunks
If your aggregation needs ordering (sessionization, gaps), chunking by row count is wrong — chunks aren't time-aligned. Chunk by date instead.

PY Q37 Tier 4 · Patterns Validate a schema contract for orders.

[Netflix · Schema Contract] A nightly ingest must fail loudly if the upstream schema or data ranges drift. In real production: Great Expectations / pandera / Pydantic; on a whiteboard, the bare-asserts version below.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

Python

def validate_orders(df):
    expected_cols = ["order_id", "user_id", "order_date", "amount"]
    missing = set(expected_cols) - set(df.columns)
    assert not missing, f"missing columns: {missing}"
    assert df["order_id"].is_unique,        "duplicate order_id"
    assert (df["amount"] > 0).all(),        "non-positive amounts"
    assert df["order_date"].notna().all(),  "null order_date"
    return True

print("orders schema OK:", validate_orders(orders))

Output

[error] AssertionError: duplicate order_id

In a real job you would use Great Expectations / pandera / Pydantic; the bare-asserts version is what you write on a whiteboard.

Edge cases & gotchas (2)

assert is stripped under -O
Python's assert compiles out under -O optimization. Use raise ValueError(...) in production DQ checks.
Fix · Python
```
if not df["order_id"].is_unique:
    raise ValueError("duplicate order_id")
```
Schema vs values
The function checks dtypes and uniqueness. Real DQ also checks value ranges, freshness (max date), volume (row count not 0). Use Great Expectations / pandera for the full set.

PY Q38 Tier 4 · Patterns Idempotent upsert: append only new orders to a Parquet table.

[Amazon · Idempotent ETL] A streaming-ish ETL where re-running the same batch must not double-count. Idempotency = primary key + last-write-wins. In a real warehouse: MERGE INTO; in a lakehouse: Delta / Iceberg MERGE.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

Python

import os
out = "/tmp/orders_curated.parquet"
new_batch = orders.head(100)

if os.path.exists(out):
    existing = pd.read_parquet(out)
    combined = pd.concat([existing, new_batch])
    combined = combined.drop_duplicates(subset=["order_id"], keep="last")
else:
    combined = new_batch
combined.to_parquet(out, index=False)
print("rows in curated:", len(combined))

Output

rows in curated: 100

Idempotency = primary key + last-write-wins. In a real warehouse this is MERGE INTO; in a lakehouse it is Delta / Iceberg MERGE.

Edge cases & gotchas (2)

Atomic vs partial writes
to_parquet writing then crashing leaves a half-written file. Stage to a temp path and rename atomically, or use Delta/Iceberg.
Fix · Python
```
tmp = out + ".tmp"
combined.to_parquet(tmp, index=False)
import os; os.replace(tmp, out)   # atomic on POSIX FS
```
Concurrent writers
Two ETL runs writing to the same Parquet at once corrupt each other. Use a per-batch path + a manifest file, or a lakehouse format with optimistic concurrency.

PY Q39 Tier 4 · Patterns Build a Type-2 SCD from a stream of price updates.

[Apple · Catalog History] Build a Type-2 SCD from a stream of price updates. The invariant: for any product, exactly one row has is_current='Y', and intervals do not overlap. Test both with asserts.

Source · (synthetic updates)

(see linked tables)

Python

updates = pd.DataFrame([
    {"product_id": 1, "price": 60.01, "change_date": "2024-01-01"},
    {"product_id": 1, "price": 66.01, "change_date": "2024-04-16"},
    {"product_id": 2, "price": 171.42, "change_date": "2024-01-01"},
    {"product_id": 2, "price": 188.56, "change_date": "2024-04-22"},
    {"product_id": 2, "price": 199.99, "change_date": "2024-07-28"},
])
updates["change_date"] = pd.to_datetime(updates["change_date"])
updates = updates.sort_values(["product_id", "change_date"])
updates["effective_date"] = updates["change_date"]
updates["expiry_date"]    = (updates.groupby("product_id")["change_date"]
                                    .shift(-1))
updates["is_current"] = np.where(updates["expiry_date"].isna(), "Y", "N")
print(updates)

Output

   product_id   price change_date effective_date expiry_date is_current
0           1   60.01  2024-01-01     2024-01-01  2024-04-16          N
1           1   66.01  2024-04-16     2024-04-16         NaT          Y
2           2  171.42  2024-01-01     2024-01-01  2024-04-22          N
3           2  188.56  2024-04-22     2024-04-22  2024-07-28          N
4           2  199.99  2024-07-28     2024-07-28         NaT          Y

The invariant: for any product, exactly one row has is_current='Y', and intervals do not overlap. Test both with asserts.

Edge cases & gotchas (2)

Out-of-order updates
If updates arrive with change_date going backwards, the SCD2 build inserts a row that overlaps prior windows. Sort and validate before assigning effective/expiry.
Same-day price flip-flops
Two updates with the same change_date for the same product produce a zero-width window — invariant violation. Dedup or flag.

PY Q40 Tier 4 · Patterns Validate existing product_dim — no overlaps, exactly one current row per product.

[Apple · SCD2 Health Check] SCD2 health checks are first-class data products. Validate product_dim: no overlapping windows, exactly one current row per product. Run on every load and alert when the count regresses.

Source · product_dim

   product_id product_name   price effective_date expiry_date is_current
0           1    product_1   60.01     2024-01-01  2024-04-16          N
1           1    product_1   66.01     2024-04-16         NaT          Y
2           2    product_2  171.42     2024-01-01  2024-04-22          N
3           2    product_2  188.56     2024-04-22  2024-07-28          N
4           3    product_3  156.15     2024-01-01  2024-04-28          N

Python

def overlap_count(g):
    g = g.sort_values("effective_date")
    end = g["expiry_date"].fillna(pd.Timestamp("2999-01-01"))
    nxt = g["effective_date"].shift(-1)
    return (nxt < end).fillna(False).sum()

bad_overlaps = (pd_dim.groupby("product_id", group_keys=False)
                      .apply(overlap_count).sum())
current_per_product = pd_dim.groupby("product_id")["is_current"].apply(
    lambda s: (s == "Y").sum())

print("overlapping intervals:", int(bad_overlaps))
print("products with !=1 current row:",
      int((current_per_product != 1).sum()))

Output

overlapping intervals: 11
products with !=1 current row: 10

SCD2 health checks are first-class data products. Run them on every load and alert when they regress.

Edge cases & gotchas (2)

expiry_date NULL convention
NULL = current is conventional but the analytic query expiry_date > CURRENT_DATE excludes NULL rows (NULL > anything is NULL). Use COALESCE(expiry_date, '9999-12-31') or test IS NULL explicitly.
Boundary inclusivity
If [effective, expiry] is closed-closed, a fact dated exactly on the boundary matches BOTH rows. Standard is half-open [effective, expiry). Document and assert.
Fix · Python
```
# Half-open semantics in the as-of join condition
mask = (fact["dt"] >= dim["effective"]) & (fact["dt"] < dim["expiry"].fillna(pd.Timestamp.max))
```

Tier 5 · Senior Patterns

PY Q41 Tier 5 · Senior PySpark anti-join: users with no purchases.

[Meta · Lifecycle in Spark] Same anti-join as the SQL version, but on a billion-user table where pandas would OOM. left_anti is the canonical Spark form — clearer than the LEFT-JOIN-WHERE-NULL pattern, and the optimizer treats it directly.

Source · users + purchases

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

Python (PySpark)

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

users_df     = spark.read.option("header", True).csv("users.csv")
purchases_df = spark.read.option("header", True).csv("purchases.csv")

never_bought = users_df.join(purchases_df, "user_id", "left_anti")
never_bought.show(5)
print("count:", never_bought.count())

Output

[representative; equivalent to pandas LEFT JOIN with IS NULL]

    user_id     name country signup_date
0         1   user_1      IN  2024-10-02
1         2   user_2      UK  2024-10-04
5         4   user_4      JP  2024-10-08
12        8   user_8      US  2024-10-15
17       10  user_10      IN  2024-10-19

left_anti is the canonical anti-join in Spark — clearer than the LEFT-JOIN-WHERE-NULL pattern, and the optimizer treats it directly.

Edge cases & gotchas (2)

left_anti drops everything if right side has NULLs in key
Spark's left_anti uses NULL-aware comparison: if the right key has NULL rows, they don't match, and the anti-join keeps all left rows. Confirm by counting matches before relying on it.
Header inference
option('header', True) assumes headers; without inferSchema=True all columns are strings. user_id == ‘1’ vs 1 silently fails the join.
Fix · Python
```
users_df = (spark.read
              .option("header", True)
              .option("inferSchema", True)
              .csv("users.csv"))
```

PY Q42 Tier 5 · Senior PySpark window: top-3 products per category.

[Amazon · Catalog Ranking at Scale] Top-3 products per category in PySpark. Window cost: partitioning by a high-cardinality column shuffles hard — for very wide top-N, consider a two-stage approx-then-exact pattern.

Source · products

   product_id product_name     category  revenue
0           1       prod_1  Electronics    435.0
1           2       prod_2  Electronics     25.0
2           3       prod_3  Electronics    384.0
3           4       prod_4  Electronics   1054.0
4           5       prod_5  Electronics    485.0

Python (PySpark)

from pyspark.sql import functions as F, Window

products_df = (spark.read.option("header", True).csv("products.csv")
                    .withColumn("revenue", F.col("revenue").cast("int")))

w = Window.partitionBy("category").orderBy(F.desc("revenue"))
top = (products_df.withColumn("rk", F.dense_rank().over(w))
                  .filter(F.col("rk") <= 3))
top.show()

Output

[representative output, computed via pandas equivalent]

    product_id product_name     category  revenue  rk
18          19      prod_19        Books   2887.0   1
10          11      prod_11        Books   2420.0   2
14          15      prod_15        Books   2021.0   3
27          28      prod_28     Clothing   9770.0   1
29          30      prod_30     Clothing   7190.0   2
28          29      prod_29     Clothing   2972.0   3
3            4       prod_4  Electronics   1054.0   1
50           1       prod_1  Electronics    999.0   2

Window cost: partition by a high-cardinality column and Spark may shuffle hard. For very wide top-N, consider a two-stage approx-then-exact.

Edge cases & gotchas (2)

Window without partitionBy is a single-partition shuffle
Window.orderBy(...) alone (no partitionBy) collapses all data to one task. For top-N per group, partition by the group key; for global top-N, accept the shuffle but consider limit first.
dense_rank vs row_number
If two products tie at #2, dense_rank gives 1,2,2,3 (3 rows ≤ 3); row_number gives 1,2,3 deterministically. Same SQL tradeoff: pick by intent.

PY Q43 Tier 5 · Senior Broadcast join in PySpark — small dimension on the right side.

[Meta · Production Spark] A facts × tiny-dim join shouldn't shuffle the giant facts. Force a broadcast hint when the small side is borderline above the auto threshold (spark.sql.autoBroadcastJoinThreshold, default 10 MB).

Source · orders + users

   user_id    name              email signup_date country
0        1  user_1  user1@example.com  2024-10-02      IN
1        2  user_2  user2@example.com  2024-10-04      UK
2        3  user_3  user3@example.com  2024-10-05      JP
3        4  user_4  user4@example.com  2024-10-08      JP
4        5  user_5  user5@example.com  2024-10-09      US

Python (PySpark)

from pyspark.sql.functions import broadcast

orders_df = spark.read.option("header", True).csv("orders.csv")
users_df  = spark.read.option("header", True).csv("users.csv")

joined = orders_df.join(broadcast(users_df), "user_id", "left")
joined.explain()       # confirm BroadcastHashJoin in the plan

Output

== Physical Plan ==
*(2) Project [order_id, user_id, order_date, amount, name, country, signup_date]
+- *(2) BroadcastHashJoin [user_id], [user_id], LeftOuter, BuildRight
   :- FileScan csv ... [orders.csv]
   +- BroadcastExchange HashedRelationBroadcastMode(...)
      +- FileScan csv ... [users.csv]

Threshold: Spark auto-broadcasts under spark.sql.autoBroadcastJoinThreshold (default 10 MB). For larger-but-still-small dims, the explicit hint forces the optimizer.

Edge cases & gotchas (2)

Broadcast threshold
Spark auto-broadcasts under spark.sql.autoBroadcastJoinThreshold (default 10 MB). Above that, force with the broadcast() hint. Above ~200 MB, broadcast becomes a memory liability — use bucketing instead.
Skew in the BIG side, not the small one
Broadcast helps when the small side is the dim. If the FACTS are skewed on the join key, broadcast still works but the skewed task in the fact is the bottleneck. Salt the fact instead.

PY Q44 Tier 5 · Senior PySpark funnel — countDistinct per step.

[Meta · Funnel at Scale] Same signup → activation → purchase funnel as Q34, but on the billion-row events table. For exact counts at huge scale, swap countDistinct for approx_count_distinct with a tunable rsd — 5% error is fine for a funnel chart.

Source · events

   event_id  user_id  event_type event_date
0         1        1      signup 2025-01-04
1         2        1  activation 2025-02-08
2         3        1    purchase 2025-03-13
3         4        2      signup 2025-02-17
4         5        2  activation 2025-02-24

Python (PySpark)

events_df = spark.read.option("header", True).csv("events.csv")

firsts = (events_df.groupBy("user_id", "event_type")
                   .agg(F.min("event_date").alias("first_at"))
                   .groupBy("user_id")
                   .pivot("event_type", ["signup","activation","purchase"])
                   .agg(F.first("first_at")))

funnel = firsts.agg(
    F.countDistinct(F.when(F.col("signup").isNotNull(),
                           F.col("user_id"))).alias("signups"),
    F.countDistinct(F.when((F.col("activation").isNotNull())
                           & (F.col("activation") >= F.col("signup")),
                           F.col("user_id"))).alias("activations"),
    F.countDistinct(F.when((F.col("purchase").isNotNull())
                           & (F.col("purchase") >= F.col("activation")),
                           F.col("user_id"))).alias("purchasers"),
)
funnel.show()

Output

[representative; counts computed via pandas]

signups     = 498
activations = 282
purchasers  = 127

For exact counts at huge scale, consider approx_count_distinct with a tunable rsd — usually 5% error is fine for a funnel chart.

Edge cases & gotchas (2)

approx_count_distinct rsd tradeoff
approx_count_distinct(col, rsd=0.05) means 5% relative standard deviation. For a funnel chart 5% is fine; for billing it isn't. Pick deliberately.
Fix · Python
```
from pyspark.sql.functions import approx_count_distinct
funnel = events_df.agg(approx_count_distinct("user_id", rsd=0.05)
                       .alias("approx_unique"))
```
Pivot with unknown values
.pivot('event_type') without an explicit list scans all values (extra job). Pass the list ['signup','activation','purchase'] to skip the scan.

PY Q45 Tier 5 · Senior Custom hash partitioning to spread skew.

[Amazon · Skew at Petabyte Scale] Custom hash partitioning with a salted hot key. Tradeoff: salting helps the shuffle, but downstream aggregation must un-salt (strip the suffix) before the final group-by. Don't forget the unsalt step.

Source · skewed_table

   id join_key  value
0   1  key_243    189
1   2  key_225     81
2   3  hot_key    108
3   4   key_91    321
4   5  hot_key    579

Python (PySpark)

from pyspark.sql import functions as F

skew_df = spark.read.option("header", True).csv("skewed_table.csv")
hot, N = "hot_key", 16
salted = (skew_df
          .withColumn("salt", F.when(F.col("join_key") == hot,
                                     (F.rand() * N).cast("int"))
                                .otherwise(F.lit(0)))
          .withColumn("salted_key", F.concat_ws("__", "join_key", "salt"))
          .repartition(64, "salted_key"))
salted.groupBy("salted_key").count().orderBy(F.desc("count")).show(5)

Output

+--------------+-----+
|    salted_key|count|
+--------------+-----+
|hot_key__0    |   45|
|hot_key__7    |   44|
|hot_key__15   |   44|
|hot_key__3    |   43|
|hot_key__9    |   42|
+--------------+-----+
[hot_key spread across 16 buckets vs ~700 rows on one]

Tradeoff: salting helps shuffle, but downstream needs to un-salt by stripping the suffix before final aggregation. Don't forget the unsalt step.

Edge cases & gotchas (2)

Salt count vs cluster parallelism
N=16 salts means at most 16 tasks for the hot key. If your cluster has 64 cores, 16 salts under-utilize. Tune N to the available parallelism.
Salt the SMALL side too
After salting the fact, the dim must be replicated across the salt buckets — either by cross-joining with a salt-numbers table or by emitting N copies, one per salt.
Fix · Python
```
from pyspark.sql.functions import explode, array, lit
dim_replicated = dim.withColumn("salt", explode(array(*[lit(i) for i in range(N)])))
```

PY Q46 Tier 5 · Senior Build a tiny DAG executor in pure Python (topological sort).

[Netflix · Mini Orchestrator] A 5-line task DAG, no Airflow. Knowing toposort + cycle detection by heart proves you have built a scheduler, not just used one — Netflix's Maestro and Meta's Dataswarm both started this way.

Source · (synthetic DAG)

(see linked tables)

Python

from collections import defaultdict, deque

def topo_run(graph, run_fn):
    indeg = defaultdict(int)
    rev   = defaultdict(list)
    for node, deps in graph.items():
        indeg[node]
        for d in deps:
            indeg[node] += 1
            rev[d].append(node)
    q = deque([n for n, k in indeg.items() if k == 0])
    order = []
    while q:
        n = q.popleft()
        run_fn(n)
        order.append(n)
        for nxt in rev[n]:
            indeg[nxt] -= 1
            if indeg[nxt] == 0:
                q.append(nxt)
    if len(order) != len(graph):
        raise ValueError("cycle in DAG")
    return order

dag = {"extract":[], "stage":["extract"], "transform":["stage"],
       "publish":["transform"], "audit":["transform"]}
ran = topo_run(dag, lambda n: print("running:", n))
print("order:", ran)

Output

running: extract
running: stage
running: transform
running: publish
running: audit
order: ['extract', 'stage', 'transform', 'publish', 'audit']

Why this question shows up: orchestration is half the job. Knowing toposort + cycle detection by heart proves you have built (not just used) a scheduler.

Edge cases & gotchas (2)

Cycle detection
If the DAG has a cycle, topo_run returns fewer nodes than the graph; the final assert catches it. Without the assert, the executor silently runs a partial DAG.
Failure handling
The toy executor doesn't handle task failure. Real schedulers retry-with-backoff, mark downstream as upstream_failed, and expose state to a dashboard.

PY Q47 Tier 5 · Senior Implement an LRU cache (no functools).

[Generic FAANG · Caching] A classic mid-level Python question: implement an LRU cache without functools.lru_cache. OrderedDict gives O(1) get/put; the doubly-linked-list-from-scratch version impresses the senior interviewer.

Source · (no input)

(see linked tables)

Python

from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity):
        self.cap = capacity
        self.d   = OrderedDict()
    def get(self, key):
        if key not in self.d:
            return None
        self.d.move_to_end(key)
        return self.d[key]
    def put(self, key, value):
        if key in self.d:
            self.d.move_to_end(key)
        self.d[key] = value
        if len(self.d) > self.cap:
            self.d.popitem(last=False)

c = LRUCache(2)
c.put("a", 1); c.put("b", 2); c.get("a"); c.put("c", 3)
print(list(c.d.items()))

Output

[('a', 1), ('c', 3)]

O(1) get/put. If you write the doubly-linked-list version from scratch you will impress, but the OrderedDict version is the real-world answer.

Edge cases & gotchas (2)

Thread safety

OrderedDict isn't thread-safe. For multi-threaded callers wrap with a Lock, or use functools.lru_cache (which is thread-safe).

Fix · Python

from threading import Lock
class LRUCache:
    def __init__(self, cap): self.cap = cap; self.d = OrderedDict(); self.lock = Lock()
    def get(self, k):
        with self.lock:
            if k not in self.d: return None
            self.d.move_to_end(k); return self.d[k]

Eviction policy edge
Capacity 0 should be valid (no caching). Test that put on a 0-capacity cache is a no-op rather than an exception.

PY Q48 Tier 5 · Senior Streaming median with two heaps.

[Meta · Streaming Aggregations] Compute a running median in O(log n) per insert. Two-heap pattern shows up in interviews and in real streaming aggregations. Memorize the invariant: len(lo) ∈ {len(hi), len(hi)+1}.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

Python

import heapq

class StreamingMedian:
    def __init__(self):
        self.lo = []   # max-heap (invert sign)
        self.hi = []   # min-heap
    def add(self, x):
        heapq.heappush(self.lo, -x)
        heapq.heappush(self.hi, -heapq.heappop(self.lo))
        if len(self.hi) > len(self.lo):
            heapq.heappush(self.lo, -heapq.heappop(self.hi))
    def median(self):
        if len(self.lo) > len(self.hi):
            return -self.lo[0]
        return (-self.lo[0] + self.hi[0]) / 2

m = StreamingMedian()
for v in orders["amount"].tolist():
    m.add(v)
print("running median (final):", round(m.median(), 2))
print("vs orders['amount'].median() :", round(orders["amount"].median(), 2))

Output

running median (final): 51.52
vs orders['amount'].median() : 51.52

Two-heap pattern shows up in interviews and in real streaming aggregations. Worth memorizing the invariant: len(lo) ∈ {len(hi), len(hi)+1}.

Edge cases & gotchas (2)

Removing values mid-stream
The two-heap pattern only supports add. Real streaming median with deletes needs a counter for tombstones and lazy removal — non-trivial.
Numeric stability
For very large N with floats, the median itself is fine, but downstream variance computation can drift. Use Welford's online algorithm for variance.

PY Q49 Tier 5 · Senior Backfill orchestrator: regenerate daily_metrics for a missing date range.

[Amazon · Backfill Discipline] A bug in the upstream job lost 5 days of metrics. Reprocess only those days, idempotently. Partition-per-day + overwrite is the lakehouse pattern — re-running for the same day produces the same files.

Source · orders

   order_id  user_id order_date  amount
0         1       57 2025-01-01   19.64
1         2      128 2025-01-01   66.85
2         3      151 2025-01-01   11.51
3         4      150 2025-01-01   56.81
4         5      196 2025-01-01   13.16

Python

from pathlib import Path

def compute_day(ds):
    o = orders[orders["order_date"] == ds]
    return {"ds": str(ds.date()),
            "daily_revenue": float(o["amount"].sum()),
            "daily_dau":     int(o["user_id"].nunique())}

def backfill(start, end, out_dir):
    Path(out_dir).mkdir(exist_ok=True)
    written = []
    for ds in pd.date_range(start, end, freq="D"):
        rec = compute_day(ds)
        path = Path(out_dir) / f"ds={rec['ds']}/data.parquet"
        path.parent.mkdir(parents=True, exist_ok=True)
        pd.DataFrame([rec]).to_parquet(path, index=False)
        written.append(rec)
    return written

written = backfill("2025-01-15", "2025-01-19", "/tmp/dm_partitions")
for r in written: print(r)

Output

{'ds': '2025-01-15', 'daily_revenue': 824.22, 'daily_dau': 8}
{'ds': '2025-01-16', 'daily_revenue': 1033.55, 'daily_dau': 8}
{'ds': '2025-01-17', 'daily_revenue': 568.84, 'daily_dau': 8}
{'ds': '2025-01-18', 'daily_revenue': 499.1, 'daily_dau': 8}
{'ds': '2025-01-19', 'daily_revenue': 455.85, 'daily_dau': 8}

Partition-per-day + overwrite is the lakehouse pattern. Re-running for the same day produces the same files — that is idempotency.

Edge cases & gotchas (2)

Idempotent re-run
Running backfill for 2025-01-15 twice must produce the same files. Overwriting the partition (Path.rename or Delta MERGE) achieves this; appending does not.
Late-arriving rows mid-backfill
If new rows for 2025-01-15 land while you backfill, you may miss them. Snapshot the source (e.g., a fixed read offset) before computing.
Fix · Python
```
# Snapshot orders to a versioned read
orders_snap = orders[orders["order_date"] <= ds]   # bound by your run timestamp
```

PY Q50 Tier 5 · Senior Idempotent ETL: write a daily partition with overwrite-by-date semantics.

[Apple · Daily Fact Publication] The publish-a-daily-fact-table pattern every senior DE has written 50 times. Production reality: on object stores this becomes Delta / Iceberg replaceWhere or BigQuery's WRITE_TRUNCATE with a partition decorator. The shape of the answer is what matters.

Source · daily_metrics

          ds  daily_revenue  daily_dau
0 2025-01-01       50008.19       9486
1 2025-01-02       79294.66      10292
2 2025-01-03       55245.52      10806
3 2025-01-04       51104.66       9132
4 2025-01-05       55714.88       9883

Python

from pathlib import Path
import shutil

def publish_daily_fact(df, ds, root):
    assert (df["ds"].astype(str) == ds).all(), "frame contains other dates"
    target = Path(root) / f"ds={ds}"
    tmp    = Path(root) / f".staging_{ds}"
    if tmp.exists(): shutil.rmtree(tmp)
    tmp.mkdir(parents=True)
    df.to_parquet(tmp / "data.parquet", index=False)
    if target.exists(): shutil.rmtree(target)
    tmp.rename(target)
    return target

ds_str = "2025-01-20"
dm_today = dm[dm["ds"].astype(str).str[:10] == ds_str].copy()
dm_today["ds"] = dm_today["ds"].astype(str).str[:10]
out = publish_daily_fact(dm_today, ds_str, "/tmp/dm_curated")
print("published:", out)
print(list(out.iterdir()))

Output

published: /tmp/dm_curated/ds=2025-01-20
[PosixPath('/tmp/dm_curated/ds=2025-01-20/data.parquet')]

Production reality: on object stores this becomes Delta / Iceberg replaceWhere or BigQuery's WRITE_TRUNCATE with a partition decorator. The shape of the answer is what matters.

Edge cases & gotchas (2)

Atomic rename across filesystems
Path.rename is atomic only within the same filesystem. On S3, rename is a copy + delete, not atomic. Use Delta/Iceberg replaceWhere or write to a manifest file.
Frame contains other dates
The assert guards against passing a multi-date frame. Without it, you'd silently overwrite the partition with rows from other days. Always validate the partition key before publish.

Bonus · Privacy & the Pre-/Post-ATT Era

April 2021 was the dividing line. Before iOS App Tracking Transparency (ATT), deterministic user tracking via IDFA was the default; ad-tech and growth pipelines were built around 1:1 user-event joins. After ATT, ~75% of users decline tracking, and SKAdNetwork-style aggregated postbacks become the default signal.

This shift broke half the SQL and Python pipelines on every Marketing Science / Ads / Growth team at Meta, Google, Amazon, and Apple. Interviewers at those companies love to probe whether candidates have lived through it. Below, 5 questions that reframe the existing dataset around the ATT boundary.

ATT Q1 Bonus · SQL Pre- vs post-ATT conversion rate (consent gating).

[Meta · Marketing Science] A growth lead wants the apples-to-apples conversion rate of users who signed up before Apple's ATT prompt rolled out vs after. Post-ATT, only ~25% of users opt in to deterministic tracking — the rest fall back to SKAdNetwork-style aggregated signals.

Source · users_att (pre/post ATT) + purchases

   user_id    name              email signup_date country att_era  att_consent
0        1  user_1  user1@example.com  2024-10-02      IN     pre            1
1        2  user_2  user2@example.com  2024-10-04      UK     pre            1
2        3  user_3  user3@example.com  2024-10-05      JP     pre            1
3        4  user_4  user4@example.com  2024-10-08      JP     pre            1
4        5  user_5  user5@example.com  2024-10-09      US     pre            1

SQL

SELECT u.att_era,
       COUNT(DISTINCT u.user_id)                                AS users,
       COUNT(DISTINCT p.user_id)                                AS purchasers,
       ROUND(100.0 * COUNT(DISTINCT p.user_id) / NULLIF(COUNT(DISTINCT u.user_id), 0), 2) AS conv_pct
FROM users_att u
LEFT JOIN purchases p ON p.user_id = u.user_id
GROUP BY u.att_era
ORDER BY u.att_era;

Output

  att_era  users  purchasers  conv_pct
0    post    184         116     63.04
1     pre     16           9     56.25

What an interviewer probes: definition discipline. ‘Conversion’ for the post-ATT cohort is fundamentally noisier — purchases that fired but were never reported back to your attribution provider don't show up here. State the bias explicitly before drawing conclusions.

ATT Q2 Bonus · SQL Deterministic-only retention: opt-in users vs full cohort.

[Meta · Attribution] Post-ATT, only the consented users have deterministic event-level data. Compute retention W0/W1/W2 for the consented subset and compare against the full cohort — the gap is the ‘ATT measurement tax’ that survival analysis has to model around.

Source · users_att (att_consent flag) + logins

   user_id    name              email signup_date country att_era  att_consent
0        1  user_1  user1@example.com  2024-10-02      IN     pre            1
1        2  user_2  user2@example.com  2024-10-04      UK     pre            1
2        3  user_3  user3@example.com  2024-10-05      JP     pre            1
3        4  user_4  user4@example.com  2024-10-08      JP     pre            1
4        5  user_5  user5@example.com  2024-10-09      US     pre            1

SQL

WITH cohort AS (
    SELECT user_id, att_era, att_consent,
           DATE_TRUNC('week', signup_date) AS cohort_week
    FROM users_att
),
joined AS (
    SELECT c.cohort_week, c.user_id, c.att_consent,
           CAST(FLOOR(EXTRACT(EPOCH FROM (l.login_date - c.cohort_week)) / (7*86400)) AS INT) AS w
    FROM cohort c
    JOIN logins l ON l.user_id = c.user_id
    WHERE l.login_date >= c.cohort_week
)
SELECT att_consent,
       COUNT(DISTINCT user_id) FILTER (WHERE w = 0) AS w0_users,
       COUNT(DISTINCT user_id) FILTER (WHERE w = 1) AS w1_users,
       COUNT(DISTINCT user_id) FILTER (WHERE w = 2) AS w2_users,
       ROUND(100.0 * COUNT(DISTINCT user_id) FILTER (WHERE w = 1)
                   / NULLIF(COUNT(DISTINCT user_id) FILTER (WHERE w = 0), 0), 1) AS w1_pct,
       ROUND(100.0 * COUNT(DISTINCT user_id) FILTER (WHERE w = 2)
                   / NULLIF(COUNT(DISTINCT user_id) FILTER (WHERE w = 0), 0), 1) AS w2_pct
FROM joined
GROUP BY att_consent
ORDER BY att_consent DESC;

Output

   att_consent  w0_users  w1_users  w2_users  w1_pct  w2_pct
0            1         1         3         3   300.0   300.0
1            0        15        14        14    93.3    93.3

The senior signal: consented-only retention is biased — the people who opt in are systematically different (engaged power users disproportionately accept). Calling out the selection bias before the interviewer asks is the difference between Senior and Staff.

ATT Q3 Bonus · Python SKAdNetwork-style aggregated postback: bucket conversions, no user IDs.

[Apple · App Store / Meta · CAPI] SKAdNetwork postbacks arrive 24-48h after the conversion, with no user_id — only an aggregated ‘conversion value’ bucket per (campaign, source app, day). Build the equivalent: bucket purchases into low/med/high CV per user-cohort-day, then aggregate to (campaign, day) without ever exposing user_id.

Source · purchases (treated as ad conversions)

   purchase_id  user_id  amount purchase_date product_category
0            1      149  321.30    2024-11-01           Sports
1            2       82   36.66    2024-11-01         Clothing
2            3       19   70.12    2024-11-01         Clothing
3            4       98  122.21    2024-11-01             Food
4            5       59    5.33    2024-11-01      Electronics

Python

import pandas as pd

# Conversion-value mapping (CVM): bucket amount into 4 levels (Apple gives 6 bits = 64)
def cv_bucket(amt):
    if amt < 50:   return "cv0_low"
    if amt < 150:  return "cv1_med"
    if amt < 300:  return "cv2_high"
    return "cv3_premium"

p = purchases.copy()
p["cv_bucket"]   = p["amount"].apply(cv_bucket)
p["postback_dt"] = p["purchase_date"] + pd.Timedelta(hours=36)

# Synthesize a campaign id (would come from ad_attributed_install in real life)
p["campaign_id"] = (p["user_id"] % 5).astype(str).radd("camp_")

# Aggregated postback: NO user_id leaves this aggregation
postback = (p.groupby(["campaign_id", p["postback_dt"].dt.date, "cv_bucket"])
              .size().rename("conversions").reset_index())
postback.columns = ["campaign_id", "postback_date", "cv_bucket", "conversions"]

# Apply k-anonymity threshold: drop rows with conversions < 10 (real SKAN min)
postback = postback[postback["conversions"] >= 3]
print(postback.head(10))

Output

Empty DataFrame
Columns: [campaign_id, postback_date, cv_bucket, conversions]
Index: []

Real-world callout: Apple's SKAdNetwork enforces a minimum threshold (originally called the ‘crowd anonymity tier’) to prevent re-identification. If your aggregation has rows with conversions=1, you've broken privacy. Always apply the k-anonymity filter inside the aggregation function, not as an afterthought downstream.

ATT Q4 Bonus · Python Deterministic vs probabilistic attribution: which campaign drove the conversion?

[Meta · Marketing Science / Apple · AEM] Pre-ATT: every conversion has a deterministic last-touch campaign. Post-ATT: SKAN gives one aggregated postback per install per campaign, with no user-level link. Build both attribution views side-by-side from the same purchase data, and surface the divergence — that gap is what marketing teams now have to model.

Source · events + purchases (synthetic campaign attribution)

   event_id  user_id  event_type event_date
0         1        1      signup 2025-01-04
1         2        1  activation 2025-02-08
2         3        1    purchase 2025-03-13
3         4        2      signup 2025-02-17
4         5        2  activation 2025-02-24

Python

import pandas as pd

# Synthetic deterministic last-touch: every signup-event mapped to a campaign
ev = events.copy()
ev["campaign_id"] = (ev["user_id"] % 5).astype(str).radd("camp_")
last_touch = (ev.sort_values(["user_id", "event_date"])
                .drop_duplicates("user_id", keep="last")
                [["user_id", "campaign_id", "event_date"]])

p = purchases.copy()
det = (p.merge(last_touch, on="user_id", how="left")
        .groupby("campaign_id", as_index=False)
        .agg(det_conversions=("purchase_id", "count"),
             det_revenue=("amount", "sum")))

# Probabilistic: assume SKAN reports only 25% of post-ATT users w/ campaign
import numpy as np
np.random.seed(7)
p["reported"] = np.random.rand(len(p)) < 0.25
prob = (p[p["reported"]]
          .merge(last_touch, on="user_id", how="left")
          .groupby("campaign_id", as_index=False)
          .agg(prob_conversions=("purchase_id", "count"),
               prob_revenue=("amount", "sum")))

# Project the under-reported probabilistic view back up to full
prob["prob_conversions_scaled"] = prob["prob_conversions"] / 0.25
prob["prob_revenue_scaled"]     = prob["prob_revenue"]     / 0.25

cmp = det.merge(prob, on="campaign_id", how="outer")
cmp["det_vs_prob_pct_gap"] = (
    100 * (cmp["prob_revenue_scaled"] - cmp["det_revenue"])
        / cmp["det_revenue"].where(cmp["det_revenue"] > 0)
).round(1)
print(cmp.round(2))

Output

  campaign_id  det_conversions  ...  prob_revenue_scaled  det_vs_prob_pct_gap
0      camp_0               43  ...              4708.56                  2.7
1      camp_1               47  ...              2577.72                -20.9
2      camp_2               82  ...              8435.84                 33.8
3      camp_3               70  ...              9506.56                 12.4
4      camp_4               57  ...              3351.84                -96.8

[5 rows x 8 columns]

Why this question is asked: any DE on a Marketing Mix / MMM team gets this within the first month. The shape of the answer interviewers want: (a) build both attribution lenses, (b) compute the gap, (c) name the bias (selection on opt-in), (d) propose a calibration approach (geo-lift test, holdback, MMM).

ATT Q5 Bonus · Python Identity stitching: deterministic IDFA-based vs probabilistic post-ATT.

[Meta · Identity / Amazon · Ads] Pre-ATT, you stitch sessions to users via IDFA (deterministic). Post-ATT, you have to fall back on probabilistic graphs (IP + device + behavioral fingerprints). Show both: a deterministic merge, and a probabilistic merge that allows ambiguous matches and reports a confidence score.

Source · sessions + user_profiles (treated as device graph)

   session_id  user_id platform session_date  duration_sec
0           1       51  android   2025-01-01         388.0
1           2       73  android   2025-01-01         167.0
2           3       32  android   2025-01-01        2493.0
3           4       75  android   2025-01-01        1881.0
4           5        8  android   2025-01-01        1217.0

Python

import pandas as pd

s = sessions[["session_id", "user_id", "platform", "session_date"]].copy()
prof = user_profiles.copy()

# DETERMINISTIC: clean 1:1 join on user_id (pre-ATT, IDFA available)
det = s.merge(prof, on="user_id", how="left")
det_match_rate = det["city"].notna().mean()

# PROBABILISTIC: drop user_id (simulate ATT), match on (date, platform) buckets
# In real life: IP + device fingerprint + behavioral. Here, we use platform+date as a poor proxy.
prof_p = prof.copy()
prof_p["match_date"] = prof_p["updated_at"]
s_p = s.copy()
s_p["match_date"] = s_p["session_date"]

prob = s_p.merge(prof_p[["match_date","city","name"]], on="match_date", how="left")
# Each session may now have multiple ambiguous candidates - count fan-out
fanout = prob.groupby("session_id").size()
ambiguous_pct = (fanout > 1).mean()

print(f"Deterministic match rate (pre-ATT): {det_match_rate:.1%}")
print(f"Probabilistic ambiguity (post-ATT): {ambiguous_pct:.1%} sessions have >1 candidate")
print(f"Avg candidates per session: {fanout.mean():.2f}")

Output

Deterministic match rate (pre-ATT): 97.8%
Probabilistic ambiguity (post-ATT): 73.4% sessions have >1 candidate
Avg candidates per session: 4.84

Senior framing: probabilistic stitching is never ‘wrong’ — it's a probability distribution. The interviewer wants to hear you acknowledge the confidence-score model (e.g., logistic over feature-similarity scores) and the downstream impact: every funnel rate becomes a posterior estimate, not a point value.

Part 5 · Closing & Further Reading

Two things separate candidates who pass these rounds from candidates who don't, and neither is ‘knowing more SQL’.

The Two Things That Actually Matter

1. You restate the problem before you write code. ‘So you want X grouped by Y for the last Z days, treating nulls as W — is that right?’ That single sentence is the difference between a hire and a no-hire on most loops.

2. You name the tradeoff before the interviewer asks. ‘I am using a window here instead of a correlated subquery because it is a single pass over the table; the cost is that we need the data sortable in memory.’ Volunteered tradeoffs signal seniority.

What to do with this bank

Run the dataset locally. Postgres, DuckDB, or pandas. Querying real rows beats reading SQL on a screen.
Hide the answer first. Write your version, then expand the card. The compare step is where insight forms.
Translate. For each question, write both the SQL and the pandas form. Switching fluently is the senior signal.
Time-box. Tier 1–2 in 5 minutes; Tier 3 in 10; Tier 4 in 15; Tier 5 in 20. If you stall, peek, learn, redo from scratch tomorrow.
Mock with a friend. Speaking out loud — narrating clarifying questions, naming tradeoffs — is the only way to build that muscle.

Companion articles on PaddySpeaks

Data Engineering Interview Prep — Senior / L5 Deep Dive — modeling, batch, streaming, Spark internals, lakehouse, 40+ scenarios.
Ads Data Engineering Interview Prep — auctions, attribution, identity & consent, ads SQL.
The Bhagavad Gita of Data Engineering — the philosophical companion piece on craft, dharma, and the discipline.

✦ ✦ ✦

100 questions. One dataset. Real outputs. The rest is reps.