Primer · Ads 101
Ads 101 · the 60-second world.
The interview is a story about ads data. If the data world is new, the story will be hard to follow. This primer gives you the three things you need before Act 1: how the money actually moves, the five companies that shape the entire industry, and the eleven specific ways ads data rots in production. Fifteen minutes. Then Raj.
ad_rank = bid × predicted_action_rate × quality_factor. The ranker is usually a deep-learning model that predicts pCTR and pConvert from hundreds of features (user history, creative embeddings, context). A second-price auction (GSP / VCG variants) clears: highest ad_rank wins, pays just enough to beat the second-highest ad_rank. Loser logs are written too — they are precious for ranker retraining. Privacy budgets weight the final score on iOS / Privacy Sandbox paths.| rank | advertiser | bid | × pAction | × quality | = ad_rank | paid |
|---|---|---|---|---|---|---|
| 1 | Nike · running shoe | $2.40 | 0.034 | 0.92 | 0.0751 | $1.92 |
| 2 | Asics · marathon gel | $2.10 | 0.029 | 0.94 | 0.0572 | — |
| 3 | Strava · Premium | $1.80 | 0.028 | 0.81 | 0.0408 | — |
| 4 | Garmin · watch | $3.20 | 0.011 | 0.72 | 0.0253 | — |
| 5 | Local gym · signup | $0.90 | 0.021 | 0.88 | 0.0166 | — |
ad_rank of 0.0572 at Nike's quality-adjusted rate. That's the second-price savings ($0.48) that makes advertisers trust the auction.is_valid = FALSE on the impression. Everything downstream uses this stamp.ad_id, user_id, and timestamp. For video: pause, resume, skip, complete all stream back. For CTV there is usually no click — just exposure and completion. For every event, the platform re-resolves identity against the graph, applies consent rules (GDPR, CCPA, iOS ATT), and may drop or aggregate the event to satisfy privacy requirements.event_time + attribution_time) you cannot reproduce last week's dashboard and the CFO's question "what did we tell the board on the 30th" becomes unanswerable.is_valid=FALSE column); post-impression forensics by a vendor like DoubleVerify or HUMAN catches the rest — that's the ivt_credits_usd column on the Finance ledger that lags 30 days.event_time + attribution_time — so every historical number is reproducible.fx_rate_to_usd snapshotted at event time — reproducible. Finance applies an end-of-month revaluation — fx_revaluation_usd. Both are correct; they answer different questions.campaign_id; the dim table has the new name. Dashboards straddle.Now the room.
You know how a single ad happens. You know which five companies wrote the rules. You know the eleven ways the data lies. That is what the interview assumes you walked in with.
Raj is across the desk. The Slack storm is already three minutes old. Scroll.
Five Rooms · By Company
The same round · different walls.
The main article runs Raj at StreamCo because you can only be in one room at a time. If you are interviewing at Meta or Google or Amazon or Apple instead, the opening ninety seconds rhyme but do not repeat. The interviewer is different. The Slack storm is different. The first question names a pain that is specific to that company's stack, model, and business. The shape of the test is identical.
"Quality Score moved from 8 to 3 overnight on a high-spend keyword. The ranker is a black box to you — you don't have read access to the model. Walk me through how you debug this, and what you tell the advertiser in ninety minutes."
"Sponsored Products auction logs are clean on our side. Retail's order-level pipeline is clean on their side. They disagree by $1.8M on a $47M day. The join is somewhere in the middle. Walk me through how you find the seam, and what you tell Retail Finance by end of day."
customer_id on the ads side equals customer_id on the retail side. Historically they had different namespaces; the lookup table is the seam, and it lags 6 hours on Prime Day."You're the tech lead. It's 09:47. The director will DM you in thirty seconds. What do you say, and why?"
"Advertiser says SKAN postbacks are broken. You've verified the pipeline is healthy. The 41% number is real; the 94% number was also real six months ago. What happened in between, and how do you explain it to the developer in writing by end of day?"
| Company | Primary metrics | Primary dimensions | Attribution model | Canonical use cases |
|---|---|---|---|---|
| MetaFeed / Reels / Stories |
CPM CPC CPA ROAS Video 3s / 75% / thru CTR On-Meta vs off-Meta conv
|
Campaign → Ad set → Ad Objective (awareness / traffic / conversions / sales) Placement (Feed / Reels / Stories / Messenger) Audience (Lookalike / Interest / Custom) Device (iOS vs Android — matters massively post-ATT) |
7-day click + 1-day view default. iOS uses AEM — aggregated, up to 72h delay, privacy-budget-gated. Android still deterministic. CAPI (server) + Pixel (client) must be deduped via event_id.
|
DTC conversion optimization, app installs, Lookalike scaling, retargeting via Custom Audiences, brand awareness on Reels, sales via Advantage+ Shopping. |
| GoogleSearch / YouTube / Display |
Quality Score Ad Rank CPC CPA ROAS Impression share Abs Top IS View-through conv
|
Campaign → Ad group → Ad / Keyword Match type (exact / phrase / broad) Network (Search / Display / YouTube / Shopping) Device / Geography / Time-of-day Audience (In-market / Affinity / Custom segments) |
Data-Driven Attribution (DDA) default as of 2023, 30-day click window for Search, 1-day view for Display/Video, cross-device via Google account sign-in. Privacy Sandbox migrating Display away from 3P cookies. | High-intent Search for bottom-funnel, YouTube pre/mid-roll for awareness, Display retargeting, Performance Max for auto-optimized full-funnel, Shopping for e-comm. |
| AmazonSponsored / DSP / Retail |
ACOS TACOS New-to-brand % SP impressions SB clicks DSP CPM Retail attributed sales
|
Campaign → Ad group → Keyword / ASIN Placement (Top of search / Product page / Rest) Match type (broad / phrase / exact / auto) Customer segment (Brand-new-to-brand / Repeat) Category / Subcategory / ASIN |
14-day purchase window (Sponsored Products default), 7-day view / 14-day click (DSP), deterministic because Amazon owns the purchase event. Retail ↔ Ads ID reconciliation is the seam — 6-hour lag on Prime Day. | SKU-level sales via Sponsored Products, brand discovery via Sponsored Brands, off-Amazon retargeting via DSP, Prime Day amplification, Fire TV / Twitch CTV, subscribe-and-save retention. |
| Netflix · StreamCoCTV · this article's focus |
CPM Completion rate Reach Frequency Unique households Attributed off-platform conv On-screen time
|
Campaign → Creative Household segment (subscriber tier / geo / genre affinity) Content genre / specific title Daypart (prime / late night / weekend) Device type (Smart TV / Mobile / Tablet) |
Exposure → off-platform conversion via Xandr postbacks. Nielsen-panel reach & frequency. Partner SSP postbacks on 14-day lag. No click attribution — most CTV surfaces can't be clicked. Identity is the Netflix account + household stitch. | Direct-to-advertiser CPM buys, programmatic CTV via Xandr exchange, brand-lift studies, TV-style reach & frequency planning, content-adjacency targeting (show-level). |
| AppleApp Store · SKAN |
SKAN installs Conversion value distribution TTR (tap-through) Search Ads impressions CPP (cost per purchase)
|
Campaign (SKAN limit: 100 per app) Keyword (Search Ads) Creative set / Asset Country / region (privacy threshold varies by scale) Time-of-day / iOS version |
SKAdNetwork postbacks, delayed 24–48h + randomization window. Conversion value encoded in 6 bits = 64 values. Privacy thresholds suppress postbacks when cohort sub-populations fall below the crowd-anonymity floor. Never user-level. | App install campaigns (primary), Search Ads for App Store discovery, Apple News / Stocks for awareness, privacy-first attribution for regulated industries (finance / health). |
| Platform | Layer · what it does | Meta | Amazon | Netflix | Apple | |
|---|---|---|---|---|---|---|
| KafkaLinkedIn OSS | Ingest · durable, replayable event log. Topics per event type. Schema registry in front. | Internal fork; trillions of events/day in the ads path. Produces bid logs, impressions, conversions. | Rarely — Google runs proprietary pub-sub (Colossus-backed). Kafka appears at acquired units (Looker, Fitbit). | MSK (managed Kafka) for Sponsored Ads auction logs. Kinesis dominates elsewhere in AWS. | Primary ingest for ads events. This article's architecture lands on Kafka in Act 3 Layer 5. | Internal pub-sub (Apple's own). Kafka not public-facing at scale. |
| KinesisAWS managed | Managed streaming · zero-ops ingest. Better than Kafka on small teams; worse on replay >7 days. | — | — | Default everywhere except where replay >7d is required. Firehose → S3 for cheap retention. | Considered; rejected because 28-day attribution replay needs Kafka's topic retention. | — |
| FlinkAlibaba-originated | Stream processing · stateful. Windowed aggregations, pattern detection, streaming joins with exactly-once. | Stream processing for attribution + fraud detection. Heavy Flink shop. | Dataflow (proprietary, Apache Beam OSS) is Google's Flink-equivalent. Same category, different implementation. | Flink Kinesis Data Analytics for near-real-time pacing and IVT pre-bid flagging. | Flink on Kubernetes for streaming identity-graph resolution + ROAS guardrails. | Internal stream processor; Flink not public. Postback batching happens in batch, not stream. |
| SparkDatabricks stewardship | Batch + structured streaming · large-scale ETL, ML feature pipelines, reprocessing. | PyTorch on Spark for ranking-model feature generation. | Used throughout ads ETL + ranker training. Dremel/BigQuery supplant for ad-hoc. | EMR Spark for nightly reconciliation + backfills. Retail-scale. | Iceberg-on-Spark for the silver/gold batch layer. Attribution backfills after graph upgrades. | Internal; SKAN postback aggregation is Spark-equivalent. |
| IcebergNetflix OSS | Table format · ACID on a data lake. Schema evolution, time-travel, hidden partitioning, snapshot isolation. | Growing — being evaluated as the warehouse-lake format across ads. | BigQuery is the warehouse; Iceberg appears at BQ-external-table adoption sites. | Iceberg-on-Glue for Sponsored Ads fact tables. Schema evolution without downtime is the killer feature. | Origin shop. Iceberg was born here. Every gold table in this article's architecture is Iceberg. | Internal columnar format; not Iceberg. |
| Delta LakeDatabricks OSS | Table format · ACID like Iceberg, tighter Databricks coupling, slightly different partition model. | — | — | Some adoption on Databricks-hosted workloads. Not Retail-primary. | — | — |
| HudiUber OSS | Table format · optimized for upserts and late-arriving data. Strong at CDC ingestion. | — | — | — | Considered for Finance ledger CDC; lost to Iceberg for consistency with rest of stack. | — |
| BigQueryGoogle proprietary | Serverless warehouse · petabyte SQL, minute-to-minute billing, external table federation. | — | Primary analytics engine across Ads. Dremel is the guts. | — | — | — |
| SnowflakeSaaS warehouse | Warehouse · multi-cloud, separation of storage and compute, zero-copy clones. | — | — | Retail Finance reporting; not the ads primary. Ads leans on Redshift + Iceberg. | BI + exec dashboards. Ads core pipelines are Iceberg + Spark; Snowflake is the consumer tier. | — |
| Presto / TrinoMeta OSS | Query engine · federated SQL across warehouses and lakes. Interactive analytics. | Origin shop. Presto was built at Meta. Primary interactive SQL. | Trino/Athena analogues live alongside BigQuery. | Athena (managed Trino) on S3. Interactive ad-hoc on raw logs. | Trino on Kubernetes for ad-hoc SQL on Iceberg tables. | — |
| dbtSaaS transform | Transform layer · SQL-native DAG, tests, lineage, metric YAMLs, CI for transformations. | Internal analogue (not public dbt). Meta has its own SQL-DAG tooling. | Internal tooling. dbt appears in partner-team adoption. | Retail Finance on dbt; ads analytics also. Core ads pipelines are custom. | Primary semantic layer in this article's architecture (Act 3 Layer 2). Metric YAMLs + CI tests. | — |
| AirflowAirbnb OSS | Orchestration · scheduled DAGs for batch jobs, backfills, dependency management. | Internal scheduler (Dataswarm). Airflow in peripheral teams. | Composer (managed Airflow) on GCP; also internal schedulers. | MWAA for batch ETL orchestration. Step Functions for event-driven pipelines. | Maestro — Netflix-built orchestrator, public in 2024. Replaces Airflow at scale. | Internal scheduler. |
| Druid / PinotMetamarkets / LinkedIn OSS | OLAP · low-latency slice-and-dice on high-cardinality time-series. Dashboards over billions of rows. | Presto + internal columnar. Pinot appears in adjacent teams. | Proprietary. | Druid for real-time advertiser dashboards. Pinot under evaluation. | Druid for real-time ROAS dashboards and advertiser self-serve. Heavy Druid shop. | — |
| Xandr · DV360Ad-stack vendors | Programmatic SSP/DSP stack · ad serving, auction-running, postbacks for partner attribution. | Owns full stack in-house (Meta Ads Manager). | Owns full stack (Google Ad Manager + DV360). | Owns full stack (Amazon DSP). | Xandr partnership is the backbone of CTV monetization. Postbacks are the 14-day lag in Act 2. | Apple Search Ads is proprietary; SKAdNetwork is the SSP-equivalent framework. |
Whichever company you are interviewing at, the main arc (Brief + Acts 1–7 + Appendix) plays at StreamCo because that is the one you can sit inside for forty-five minutes end-to-end. Read it all. Then:
- If you're interviewing at Meta — re-read Act 5 (Everything Breaks) substituting AEM postback delay for identity graph upgrade. The shape of the debug is identical; the specific column is different.
- If you're interviewing at Google — Act 3 (Architecture) is still canonical; the four revenue tables become four Quality Score input streams. Act 4 (Defense) around auction-log vs offline-log reconciliation lands identically.
- If you're interviewing at Amazon — Act 2 (reconciliation) is the money act. Substitute Retail ledger for Finance ledger, purchase-attribution window for partner-postback lag. Everything else holds.
- If you're interviewing at Netflix · StreamCo — read straight through. You are the main character.
- If you're interviewing at Apple — Act 5 is still a measurement-change, not a bug — the classification move is the same. Substitute SKAN privacy threshold for identity graph version. The paragraph you write to the developer is longer because the math is harder to explain.
Now keep scrolling. The Brief is next. Raj is waiting.
Question Bank · soft → hard
What they actually ask.
No interviewer opens with "design a SKAN postback aggregator." They start soft — your last pipeline, your hardest metric — and build up. By question twenty they're testing real debugging instincts. By question thirty they're testing how you talk to a CFO with incomplete data. This bank mirrors that arc. Read it top to bottom; it is the realistic difficulty gradient you will actually walk through.
- Tell me about an ads or measurement pipeline you owned end-to-end.
- What broke in that system, and how did you detect it?
- What was the hardest metric to make trustworthy?
- What is the difference between an impression, a click, and a conversion?
- When can impressions increase but revenue stay flat?
- What does CTR actually hide?
- How does attribution differ between onsite ads vs external ads?
- When would you trust last-click attribution — and when not?
- What are failure modes of attribution systems?
- How do you define an "impression" on a TV where clicks don't exist?
- If a user watches an ad on TV and converts on mobile, how would you measure it?
- What makes TV ads harder to measure than mobile ads?
- How would you attribute a conversion across devices without a deterministic ID?
- What are tradeoffs between accuracy and coverage in cross-device attribution?
- How do you avoid double-counting conversions?
- Explain how a second-price auction works in ads.
- Why might CPC increase even if demand is unchanged?
- What inputs influence auction outcomes in practice?
- Impressions up 30%, clicks flat — what could explain it?
- Conversions drop 10% after a UI change — what are three non-bug reasons?
- Revenue steady, but CTR drops — why?
- CPC jumps from $1.20 → $1.80 overnight. What do you check first?
- Spend drops 40% but traffic is stable — where could the issue be?
- Two dashboards show different revenue for the same campaign — how do you debug?
- Where can data loss happen in an ads pipeline?
- How do you design idempotent pipelines for event ingestion?
- What's the hardest part of reconciling ads data across systems?
- "Why did ROAS drop?" — answer in 30 seconds.
- "Can we trust this metric?" — what do you say?
- "Are we losing money right now?" — how do you respond with incomplete data?
- Why might ads revenue not match billing data?
- Explain ACOS vs TACOS and when each matters.
- Where would a $100K discrepancy hide in ads vs retail systems?
customer_id ≠ Retail customer_id) plus attribution-window edges (returns posted after the 14-day window).- How do you detect fraud or invalid traffic?
- What happens if event timestamps are delayed or out of order?
- How would you backfill missing conversion data?
Read top to bottom once. Pick the three weakest tiers. Practice answering them out loud, with a timer. Tier 1 in two minutes; tier 13 in five. If a question makes you reach for a definition, that's the one to drill — not the ones you can recite.
The interview will not ask all 39. It will ask 4–6, drawn from tiers 1–8 most of the time, with one tier-10 communication question dropped in around minute 50. The acts you scroll into next are the round playing out at full speed against this bank.
Now scroll. The Brief is next.
The Brief · 00:00 · problem drops
You are interviewing.
The rest of this article is not an article. It is the next forty-five minutes of your life.
- The next thirty seconds. A live reply to the director before he re-reads the CFO's message.
- The reconciliation. Four definitions of revenue, one query, one paragraph, one number.
- The architecture. A platform that makes this morning impossible by Friday of next week.
- The defense. Six volleys from Raj, each one a failure pattern he has watched candidates lose on.
- A page at 02:17 AM. ROAS up 18% overnight. Finance sees it at 9:00.
- A meeting at 08:30 AM the next day. The CFO. Two dashboards. Seventeen minutes.
- The last four minutes of the round. Raj's meta-question — the one that's actually scored.
- The appendix. Seven drills. If you can't solve them cold, none of the above carries in the room.
- Read linearly. The sections are hours of one day, not topics. Skipping to SQL is skipping to minute 35 of a 45-minute conversation.
- Read every "What you say" card out loud. If you can't land the line at conversational pace, you can't land it in the room.
- Solve the drills before reading the answers. The appendix is not a reference. It is a pretest.
- Every sidebar marked "The Landmine" is a specific way candidates lose this round. They do not lose on knowledge. They lose on instincts. The landmines are the instincts.
Thirty minutes. CFO on Zoom in twenty-five. What do you check first?
Before you keep reading — answer out loud. Not the "right" answer. Your answer. Commit to it. Write it in the margin if you have to. What do you type into the director's DM in the next thirty seconds?
Pick one. Now. Then scroll. Act 1 grades you.
Raj's screen: #incidents-ads-de. The CFO's message is already three minutes old. She wants an answer by 10:30. You have not drawn a single box yet. Scroll.
Round 0 + 1 · 0–5 min · Drop + Clarify
ad_rank wins, pays $0.01 more than the runner-up's ad_rank at the winner's quality factor — say $1.92. The winner saves $0.48; the auction is incentive-compatible." Can you do the arithmetic out loud? This separates "I've read about ads" from "I've shipped ads."That's it. No scale number. No latency target. No source-of-truth hint. No mention of IVT, attribution, identity, Finance, or the CFO. Just the sentence above, and a marker being pushed across the desk toward you.
Raj is not waiting for an architecture diagram. He is watching the next five seconds. Do you panic, or do you structure?
You are in Round 1.
The interviewer is Raj — Staff Engineer, ads platform team at StreamCo. He has been in the chair on your side of this conversation four times in the past two weeks. He has seen enough candidates freeze on this round to know that the freeze is the thing being scored.
Design a unified ads metrics platform at StreamCo scale. 260M subscribers. ~2B impressions/day. Four downstream consumers: Finance, Marketing, Advertisers, Sellers. Four weeks to ship v1.
"You're the tech lead. It's 09:47. The director will DM you in thirty seconds. What do you say, and why?"
In fifteen seconds, say your answer out loud. Not the "right" answer. Your answer — what you would actually type into the director's DM right now, with Raj watching, with the clock at 00:33.
No peeking. Commit. Then read A / B / C and find out whether the interview would have continued.
If you could not produce a sentence — that is the signal. The three candidates Raj has seen freeze on this exact second lost the round here, before A / B / C existed. The round is scored on whether the reply came, not whether it was perfect.
A"Both are right. They measure different things."
Probably correct. Also premature — no receipts. Reads as hedging to a CFO. Cost: low if right, catastrophic if slightly wrong.
B"Give me until 10:40. I'll reconcile and come back with a recommendation."
Buys 53 minutes. Commits to a specific deliverable. Authority through specificity. Cost: the whole morning goes to this. Accepted.
C"Dashboard is the source of truth. $8.2M."
Fast, confident, wrong. If Finance arrives independently at $9.1M in the close, the data team is gone for two quarters. Cost: career-ending in the medium term.
"Until 10:40. There are at least four definitions of 'ad revenue' at StreamCo scale. I reconcile, I come back with one number and one recommendation. That's the first thirty seconds. The rest depends on which definition the dashboard is actually computing."
Raj nods once. Timer reads 4:31. He says:
Round 1 · 5–15 min · Clarify or Fail
Four numbers, one word.
SELECT SUM(revenue) FROM ... before asking which revenue, which table, which window. The query they ship reconciles nothing because it was answering a question nobody asked. Clarification is the round. Query is scaffolding.Raj is watching. Forty minutes of round left. He asked for the reconciliation. You produce it out loud, in SQL, while naming the four revenues StreamCo actually runs on — because at this scale they do not collapse into one.
You begin. Speaking to Raj, keyboard in your lap, you say:
Raj nods. You draw four boxes on the sketch pad you have shared to your screen.
You turn back to Raj and continue:
Raj pulls up the shared editor on his side. You start typing.
The query finishes at 10:21. You screenshot the result. Raj watches you do it without comment. You then write the paragraph the director has been waiting for.
You close the editor. You look up at Raj. The timer reads 21:14.
Raj reads the paragraph twice. He says:
He changes his screen back to the whiteboard you had drawn 17 minutes ago. The half-finished architecture is still there — Kafka at the left, a few scattered boxes, the words attribution underlined twice. He looks at it. He looks back at you.
The whiteboard is yours. The incident is your brief. The clock keeps running.
"Sales Ops pulled revenue by Quality Score tier for the 11:00 forecast call. It does not match what the ads-revenue pipeline is emitting. Reconcile. You have thirty minutes and the Sales VP joins at 10:50."
"The Ads ledger posted $47.2M for Q2. Retail ledger posted $45.4M. Seven figures of difference, same quarter, same customers. Before you write SQL — what are the four definitions of 'ad-attributed sale' we could be measuring?"
"The advertiser's MMP (AppsFlyer) reports 10,000 installs attributed to our Search Ads campaign. Our SKAN postbacks show 7,200. The developer thinks we're losing data. Name four reasons the two numbers can legitimately disagree before you accept the 'bug' framing."
Round 2 · 15–30 min · Design
Build a platform that makes this morning impossible.
Clock reads 21:14. Raj wipes the whiteboard. He does not say a word. He caps the marker, sets it down where you can reach it, and leans back. This is the part of the round where most candidates start drawing boxes and arrows. This morning is fresh in your head. You do not reach for the marker yet.
"Now design the platform so September doesn't happen again. Four weeks. v1. 260M subscribers, ~2B impressions/day, four downstream consumers. I want layers, I want contracts, I want the one thing in your design that kills today's bug. Go."
Before you pick up the marker you say one sentence. It is the sentence that frames every box you are about to draw. You want this sentence in Raj's head before a single rectangle appears on the board, because every architectural decision you make after this is a consequence of it.
"The enemy is not latency. It is not cost. It is silent label drift between audiences who each think they are reading the same number. Every layer I draw has one job: make that drift impossible, or make it obvious within five minutes."
Now you draw. Not a wiring diagram — a layer stack. Five layers, top to bottom. Each layer does one thing. Each layer has a contract with the layer above and below. You draw them as stacked rectangles.
fact_impression, fact_click, fact_conversion. Currency normalised to USD at event time. IVT flag materialised here. Identity graph join resolved here, once, with lineage columns.gold.revenue_spend, gold.revenue_attributed, gold.revenue_billed, gold.revenue_settled. Each has a definition header row, a grain contract, and an owning team.gold.revenue. That name is reserved and will never be created. Anyone who writes SELECT revenue FROM … gets a query-parser error with a link to the definitions page.revenue_type column. Defaults will be chosen. The default becomes the bug.The stack is on the board. Raj has not interrupted. You have four minutes of air, so you spend them on the three contracts that make the stack real. Without them the boxes are furniture.
event_time (when it happened) and attribution_time (when we credited it). We never UPDATE. Late postbacks arrive as new rows. Historical dashboards are reproducible at any prior attribution_time.gold.* table is a definition row: metric_name, grain, currency_basis, attribution_model, audience, owner_team, ticket_for_changes. Schema migration to change any of these requires a signed change request from the owning team.attribution_time cohort — the old numbers are preserved, the new numbers are labelled.The SQL shape of the append-only attribution table is the single hardest thing to get right, so you write it on the board:
-- gold.revenue_attributed · append-only, bi-temporal
CREATE TABLE gold.revenue_attributed (
conversion_id UUID NOT NULL,
advertiser_id BIGINT NOT NULL,
campaign_id BIGINT NOT NULL,
event_time TIMESTAMP NOT NULL, -- when the conversion happened
attribution_time TIMESTAMP NOT NULL, -- when we credited it
attribution_model VARCHAR(32) NOT NULL, -- last_click_7d, mta_linear, ...
identity_graph_version VARCHAR(16) NOT NULL, -- pinned for reproducibility
credited_usd NUMERIC(18,4) NOT NULL,
is_superseded BOOLEAN NOT NULL DEFAULT FALSE,
PRIMARY KEY (conversion_id, attribution_time)
)
PARTITION BY RANGE (attribution_time);
-- "what did we show Finance on the 10:37 email on 2025-09-30?"
SELECT SUM(credited_usd)
FROM gold.revenue_attributed
WHERE event_time >= '2025-09-01' AND event_time < '2025-10-01'
AND attribution_time <= '2025-09-30 10:37:00'
AND is_superseded = FALSE;
That last query is the one that makes the platform trustworthy. Any number Finance ever saw is reproducible by plugging the attribution_time ceiling into the same query. "What did we tell you, and when?" becomes a SQL answer instead of a Slack archaeology project.
Raj picks up the marker. Not to draw. To point.
"Design the serving-log ↔ offline-log reconciliation platform for Search. Auction decisions are logged one place, conversions another, no user ID in between on iOS. How does Quality Score ground-truth get built?"
"Design the Retail ↔ Ads attribution bridge. Two identity spaces. Prime Day at 10× scale. Finance sign-off by Friday. Show me the schemas that survive the customer-ID reconciliation lag."
"Design the SKAN postback aggregator that produces advertiser-facing install counts from Apple's 6-bit, delayed, rate-limited, privacy-thresholded stream — without making the numbers look fabricated. And make it explainable in a single paragraph to a developer."
Round 3 · 30–40 min · Defend
Every box gets punched. You defend or you fold.
Clock reads 24:03. Raj has the marker. He is not drawing — he is pointing. He will spend the next nineteen minutes trying to break the architecture you just put on the board. Each challenge is a real bet he has watched candidates lose. You have one correct shape of answer for every question: name the alternative, name the trade, name the harm the trade prevents.
There is no clean way to script this. So here's how it actually goes: six volleys, a clock, and a running counter of whether you are converting pressure into signal or leaking it.
gold.revenue with a revenue_type column? Same data, half the joins."is_superseded filter. This is over-engineered for a four-week v1."attribution_time month and cold-tier anything >90 days to S3 Glacier IR. Cuts hot-storage 80%. On the planner: we add a materialised view gold.revenue_attributed_current that filters is_superseded = FALSE once; consumers hit the view. Keeps the principle, hides the complexity. The one thing I won't do is UPDATE — that's the bug we're here to prevent."gold.revenue_settled — in production by next Friday, with the semantic binding and the definition header. That's the one you need for the close. The other three follow on a published four-week schedule. You get the number you need in seven days and a plan you can show the board for the rest."is_superseded filter. Backfills become a versioning job, not a rewrite.Clock reads 43:22. Raj caps the marker. The forty-five-minute round ends in eighty-eight seconds. You think you are done. You are not.
"BigQuery runs us $40M/year at current scale. Why are you not on Dremel internally? And why Spark for ranker features when TensorFlow Extended exists inside the walls? Cost is my first question; correctness is the constraint."
"Why Redshift and not Snowflake? Retail Finance lives on Snowflake. Your dual-book reconciliation is a cross-team join every day. You just added two hops and a service account. Justify or re-draw."
"Why Spark for SKAN aggregation instead of an internal Apple-built aggregator? Compliance will ask, privacy-review will ask, and both rooms will require a written answer. Walk me through the approval path before you walk me through the stack."
Round 5 · 50–55 min · System Breaks
Black Friday. 2:17 AM. ROAS up 18%.
The platform shipped in late October. The CFO's Friday dashboard has run clean for ten weeks. You are asleep. PagerDuty does not care.
The guardrail is your own. You wrote it in week six, after Raj's 2:17 AM question in the interview turned out to be a real hour at this job. The alert fires on WoW ROAS delta >4% across the top-five advertisers. Eighteen percent is not a drift. Eighteen percent is a bug, a miracle, or a fraud. Finance sees this number at 09:00. You have roughly thirty minutes before the director wakes up and sixty before that meeting starts.
The thing about debugging your own platform is that you know exactly where to look. The architecture is not a mystery. The bug will be in exactly one of four places, and the platform was designed to tell you which one.
bronze.fact_impression row count by hour. Compare to 7-day median. Expected hourly volume at 02:00 PST is ~72M. Actual: 71.4M. Ruled out.attribution_time stamps. Revenue doubles on the same conversions.gold.revenue_attributed — count attribution_time BETWEEN '2025-11-29' AND '2025-11-30' where event_time < '2025-11-01'. Expected: <1%. Actual: 0.4%. Ruled out.identity_graph_version distinct values in the last 24h. Expected: 1. Actual: 2 — v2025.10 flipped to v2025.11 at 20:04 PST. Hot.H3 is hot. You write the query your architecture made cheap:
-- ROAS decomposed by identity graph version, last 24h
SELECT
identity_graph_version,
COUNT(DISTINCT conversion_id) AS n_conversions,
SUM(credited_usd) AS attributed_usd,
SUM(credited_usd) / NULLIF(s.spend_usd, 0) AS roas
FROM gold.revenue_attributed r
JOIN gold.revenue_spend s USING (advertiser_id)
WHERE r.advertiser_id IN (/* top 5 */)
AND r.attribution_time >= NOW() - INTERVAL '24 hours'
AND s.spend_date = CURRENT_DATE
GROUP BY identity_graph_version, s.spend_usd
ORDER BY identity_graph_version;
| identity_graph_version | n_conversions | attributed_usd | ROAS |
|---|---|---|---|
v2025.10 | 142,318 | $1.82M | 3.04 |
v2025.11 | 168,904 | $2.19M | 3.59 |
v2025.10, watch ROAS settle back to +3%, close the page, sleep. Thirty-eight minutes from page to all-clear. Heroic. Also wrong.You do not revert. You do not go back to bed. You write the paragraph.
v2025.11 at 20:04 PST, which resolves cross-device journeys the previous graph missed — +26,586 conversions credited, +$370K attributed, same spend. The platform is doing the right thing. Recommendation: do not revert. For Finance's 9 AM view, I'm publishing gold.revenue_attributed_v10only as a parallel table so the November close runs on the old graph; December onwards runs on v11, which I'll flag in the board prep as a known measurement improvement worth ~5% reattributed revenue. I'll draft an advertiser-facing note by end of day Friday. The alert was correct to fire — an 18% WoW lift should always wake someone up."03:21 AM. You close the laptop. You lie down. You stare at the ceiling for forty minutes. You are still awake at 4:02 when the 4:00 AM batch runs green. You already know the thing the last hour actually tested, and it is not the thing Raj asked.
"04:33 AM ET. A Floodlight tag rollout to 40% of Display advertisers started double-firing last night. Reported revenue is up 8% in advertiser dashboards. Thirty minutes before the daily exec check-in. Real lift or double-count?"
"Prime Day · 18:30 UTC. DSP impression counts spiked 30% in the last ten minutes but Sponsored Ads auction logs did not. Is it real inventory or a pipeline divergence? Retail Finance asks in fifteen minutes — they have a $3M anomaly flag open already."
"03:12 AM PT after an iOS 17.4 release. SKAN postbacks are dropping from previously healthy apps. Is it the OS, a privacy-threshold change, or a bug in our decoder? You have two hours before developers open tickets at scale."
Round 6 · 55–60 min · Recovery
08:29 AM. The CFO's office. Two dashboards, one conversation.
Five hours after the page, you walk into Amelia's office with a laptop and two browser tabs open. One is the dashboard the platform ships to her every morning. The other is a parallel view you built at 03:40 AM to explain tonight. The meeting is seventeen minutes. You do not open your laptop for the first three.
Amelia, the director, and one board-prep analyst. Two screens. Black coffee on the table. No deck. She says: "Walk me through this like I'm the advertiser, not the CFO. I already trust the number — I need to explain it to someone who doesn't."
The meeting is not about the platform. It is about whether the platform can narrate itself. If the dashboard cannot answer "why is tonight's number what it is" without you in the room, the platform is not done, no matter how many tests pass. So you open the laptop. You start with the picture she sees every morning.
The CFO's morning view · conversion funnel
This is what Amelia sees at 07:45 AM, every day, on her phone. One chart, top of the dashboard. Five steps. It answers: how did the whole ad tier perform yesterday, shape-wise. It does not answer a question about dollars — that's the second section below. This chart answers whether the shape of the day was normal.
The funnel · from impression to repurchase
From impression to return purchase — live drop-off
The first thing you point at on the screen is the drop-off between two specific steps. Not the overall number. The shape. Shape reveals measurement changes before totals do.
The tab control at the top is the feature that matters this morning. Switching attribution window from 7-day to 1-day to 28-day shifts the tail of the funnel. Last night's identity-graph change affects 28-day attribution most — by design, more cross-device journeys close within 28 days than within 1. Shape tells the story without dollars.
ROAS by attribution window · the reason you're here
Same campaign. Same reality. Five different ROAS numbers.
This is the chart you built overnight. It is not in Amelia's normal view. You pulled it forward because the only way to explain tonight's +18% is to show ROAS across attribution windows side by side, with the identity-graph version as a visible cohort.
You say one sentence while it loads: "The 28-day window is where the graph change shows up. 7-day barely moves. 1-day is flat. The platform is measuring more accurately — not more optimistically." Amelia looks at the 28-day bar for thirty seconds. She asks one question. It is the question every CFO asks eventually, and the platform has to have a clean answer.
You were ready for this one at 03:40 AM. You built the parallel table for exactly this question.
v2025.11 closes at $9.31M instead of $9.12M. That $190K is real revenue; we under-attributed it at close. I'd recommend we don't restate the Q3 number — the $9.12M was correct under the measurement we had — but we flag the $190K in the board footnote as an identified improvement, and the advertisers affected get a measurement-update note from the partnerships team. I've drafted both. They're in your inbox."She reads the email for ninety seconds. She does not say anything while she reads. When she looks up, the first thing she says is not about the numbers.
You answer honestly.
She nods once. The meeting ends at 08:42. The board prep gets the $9.31M footnote. The advertisers get the note on Tuesday. Nobody mentions the word "incident."
The advertiser view · install-to-event funnel
After the CFO meeting you walk down to the advertiser-success team. They have their own dashboard — the one the platform ships to every self-serve advertiser. This is a different surface for a different audience, and the question it answers is different too. Is my campaign working, in my vocabulary.
The difference between this view and the CFO view is the thing that took you three months to learn at StreamCo: audiences do not translate. Amelia does not want to see "D30 retention by cohort." An advertiser does not want to see "settled revenue, USD basis." The same ad platform produces both surfaces from the same silver tables, bound through the semantic layer, and neither audience is ever confused about which number is which — because the audience tag on the metric is part of the query plan.
The sales-cycle view · lead funnel + cycle time
The third surface is for advertisers running lead-generation campaigns. The shape of a lead campaign is not a funnel; it is a funnel with a time axis, because the bottleneck is days-to-close, not percent-converting. A lead campaign with a 40% MQL-to-close that takes 90 days is worse than a 25% MQL-to-close that takes 14 days, and no one-number ROAS captures this.
This dashboard is the one advertisers screenshot and send to their CFOs. The platform's job is to make sure they don't have to re-label anything before they paste it into the deck — the axis labels, the window, the attribution model are all on the chart itself.
There is one more dashboard below — the campaign-level drilldown. It is a tool, not a narrative surface. You include it because every advertiser eventually asks "but what about my campaign," and the platform needs a no-ceremony way to answer that.
Campaign drilldown · the header card
Period: Apr 1 – Apr 22, 2026
Single campaign, scorecard at the top, trend below, benchmark row for context. Self-serve advertisers bounce between this and the funnel chart above; that pattern drove the layout decisions.
"09:30 call with the Top-20 Sales exec. Explain the Floodlight double-fire incident in business terms for a non-engineer, then walk through the three safeguards you added overnight. The exec represents $400M of annual spend. Pick your three."
"11:00 sync with Retail Finance, Amazon DSP leadership, and Sponsored Ads PM. Three dashboards. Dual-books reconciliation. Walk all three orgs through the corrected numbers and the dual-book reconciliation addendum your team will publish by Friday."
"10:00 Developer Relations call. The top-tier fintech developer is on the line with their head of growth. Explain the SKAN postback anomaly, the iOS 17.4 interaction, and the two lines of config that would prevent recurrence — without referencing any Apple-internal infra by name."
Dashboard · Growth Accounting
Where does MAU actually come from — and where does it leak?
Activet = Retained + New + Resurrected · Activet-1 = Retained + Churned. Net = New + Resurrected − Churned.Dashboard · Where the Money Is Left on the Table
$142,300 gross spend → $89,720 effective spend. Where does the other $52,580 go?
Dashboard · Campaign Improvement Opportunities
Six moves you can make next. Bubble size = monthly $ recovery.
Round 7 · final minute · Signal
The last four minutes.
Clock reads 43:22. Raj caps the marker for the second time. The 2:17 AM story is on the whiteboard. The CFO meeting is on the whiteboard. There are no more scenarios. You think the round is done. Raj leans forward, not back, which is how you know it is not.
"One more. Quick one. You've built the platform. You've defended it. You've debugged it at two in the morning. You've walked the CFO through it. Look at what you did tonight — 02:17 to 02:58 — and tell me what you think this interview was actually testing."
You do not answer for six seconds. This is not thinking time. This is the moment the round pivots, and you need the pivot in your own head before you put it into his.
- Whether you can design an ads platform at StreamCo scale.
- Whether you can defend architectural choices under pushback.
- Whether you can debug production incidents.
- Whether you can explain numbers to executives.
- Whether you know enough SQL, enough Kafka, enough dbt.
- Whether you know when not to trust your own pipeline.
- Whether you can tell a bug from a correction at 2:43 AM.
- Whether you'll refuse a default when defaults are the enemy.
- Whether you'll write a paragraph before you write a query.
- Whether you'll hold a principle when a tactic is easier.
You say it as one sentence because the question asked for one.
"The interview wasn't testing whether I can build pipelines. It was testing whether I know when not to trust them — including the ones I built myself."
Raj does not nod. He also does not write anything down. The absence is the signal. He reaches for the laptop on the side of the desk and closes it — not yours, his. The round is effectively over. He spends the remaining thirty seconds on one thing.
He lets the last principle sit for eight seconds. Then he stands up, which is the exit move he has made on every good candidate in the last year. The interview ends in an unremarkable sentence.
You do not ask what aligned means. You already know. The round scored you on the only thing that matters at this level: whether you know the enemy in your own system. You named it in Act 1 — silent label drift between audiences who think they are reading the same number — and every act after that was one more test of whether you'd forget it under pressure.
You did not forget it. The offer comes three days later.
Ads data engineering at this scale is not about moving data. It is about the architecture that keeps four audiences from telling each other different truths in the same word.
Raj did not hire you because your Kafka topics were elegant. He hired you because when the number moved at 02:17, you knew which number had moved and why — and when the CFO asked at 08:30, you answered with a paragraph instead of a query. The platform is a consequence of those two sentences.
— The senior DE you interviewed as, quieter than the one you walked in being.
Scroll back to The Brief. Look at the $0.9M gap. You know what to say now.
That sentence is the whole interview.
"What's the one decision you made in the last forty-five minutes that you would change if you did the round again? Not the rehearsed answer. The actual one."
"What's the one decision you made today that you are certain about? Explain why that certainty survives the next 18 months of this platform getting bigger."
"What do you hope we do not hire you for?"
Round 4 · Appendix · Deep Dive · SQL
The coding bank.
Seven drills. Twelve minutes of SQL each. Twenty minutes of Python each. If you can solve them without reading the answer first, the narrative half of this article will hold under real pressure. If you cannot, work the drills until you can — the rest of the article is scaffolding around this set.
Close the answer
<details>. Set a timer. Solve on paper or in a blank editor. Do not check the reference until the timer runs out, even if you are stuck. Then compare — not to see if you got the same answer, but to see what edge cases your solution missed.
The bank · seven drills
Given silver.fact_impression (spend_cents, is_valid, fx_rate_to_usd, impression_ts) and finance.ad_revenue_ledger (month, invoiced_usd, refunds_usd, ivt_credits_usd, partner_attribution_adjustment_usd). Produce one row reconciling pipeline spend to Finance settled revenue for September 2025 with an unexplained_gap_usd column.
Edge cases to catch: timezone of impression_ts (is Sep the PST month or UTC month?), is_valid = FALSE treatment, signs of refunds and credits (subtract) vs adjustments (add).
Show the reference solution
WITH pipeline AS (
SELECT
SUM(spend_cents * fx_rate_to_usd) / 100.0 AS spend_revenue_usd,
SUM(CASE WHEN is_valid = FALSE
THEN spend_cents * fx_rate_to_usd ELSE 0 END) / 100.0
AS pipeline_ivt_flagged_usd
FROM silver.fact_impression
WHERE impression_ts >= TIMESTAMP '2025-09-01 00:00:00 America/Los_Angeles'
AND impression_ts < TIMESTAMP '2025-10-01 00:00:00 America/Los_Angeles'
),
ledger AS (
SELECT
SUM(invoiced_usd) - SUM(refunds_usd) - SUM(ivt_credits_usd)
+ SUM(partner_attribution_adjustment_usd) AS settled_revenue_usd,
SUM(refunds_usd) AS finance_refunds_usd,
SUM(ivt_credits_usd) AS finance_ivt_credits_usd,
SUM(partner_attribution_adjustment_usd) AS finance_partner_adj_usd
FROM finance.ad_revenue_ledger
WHERE month = DATE '2025-09-01'
)
SELECT p.*, l.*,
(l.settled_revenue_usd - p.spend_revenue_usd)
- (p.pipeline_ivt_flagged_usd - l.finance_ivt_credits_usd)
+ l.finance_refunds_usd
- l.finance_partner_adj_usd AS unexplained_gap_usd
FROM pipeline p CROSS JOIN ledger l;
What junior solutions miss: timezone boundary — the Finance ledger's "September" is PST, not UTC. Missing this puts 3 hours of Oct 1 UTC impressions in the pipeline side and leaves the gap unexplained by 0.5%.
Table gold.revenue_attributed is append-only with event_time, attribution_time, credited_usd, is_superseded. Reproduce exactly the September total Finance saw at 10:37 AM on 2025-09-30, regardless of any rows inserted afterwards.
Edge: is_superseded as of that moment, not now. You cannot trust today's is_superseded flag.
Show the reference solution
-- Sum rows that (a) existed by 10:37 and (b) were not superseded by any row
-- also created by 10:37.
WITH snapshot AS (
SELECT *
FROM gold.revenue_attributed
WHERE attribution_time <= TIMESTAMP '2025-09-30 10:37:00'
)
SELECT SUM(r.credited_usd) AS finance_view_1037
FROM snapshot r
WHERE NOT EXISTS (
SELECT 1 FROM snapshot s2
WHERE s2.conversion_id = r.conversion_id
AND s2.attribution_time > r.attribution_time
);
Why this is the senior answer: naïvely filtering WHERE is_superseded = FALSE uses today's supersession flags. A row that was current at 10:37 but was superseded three days later would be (wrongly) excluded from the snapshot.
Given fact_click (user_id, campaign_id, click_ts) and fact_conversion (user_id, conversion_ts, conversion_usd), credit each conversion to the latest click within 7 days preceding it. Un-attributable conversions (no click in window) return NULL for campaign.
Edge: ties on click_ts (same second), and conversions with zero matching clicks still need to appear in output (LEFT JOIN, not INNER).
Show the reference solution
SELECT
c.user_id,
c.conversion_ts,
c.conversion_usd,
last_click.campaign_id AS attributed_campaign
FROM fact_conversion c
LEFT JOIN LATERAL (
SELECT campaign_id
FROM fact_click k
WHERE k.user_id = c.user_id
AND k.click_ts <= c.conversion_ts
AND k.click_ts > c.conversion_ts - INTERVAL '7 days'
ORDER BY k.click_ts DESC, k.campaign_id DESC -- deterministic tiebreak
LIMIT 1
) last_click ON TRUE;
Why the tiebreak column matters: two clicks at the same second from different campaigns would non-deterministically attribute without the secondary ORDER BY. Candidates who skip this ship a test suite that passes locally and fails flakily in CI.
Implement should_serve(user_id, now_ts) -> bool backed by Redis. Cap: at most 5 impressions per user in any rolling 24h window. Redis ops available: ZADD, ZCOUNT, ZREMRANGEBYSCORE, EXPIRE.
Edge: race between two ad servers for the same user at the same instant — without a transaction, both decide "yes, serve." And: memory bloat if old entries are never trimmed.
Show the reference solution
import time
import redis
r = redis.Redis()
WINDOW_SEC = 24 * 3600
CAP = 5
def should_serve(user_id: str, now_ts: int | None = None) -> bool:
now = now_ts if now_ts is not None else int(time.time())
key = f"freq:{user_id}"
cutoff = now - WINDOW_SEC
# Atomic check-then-add in a pipeline with WATCH to avoid the race.
with r.pipeline() as pipe:
while True:
try:
pipe.watch(key)
pipe.zremrangebyscore(key, 0, cutoff)
count = pipe.zcount(key, cutoff, "+inf")
if count >= CAP:
pipe.unwatch()
return False
pipe.multi()
pipe.zadd(key, {f"{now}:{user_id}:{count}": now})
pipe.expire(key, WINDOW_SEC + 60) # trim memory
pipe.execute()
return True
except redis.WatchError:
continue # raced with another server; retry
What junior solutions miss: (1) the WATCH/MULTI transaction — without it, concurrent ad servers both read count = 4 and both insert, landing at 6. (2) the EXPIRE — without it, inactive users accumulate forever and the Redis cluster OOMs at 3 AM.
Given two iterables of user_id (possibly 10M+ each, cannot fit both in memory). Compute Jaccard similarity |A ∩ B| / |A ∪ B|. Approximation acceptable if it beats exact at scale.
Edge: duplicates within each iterable (same user appearing twice) should not double-count. Memory is the constraint, not time.
Show the reference solution
# Exact — for datasets that fit in memory
def jaccard_exact(a, b):
sa, sb = set(a), set(b)
inter = len(sa & sb)
union = len(sa | sb)
return inter / union if union else 0.0
# Approximate — MinHash LSH for the 10M+ case
from datasketch import MinHash
def jaccard_minhash(iterable_a, iterable_b, num_perm=256):
m_a, m_b = MinHash(num_perm=num_perm), MinHash(num_perm=num_perm)
for u in iterable_a: m_a.update(str(u).encode())
for u in iterable_b: m_b.update(str(u).encode())
return m_a.jaccard(m_b) # approx ±1% at num_perm=256
Why the approximate version is the senior answer: at 10M per segment, set() costs ~800MB per side. Two segments ≈ 1.6GB resident before you've started computing. MinHash is O(n) streaming with 256 × 8-byte fingerprints per segment — 2KB, regardless of cardinality. Cite the precision/recall trade explicitly in the answer.
Impression events arrive on a Kafka topic with impression_id, event_ts. Consumer restarts replay the last 10 minutes. Write a consumer that produces exactly-once writes to the warehouse, without external transactions.
Edge: impression_id collisions across days (IDs only unique-per-day), clock skew between producers, and the consumer crashing mid-batch.
Show the reference solution
from datetime import datetime
SEEN_TTL_SEC = 86400 # impression_id uniqueness window
class Dedupe:
def __init__(self, redis_client):
self.r = redis_client
def key(self, event_ts: int, impression_id: str) -> str:
# Day-bucket prevents cross-day collision.
day = datetime.utcfromtimestamp(event_ts).strftime("%Y%m%d")
return f"seen:{day}:{impression_id}"
def process(self, event) -> bool:
k = self.key(event["event_ts"], event["impression_id"])
# SETNX returns 1 iff we're the first writer. TTL ensures eventual
# memory release; slightly longer than the replay window protects
# against consumer restart within the same bucket.
if self.r.set(k, b"1", nx=True, ex=SEEN_TTL_SEC):
write_to_warehouse(event)
return True
return False # duplicate; skip
def write_to_warehouse(event):
# Idempotent upsert in the warehouse too — belt and braces.
warehouse.upsert(
table="silver.fact_impression",
key=("impression_id", "event_ts"),
row=event,
)
What junior solutions miss: (1) day-bucketing the dedup key (IDs collide across days at some providers), (2) warehouse-side UPSERT as a second line of defence — the Redis check is fast but not durable across region failover, and exactly-once in ads-DE is always defence-in-depth.
Given fact_daily_active (user_id, active_date), produce a daily growth-accounting table with columns date, new, retained, resurrected, churned. Definitions: new = first ever active date; retained = active yesterday and today; resurrected = active today, not yesterday, but active in the last 28 days; churned = active yesterday, not today.
Edge: dates with zero activity shouldn't silently drop. Use a calendar spine.
Show the reference solution
WITH first_seen AS (
SELECT user_id, MIN(active_date) AS first_date FROM fact_daily_active GROUP BY user_id
),
today AS (
SELECT a.active_date, a.user_id,
EXISTS (SELECT 1 FROM fact_daily_active y
WHERE y.user_id = a.user_id
AND y.active_date = a.active_date - INTERVAL '1 day') AS active_yday,
EXISTS (SELECT 1 FROM fact_daily_active r
WHERE r.user_id = a.user_id
AND r.active_date BETWEEN a.active_date - INTERVAL '28 days'
AND a.active_date - INTERVAL '2 days') AS active_last_28
FROM fact_daily_active a
)
SELECT d.date,
COUNT(*) FILTER (WHERE fs.first_date = d.date) AS new_users,
COUNT(*) FILTER (WHERE fs.first_date < d.date AND t.active_yday) AS retained,
COUNT(*) FILTER (WHERE fs.first_date < d.date AND NOT t.active_yday
AND t.active_last_28) AS resurrected,
(SELECT COUNT(*) FROM fact_daily_active y
WHERE y.active_date = d.date - INTERVAL '1 day'
AND NOT EXISTS (SELECT 1 FROM fact_daily_active z
WHERE z.user_id = y.user_id AND z.active_date = d.date))
AS churned
FROM (SELECT generate_series(DATE '2025-01-01', DATE '2025-12-31', INTERVAL '1 day')::date AS date) d
LEFT JOIN today t ON t.active_date = d.date
LEFT JOIN first_seen fs ON fs.user_id = t.user_id
GROUP BY d.date
ORDER BY d.date;
Sanity check: retained + resurrected + new = DAU(today), and DAU(yday) - churned = retained. If those identities don't hold, the SQL is wrong.
Tech round drill sprint · 8 SQL + 8 Python · L3 → L5
The seven detail-drills above are the deep work. Below are sixteen short sprints — the patterns the SQL/Python rounds actually test, framed against ads scenarios. Each one names the L-tier, the underlying pattern, and a reference solution with one line on what junior solutions miss. Read them as a warm-up rotation — answer mentally, then check.
ROW_NUMBER() OVER (PARTITION BY ...)fact_campaign_daily(advertiser_id, campaign_id, spend_usd, day), return the top 3 spending campaigns per advertiser for the last 7 days. Tie-break: lower campaign_id wins."Senior solution + what junior misses
SELECT advertiser_id, campaign_id, total_spend
FROM (
SELECT advertiser_id, campaign_id,
SUM(spend_usd) AS total_spend,
ROW_NUMBER() OVER (
PARTITION BY advertiser_id
ORDER BY SUM(spend_usd) DESC, campaign_id ASC
) AS rk
FROM fact_campaign_daily
WHERE day >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY advertiser_id, campaign_id
) t
WHERE rk <= 3;
Junior misses: the secondary ORDER BY campaign_id. Without it, ties are non-deterministic and the result flickers between runs. Senior writes the deterministic tiebreak before the interviewer asks.
LEFT JOIN ... WHERE r.id IS NULLdim.advertiser) who served zero impressions yesterday. Active means dim.advertiser.is_active = TRUE."Senior solution + what junior misses
SELECT a.advertiser_id, a.name
FROM dim.advertiser a
LEFT JOIN silver.fact_impression i
ON i.advertiser_id = a.advertiser_id
AND i.impression_ts >= CURRENT_DATE - INTERVAL '1 day'
AND i.impression_ts < CURRENT_DATE
WHERE a.is_active = TRUE
AND i.advertiser_id IS NULL;
Junior misses: putting the date filter in the WHERE instead of the ON. That converts the LEFT JOIN to an inner join and drops every advertiser without an impression — exactly the rows you wanted. The filter belongs on the join, not after it.
ROW_NUMBER() ... ORDER BY ingested_ts DESCbronze.raw_impression may contain the same impression_id twice (consumer replay). Keep only the latest ingested_ts per impression_id."Senior solution + what junior misses
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY impression_id
ORDER BY ingested_ts DESC
) AS rn
FROM bronze.raw_impression
)
SELECT * EXCEPT (rn) FROM ranked WHERE rn = 1;
Junior misses: using DISTINCT ON in Postgres-only flavour, or GROUP BY impression_id + MAX(ingested_ts) which loses the rest of the columns. ROW_NUMBER keeps the whole row and works on every dialect.
LAG() over partitioned weekly aggregateswow_pct column showing change from the prior week."Senior solution + what junior misses
WITH weekly AS (
SELECT advertiser_id,
DATE_TRUNC('week', event_time) AS wk,
SUM(credited_usd) AS rev
FROM gold.revenue_attributed
WHERE event_time >= CURRENT_DATE - INTERVAL '84 days'
GROUP BY 1, 2
)
SELECT advertiser_id, wk, rev,
LAG(rev) OVER (PARTITION BY advertiser_id ORDER BY wk) AS prev_rev,
(rev - LAG(rev) OVER (PARTITION BY advertiser_id ORDER BY wk))
/ NULLIF(LAG(rev) OVER (PARTITION BY advertiser_id ORDER BY wk), 0) AS wow_pct
FROM weekly
ORDER BY advertiser_id, wk;
Junior misses: the NULLIF(prev_rev, 0) guard. A new advertiser whose first week is week N has prev_rev = 0; without the guard the query divides by zero and silently fails or returns Inf. Senior writes the guard reflexively.
fact_revenue(advertiser_id, surface, day, revenue_usd) where surface ∈ (feed, reels, stories, ctv), return one row per (advertiser_id, day) with one column per surface."Senior solution + what junior misses
SELECT advertiser_id, day,
SUM(CASE WHEN surface = 'feed' THEN revenue_usd ELSE 0 END) AS feed_usd,
SUM(CASE WHEN surface = 'reels' THEN revenue_usd ELSE 0 END) AS reels_usd,
SUM(CASE WHEN surface = 'stories' THEN revenue_usd ELSE 0 END) AS stories_usd,
SUM(CASE WHEN surface = 'ctv' THEN revenue_usd ELSE 0 END) AS ctv_usd
FROM fact_revenue
GROUP BY advertiser_id, day;
Junior misses: using FILTER (WHERE surface = ...) Postgres-specific syntax which is cleaner but won't run on Spark/BigQuery/Snowflake. CASE is portable. Always ask which engine before reaching for FILTER or PIVOT.
LAG + cumulative SUM over windowfact_impression rows into sessions where a session ends after 30 minutes of inactivity. Output: user_id, session_id, start_ts, end_ts, impression_count."Senior solution + what junior misses
WITH gaps AS (
SELECT user_id, impression_ts,
CASE WHEN impression_ts - LAG(impression_ts)
OVER (PARTITION BY user_id ORDER BY impression_ts)
> INTERVAL '30 minutes' THEN 1 ELSE 0 END AS is_new_session
FROM silver.fact_impression
),
sessions AS (
SELECT user_id, impression_ts,
SUM(is_new_session) OVER (
PARTITION BY user_id ORDER BY impression_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_id
FROM gaps
)
SELECT user_id, session_id,
MIN(impression_ts) AS start_ts,
MAX(impression_ts) AS end_ts,
COUNT(*) AS impression_count
FROM sessions
GROUP BY user_id, session_id;
Junior misses: the unbounded-preceding window frame. Without it the cumulative SUM resets per row and you get session_ids that don't accumulate. The frame is the whole point.
NOT EXISTS for snapshot isolationgold.revenue_attributed is append-only with event_time + attribution_time. Reproduce exactly the September total Finance saw at 10:37 AM on 2025-09-30, regardless of any rows inserted afterwards."Senior solution + what junior misses
WITH snap AS (
SELECT *
FROM gold.revenue_attributed
WHERE attribution_time <= TIMESTAMP '2025-09-30 10:37:00'
AND event_time >= '2025-09-01' AND event_time < '2025-10-01'
)
SELECT SUM(credited_usd) AS finance_view_1037
FROM snap r
WHERE NOT EXISTS (
SELECT 1 FROM snap s2
WHERE s2.conversion_id = r.conversion_id
AND s2.attribution_time > r.attribution_time
);
Junior misses: filtering on today's is_superseded flag instead of computing supersession at the snapshot moment. A row that was current at 10:37 but was superseded three days later would be wrongly excluded. Bi-temporal questions test whether you understand "as-of-X" vs "current."
dim.advertiser_rate stores agency-rate history with (advertiser_id, rate_pct, effective_from, effective_to). Join it to fact_impression so each impression picks up the rate that was active at impression_ts."Senior solution + what junior misses
SELECT i.*,
r.rate_pct
FROM silver.fact_impression i
JOIN dim.advertiser_rate r
ON r.advertiser_id = i.advertiser_id
AND i.impression_ts >= r.effective_from
AND (i.impression_ts < r.effective_to OR r.effective_to IS NULL);
Junior misses: the OR effective_to IS NULL branch for the currently-active row. SCD2 conventionally stores open intervals as effective_to = NULL; without the OR, every impression for an advertiser whose rate is current gets dropped from the join.
collections.defaultdict · streaming aggregation without pandas(impression_ts, spend_cents) tuples (could be 100M rows), return total spend per hour. Cannot use pandas. Memory must stay bounded by the number of distinct hours, not row count."Senior solution + what junior misses
from collections import defaultdict
from datetime import datetime
def spend_by_hour(rows):
out = defaultdict(int)
for ts, cents in rows:
# Truncate to hour (cheaper than .replace if ts is already ISO string)
hour = ts.replace(minute=0, second=0, microsecond=0) if isinstance(ts, datetime) else ts[:13]
out[hour] += cents
return dict(out)
Junior misses: using a regular dict with if key not in out: out[key] = 0. Works, slower, and easier to typo. defaultdict(int) is the idiom every senior reaches for here. Bonus: it's also thread-safer if the loop ever moves to producer/consumer.
Senior solution + what junior misses
import time, random, functools
def retry(attempts=5, base=1.0, max_delay=30.0):
def decorator(fn):
@functools.wraps(fn)
def wrapper(*args, **kwargs):
last_exc = None
for i in range(attempts):
try:
return fn(*args, **kwargs)
except Exception as e:
last_exc = e
if i == attempts - 1:
break
sleep = min(base * (2 ** i), max_delay)
sleep += random.uniform(0, sleep * 0.1) # 10% jitter
time.sleep(sleep)
raise last_exc
return wrapper
return decorator
@retry(attempts=5)
def fetch_ssp_postback(url): ...
Junior misses: jitter. Without it, a thousand workers retrying the same SSP at the same rhythm DDOS the SSP and you get banned. Even 10% jitter de-syncs the herd. Senior writes it before being asked.
collections.Counter · streaming top-Kimpression_id, advertiser_id rows, return the top 10 advertisers by impression count. Single pass, bounded memory."Senior solution + what junior misses
from collections import Counter
def top_10_advertisers(rows):
c = Counter(advertiser_id for _, advertiser_id in rows)
return c.most_common(10)
Junior misses: reaching for sorted(...)[:10] on a 50M-element dict — that's O(n log n) when most_common(k) is O(n log k) using a heap internally. At ad scale the heap path is 5–8× faster.
yield · constant-memory streamingSenior solution + what junior misses
import gzip, json
from itertools import islice
def chunked_jsonl(stream, chunk_size=10_000):
"""Stream JSONL records in chunks. Constant memory."""
while True:
chunk = list(islice(stream, chunk_size))
if not chunk:
return
yield [json.loads(line) for line in chunk]
# Usage:
with gzip.open('s3://path/impressions.jsonl.gz', 'rt') as f:
for batch in chunked_jsonl(f):
process(batch)
Junior misses: calling f.readlines() first — that materialises the whole file in memory and dies on a 50GB input. islice on the file object reads lazily. Senior reaches for the lazy iterator on instinct.
collections.deque · monotonic queue · O(n)Senior solution + what junior misses
from collections import deque
def rolling_max_roas(stream, window_seconds=300):
"""Monotonic deque: indices ordered by ts, values monotonically decreasing."""
dq = deque() # holds (ts, roas)
for ts, roas in stream:
# Drop items outside the window (left of dq)
while dq and dq[0][0] < ts - window_seconds:
dq.popleft()
# Drop items smaller than current roas (right of dq) — they can never be max again
while dq and dq[-1][1] <= roas:
dq.pop()
dq.append((ts, roas))
yield ts, dq[0][1] # dq[0] is always the current window's max
Junior misses: implementing this with max(values_in_window) per row — that's O(n·k) and at 2B impressions/day per advertiser is the difference between 30 seconds and 30 hours. The monotonic deque makes it O(n) amortised.
can_proceed(advertiser_id) → bool."Senior solution + what junior misses
import time
from dataclasses import dataclass
@dataclass
class Bucket:
tokens: float
last_refill: float
class RateLimiter:
def __init__(self, rate=100, burst=200):
self.rate = rate # tokens per second
self.burst = burst # max tokens
self.buckets: dict[str, Bucket] = {}
def can_proceed(self, advertiser_id: str) -> bool:
now = time.monotonic()
b = self.buckets.get(advertiser_id) or Bucket(self.burst, now)
# Refill
elapsed = now - b.last_refill
b.tokens = min(self.burst, b.tokens + elapsed * self.rate)
b.last_refill = now
if b.tokens >= 1:
b.tokens -= 1
self.buckets[advertiser_id] = b
return True
self.buckets[advertiser_id] = b
return False
Junior misses: using time.time() instead of time.monotonic(). time.time() can go backwards on NTP adjustments — a candidate who hasn't lost a Saturday to that bug uses it; a senior reaches for monotonic().
{job: [upstream_jobs]} describing an ads-ETL DAG, return a valid execution order. Raise CycleError if there's a cycle."Senior solution + what junior misses
from collections import defaultdict, deque
class CycleError(ValueError): pass
def topo_sort(deps: dict[str, list[str]]) -> list[str]:
in_deg = defaultdict(int)
children = defaultdict(list)
nodes = set(deps)
for job, parents in deps.items():
for p in parents:
children[p].append(job)
in_deg[job] += 1
nodes.add(p)
q = deque(n for n in nodes if in_deg[n] == 0)
order = []
while q:
n = q.popleft()
order.append(n)
for child in children[n]:
in_deg[child] -= 1
if in_deg[child] == 0:
q.append(child)
if len(order) != len(nodes):
raise CycleError(f"cycle involving {nodes - set(order)}")
return order
Junior misses: tracking nodes separately from deps.keys(). A leaf job that's only ever a parent (never a key) is missing from deps — without adding parents to nodes the in-degree count is wrong and the cycle check produces false positives. Subtle and common.
asyncio.gather · bounded concurrency · partial-failure toleranceSenior solution + what junior misses
import asyncio
import aiohttp
async def fetch_one(session, ssp, sem):
async with sem: # bounded concurrency
try:
async with session.get(ssp.url, timeout=10) as resp:
resp.raise_for_status()
return ('ok', ssp.name, await resp.json())
except Exception as e:
return ('err', ssp.name, repr(e))
async def fetch_all_postbacks(ssps):
sem = asyncio.Semaphore(4)
async with aiohttp.ClientSession() as session:
results = await asyncio.gather(
*(fetch_one(session, s, sem) for s in ssps),
return_exceptions=False, # we already wrapped in fetch_one
)
successes = [r for r in results if r[0] == 'ok']
failures = [r for r in results if r[0] == 'err']
return successes, failures
Junior misses: the Semaphore. Without it, all 8 fire in parallel and the smallest SSP returns 429s. Senior solution has bounded concurrency + per-task exception capture so one bad SSP does not poison the batch.
Three things separate junior from senior in tech rounds: (1) deterministic edge-case handling on day one of writing the query (the secondary ORDER BY, the NULLIF, the OR effective_to IS NULL); (2) reaching for the idiom rather than recreating the algorithm (Counter, defaultdict, asyncio.gather, monotonic deque); (3) naming the production failure mode that the obvious solution would hit (DDOS without jitter, OOM without lazy iter, false cycle from missing nodes, NTP drift from time.time). The drills above all hide one of those three signals. Practise until you write the senior version on first pass.
If you can do these from a blank buffer, under timer, without reaching for the docs, the round will play out the way the seven acts did.
Seven drills, seven acts, one lesson. The lesson is the one Raj was testing in Act 7: the job is mostly the things you don't build. The drills are the things you need to be fluent in so you can spend the round thinking about the things that don't make it into code. Walk in warm.