Design the data backend for a For-You-Page — billions of impressions a day, sub-200ms ranking, and a model that trains on its own output. One prompt, one vicious feedback loop, and a single decision that separates a senior answer: hold back 5% of traffic so you can ever know whether the algorithm caused anything. A complete working-through of data flow, schema, streaming Python, off-policy evaluation, the causal-lift SQL, and the dashboard that watches for drift.
Every ML engineer who works on ranking eventually meets this question. It sounds like "build a recommender." It is really "build a recommender that can tell whether it is any good" — and those are not the same system.
"Design the data model behind a For-You-Page recommendation feed. It serves billions of impressions a day and the ranker trains on the engagement it generates. How do you model it without producing a biased feedback loop — and how do you know if a new model is actually better?"
The For-You-Page is the most-watched ML system on the planet, and the trap is not scale — it is the snake eating its tail. The ranker boosts videos that already get watched; users see them more; they get watched more; they are boosted more. New videos and new creators starve in the dark. This is echo-chamber bias, and it is not a tuning problem you smooth away later — it is baked into the data the moment you log only what the algorithm chose to show. Every engagement metric you compute is confounded by the ranker's own selection. "This video got an 80% completion rate" tells you nothing, because the ranker only showed it to the people it already believed would finish it.
A weak answer logs impressions and engagements and calls it a recommender. A strong answer notices the missing table — the one that holds the videos the algorithm did not choose — and the missing column — the prediction the model made before it knew the outcome. Without those two, the system can attribute, but it can never measure causality, and a recommender that cannot measure its own causal lift cannot safely improve. So before any boxes and arrows, the working frame for the whole session:
Scope is the first scored dimension, and most candidates skip it. State what you build, what you ignore, and the numbers that shape every later choice. Out of scope here, said explicitly: the candidate-generation and ranking model internals (treated as a callable service that returns scored videos), the video CDN and transcoding stack, trust-and-safety classification, and the recommendation UI. In scope: how an impression becomes a row, how the exploration slice is carved and logged, how the predicted score is frozen for off-policy evaluation, how new rankers roll out under shadow traffic, and how cold-start videos escape the starvation trap.
Then the envelope math, volunteered rather than extracted. TikTok-shaped numbers:
| Quantity | Estimate | Consequence |
|---|---|---|
| Daily active users | 1,000,000,000 | Sets the dimension cardinality, not the fact size |
| Impressions/day | ≈ 100 B | The firehose — cannot land full-fidelity in the warehouse |
| Serving latency budget | < 200 ms | Score is computed and frozen on the hot path, logged async |
| Exploration slice | 5% ≈ 5 B impressions/day | The control group that shapes the whole architecture |
| Analytics sample | 1% of impressions | Tens of TB/day even sampled; full data in feature store |
| Concurrent ranker versions | 5–10 | Forces model_version on every score row |
| Raw event retention | 30 days | PII; aggregate to durable rollups, expire the raw |
Notice what the numbers say and what they do not. The hundred-billion firehose dictates the sampling and storage tiering — but the row that shapes the architecture, not just the bill, is the 5% exploration slice. Five billion deliberately un-ranked impressions a day is the price of being able to answer the only question that matters: did the algorithm cause the engagement, or did it just take credit for what was going to happen anyway? The rest of this article follows that question.
One feed, two regimes. The serving path splits at the source: 95% of impressions are ranked by the model, 5% are randomized — and both carry a frozen prediction downstream, so that weeks later the analytics layer can subtract one from the other and recover causal truth.
Three properties of this picture do most of the interview's work. First, the split happens at the mixer and is recorded: every impression carries a source tag — ranked or exploration — so the analytics layer can separate the two regimes without guessing. Second, the prediction is frozen on the hot path: the ranker writes predicted_watch_seconds and its model_version the instant the video is shown, before any outcome exists, which is the only moment the prediction is uncontaminated by what later happened. Third, the dashed lines are the whole point — the exploration log and the frozen scores flow into an offline evaluation that compares the algorithm's engagement against the random baseline, and that subtraction, not any online metric, is what licenses promoting a new model.
The firehose may be sampled; the control group may never be. Dropping 99% of impressions for analytics is fine — a 1% sample of a hundred billion is still a billion rows, statistically lavish. But the 5% exploration slice is sampled at full fidelity and protected from any optimization that would shrink it, because the instant the control group is itself selected by the ranker, it stops being a control group and every lift number becomes a lie. Sample the firehose freely; never let an A/B test or a latency tweak quietly poison the exploration arm.
The schema falls out of the causality question. The impression is the spine; the engagement hangs off it by id; the exploration event is the control arm at the same grain; the ranker score is the frozen prediction. And because engagement is high-PII, the identity lives behind a token in a separate vault.
The impression fact is the most-written table on Earth, so its design is about discipline under sampling. One row per video shown to a user, stamped with the serving context the analytics layer will need: the feed position, the model_version that ranked it, the source that says whether this slot was ranked or exploration, and the sampled_pct that lets a downstream query reweight a 1% sample back to population scale. Crucially the user is keyed by a user_id_token, never the raw id — identity is borrowed, not stored.
Engagement is a separate fact, joined to the impression by impression_id. Keeping it separate matters: an impression with no engagement row is a swipe-away, and that absence is itself a signal the ranker learns from. The watched-seconds column is the regression target the model predicts; the booleans are the classification targets. And like the spine, it is keyed by token, because this is the table a privacy review will scrutinize hardest.
Two more facts, and they are the ones the naive design forgets. The exploration event records the 5% of impressions that were random or holdout — the counterfactual baseline that answers "if we showed this to a thousand random users, how many would engage?" The ranker-score fact freezes the model's prediction at impression-time, stamped with version, so that "was the model right?" becomes a join between a prediction made in the past and an outcome known in the present. Both are append-only; neither is ever back-filled with hindsight.
And the vault — deliberately apart. The token that keys every fact above maps to a real user id only inside vault.user_id_map, an encrypted store with its own access pool that the analytics warehouse cannot read. Raw events carry a 30-day retention; the aggregated rollups (per-user-per-day-per-cluster) are de-identified enough to keep indefinitely. The separation is not decoration — it is what lets the firehose be useful to ML without being a standing liability to privacy.
The whole correctness of this system lives in one rule about time: the model's prediction is written before the model can possibly know whether it was right. Freeze the score at impression-time, stamp it with a version, and every later question about model quality becomes a deterministic join across time instead of an irreproducible guess.
Why this is the invariant, and not a nice-to-have. Models are versioned and they change daily; today's recommendation was made by ranker v117, and replaying the same impression through v118 produces a different number. If you do not capture v117's prediction at the moment it was made, you lose it forever — there is no way to reconstruct what the deployed model believed about a video it has long since stopped seeing. Storing the score freezes the test. "Was the model right?" becomes comparing a prediction-then to an outcome-now, and "is the new model better?" becomes scoring it on the same impressions retroactively, with no online experiment required.
Read the lifecycle left to right and notice where the freeze sits: between SCORED and SHOWN, before the outcome exists. The score row is written synchronously with the impression; the engagement arrives seconds-to-hours later; the evaluation happens days later, offline. Because the prediction was frozen, the gap between SHOWN and EVALUATED can be arbitrarily long and the comparison stays honest. A video shown ten million times with a high predicted score and a low actual watch-time is not noise — it is the cleanest possible diagnostic of model drift, and you only have it because the prediction was stamped before the truth came in.
The is_shadow flag and the composite key (impression_id, model_version) are doing the quiet work. A shadow model writes a prediction for an impression it did not serve — same impression id, different version, is_shadow = true — so the head-to-head above can include a model that has never touched a real user. That is the whole game: you compare v119 to v118 on identical impressions, offline, before v119 is ever allowed to influence a single feed.
Three programs carry the system. The mixer that carves the exploration slice and freezes the score, the logger that joins engagement back to its impression, and the cold-start governor that forces new videos into the light. Each is small; the judgment is in what they refuse to do.
This runs on the serving hot path, under the 200ms budget, so it does the minimum and logs asynchronously. It assigns each feed slot a source — ranked or exploration — by a stable hash, freezes the chosen ranker's prediction into the score buffer the instant the slot is filled, and never blocks the response on the log write. The carve is deterministic per (user, request) so the exploration assignment is reproducible and auditable, not a coin flip the system forgets.
One carve-out, always stated: the mixer never lets an experiment touch the exploration arm. A/B tests and feature flags partition the ranked 95%; the 5% control group is held constant across all of them, because a control group that is itself being experimented on is no longer a control group. This single discipline is what keeps every lift number downstream meaningful.
Aggregation here is not a roll-up — it is a causal estimate. The exploration arm gives the counterfactual; the ranked arm gives the observed; lift is the difference, normalized. This is the slow, offline computation that decides whether a model ships, and it is the reason the 5% exists.
The intuition first, because it is the thing an interviewer wants to hear out loud. Suppose "videos liked by similar users" get clicked 80% of the time. Is the algorithm good, or were those videos going to win regardless of who surfaced them? The ranked data cannot tell you — it only ever showed those videos to people it already believed would click. The exploration arm can: show a thousand random users that same video and measure how many engage with no algorithmic selection at all. That random click-through is the counterfactual baseline. Model lift = (algorithmic CTR − exploration CTR) ÷ exploration CTR. Without exploration you can compute attribution; only with it can you compute causation. This is off-policy evaluation, and naming it is half the points.
The shadow-traffic rollout is the operational half of the same idea, and it is why model_version sits on every score row. A new ranker first runs as shadow: it scores the same impressions the live model serves, writes its predictions with is_shadow = true, and influences nothing. If its frozen predictions correlate better with actual outcomes than the incumbent's — measured by the §04 head-to-head on identical impressions — it graduates to a 1% A/B, then ramps to 100%. Five to ten versions write to the same table concurrently, joined by version for evaluation; promotion is a data decision, not a deploy decision, and rollback is just routing traffic back to the previous version that is still scoring in shadow.
The four facts are where the system explains itself. Three queries an interviewer loves, because each one carries a classic pattern on its back — the causal-lift join, position-bias as a window, and the cohort retention that exposes echo-chamber narrowing.
The single most important query in the system, and the one that separates a senior answer. It joins the ranked arm to the exploration arm per video cluster and computes lift as the normalized difference. A positive lift means the algorithm genuinely surfaced engagement the random baseline would not have; a lift near zero means the ranker is taking credit for videos that were going to win anyway.
A video at the top of the feed gets watched more because it is at the top, not only because it is good. To learn an unbiased signal, you measure the baseline engagement per position and discount accordingly. The pattern is a window over position — average engagement at each slot, then express each impression's outcome relative to its slot's norm.
The failure the whole design fights, made visible. Track each user cohort's content diversity over their lifetime — are they being shown a narrowing set of clusters as the ranker learns them? This is a cohort retention shape applied to diversity instead of revenue: anchor on the user's first week, measure distinct clusters per week as the cohort ages, and watch for the collapse.
A senior design ends with observability, because every safeguard above is invisible without it. The recsys dashboard watches three things at once: is the live model causally lifting, is the candidate model safe to promote, and is the feedback loop narrowing the world.
Read the amber tiles together and the dashboard narrates the central tension of the whole problem: the model is winning on its own terms — lift up, shadow ready to promote — while content diversity sags and the creator gini climbs, the feedback loop tightening even as every serving metric looks excellent. That is exactly why the 5% and the diversity cohort exist: they are the only instruments that see the system narrowing the world while it congratulates itself on engagement.
Strip the video details away and the question was testing five judgments, each of which generalizes far beyond recommendation: