The Recommendation Problem

§ 01 — THE QUESTIONThe system that grades itself

Every ML engineer who works on ranking eventually meets this question. It sounds like "build a recommender." It is really "build a recommender that can tell whether it is any good" — and those are not the same system.

Interview Prompt

"Design the data model behind a For-You-Page recommendation feed. It serves billions of impressions a day and the ranker trains on the engagement it generates. How do you model it without producing a biased feedback loop — and how do you know if a new model is actually better?"

LEVEL · SENIOR / STAFFDURATION · 45 MINFORMAT · WHITEBOARD

The For-You-Page is the most-watched ML system on the planet, and the trap is not scale — it is the snake eating its tail. The ranker boosts videos that already get watched; users see them more; they get watched more; they are boosted more. New videos and new creators starve in the dark. This is echo-chamber bias, and it is not a tuning problem you smooth away later — it is baked into the data the moment you log only what the algorithm chose to show. Every engagement metric you compute is confounded by the ranker's own selection. "This video got an 80% completion rate" tells you nothing, because the ranker only showed it to the people it already believed would finish it.

A weak answer logs impressions and engagements and calls it a recommender. A strong answer notices the missing table — the one that holds the videos the algorithm did not choose — and the missing column — the prediction the model made before it knew the outcome. Without those two, the system can attribute, but it can never measure causality, and a recommender that cannot measure its own causal lift cannot safely improve. So before any boxes and arrows, the working frame for the whole session:

THE LOGGING LOOP: Impressions & engagement. One row per video shown, billions a day, joined to likes / shares / watch-seconds / swipe-aways by impression id. The firehose — sampled to 1% for analytics, full-fidelity in the feature store. This is what every naive design captures, and on its own it is a hall of mirrors.
THE COUNTERFACTUAL: Exploration events. 5% of traffic deliberately randomized or drawn from a holdout pool — videos the ranker would not have chosen. The control group that makes lift computable. Non-negotiable: without it you have attribution, not causation.
THE FROZEN PREDICTION: Ranker scores at impression-time. The model's predicted engagement, stamped with its version, written the instant the video is shown — before the outcome exists. This is what makes "was the model right?" answerable retroactively, and what lets five model versions be compared on the same impressions.

You cannot improve what you only observe through its own choices. The 5% you throw away on random videos is the only 5% that tells you the truth.

Scoping out loud

Scope is the first scored dimension, and most candidates skip it. State what you build, what you ignore, and the numbers that shape every later choice. Out of scope here, said explicitly: the candidate-generation and ranking model internals (treated as a callable service that returns scored videos), the video CDN and transcoding stack, trust-and-safety classification, and the recommendation UI. In scope: how an impression becomes a row, how the exploration slice is carved and logged, how the predicted score is frozen for off-policy evaluation, how new rankers roll out under shadow traffic, and how cold-start videos escape the starvation trap.

Then the envelope math, volunteered rather than extracted. TikTok-shaped numbers:

Quantity	Estimate	Consequence
Daily active users	1,000,000,000	Sets the dimension cardinality, not the fact size
Impressions/day	≈ 100 B	The firehose — cannot land full-fidelity in the warehouse
Serving latency budget	< 200 ms	Score is computed and frozen on the hot path, logged async
Exploration slice	5% ≈ 5 B impressions/day	The control group that shapes the whole architecture
Analytics sample	1% of impressions	Tens of TB/day even sampled; full data in feature store
Concurrent ranker versions	5–10	Forces model_version on every score row
Raw event retention	30 days	PII; aggregate to durable rollups, expire the raw

Notice what the numbers say and what they do not. The hundred-billion firehose dictates the sampling and storage tiering — but the row that shapes the architecture, not just the bill, is the 5% exploration slice. Five billion deliberately un-ranked impressions a day is the price of being able to answer the only question that matters: did the algorithm cause the engagement, or did it just take credit for what was going to happen anyway? The rest of this article follows that question.

§ 02 — DATA FLOWFollowing an impression through the building

One feed, two regimes. The serving path splits at the source: 95% of impressions are ranked by the model, 5% are randomized — and both carry a frozen prediction downstream, so that weeks later the analytics layer can subtract one from the other and recover causal truth.

FIG. 1 — End-to-end flow. The serving split is the spine; the exploration log is the control group; the frozen score is the time capsule that makes evaluation possible.

Three properties of this picture do most of the interview's work. First, the split happens at the mixer and is recorded: every impression carries a source tag — ranked or exploration — so the analytics layer can separate the two regimes without guessing. Second, the prediction is frozen on the hot path: the ranker writes predicted_watch_seconds and its model_version the instant the video is shown, before any outcome exists, which is the only moment the prediction is uncontaminated by what later happened. Third, the dashed lines are the whole point — the exploration log and the frozen scores flow into an offline evaluation that compares the algorithm's engagement against the random baseline, and that subtraction, not any online metric, is what licenses promoting a new model.

The Failure Philosophy, In One Rule

The firehose may be sampled; the control group may never be. Dropping 99% of impressions for analytics is fine — a 1% sample of a hundred billion is still a billion rows, statistically lavish. But the 5% exploration slice is sampled at full fidelity and protected from any optimization that would shrink it, because the instant the control group is itself selected by the ranker, it stops being a control group and every lift number becomes a lie. Sample the firehose freely; never let an A/B test or a latency tweak quietly poison the exploration arm.

§ 03 — DATA MODELFour facts, and a vault

The schema falls out of the causality question. The impression is the spine; the engagement hangs off it by id; the exploration event is the control arm at the same grain; the ranker score is the frozen prediction. And because engagement is high-PII, the identity lives behind a token in a separate vault.

The spine — one row per video shown

The impression fact is the most-written table on Earth, so its design is about discipline under sampling. One row per video shown to a user, stamped with the serving context the analytics layer will need: the feed position, the model_version that ranked it, the source that says whether this slot was ranked or exploration, and the sampled_pct that lets a downstream query reweight a 1% sample back to population scale. Crucially the user is keyed by a user_id_token, never the raw id — identity is borrowed, not stored.

DDL · THE SPINE — IMPRESSION FACT

-- One row per video shown. Billions/day; sampled to 1% for analytics,
-- full-fidelity in the feature store. Partitioned by shown_at (hour),
-- clustered on user_id_token. The 'source' column is the bias firewall.
CREATE TABLE fct_impression (
    impression_id   BIGINT       PRIMARY KEY,   -- snowflake: time-ordered
    user_id_token   BYTEA        NOT NULL,       -- tokenized, NOT the raw user_id
    video_id        BIGINT       NOT NULL,
    shown_at        TIMESTAMPTZ  NOT NULL,
    position_in_feed SMALLINT    NOT NULL,       -- position bias lives here
    model_version   TEXT         NOT NULL,       -- which ranker chose this (v117, v118…)
    source          TEXT         NOT NULL
                    CHECK (source IN ('ranked','exploration','cold_start')),
    sampled_pct     NUMERIC(6,4) NOT NULL        -- 1.0000 = kept; reweight factor
);
CREATE INDEX idx_imp_video ON fct_impression (video_id, shown_at DESC);
CREATE INDEX idx_imp_source ON fct_impression (source, shown_at DESC);  -- carve the 5%

The outcome — engagement, joined by id

Engagement is a separate fact, joined to the impression by impression_id. Keeping it separate matters: an impression with no engagement row is a swipe-away, and that absence is itself a signal the ranker learns from. The watched-seconds column is the regression target the model predicts; the booleans are the classification targets. And like the spine, it is keyed by token, because this is the table a privacy review will scrutinize hardest.

DDL · THE OUTCOME — ENGAGEMENT FACT

CREATE TABLE fct_engagement (
    impression_id   BIGINT       NOT NULL REFERENCES fct_impression,
    user_id_token   BYTEA        NOT NULL,
    liked           BOOLEAN      NOT NULL DEFAULT false,
    shared          BOOLEAN      NOT NULL DEFAULT false,
    commented       BOOLEAN      NOT NULL DEFAULT false,
    watched_seconds NUMERIC(8,3) NOT NULL DEFAULT 0,   -- the regression target
    completed       BOOLEAN      NOT NULL DEFAULT false,
    skipped         BOOLEAN      NOT NULL DEFAULT false,
    engaged_at      TIMESTAMPTZ  NOT NULL DEFAULT now(),
    PRIMARY KEY (impression_id)
);
-- No engagement row for an impression = a swipe-away. The ABSENCE is data.

The control arm and the frozen prediction

Two more facts, and they are the ones the naive design forgets. The exploration event records the 5% of impressions that were random or holdout — the counterfactual baseline that answers "if we showed this to a thousand random users, how many would engage?" The ranker-score fact freezes the model's prediction at impression-time, stamped with version, so that "was the model right?" becomes a join between a prediction made in the past and an outcome known in the present. Both are append-only; neither is ever back-filled with hindsight.

DDL · THE CONTROL ARM + THE TIME CAPSULE

-- The 5%: videos the ranker would NOT have chosen. The control group.
CREATE TABLE fct_exploration_event (
    impression_id   BIGINT       PRIMARY KEY REFERENCES fct_impression,
    user_id_token   BYTEA        NOT NULL,
    video_id        BIGINT       NOT NULL,
    explore_kind    TEXT         NOT NULL
                    CHECK (explore_kind IN ('uniform_random','holdout_pool')),
    shown_at        TIMESTAMPTZ  NOT NULL,
    engaged         BOOLEAN      NOT NULL DEFAULT false,
    watched_seconds NUMERIC(8,3) NOT NULL DEFAULT 0
);

-- The frozen prediction. Written ON THE HOT PATH, before the outcome
-- exists. Five-to-ten model versions write here concurrently; the version
-- column is what lets us replay v117 vs v118 on the SAME impressions.
CREATE TABLE fct_ranker_score (
    impression_id          BIGINT  NOT NULL REFERENCES fct_impression,
    model_version          TEXT    NOT NULL,
    predicted_watch_seconds NUMERIC(8,3) NOT NULL,  -- e.g. 18.300
    predicted_engage_prob  NUMERIC(6,5) NOT NULL,
    is_shadow              BOOLEAN NOT NULL DEFAULT false,   -- scored but not served
    PRIMARY KEY (impression_id, model_version)        -- one row per (imp × model)
);

And the vault — deliberately apart. The token that keys every fact above maps to a real user id only inside vault.user_id_map, an encrypted store with its own access pool that the analytics warehouse cannot read. Raw events carry a 30-day retention; the aggregated rollups (per-user-per-day-per-cluster) are de-identified enough to keep indefinitely. The separation is not decoration — it is what lets the firehose be useful to ML without being a standing liability to privacy.

§ 04 — THE INVARIANTThe prediction is frozen before the outcome

The whole correctness of this system lives in one rule about time: the model's prediction is written before the model can possibly know whether it was right. Freeze the score at impression-time, stamp it with a version, and every later question about model quality becomes a deterministic join across time instead of an irreproducible guess.

Why this is the invariant, and not a nice-to-have. Models are versioned and they change daily; today's recommendation was made by ranker v117, and replaying the same impression through v118 produces a different number. If you do not capture v117's prediction at the moment it was made, you lose it forever — there is no way to reconstruct what the deployed model believed about a video it has long since stopped seeing. Storing the score freezes the test. "Was the model right?" becomes comparing a prediction-then to an outcome-now, and "is the new model better?" becomes scoring it on the same impressions retroactively, with no online experiment required.

CANDIDATE→ SCORED v118→ SHOWN @ pos→ ENGAGED / SKIPPED→ EVALUATED

Read the lifecycle left to right and notice where the freeze sits: between SCORED and SHOWN, before the outcome exists. The score row is written synchronously with the impression; the engagement arrives seconds-to-hours later; the evaluation happens days later, offline. Because the prediction was frozen, the gap between SHOWN and EVALUATED can be arbitrarily long and the comparison stays honest. A video shown ten million times with a high predicted score and a low actual watch-time is not noise — it is the cleanest possible diagnostic of model drift, and you only have it because the prediction was stamped before the truth came in.

SQL · THE FROZEN TEST — PREDICTED-THEN vs OUTCOME-NOW

-- Per model version: how far did predicted watch-time miss the actual?
-- This is only answerable because the score was frozen at impression-time.
-- The same impressions scored by two versions → an honest head-to-head.
SELECT
    r.model_version,
    count(*)                                              AS n,
    round(avg(r.predicted_watch_seconds), 3)             AS avg_predicted,
    round(avg(e.watched_seconds), 3)                     AS avg_actual,
    round(avg(abs(r.predicted_watch_seconds
                  - e.watched_seconds)), 3)              AS mae,   -- calibration error
    round(corr(r.predicted_watch_seconds,
               e.watched_seconds)::numeric, 4)           AS rank_signal
FROM       fct_ranker_score r
JOIN       fct_engagement   e USING (impression_id)
WHERE      r.is_shadow = false
GROUP BY   r.model_version
ORDER BY   mae ASC;     -- the better-calibrated model has lower MAE on the same data

The is_shadow flag and the composite key (impression_id, model_version) are doing the quiet work. A shadow model writes a prediction for an impression it did not serve — same impression id, different version, is_shadow = true — so the head-to-head above can include a model that has never touched a real user. That is the whole game: you compare v119 to v118 on identical impressions, offline, before v119 is ever allowed to influence a single feed.

Freeze the prediction before the outcome, or you can never prove the model improved. A score written after the fact is a memory; a score written before it is evidence.RECSYS RULE Nº 1

§ 05 — INGESTION & STREAMSPython on the feed

Three programs carry the system. The mixer that carves the exploration slice and freezes the score, the logger that joins engagement back to its impression, and the cold-start governor that forces new videos into the light. Each is small; the judgment is in what they refuse to do.

1 · The mixer — carve the 5%, freeze the score

This runs on the serving hot path, under the 200ms budget, so it does the minimum and logs asynchronously. It assigns each feed slot a source — ranked or exploration — by a stable hash, freezes the chosen ranker's prediction into the score buffer the instant the slot is filled, and never blocks the response on the log write. The carve is deterministic per (user, request) so the exploration assignment is reproducible and auditable, not a coin flip the system forgets.

PYTHON · FEED MIXER — DETERMINISTIC SPLIT + FROZEN SCORE

import hashlib

EXPLORE_RATE = 0.05        # the control arm — guarded, never optimized away

class FeedMixer:
    """Fills feed slots from the ranker, but routes a deterministic 5%
    to uniform-random exploration. The split is a STABLE hash, not a
    coin flip: the same (user, request, slot) always lands the same way,
    so the exploration assignment is reproducible for audit."""

    def __init__(self, ranker, random_pool, log):
        self.ranker, self.random_pool, self.log = ranker, random_pool, log

    def fill_slot(self, user, request_id, slot_idx) -> dict:
        if self._is_exploration(user.id, request_id, slot_idx):
            video = self.random_pool.sample()              # the ranker did NOT choose this
            source, version = "exploration", "none"
            score = None                                  # no prediction to freeze
        else:
            video, score = self.ranker.top(user)           # returns (video, prediction)
            source, version = "ranked", self.ranker.version
        imp_id = new_snowflake()
        # Log async — NEVER block the 200ms response on the write:
        self.log.emit_impression(imp_id, user.token, video.id,
                                 slot_idx, version, source)
        if score is not None:
            self.log.freeze_score(imp_id, version,            # the time capsule
                                  score.watch_seconds, score.engage_prob)
        return {"impression_id": imp_id, "video": video}

    def _is_exploration(self, uid, req, slot) -> bool:
        h = hashlib.blake2b(f"{uid}:{req}:{slot}".encode(), digest_size=8)
        return (int.from_bytes(h.digest()) % 10000) / 10000 < EXPLORE_RATE

One carve-out, always stated: the mixer never lets an experiment touch the exploration arm. A/B tests and feature flags partition the ranked 95%; the 5% control group is held constant across all of them, because a control group that is itself being experimented on is no longer a control group. This single discipline is what keeps every lift number downstream meaningful.

2 · The engagement logger — join the outcome to the impression

PYTHON · ENGAGEMENT LOGGER — IDEMPOTENT BY IMPRESSION

async def log_engagement(store, ev) -> None:
    """Engagement arrives on a separate stream, seconds-to-hours after
    the impression. Idempotent on impression_id: a video the user likes,
    then shares, then finishes is ONE evolving row, not three. The client
    sends the impression_id it was served, so the join is exact."""
    await store.upsert_engagement(
        impression_id   = ev.impression_id,        # the join key, from the served slot
        user_id_token   = ev.user_token,           # tokenized, never raw id
        liked           = ev.liked,
        shared          = ev.shared,
        commented       = ev.commented,
        watched_seconds = ev.watched_seconds,       # last-write-wins; final dwell
        completed       = ev.watched_seconds >= ev.video_duration * 0.95,
        skipped         = ev.watched_seconds < 1.0)
    # Note: we do NOT manufacture a row for impressions with no engagement.
    # The MISSING row is the swipe-away signal — absence carries information,
    # and a LEFT JOIN at analysis time recovers it precisely.

3 · The cold-start governor — force new videos into the light

PYTHON · COLD-START — FORCED EXPLORATION WITH A COUNTDOWN

def cold_start_decision(video, lifecycle) -> dict:
    """New videos have no engagement history, so the ranker cannot score
    them — and if left to the ranker they never get shown, never earn
    history, and starve. Two levers break the loop: a forced-exploration
    budget, and inherited priors as a fallback feature."""
    if lifecycle.bootstrap_pool_remaining > 0:
        # Guarantee the first ~1000 impressions regardless of score, so the
        # video earns real engagement data instead of dying in the dark.
        lifecycle.bootstrap_pool_remaining -= 1          # counted down per impression
        return {"force_show": True, "source": "cold_start"}
    # Pool spent: fall back to the creator's and sound's priors as features
    # so the ranker has SOMETHING to go on until first-party signal arrives.
    return {"force_show": False,
            "prior_engage_rate": video.creator.recent_engagement_rate,
            "prior_sound_rate":  video.sound.recent_engagement_rate}

§ 06 — OFF-POLICY EVALUATIONSubtracting the bias out

Aggregation here is not a roll-up — it is a causal estimate. The exploration arm gives the counterfactual; the ranked arm gives the observed; lift is the difference, normalized. This is the slow, offline computation that decides whether a model ships, and it is the reason the 5% exists.

The intuition first, because it is the thing an interviewer wants to hear out loud. Suppose "videos liked by similar users" get clicked 80% of the time. Is the algorithm good, or were those videos going to win regardless of who surfaced them? The ranked data cannot tell you — it only ever showed those videos to people it already believed would click. The exploration arm can: show a thousand random users that same video and measure how many engage with no algorithmic selection at all. That random click-through is the counterfactual baseline. Model lift = (algorithmic CTR − exploration CTR) ÷ exploration CTR. Without exploration you can compute attribution; only with it can you compute causation. This is off-policy evaluation, and naming it is half the points.

PYTHON · OFF-POLICY LIFT — SKETCH OF THE ESTIMATION JOB

# Compare what the algorithm achieved against the random baseline,
# per video cluster. The exploration arm is the control; the ranked arm
# is the treatment. Importance-weighting corrects the 1% sampling.

ranked = (impressions
    .filter(lambda i: i.source == "ranked")
    .join(engagement, on="impression_id", how="left")   # left: swipe-aways count as 0
    .with_weight(lambda i: 1.0 / i.sampled_pct)         # reweight the 1% sample
    .key_by(lambda i: i.video_cluster)
    .aggregate(ctr=mean("engaged"), watch=mean("watched_seconds")))

explore = (exploration_events
    .key_by(lambda e: e.video_cluster)
    .aggregate(ctr=mean("engaged"), watch=mean("watched_seconds")))

def causal_lift(cluster, ranked_ctr, explore_ctr):
    if explore_ctr <= 0:                       # no baseline → no causal claim
        return None
    return (ranked_ctr - explore_ctr) / explore_ctr   # the only honest lift

lift = (ranked.join(explore, on="video_cluster")
    .map(causal_lift)
    .sink(model_lift_table))   # read by the promote/rollback gate — the ship decision

The shadow-traffic rollout is the operational half of the same idea, and it is why model_version sits on every score row. A new ranker first runs as shadow: it scores the same impressions the live model serves, writes its predictions with is_shadow = true, and influences nothing. If its frozen predictions correlate better with actual outcomes than the incumbent's — measured by the §04 head-to-head on identical impressions — it graduates to a 1% A/B, then ramps to 100%. Five to ten versions write to the same table concurrently, joined by version for evaluation; promotion is a data decision, not a deploy decision, and rollback is just routing traffic back to the previous version that is still scoring in shadow.

A model earns production by beating the incumbent on the same impressions it never served. Ship on evidence frozen in the log, not on a hunch ramped to a hundred percent.RECSYS RULE Nº 2 — SHADOW BEFORE SERVE

§ 07 — ANALYTICS SQLInterrogating the loop

The four facts are where the system explains itself. Three queries an interviewer loves, because each one carries a classic pattern on its back — the causal-lift join, position-bias as a window, and the cohort retention that exposes echo-chamber narrowing.

Causal lift — the treatment-minus-control join

The single most important query in the system, and the one that separates a senior answer. It joins the ranked arm to the exploration arm per video cluster and computes lift as the normalized difference. A positive lift means the algorithm genuinely surfaced engagement the random baseline would not have; a lift near zero means the ranker is taking credit for videos that were going to win anyway.

SQL · MODEL LIFT — ALGORITHM vs EXPLORATION BASELINE

WITH ranked AS (
    SELECT v.video_cluster,
           avg((e.watched_seconds > 0)::int)  AS algo_ctr,
           avg(e.watched_seconds)             AS algo_watch
    FROM   fct_impression i
    JOIN   dim_video v USING (video_id)
    LEFT JOIN fct_engagement e USING (impression_id)   -- LEFT: swipe-away = 0
    WHERE  i.source = 'ranked'
    GROUP BY v.video_cluster
),
explore AS (
    SELECT v.video_cluster,
           avg(x.engaged::int)   AS base_ctr,
           avg(x.watched_seconds) AS base_watch
    FROM   fct_exploration_event x
    JOIN   dim_video v USING (video_id)
    GROUP BY v.video_cluster
)
SELECT r.video_cluster, r.algo_ctr, e.base_ctr,
       round((r.algo_ctr - e.base_ctr)
             / nullif(e.base_ctr, 0), 3)  AS causal_lift
FROM   ranked r
JOIN   explore e USING (video_cluster)
ORDER BY causal_lift DESC;
-- lift > 0: the algorithm caused engagement. lift ≈ 0: it only attributed it.

Position bias — discount engagement by slot with a window

A video at the top of the feed gets watched more because it is at the top, not only because it is good. To learn an unbiased signal, you measure the baseline engagement per position and discount accordingly. The pattern is a window over position — average engagement at each slot, then express each impression's outcome relative to its slot's norm.

SQL · POSITION-NORMALIZED ENGAGEMENT — WINDOW BY SLOT

WITH per_slot AS (
    SELECT i.position_in_feed,
           avg((e.watched_seconds > 0)::int)              AS slot_ctr
    FROM   fct_impression i
    LEFT JOIN fct_engagement e USING (impression_id)
    WHERE  i.source = 'exploration'          -- measure bias on UNranked slots
    GROUP BY i.position_in_feed
)
SELECT v.video_id,
       count(*)                                            AS impressions,
       avg((e.watched_seconds > 0)::int)                   AS raw_ctr,
       avg((e.watched_seconds > 0)::int
           / nullif(s.slot_ctr, 0))                       AS position_adj_ctr
FROM   fct_impression i
JOIN   per_slot s USING (position_in_feed)
LEFT JOIN fct_engagement e USING (impression_id)
JOIN   dim_video v USING (video_id)
GROUP BY v.video_id
HAVING count(*) > 1000
ORDER BY position_adj_ctr DESC;
-- A video that beats its slot's baseline is genuinely good, not just well-placed.

Echo-chamber narrowing — cohort diversity retention

The failure the whole design fights, made visible. Track each user cohort's content diversity over their lifetime — are they being shown a narrowing set of clusters as the ranker learns them? This is a cohort retention shape applied to diversity instead of revenue: anchor on the user's first week, measure distinct clusters per week as the cohort ages, and watch for the collapse.

SQL · DIVERSITY DECAY BY USER COHORT — RETENTION SHAPE

WITH user_week AS (
    SELECT i.user_id_token,
           date_trunc('week', min(i.shown_at) OVER
                      (PARTITION BY i.user_id_token))        AS cohort_week,
           date_trunc('week', i.shown_at)                    AS active_week,
           v.video_cluster
    FROM   fct_impression i
    JOIN   dim_video v USING (video_id)
    WHERE  i.source = 'ranked'
)
SELECT cohort_week,
       extract(week FROM age(active_week, cohort_week))::int AS weeks_in,
       count(DISTINCT user_id_token)                          AS users,
       round(avg(cluster_cnt), 1)                              AS avg_distinct_clusters
FROM (
    SELECT user_id_token, cohort_week, active_week,
           count(DISTINCT video_cluster) AS cluster_cnt
    FROM   user_week
    GROUP BY user_id_token, cohort_week, active_week
) w
GROUP BY cohort_week, weeks_in
ORDER BY cohort_week, weeks_in;
-- avg_distinct_clusters falling week-over-week = the echo chamber tightening.

§ 08 — THE DASHBOARDWatching the loop for drift

A senior design ends with observability, because every safeguard above is invisible without it. The recsys dashboard watches three things at once: is the live model causally lifting, is the candidate model safe to promote, and is the feedback loop narrowing the world.

CAUSAL HEALTH: model lift (ranked CTR over exploration baseline), exploration coverage (is the 5% intact?), prediction calibration MAE on the live version — falling lift means the ranker is attributing, not causing.
ROLLOUT SAFETY: shadow vs live MAE on identical impressions, A/B ramp stage, prediction drift (a high-score / low-watch divergence), versions in flight — the gate that decides promote or roll back.
LOOP HEALTH: content diversity per cohort over time, cold-start escape rate (new videos that earned real signal), creator gini (is engagement concentrating?) — the early warning that the system is eating itself.

FYP Ranker Ops — Global WED 14:20 UTC · LIVE v118 · SHADOW v119 · 60s REFRESH

Model Lift (live)

+34%

Exploration Coverage

5.0%

Calibration MAE

4.1s

Shadow MAE (v119)

3.2s

Lift vs Calibration over 24h — live v118 (causal lift ↑ good, MAE ↓ good)

Cold-Start Escape

82%

Content Diversity

6.4

Prediction Drift

0.7%

A/B Ramp

Creator Gini

0.71

Versions In Flight

FIG. 2 — The story a healthy-but-watched system tells: live lift positive and rising, exploration locked at exactly 5.0%, shadow v119 already beating live on calibration — and the amber tiles, diversity and creator gini, drifting the wrong way as the loop quietly concentrates attention.

Read the amber tiles together and the dashboard narrates the central tension of the whole problem: the model is winning on its own terms — lift up, shadow ready to promote — while content diversity sags and the creator gini climbs, the feedback loop tightening even as every serving metric looks excellent. That is exactly why the 5% and the diversity cohort exist: they are the only instruments that see the system narrowing the world while it congratulates itself on engagement.

§ 09 — THE RUBRICWhat was actually being tested

Strip the video details away and the question was testing five judgments, each of which generalizes far beyond recommendation:

CONFOUNDING: Seeing that a system which logs only its own choices can never measure causality — and carving a deliberate control group out of production to recover it. The 5% is an epistemics decision, not a feature.
TEMPORAL HONESTY: Freezing the prediction before the outcome exists, stamped with its version, so model quality is a reproducible join across time rather than an irretrievable guess.
SAFE ITERATION: Shipping models on evidence, not hope: shadow-score the same impressions, compare offline, promote through a ramp, roll back by routing. Promotion is a data decision.
FEEDBACK AWARENESS: Designing against the loop the system creates — cold-start budgets so new entrants escape starvation, diversity instruments so the echo chamber is visible before it is irreversible.
PRIVACY POSTURE: Treating the firehose as a liability as well as an asset: tokens not raw ids, an isolated vault, short raw retention with de-identified rollups kept long.

The algorithm proposes; the 5% disposes of the illusion. A recommender that cannot subtract its own bias is not learning — it is only agreeing with itself, louder each day.— CLOSING ARGUMENT