Model the analytics backend for a global streaming service. Fifteen billion play events a day, a content tree four levels deep, and a behavior — bingewatching — that is never recorded and must be derived. The one decision that separates a senior answer: a play event is not allowed to wait for its metadata. A complete working-through — data flow, schema, the defensive join, streaming Python, the nightly derivation, and the dashboard.
It sounds like a reporting task. It is really a question about time — three different definitions of it, fighting over the same fact table.
"Design the data model behind a streaming service like Netflix — play events, series and episodes, and bingewatching. Two hundred and fifty million subscribers, fifteen billion plays a day. How would you scope it?"
The trap is that all three sub-problems hide inside the word "play." A play event is dense, immutable, and arrives in a torrent. But the moment you ask anything interesting of it — what was watched, by whom, as part of what behavior — you discover three separate clocks. Content metadata propagates on its own schedule, regionally, over minutes to hours. The content hierarchy must roll up without double-counting, so that a subscriber who watches eight episodes is one series-bingewatcher and not eight series-views. And bingewatching itself is not a column you can write; it is a shape in the play stream that only exists once you have stitched a night's worth of sessions together. A junior answer reaches for a star schema and a risk_score-shaped is_binge flag. A senior answer notices that the fact arrives before its dimension, that the dimension changes underneath the fact, and that the behavior is downstream of both.
So before any boxes and arrows, the working frame for the session — three structural challenges that the schema has to hold simultaneously:
Scope is the first scored dimension, and most candidates skip it. State what you are building and what you are deliberately ignoring. Out of scope, said explicitly: the recommendation model itself (a consumer of these facts, not part of this design), the CDN and adaptive-bitrate delivery path, billing and payments, and the encode pipeline. In scope: the play fact and its companions, the content and subscriber dimensions, the late-arriving-dimension defense, the bingewatch and streak derivations, and the analytics surface those feed. One caveat, volunteered: the design should not preclude the password-sharing analysis that Trust & Safety will ask for later — which means the account and the profile must both live on every play row from day one.
Then the envelope math, offered rather than extracted. Netflix-shaped numbers:
| Quantity | Estimate | Consequence |
|---|---|---|
| Subscribers | 250,000,000 | Up to 5 profiles each — the dual-key population |
| Play events | ≈ 15 B / day → ~175 K/s | The number that shapes the whole architecture |
| Heartbeats (every 60 s) | ~120 B / day | 8× the plays — split to its own table, downsample |
| Launch-night metadata lag | seconds to hours | The orphan window every join must survive |
| Orphan-rate SLO | < 0.5% | Above 1% pages the data-platform on-call |
| Binge derivation | nightly batch | Materialized so analyst queries never re-derive |
| Raw plays, 5-yr title history | tens of PB | Episode-only fact + bridge — never backfill on re-title |
Notice the asymmetry. The heartbeat stream is an order of magnitude heavier than the play stream, and the play stream is itself heavy enough that no per-event synchronous lookup against a regional dimension can be on its critical path. That single fact dictates the join strategy, the partition layout, and the failure philosophy. The rest of this article follows the play.
One pipeline, two clocks. Plays land immediately and unconditionally; metadata propagates on its own schedule and is reconciled at read time, never at write time. The fact never waits for the dimension.
Three properties of this picture do most of the interview's work. First, the collector validates shape and nothing else — it never calls the content service to confirm the episode exists, because a regional metadata miss at 175 K events per second would either block the firehose or drop the launch. Second, the fact stores episode_id and only episode_id — the immutable identity — never the title, never the series name; every human-readable attribute is resolved by joining the dimension at query time. Third, the dashed line is dashed on purpose: the marriage of fact and dimension is deferred to the read path, where a missing or stale title degrades into a labeled placeholder rather than a dropped row.
The fact plane never blocks on the metadata plane. When the dimension is missing, the play is still written, still counted, and still attributed — to a searchable placeholder ("Pending Sync") that the BI tool can filter on. The orphan is visible, measurable against a < 0.5% SLO, and self-heals the instant metadata propagates, because the join re-resolves on the next query. The failure mode is a labeled unknown, never a silent zero.
The schema falls out of two questions: what is immutable, and what changes underneath us. Immutable identity goes on the fact. Mutable, re-titleable, late-arriving description goes in SCD2 dimensions, resolved at read time.
Four dimensions, each pointing at its parent: dim_franchise → dim_series → dim_season → dim_episode. The titles are SCD2 — Marvel acquires a show and renames it; a series gets re-genre'd — so each carries effective_from and effective_to. The fact never stores the title; it stores the leaf identity and walks up the tree at query time.
Netflix's identity model is unusual and load-bearing: one billing account, up to five profiles. Billing, plan tier, and payment events are account-grain — they cross profile boundaries. Every play, search, browse, and rating is profile-grain. The rule that makes everything downstream tractable is that both keys live on every play row — never just one. Roll up "minutes per account" and you simply sum without re-joining; ask "which profiles on this account watched from incompatible geographies" and the password-sharing analysis is a GROUP BY, not a join puzzle.
The grain of fct_play_event is one row per playback session start. It is append-only and partitioned by play_date so that launch night is one hot partition and last year is cold Parquet. Critically it carries account_id and profile_id, the bare episode_id (no resolved title), and the wall-clock played_at — which is the timestamp every temporal join will pivot on. Quality-of-experience telemetry — rebuffering, bitrate, startup latency — is an order of magnitude heavier, so it splits into fct_play_heartbeat and is downsampled in the warehouse. The two derived facts, fct_binge_session and fct_series_streak, are computed nightly and materialized so that analyst queries read them rather than re-deriving them from fifteen billion rows.
One bridge completes the model. Temporal correctness — attributing a play to the title that was current at the time of the event, not now — is handled by an SCD2 bridge keyed on episode_id with effective_from/effective_to. It is the same SCD2 mechanism the dimensions use, exposed as a thin join surface for the read path.
Everything in this design protects one sentence: every play is counted exactly once, attributed to the version of the truth that was current when it happened. Two defenses enforce it — a defensive join for lateness, and a temporal join for change.
The lifecycle of a single play is short but it crosses both clocks. It is collected, streamed, and written with episode_id intact. At that moment the episode's dimension row may not exist in this region yet. Later — seconds, or hours — it propagates. Later still the show may be re-titled. The invariant says none of these later events is allowed to change the count or corrupt the attribution of the play that already happened.
The defensive read is where correctness lives. It is three moves working as one. LEFT JOIN keeps every play, orphan or not. ROW_NUMBER() … rn = 1 picks the most-recent SCD2 version of the dimension when several exist. COALESCE with a searchable placeholder turns a missing title into a visible, filterable "Pending Sync" rather than a NULL that silently vanishes from a GROUP BY. This is the make-or-break pattern of the whole system, and an interviewer is watching for it by name.
The cousin of lateness is change, and it is handled by the same machinery pointed the other way. A user watches an episode offline on the subway; the device uploads the play four hours later, by which time the dimension has moved on. The temporal join binds the play to the version that was effective at played_at, not the version that is current now.
Three programs carry the event plane: the collector that refuses to resolve titles, the consumer that lands plays idempotently, and the orphan monitor that turns the SLO into a number. Each is small; the judgment is in what they refuse to do.
The collector's one temptation is to be helpful: to look up the episode and stamp the title onto the event before it streams, so downstream joins are simpler. It must refuse. A synchronous content-service call at 175 K events per second couples the firehose to a regional cache that, by design, is sometimes behind. The collector validates that the event has the fields it needs — both keys, an episode_id, a timestamp — and streams it. Resolution is somebody else's job, later, where lateness is cheap.
One distinction, always stated: a malformed event — missing a required field — is dropped loudly and counted as a reject, because it can never be made correct. An orphan event — well-formed but pointing at a title that hasn't propagated — is kept, because it will become correct on its own. The collector handles the first; the read path handles the second. Conflating them is how a launch-night metadata lag turns into thrown-away analytics.
The defensive join keeps orphans countable; this job makes the count actionable. It measures the fraction of recent plays whose episode_id has no resolvable dimension version, compares it to the SLO, and pages only on a rate — never a raw count, because the raw count spikes mechanically every launch night and then drains as metadata catches up. The signal is the persistence of orphans, not their existence.
Bingewatching is not an event; it is a pattern in the play stream. There are two definitions, they answer different questions, and the schema must carry both without one re-writing the other. The craft is in the sessionization — and in the discipline of materializing nightly.
The two definitions are not interchangeable. Per-session binge — three or more episodes of the same series within twenty-four hours, with gaps under thirty minutes — is the signal the auto-play algorithm tunes against: it captures a continuous sitting. Per-day binge — any three-plus episodes of the same series in a calendar day — is the marketing rollup: "what percentage of subscribers binged this month." Force them into one definition and you mislead one stakeholder to please the other. So both are derived facts, computed nightly, materialized so analysts never re-derive from the raw firehose.
The per-session derivation is a sessionization — the canonical gaps-and-islands move applied to time. Partition the same-series plays by (account_id, profile_id, series_id), look back one row with LAG, and start a new session every time the gap exceeds thirty minutes. A running sum of those session-start flags becomes the session id; group, count, and keep only the sittings of three or more.
The per-day rollup is a different shape entirely — and notably simpler, which is itself the point: it counts distinct episodes of a series within a single play_date, no gap logic, because the marketing question does not care whether the sitting was continuous. Two derivations, two grains, one fact stream feeding both.
The fact stream and its derivations are where the system explains itself. Three queries an interviewer loves, because each one carries a classic pattern on its back: SCD2 as-of attribution, conditional rollup over a content tree, and gaps-and-islands across consecutive days.
"Top binged series this week" requires going from episode to series without letting a re-title rewrite history. Store only episode_id on the fact; join up through the SCD2 hierarchy with a temporal predicate so each play attributes to the title that was current when it happened. The distinct-viewer count keys on account_id || '|' || series_id so eight episodes by one household are one viewer of one series — the grain discipline made literal.
"Watched at least one episode every day for five days" is the loyalty signal content acquisition uses for renewal decisions. It is a different shape from binge: same household across consecutive days, not many episodes in one sitting. The pattern is gaps-and-islands by date — subtract a dense row number from the date, and every run of consecutive days collapses to a constant island id.
Because both keys live on every play, the password-sharing question that Trust & Safety will eventually ask is a conditional aggregation, not a join expedition. Per account, count distinct profiles, distinct countries, and distinct devices in a window; the accounts whose plays span geographically incompatible patterns are the candidates. The dual key — not a clever model — is what makes this cheap.
A senior design ends with observability, because every defensive choice above is invisible without it. The dashboard watches two things at once: the health of the pipeline (is the orphan rate draining?) and the shape of the behavior (did the season get binged?).
Read the amber tiles together and the dashboard narrates the launch from §02: fifteen billion plays a day means the first hour outruns the regional metadata fan-out, so orphans rise while propagation lag is still draining — exactly the designed race. The tell that it is healthy and not broken is the reject rate pinned at one in ten thousand: every play is landing, none malformed, the orphans are well-formed rows wearing "Pending Sync" until their titles arrive. And the season is being devoured four episodes at a time. That is what a designed degradation looks like from the operator's chair on the biggest night of the quarter.
Strip the show away and the question was testing five judgments, each of which generalizes far beyond streaming: