Model Spotify's listening history so that one fact serves recommendations and pays four rights holders per stream — under a pool where the per-stream rate doesn't exist until the month closes. One decision separates the senior answer: the rate is a stored fact, not a running number. A full working-through — data flow, schema, the 30-second boundary, the SCD2 rights bridge, the period-close pipeline, and the SQL that splits a fraction of a cent four ways.
It sounds like an analytics question — store the plays, count them. It is really an accounting question wearing a product's clothes, and the moment money attaches to an event, every modeling shortcut becomes a future lawsuit.
"Model Spotify's listening history to power recommendations and Year in Review, and to pay per-stream royalties to multiple rights holders — artist, label, publisher, songwriter — under the pool model, where each country-period's revenue is divided by its total qualified streams. How would you scope it?"
The trap is to hear one verb — "store the plays" — and reach for one table. But a single play is consumed by two organizations that disagree about what it even is. To Product, a play is behavioral signal: every tap counts, and a three-second skip is the most useful data point of all, because it teaches the recommender what not to do. To Finance, that same skip does not exist: industry custom pays royalty only on streams of thirty seconds or more, so a skip is a non-event with a dollar value of exactly zero. The fact table has to satisfy both readings without forking into two sources of truth that drift apart by the next quarter.
Then the pool model adds a genuinely strange dependency. A stream's payout is not a price you can look up when it happens. It is revenue ÷ total qualified streams in the country-period — and the denominator is unknowable until the period ends and the last late-arriving offline play has landed. The rate a stream earns in April is only computable in May. That single fact reorders the whole architecture.
Scope is the first thing scored, and it is mostly about what you refuse to build. In scope: the play event, the rights bridge, the pool close, the attribution output, and the queries each stakeholder lives on. Explicitly out of scope, said aloud: the recommender model itself (it consumes the play fact, it does not live here), the audio pipeline, subscriber billing internals (we take collected revenue as an input), and tax withholding on payouts. One deliberate non-goal worth naming: I am not hard-coding pro-rata. The schema must also express user-centric payout (UCPS) — where a subscriber's fee flows only to artists they personally played — without a rewrite, because that is the live policy debate in the industry and an interviewer will probe it.
Then the envelope math, volunteered rather than dragged out. Spotify-shaped numbers:
| Quantity | Estimate | Consequence |
|---|---|---|
| Monthly active listeners | ~600 M | Sets the play-event firehose and the user dimension size |
| Plays per day | ≈ 4 B | Append-only event grain; partition by play_date or it is unqueryable |
| Qualified-stream fraction | ~55–70% | The flag that splits Product's denominator from Finance's |
| Rights holders per track | ~4 | Attribution rows ≈ 4 × qualified plays — linear, but 4× the firehose |
| Country pools / month | ~180 | Each freezes one revenue figure and one per-stream rate at close |
| Payout pool share of revenue | ≈ 70% | The ~30% Spotify retains; pool = collected revenue × 0.70 |
| Late offline streams | days after play | Attribution must key on month-of-event, never month-of-ingest |
Notice the asymmetry. The play firehose is enormous and cheap to be wrong about — a dropped skip costs nobody anything. The pool table is tiny, a few thousand rows a year, and catastrophic to be wrong about, because every payout in the country divides by its numbers. That inversion — vast-but-forgiving against tiny-but-sacred — is the spine of everything below. The architecture follows the dollars, not the bytes.
Two tempos share one spine. The hot path is a continuous play stream that fans out to recommendations the instant it lands. The cold path is a once-a-month batch that freezes a pool, joins a temporal bridge, and mints payouts that never change again.
Three properties of this picture carry the interview. First, the hot path never waits for money: a play reaches the recommender in milliseconds and lands in fct_play_events with its qualified flag already computed, so Product is never blocked on Finance's monthly cadence. Second, the cold path is a pure function of frozen inputs — collected revenue, the qualified-stream count, and the rights bridge as it stood at play time — which is what lets an auditor replay any month from scratch and land on the identical number. Third, fraud is a side-flag, never a delete: the anti-fraud Spark job sets is_fraudulent on offending rows so the close excludes them, but the original play survives, because "what was claimed versus what was paid" is itself a reportable, litigable fact.
The hot path may forget; the cold path may never lie. A lost or duplicated play barely moves a recommendation and is invisible against billions of events, so the stream is at-least-once and best-effort. A wrong payout is a breach of a rights contract, so the close is deterministic, idempotent, and reproducible from immutable inputs. When the two philosophies conflict, the money wins — the recommender can tolerate noise the ledger cannot.
The schema falls out of who owns what. Product owns the event grain. Finance owns three derived facts: the pool, the attribution, and the rights bridge that decides who gets paid. The cleverness is concentrated in two columns — a generated boolean and a pair of effective dates.
A play is a small, immutable record at colossal volume. The decisive choices are the generated is_royalty_eligible column — the thirty-second boundary computed once at write so no downstream query can forget it — and context_type, which tells the recommender which surface drove the play. Note what is not here: no payout amount. A play does not know what it earns until its country-period closes, so storing a dollar figure on this row would be a lie with a timestamp.
That one generated column does heavy lifting. Product queries the table raw — every row is signal. Finance adds WHERE is_royalty_eligible AND NOT is_fraudulent and is instantly looking at the payable universe, from the same source of truth, with no parallel "finance plays" table to drift out of sync. The thirty-second rule lives in the schema, not in a hundred analysts' memories.
Here is where most candidates lose the plot. The split of a stream — how much to artist, label, publisher, songwriter — is not a property of the track. It is a property of the deal that was in force when the stream happened. Deals renegotiate mid-quarter; a label might cede 5% to an artist effective the 15th. So the bridge is Slowly-Changing-Dimension Type 2: every share is bounded by effective_from and effective_to, and history is never overwritten.
The pool table is the smallest and most important object in the design. It exists to make the denominator durable. At month-close the job computes pool_usd as collected revenue times the payout share (~70%), counts qualified non-fraud streams, divides, and writes per_stream_rate_usd — then sets frozen_at and never touches the row again. The attribution table is the output: one defensible line per (play × holder), produced in a single bulk write.
One subtlety worth saying out loud: the attribution row stores share_pct as a snapshot, not a foreign key back into the bridge. The bridge is the source of truth at compute time, but freezing the resolved share onto the payout row means a future bridge correction — even a legitimate one — cannot silently restate a payout that has already been remitted and reported to a tax authority. The line item is a photograph, not a live view.
Every correctness property of this system reduces to one discipline: the per-stream rate is computed once, from frozen inputs, after the period is sealed — and never again. Get the timing of that single division right and the rest is bookkeeping.
The pool model has a chicken-and-egg shape that breaks naive designs. You cannot pay a stream when it happens, because its rate is pool ÷ total_qualified_plays, and the total is unknown until the last play of the period — including offline plays uploaded days late — has arrived. A junior model computes a "running rate" daily and pays against it; by month-end that rate has drifted, early payouts are wrong, and Finance is reconciling against a number that was never real. The senior move is to refuse to compute the rate until the denominator is final, and to make the period's transition through close an explicit, observable lifecycle.
OPEN — plays land freely against the month-of-event. CUTOFF — a grace window closes (say, period_end + 7 days) to admit late offline uploads that still belong to the period. RECONCILING — fraud flags are finalized, revenue is confirmed, the qualified count is locked. FROZEN — the single division runs, per_stream_rate_usd and frozen_at are written, and the row becomes immutable. PAID — attribution is generated and remitted. A late chargeback or a discovered fraud cluster after FROZEN is never an edit to this period; it is a correcting entry in a subsequent period's pool, exactly as a double-entry ledger appends a reversal rather than rewriting history.
The division itself is trivial; the discipline is in everything that must be true before it runs. Below is the atomic close — note it reads only frozen, period-bounded inputs and writes the rate exactly once, guarded so it can never double-run.
Once that row is frozen, the per-stream rate is a constant the rest of the pipeline multiplies against. Attribution becomes a join: each qualified play, times its as-of-play shares, times the frozen rate. The hard part is over before the first dollar is split.
Three small programs carry the hot path: the ingest that normalizes a play and stamps the boundary, the consumer that fans plays out without blocking, and the fraud pass that flags without deleting. As always, the judgment is in what each refuses to do.
The single most consequential rule in ingestion is invisible: trust the play's own clock, not the wall clock at the server. Offline plays — a subway commute cached on the phone — arrive in a batch days later but belong to the month they actually happened. If ingest stamped played_at with arrival time, those streams would attribute to the wrong pool and quietly corrupt two months' payouts at once. The client's timestamp is authoritative; the server only validates it is sane.
The play_id is minted on the client, which makes the stream idempotent: a phone that retries an upload after a flaky connection produces the same play_id, and the lake's de-duplication on that key collapses the retry. At-least-once delivery plus a client key equals effectively-once payouts — the only acceptable guarantee when each row is money.
The comment at the bottom is the carve-out an interviewer is listening for. The temptation, having a play in hand, is to look up "the rate" and write a payout immediately. There is no rate yet. Resisting that — keeping the hot path entirely free of money — is what keeps the two tempos from contaminating each other.
The monthly close is the system's heartbeat. It freezes a pool, resolves every qualified play against the rights bridge as it stood at play time, multiplies by the frozen rate, and writes a few billion immutable line items in one bulk pass. Run twice, it produces the identical result.
The close is deliberately a batch, not a stream, because it requires a property streams cannot give: a sealed input. Only after CUTOFF and RECONCILING is the qualified count final and the revenue confirmed; only then can the rate be honest. The whole job is structured as a pure function of frozen inputs so that auditors, finance, and a future re-run all land on the same dollar.
The attribution generation itself is the join the whole schema was built to make cheap. It reads the frozen rate, fans each eligible play across its rights holders using the bridge bounded by play time, and computes the per-line payout. The half-open interval predicate is what makes a play at the exact renegotiation instant resolve to one deal, never two.
That last rule is the senior flourish. Because the pool, the rate, and the rights bridge are all durable facts rather than logic baked into a pipeline, switching from pro-rata (a stream earns rate × share) to user-centric payout (a subscriber's fee splits only among artists they played) is a different aggregation over identical inputs — no migration, no restatement of closed months. The schema deliberately does not know which distribution algorithm it serves, which is exactly what lets the business change its mind without rewriting history.
The frozen facts are where the system explains itself to three different audiences. Three queries an interviewer loves, because each carries a named pattern on its back: an as-of-time temporal join, conditional aggregation, and window-ranked cohort share.
This is the defensible line item: for one track in one period, who earned what, using the splits that were in force when each play happened. The pattern is the as-of join — a range predicate against the bridge's half-open validity interval. It is the answer to "pay me as of the deal that was active when my song was streamed," which is the only answer that survives a courtroom.
Product's question: is the recommender any good? Skip rate by context_type answers it, and the pattern is conditional aggregation — counting a sub-population with FILTER rather than a second pass. A high skip rate on algorithmic is a recommender-quality alarm; the same rate on radio is expected lean-back behavior. The thirty-second flag does double duty: a non-qualified play is a skip.
The pool model's central equity question: how concentrated are payouts? This ranks holders by earnings within a period and uses a window to compute each one's running share of the total — the gateway to "the top 1% of artists take N% of the pool," which is the headline pro-rata's critics aim at and the chart UCPS exists to change. The pattern is the windowed cumulative share, a cohort-concentration staple.
A royalty system is judged on a few days a month, when a period closes and money moves. The dashboard watches two things at once: the always-on play firehose, and the once-a-month close that must reconcile to the penny before a single payout is remitted.
Read the amber tiles together and the dashboard narrates the close from §06. Revenue firmed up, nudging the rate; the last twelve countries are still admitting late offline plays before they freeze; and the UCPS simulation quietly shows the long tail would gain twelve percent if the policy switched — the schema's whole point, surfaced as a number a product executive can act on. Recon drift flat near zero is the load-bearing tile: nothing gets paid until that reads green.
Strip away the music and the question was probing five judgments, each of which generalizes far past streaming royalties: