One human sees an ad on a phone, opens the site on a laptop, and buys on a tablet a week later — and a platform must decide which touch earned the sale, honestly, across devices, while Apple has cut the signal that used to make it easy. A complete working through: the identity graph, the SCD2 device bridge, match-at-read attribution, parallel windows, the SKAN carve-out, holdout lift, and the dashboard.
This prompt looks like a join problem and is in fact an identity problem wearing a join's clothes — because the three events you must connect were never stamped with the same key, and the law now forbids you from minting one.
"Design a model for an ad platform where a user sees an ad on phone, opens the advertiser's site on laptop later, and converts on tablet a week after. Support 1/7/28-day click and view-through windows, ATT/SKAdNetwork iOS constraints, and incrementality lift. How do you scope it?"
The trap is to assume the three events share a user id. They do not. A phone impression, a laptop visit, and a tablet purchase arrive as three rows with three different device tokens and — on iOS, after Apple's App Tracking Transparency — quite possibly no user identifier at all. The only thing that can connect them is an identity graph: a model that says, with some confidence, that these three devices belong to one person. And confidence is the word that makes this senior. A login is certainty; a shared household IP is a guess; behavioral overlap is a softer guess still. A system that pretends a 0.88 probabilistic link is the same as a 1.0 deterministic one is lying to the advertiser about why they should believe the number.
A weak answer hard-codes attribution into the pipeline — last-click, one window — and silently drops everything iOS won't let it see. A strong answer notices two things: that the identity graph must be versioned, because devices come and go, and that attribution must be computed at read time, because the advertiser will ask for last-click and data-driven and three windows over the same data. So before any tables, the frame for the session:
Scope is the first scored dimension, so name it. In scope: the touch/conversion facts, the identity graph and its versioned device bridge, match-at-read attribution with parallel click and view windows, the SKAN aggregate path, and incrementality via holdouts. Out of scope, said explicitly: the identity resolution algorithm itself (the probabilistic matching that decides which devices link and at what confidence — treated as an upstream producer of bridge_identity_devices), ad serving and the auction, creative storage, and billing rails. The caveat: the model must not preclude a future deterministic signal (a publisher login, a clean-room match) upgrading a link's confidence — so confidence is a column on the bridge, never a hard-coded constant.
Then the envelope math, volunteered. Cross-device-ads numbers at platform scale:
| Quantity | Estimate | Consequence |
|---|---|---|
| Ad events / day | ~10,000,000,000 | Impressions + clicks across all surfaces — the touch firehose |
| Devices per identity | 2.6 avg | The bridge fans every conversion out to several touchpoints |
| iOS opt-out share | ~75% of iOS | The row that forces the aggregate carve-out, not a filter |
| Conversion-to-touch lag | up to 28 days | Windows span 1/7/28 days × click vs view — read-time math |
| Attribution models live | last-click · MTA · DDA | All run in parallel; advertiser picks at query time |
| Link confidence range | 0.88 – 1.00 | Probabilistic to deterministic — weights, not booleans |
| Holdout share | ~1% of users | Sized for statistical power on lift, not derived from campaigns |
Read the table and the architecture is half-decided. The iOS opt-out row dictates that a privacy wall is a structural fact, not a query predicate — three-quarters of iOS will never have a user-level identity to join. The lag and the parallel-models rows dictate match-at-read: with windows up to 28 days and several models live at once, attribution cannot be frozen at ingest. And the confidence range dictates that the bridge carries a weight, because the difference between a login and a household IP is the difference between a fact and an inference. The rest of this article follows the identity.
One graph, two privacy paths. Opted-in touches and conversions fan into a single identity through the versioned bridge; iOS opt-outs and SKAN postbacks travel a walled aggregate lane that never touches a person.
Three properties of this picture do most of the interview's work. First, the identity is resolved as-of the event's timestamp through the SCD2 bridge — the phone impression links to whatever identity owned that device on April 25, not whatever owns it today — so a re-sold phone never mis-credits a stranger. Second, attribution is a plane unto itself: the conversion is paired with its eligible touchpoints and a run writes credit as rows, so windows and models multiply without ever rebuilding the facts. Third, the wall is a structural line, not a predicate — SKAN postbacks live in a fact that has no identity column to join on, so an aggregate-only measurement cannot accidentally be de-anonymized by a careless query.
Carry the confidence; never collapse it. A deterministic link (a login) and a probabilistic one (a household IP) are different kinds of truth, and the bridge stores both as a number between 0 and 1 with the method that produced it. Downstream, last-click can ignore the weight, but multi-touch and data-driven models read it — a tablet linked at 0.88 contributes less certain credit than a phone linked at 1.0. The system never launders a guess into a fact; the moment a link cannot be made at all — an iOS opt-out with a null identity — the event is routed to the aggregate path, counted, and never guessed into a person.
The schema falls out of the identity question. A graph of people; a SCD2 bridge that links devices to people with confidence; immutable touch and conversion facts; a derived, append-only attribution fact; and a SKAN fact behind a wall, with no key that could reach a person.
The heart of the model is the bridge. One row per (identity, device) link, each carrying a link_confidence (1.0 deterministic, below that probabilistic), the link_method that produced it, and the SCD2 validity window — because devices come and go, and an attribution must read the graph as it was, not as it is. The identity itself is a thin row: a person token and the strongest resolution method that vouches for them.
A single touch fact carries both impressions and clicks, event-typed, because attribution treats them as the same kind of thing — eligible touchpoints differing only in whether they open a click window or a view window. The decisive column is identity_id, which is nullable: on iOS opt-out it is null by law, and that null is the schema's way of saying "this touch can only ever be counted in aggregate." The conversion is the advertiser's reported outcome, tied to an identity when one is known.
Attribution is the match grain — one row per (conversion × eligible touchpoint × window × model × run), partitioned by attribution_run_id. The same conversion appears under click_7d / last-click with full credit and under view_28d / MTA with a fraction, all in one run. SKAN is its own fact with a deliberately impoverished schema: a campaign, a coarse conversion-value bucket, a postback count — and crucially no identity column and no event id, so the privacy wall is enforced by the absence of a join key, not by a query author's discipline.
The remaining dimensions are conformed and SCD2 where history matters. dim_campaigns and dim_creatives version budget, objective, and placement. dim_attribution_models is the keystone of re-attribution — the model is a versioned row, so "last-click vs MTA vs DDA" is a join, never an ETL fork. And dim_user_holdouts is the one dimension that is deliberately not derived: holdout assignment lives as its own SCD2 fact per advertiser × period, so a campaign re-org cannot silently dissolve the control group an incrementality study depends on.
The correctness of the whole model lives in three rules. The conservation rule: per conversion, per window, per model, credit sums to one. The as-of rule: identity is resolved at the event's timestamp, through the versioned bridge. The wall rule: the aggregate path has no key to a person, by construction.
Cross-device attribution divides one sale's value among touches on different devices, so the same dangers as any ledger apply — double-credit, lost fractions — plus one unique to identity: crediting the wrong person because a device changed hands. The first guard is arithmetic and per-conversion: within one window under one model in one run, fractional_credit sums to exactly 1.0 (or 0.0 when no touch falls in window — unattributed, a valid recorded answer). This is RULE Nº 1, made executable, and in multi-touch models the credits are shaped by each link's confidence before they are normalized back to one.
Two branches leave this spine. A touch whose identity cannot be resolved skips straight to the aggregate path — SKAN / AGGREGATE, never reaching the eligible set. And from CONVERSION, a re-run re-enters ATTRIBUTED under a new attribution_run_id — the only state allowed to repeat, because re-attributing when a model upgrades is the entire reason for match-at-read. Every other transition is immutable.
The third rule is the wall, and it is the one most candidates state as a WHERE clause and most senior engineers state as a schema. Because fct_skan_postbacks has no identity and no event id, there is no query — careless, malicious, or accidental — that can join a SKAN postback back to a person. The aggregate path can only ever be grouped by campaign and day. This is the difference between a privacy policy and a privacy guarantee: a policy is a promise a query might break; a guarantee is a key that does not exist. The opted-in graph and the aggregate lane share a warehouse and share nothing else.
Three programs carry the write path: the touch router that decides graph-or-aggregate at the door, the bridge maintainer that versions a link without ever losing history, and the holdout guard that keeps the control group sacred. Each is small; the judgment is in what they refuse to do.
Every incoming touch must be sorted onto exactly one path. An opted-in event resolves its device to an identity through the bridge and lands in the user-level fact; an iOS opt-out lands with a null identity, destined only for aggregate rollups. The router refuses the one thing that would breach the wall: it never invents an identity for an opt-out, and it never lets an unresolved opted-in device silently borrow a stranger's identity.
When the upstream resolver reports that a device's link has changed — a phone sold, a confidence upgraded by a fresh login — the maintainer expires the old row and opens a new one. It refuses to mutate history: the previous link is closed with an effective_to, not deleted, so an attribution run over last month still reads last month's graph. This is the SCD2 discipline that makes "as-of" possible.
Incrementality is only real if the holdout never sees an ad. The guard sits in the serving request path and refuses to let a held-out identity be targeted — and, just as importantly, it ensures held-out users who convert anyway are counted as conversions without ad events, which is exactly what makes the lift math valid. It refuses to derive holdout membership from campaign state; membership is read from the dedicated dimension, so a re-org cannot leak the control group into treatment.
One carve-out, always stated: none of these programs computes attribution. The router sorts, the maintainer versions, the guard protects the experiment — but credit is never assigned at ingest. That refusal is the architecture: because nothing here freezes a touch into an attributed outcome, the measurement layer is free to re-decide every conversion under any model, over any window, as many times as the advertiser asks.
Two derived layers carry the slow loop. The attribution run materializes every window and model in parallel as an append-only batch. The lift calculation compares treatment against holdout — the one number that survives the death of cross-device signal, because it needs no identity match at all.
The attribution run's defining trait is that it produces many answers at once. A single run sweeps the live windows — click_1d, click_7d, view_28d — and the live models, writing a row for every combination that attributes. The same conversion legitimately appears a dozen times: once per window-model pair it qualifies for. This is not duplication; it is the product surface. The advertiser whose funnel closes in a day reads click_1d; the one selling a considered purchase reads view_28d; the platform reports all of them, and the latest run wins on read while every prior run remains for audit.
The lift calculation is the slow loop's other half, and it is the strategic counterweight to the entire identity edifice. Attribution asks "which touch caused this sale?" — a question that gets harder every year as signal erodes. Incrementality asks a different, more robust question: "how many more sales happened because of the ads?" — answered by comparing the conversion rate of treated users against a holdout that saw nothing. It needs no cross-device match, no window, no model; it survives ATT untouched, because a holdout user who never saw an ad is a clean counterfactual regardless of how many devices they own.
The facts are where the system explains itself — once identity is resolved. Three queries an interviewer loves, because each answers a different stakeholder and each carries a classic pattern on its back.
The advertiser's first question: how many conversions fall inside each window, so they can pick the one that matches their funnel. Because every window's rows coexist in the attribution fact, this is a GROUP BY window_type — the same conversions, sliced by horizon. The pattern is dimensional comparison over a derived fact.
The measurement team's health query, and the one that watches the graph erode. What share of conversions threaded back to at least one ad event via identity? The pattern is a left-anti-join expressed as a coverage ratio — total conversions over those the graph could connect — and its decline is the leading indicator of signal loss.
The growth team's defining query, and the one that needs no identity match to be valid. Compute the conversion rate of treated users beside the holdout's, and the difference is lift. The pattern is a partitioned-cohort comparison — two arms, one metric — whose significance is a Welch's t-test on the two proportions.
A senior design ends with observability, because every confidence weight and privacy wall above is invisible without it. The dashboard watches three things — coverage of the graph, the shape of attribution, and lift — and one tile watches the wall.
Read the amber tiles together and the dashboard narrates the post-ATT world from the operator's chair: the graph is going quieter (match rate falling, opt-out at 75%), conversions arrive later (lag past five days), so the platform increasingly reports through the walled aggregate and trusts incrementality, which holds at +4.1 points with real significance. The single most important tile is the boring one: credit-sum errors at zero. Coverage is a battle you fight every quarter; credit conservation is a law you never break — and the wall is a guarantee you do not have to defend, because it has no door.
Strip the devices away and the question was testing five judgments, each of which generalizes far beyond ad attribution: