Interview Studio · Design · Data Modeling

Data modeling — twelve scenarios across the industries that actually ask.

Modeling rounds at Uber, Google, Meta, Netflix, Amazon, DoorDash, Airbnb, Stripe and Spotify all test the same five-move sequence — but the domain context changes everything you say. This notebook walks the framework once, then drills the twelve most-asked scenarios end to end. Use it to swap domains in 60 seconds when the prompt arrives.

The universal framework — five moves, on every prompt
Company → most-likely scenario matrix
Twelve scenarios, full walk-through
Cross-cutting concerns — SCDs, LAD, factless facts, streaming, GDPR, FX, skew
Common traps + the 90-second articulation script

§ 01 — The Universal Framework

Five moves, every prompt — on every domain.

The framework doesn't care whether you're modeling Uber trips, Google ads, or Spotify streams. Domain knowledge changes what you say in each move; the moves themselves stay identical. Drill the moves until they're reflex, then layer domain context on top.

Clarify the use cases. Who reads this? What 3–5 questions must it answer? What's the SLA — minutes, hours, days? Don't model anything until you have 3–5 named consumers.
Pin the grain. One sentence per fact table: "One row per X per Y per Z." If you can't say it cleanly, the model isn't ready.
Conceptual → logical → physical. Entities first. Then attributes and types. Then partitioning, clustering, SCD strategy.
SCDs and time. Type 1 (overwrite), Type 2 (versioned rows), Type 6 (hybrid). Justify each dim.
Stress test. Volunteer two failure modes — late data, GDPR, schema drift, hot partitions, time zones — without being asked.

The phrase that lifts you a level, in any domain: "…and the unlock here for the [recommendations / risk / finance / dispatch] team is X." Naming the downstream consumer turns the answer from "tables" into "data product."

The vocabulary that signals seniority — across all domains

Term	One-sentence definition
Grain	The meaning of one row, stated as a sentence.
SCD Type 1 / 2 / 6	Overwrite vs versioned-rows vs hybrid (current + history columns).
Conformed dimension	Same dim used across multiple fact tables (`dim_date`, `dim_geography`).
Bridge / factless fact	M:N resolver, or an event-occurrence table with no measure.
Late-arriving dim	Fact references dim row not yet loaded — placeholder SK `-1`, backfill.
Slowly-changing fact	Restating financials. Append corrections; never `UPDATE` in place.
Idempotent load	Re-running yesterday produces the same result. `MERGE` on natural key.
Surrogate key	Synthetic integer PK. Required for SCD2.
Star vs snowflake	Star — denormalized dims. Snowflake — normalized. Star is default for analytics.
Wide table	One denormalized fact pre-joined to dims. Read speed vs storage + drift.
Activity schema	Single-table model where every event is a row. Useful for product analytics.
Data vault	3-table pattern (hub/link/satellite) for high-change EDWs. Mention; rarely use.

· · ·

§ 02 — Company → Scenario Matrix

What each company will probably ask you to model.

Every company hires for the same five-move framework, but they probe with prompts shaped like their business. Use this matrix to anticipate.

Company	Most-likely modeling prompts	The signal they want
Uber / Lyft	Trip lifecycle, surge pricing, driver-rider matching, ETA, payouts	Geo (H3 hexes), supply-demand thinking, multi-state lifecycle
DoorDash / Uber Eats	Order pipeline, courier dispatch, restaurant supply, batched delivery	3-sided marketplace, time-sensitive SLAs
Google	Search ads auction, YouTube viewing, Maps trips, Workspace events	Auction internals, quality score, ranker features
Meta / Facebook	Feed engagement, Ads attribution, Reels watch, social graph	Cross-device identity, attribution windows, ATT/SKAdNetwork
Netflix	Viewing history, content catalog, A/B, talent contracts, CTV ads	Studio context, royalty calc, SCD2 fluency
Amazon	Orders, returns, inventory, recommendations, Prime entitlement	Returns reshape revenue; inventory snapshots; multi-warehouse
Spotify / Apple Music	Listening history, playlist generation, royalty payouts, podcasts	Per-stream royalty math, multi-rights-holder splits
Stripe / Square / PayPal	Transactions, ledger, fraud, payouts, FX	Double-entry rigor, multi-currency, restate-don't-update
Airbnb / Vrbo	Bookings, calendar, listings, reviews, host payouts	2-sided marketplace, calendar-as-fact, cancellation policies
Salesforce / HubSpot	Opportunities, accounts, activities, quotas, attribution	SCD2 on opportunities (stage transitions), org hierarchy
LinkedIn	Profile views, connections, feed engagement, recruiter search	Graph traversal, time-decayed relevance
Snowflake / Databricks	Query telemetry, billing/credits, warehouse usage	Usage metering, multi-tenant, charge-back

· · ·

§ 03 — Twelve Scenarios

The twelve most-asked modeling prompts — full walk-throughs.

Uber / Lyft · ride-sharing

Scenario 1 — Trip lifecycle (with multi-driver convoys)

Design a complete data model for the Trip Lifecycle — including the complex convoy / multi-driver scenario where one group request is split across two or more vehicles. Must support Operations (live ETA), Finance (driver payouts), and Risk (fraud).

Why this is hard — the design tension

The problem isn't a single trip — it's the group intent. Five friends going to the airport expect to be charged a single price, but Uber dispatches two vehicles. Operations sees two trips; Finance must charge the group once but pay two drivers fairly; Risk must verify the convoy actually moved together. The robust solution is a hierarchical Star Schema that decouples the physical movement of the car from the financial obligation of the rider.

Section 1 — Dimensional Model Overview

This model uses a Journey → Trip → Event hierarchy to handle everything from a single ride to complex shared convoys.

Fact Tables (the hierarchy)

fct_journeys — The "Parent" grain. One row per group request. Captures group intent (split type, total group cost, status).
fct_trips — The "Vehicle" grain. One row per driver / vehicle per journey. Source of truth for Finance (driver payouts).
fct_trip_events — The "Telemetry" grain. One row per state transition (REQUEST, MATCH, PICKUP, etc.) per trip. Powers Ops (Live ETA) and Risk (GPS replay).
fct_trip_feedback — The "Sentiment" grain. One row per (trip × rider) feedback event. Captures rating, category, free-text comment, sentiment. Powers Product (CSAT, NPS).
brg_journey_riders — The "Financial" bridge. Links riders to journeys and to a specific assigned trip/vehicle — handles cost-splitting (journey-level) and capacity routing (trip-level).

Dimension Tables

dim_users — Riders and drivers, typed via user_role. SCD2 on plan tier and rating.
dim_vehicles — Vehicle specs (model, type), insurance status, owner. SCD2 on insurance status (Risk needs as-of-trip status).
dim_geography — H3 hexagonal indexing (level 8 ≈ 0.7 km²) for spatial analysis. SCD1 (geography is structurally static).
dim_payment_methods — Card / wallet on file. SCD2 — Finance audit needs the as-of-trip card.
dim_fx_rate — SCD2 daily rates; locked at request_timestamp.

Section 2 — The Logical Data Model

Parent Fact: fct_journeys (the group intent)

Column	Type	Description
`journey_key`	PK BIGINT	Surrogate.
`journey_id`	UUID	Natural ID; idempotency anchor.
`requesting_user_id`	FK	Who initiated the group request.
`split_type`	ENUM	{EQUAL_SPLIT, DISTANCE_PRORATA, REQUESTER_PAYS, CUSTOM}.
`total_group_fare_usd`	DECIMAL(10,2)	Single bill for the whole group; sum of `brg_journey_riders.individual_charge` reconciles to this.
`num_vehicles_dispatched`	SMALLINT	1 for normal, ≥2 for convoy.
`num_riders`	SMALLINT	Total riders in the group.
`journey_status`	ENUM	{requested, in_progress, completed, partial, cancelled}.
`request_timestamp`	TIMESTAMP_TZ	Group request time; FX rate locked here.

Vehicle Fact: fct_trips (one row per driver/vehicle per journey)

Column	Type	Description
`trip_key`	PK BIGINT	Surrogate key for the trip.
`trip_id`	UUID	Natural ID from the source system; idempotency anchor.
`journey_id`	FK	Parent journey. Multiple trips can share a journey (convoy).
`trip_fare_share_usd`	DECIMAL(10,2)	This vehicle's share of the journey fare. Drives driver payout, NOT what riders pay.
`driver_id`	FK	The matched driver. (driver,journey) is unique — same driver can't take two trips for one journey.
`vehicle_id`	FK	The vehicle used.
`origin_h3_key`	FK	H3 cell at request time.
`destination_h3_key`	FK	H3 cell at dropoff (NULL on cancellation).
`request_timestamp`	TIMESTAMP_TZ	The beginning of the lifecycle.
`terminal_state`	ENUM	{completed, cancelled_rider, cancelled_driver, cancelled_system, expired, no_show}.
`surge_multiplier`	DECIMAL(4,2)	Locked at request — immutable.
`total_fare_local`	DECIMAL(10,2)	Final fare in source currency.
`fx_rate_to_usd`	DECIMAL(18,8)	Locked at `request_timestamp` — guarantees revenue can never be restated by FX drift.
`total_fare_usd`	DECIMAL(10,2)	Final amount charged to rider(s) — Finance source of truth.
`driver_payout_usd`	DECIMAL(10,2)	Final amount owed to driver after take rate (Finance focus).
`is_flagged_risk`	BOOLEAN	Fraud suspicion flag (Risk focus). Slowly-changing — append correction; never `UPDATE`.
`ingest_ts`	TIMESTAMP_TZ	For idempotent `MERGE`.

Event Fact: fct_trip_events

Column	Type	Description
`event_id`	PK	Unique event identifier (idempotency).
`trip_id`	FK	Link to `fct_trips`.
`state`	STRING	REQUESTED, MATCHED, ARRIVED, STARTED, DROPOFF, CANCELLED.
`event_timestamp`	TIMESTAMP_TZ	Precise time of transition.
`location_h3`	H3_INDEX	Cell where the event occurred.
`actor`	STRING	Who triggered the transition: rider, driver, system.

Sentiment Fact: fct_trip_feedback (per-rider satisfaction)

Column	Type	Description
`feedback_key`	PK BIGINT	Surrogate.
`trip_key`	FK	The trip rated.
`rider_id`	FK	Who left the feedback.
`rating_score`	TINYINT	1–5 stars.
`feedback_category`	STRING	{driver_behavior, vehicle_cleanliness, route, payment, app_issue, other}.
`rider_comment`	TEXT	Free-text; fed to NLP for sentiment.
`sentiment_score`	DECIMAL(4,3)	−1.0 to +1.0; computed nightly via NLP model.
`feedback_timestamp`	TIMESTAMP_TZ	When submitted (often hours after dropoff).

Financial Bridge: brg_journey_riders

Column	Type	Description
`journey_id`	FK	Parent journey — drives the group cost-split.
`rider_id`	FK	The rider on this leg.
`trip_id`	FK	The specific car this rider was assigned to. Critical: in a convoy, riders R_1 + R_2 may ride in T-101 while R_3 rides in T-102.
`individual_charge_usd`	DECIMAL(10,2)	What this rider actually pays. Computed from `fct_journeys.split_type`.
`pickup_sequence`	SMALLINT	Order picked up within their assigned trip (1, 2, 3…).
`leg_distance_km`	DECIMAL(8,3)	This rider's portion of the route; basis for distance-prorata splits.

Section 3 — How the model serves each stakeholder

Operations · live ETA & monitoring

By using fct_trip_events, Operations calculates lead-time metrics directly:

Dispatch Latency = timestamp(MATCHED) − timestamp(REQUESTED)
Pickup ETA accuracy = timestamp(STARTED) − timestamp(MATCHED) compared to the predicted ETA
Spatial hot-spot detection via location_h3: aggregate REQUESTED events per cell per minute against MATCHED events to identify cells where demand exceeds supply in real time.

Finance · driver payouts & revenue

fct_trips is the General Ledger source of truth — every payout in fct_payouts joins back here on trip_key.
surge_multiplier is immutable once the request is made — ensures the rider's quoted price matches the driver's earning potential. Stored on the row, never derived.
terminal_state dictates whether a cancellation fee triggers a partial driver payout (e.g. cancelled_rider after MATCHED triggers a fee; cancelled_system does not).
FX is locked at request_timestamp — guarantees revenue restatement is impossible from rate drift.

Risk · fraud detection

Location anomaly: compare origin_h3 on fct_trips with the first STARTED event's H3 in fct_trip_events — large divergence = potential ride-tampering.
Trip-event GPS replay reveals "teleportation" — impossible speed between consecutive events implies spoofed location.
Account farming: joining fct_trips with a dim_devices table (linked via fct_session) — a single device cycling through multiple rider_ids in a short window flags suspicious activity.
is_flagged_risk is a slowly-changing fact — appended via correction row, never updated, so Finance can replay any payout decision regardless of subsequent risk re-evaluation.

Section 4 — Why the hierarchy: bridges + parent fact

Without the journey/trip split + bridge, you face three bad choices:

Option A: One row per trip, rider IDs in an array column. Finance can't aggregate revenue per rider without exploding the array. Disqualifying.
Option B: One row per rider in fct_trips. Pool with 3 riders becomes 3 rows — fare is triple-counted on every aggregation. Driver and dispatch counts also triple-count. Disqualifying.
Option C: Single fact for everything (rider-grain + driver-grain merged). A 3-rider, 2-driver convoy = 6 rows. Now you can't tell drivers from riders without filtering by role, and convoy-cost reconciliation needs a self-join. Disqualifying.

The hierarchy with bridge is the only design that satisfies all three stakeholders:

One row per group in fct_journeys — group-level KPIs (avg group size, split-type mix) are correct without filters.
One row per driver/vehicle per journey in fct_trips — Operations and Finance per-driver counts are correct on simple aggregation. (driver_id, journey_id) uniqueness honored.
Multiple rows per journey in brg_journey_riders — Finance computes per-rider charge with the journey's split logic; Risk can route a rider to a specific trip via the FK.
Two enforceable invariants:
- SUM(brg_journey_riders.individual_charge_usd) per journey_id = fct_journeys.total_group_fare_usd
- SUM(fct_trips.trip_fare_share_usd) per journey_id = fct_journeys.total_group_fare_usd
Two continuous data-quality checks. Either failing flags a producer bug.

Allocation rules:

EQUAL_SPLIT: individual_charge = total_group_fare / num_riders
DISTANCE_PRORATA: individual_charge = total_group_fare × (leg_distance_km / SUM(leg_distance_km))
REQUESTER_PAYS: individual_charge = total_group_fare for requester, 0 for others

Section 5 — Conceptual data flow (the convoy in plain English)

For Journey J-500 with 5 riders heading to the airport, dispatched as a 2-car convoy:

fct_journeys — the group's perspective. One row, $60 total, EQUAL_SPLIT, 2 vehicles dispatched.
fct_trips — the drivers' perspective. Two rows: Driver Alpha ($30 share), Driver Bravo ($30 share).
brg_journey_riders — the riders' perspective. Three rows: each rider charged $20, two assigned to T-101, one to T-102.

Section 6 — Sample data representation

Table: fct_journeys (the logical group)

journey_id	split_type	total_group_fare_usd	num_vehicles	num_riders	status
J-500	EQUAL_SPLIT	$60.00	2	3	completed

Table: fct_trips (the physical vehicles)

trip_id	journey_id	driver_id	vehicle_id	trip_fare_share_usd
T-101	J-500	D_Alpha	V_99	$30.00
T-102	J-500	D_Bravo	V_88	$30.00

Table: brg_journey_riders (the rider split)

journey_id	rider_id	trip_id (assigned car)	individual_charge_usd
J-500	R_1	T-101	$20.00
J-500	R_2	T-101	$20.00
J-500	R_3	T-102	$20.00

Two reconciliations: SUM(brg.individual_charge) = $20 + $20 + $20 = $60 ✓ | SUM(fct_trips.trip_fare_share) = $30 + $30 = $60 ✓ | both equal fct_journeys.total_group_fare.

Enhanced graphical data model — Trip Lifecycle & Analytical Metrics

Solid arrows = referential FK relationships. Dashed arrows = analytical lineage (which fact tables feed which metric module).

Section 7 — SQL analysis for business units

Sample data (Journey J-500: airport convoy with 3 riders split across 2 vehicles)

INSERT INTO dim_users VALUES
  (1,'R_1','rider',  'Gold',     NULL),
  (2,'R_2','rider',  'Silver',   NULL),
  (3,'R_3','rider',  'Silver',   NULL),
  (4,'D_Alpha','driver','Platinum', NULL),
  (5,'D_Bravo','driver','Gold',     NULL);

-- 1 journey
INSERT INTO fct_journeys
 (journey_key, journey_id, requesting_user_id, split_type,
  total_group_fare_usd, num_vehicles_dispatched, num_riders, journey_status, request_timestamp)
VALUES
  (500,'J-500',1,'EQUAL_SPLIT', 60.00, 2, 3, 'completed', '2025-05-01 08:30:00');

-- 2 trips under J-500 (the convoy)
INSERT INTO fct_trips
 (trip_key, trip_id, journey_id, driver_id, vehicle_id, trip_fare_share_usd,
  surge_multiplier, terminal_state)
VALUES
  (101,'T-101','J-500', 4, 99, 30.00, 1.0, 'completed'),
  (102,'T-102','J-500', 5, 88, 30.00, 1.0, 'completed');

-- Trip events for both vehicles
INSERT INTO fct_trip_events VALUES
  ('e01','T-101','REQUESTED','2025-05-01 08:30:00','h3_a','rider'),
  ('e02','T-101','MATCHED',  '2025-05-01 08:31:00','h3_a','system'),
  ('e03','T-101','PICKUP',   '2025-05-01 08:35:00','h3_a','driver'),
  ('e04','T-101','DROPOFF',  '2025-05-01 09:00:00','h3_z','driver'),
  ('e05','T-102','REQUESTED','2025-05-01 08:30:00','h3_a','rider'),
  ('e06','T-102','MATCHED',  '2025-05-01 08:31:30','h3_a','system'),
  ('e07','T-102','PICKUP',   '2025-05-01 08:35:00','h3_a','driver'),  -- same cell as T-101 (legit convoy)
  ('e08','T-102','DROPOFF',  '2025-05-01 09:01:00','h3_z','driver');

-- Bridge: 3 riders, 2 assigned to T-101, 1 assigned to T-102
INSERT INTO brg_journey_riders VALUES
  ('J-500', 1, 'T-101', 20.00, 1, 12.0),
  ('J-500', 2, 'T-101', 20.00, 2, 12.0),
  ('J-500', 3, 'T-102', 20.00, 1, 12.0);

-- Feedback: post-dropoff ratings + sentiment from NLP
INSERT INTO fct_trip_feedback VALUES
  (9001,101,1,5,'driver_behavior',     'Great driver, smooth ride',  0.78,'2025-05-01 10:30:00'),
  (9002,101,2,4,'route',               'Took a slight detour',       0.20,'2025-05-01 10:42:00'),
  (9003,102,3,2,'vehicle_cleanliness', 'Car smelled bad',           -0.65,'2025-05-01 11:05:00');

A. Operations · live ETA & lifecycle metrics

Q1 — System lag: per-state transition latency using gaps-and-islands

SELECT
  trip_id,
  state,
  event_timestamp,
  LAG(event_timestamp) OVER (PARTITION BY trip_id ORDER BY event_timestamp) AS prev_event_time,
  EXTRACT(EPOCH FROM (
    event_timestamp -
    LAG(event_timestamp) OVER (PARTITION BY trip_id ORDER BY event_timestamp)
  )) AS transition_latency_sec
FROM fct_trip_events
WHERE trip_id = 'T-101'
ORDER BY event_timestamp;

Result for T-101:

state	event_timestamp	prev_event_time	transition_latency_sec
REQUESTED	08:30:00	NULL	NULL
MATCHED	08:31:00	08:30:00	60
PICKUP	08:35:00	08:31:00	240
DROPOFF	09:00:00	08:35:00	1500

Use case: high transition_latency_sec on REQUESTED→MATCHED indicates "hot partitions" or processing delays — Ops can alert when the cell-level p95 exceeds threshold.

Q2 — Dispatch latency (p50 / p95) per H3 cell per hour

WITH latency AS (
  SELECT
    e.trip_id,
    DATE_TRUNC('hour',
      MIN(CASE WHEN state='REQUESTED' THEN event_timestamp END)) AS hour_bucket,
    MAX(CASE WHEN state='REQUESTED' THEN location_h3 END) AS h3_cell,
    EXTRACT(EPOCH FROM (
      MAX(CASE WHEN state='MATCHED'   THEN event_timestamp END)
      - MAX(CASE WHEN state='REQUESTED' THEN event_timestamp END)
    )) AS latency_sec
  FROM fct_trip_events e
  GROUP BY trip_id
  HAVING MAX(CASE WHEN state='MATCHED' THEN 1 END) = 1
)
SELECT hour_bucket, h3_cell, COUNT(*) AS matched_trips,
  PERCENTILE_CONT(0.5)  WITHIN GROUP (ORDER BY latency_sec) AS p50_sec,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_sec) AS p95_sec
FROM latency GROUP BY 1, 2 ORDER BY p95_sec DESC;

Q3 — Cancellation breakdown by actor

SELECT terminal_state, COUNT(*) AS cnt,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 1) AS pct_of_total
FROM fct_trips
GROUP BY terminal_state ORDER BY cnt DESC;

Q4 — Convoy efficiency (vehicles per journey, riders per vehicle)

SELECT
  j.journey_id,
  j.num_riders,
  j.num_vehicles_dispatched,
  ROUND(j.num_riders::DECIMAL / j.num_vehicles_dispatched, 2) AS riders_per_vehicle,
  COUNT(DISTINCT t.trip_id) AS actual_trips,
  COUNT(DISTINCT b.rider_id) AS riders_seen
FROM fct_journeys j
JOIN fct_trips t USING (journey_id)
JOIN brg_journey_riders b USING (journey_id)
GROUP BY j.journey_id, j.num_riders, j.num_vehicles_dispatched;

Result for J-500:

journey_id	num_riders	num_vehicles	riders_per_vehicle	actual_trips	riders_seen
J-500	3	2	1.50	2	3

B. Finance · revenue reconciliation & driver payout

Q5 — Driver payout reconciliation (per-driver share vs riders' charges)

SELECT
  t.driver_id,
  t.trip_id,
  t.trip_fare_share_usd  AS revenue_basis,
  SUM(b.individual_charge_usd) AS total_collected_from_riders,
  COUNT(b.rider_id)            AS riders_in_this_car
FROM fct_trips t
JOIN brg_journey_riders b ON t.trip_id = b.trip_id
GROUP BY t.driver_id, t.trip_id, t.trip_fare_share_usd
ORDER BY t.trip_id;

Result for J-500:

driver_id	trip_id	revenue_basis	total_collected_from_riders	riders_in_car
D_Alpha (4)	T-101	$30.00	$40.00	2
D_Bravo (5)	T-102	$30.00	$20.00	1

Critical insight: Driver Alpha's car collected $40 from riders but Alpha is paid only on $30 revenue_basis (the journey was split equally between drivers). The $10 gap goes to the platform as the "convoy coordination" fee. Without this two-grain design, you couldn't represent it.

Q6 — Pool fare allocation per rider (single-vehicle Pool, distance-prorata)

-- Same query, useful when split_type = DISTANCE_PRORATA
SELECT
  b.journey_id,
  b.rider_id,
  b.trip_id,
  b.pickup_sequence,
  b.leg_distance_km,
  ROUND(100.0 * b.leg_distance_km
        / SUM(b.leg_distance_km) OVER (PARTITION BY b.journey_id), 1) AS distance_share_pct,
  b.individual_charge_usd,
  j.total_group_fare_usd
FROM brg_journey_riders b
JOIN fct_journeys j USING (journey_id)
WHERE b.journey_id = 'J-500'
ORDER BY b.trip_id, b.pickup_sequence;

Q7 — Twin invariant data-quality check (continuous DQ)

-- Invariant 1: SUM(brg.individual_charge) per journey = fct_journeys.total_group_fare
-- Invariant 2: SUM(fct_trips.trip_fare_share) per journey = fct_journeys.total_group_fare
SELECT
  j.journey_id,
  j.total_group_fare_usd                            AS reported_total,
  ROUND(SUM(DISTINCT t.trip_fare_share_usd), 2)     AS sum_trip_shares,
  (SELECT ROUND(SUM(b.individual_charge_usd), 2)
     FROM brg_journey_riders b WHERE b.journey_id = j.journey_id)
                                                    AS sum_rider_charges,
  CASE
    WHEN j.total_group_fare_usd != SUM(DISTINCT t.trip_fare_share_usd) THEN 'FAIL_TRIP_INVARIANT'
    WHEN j.total_group_fare_usd != (SELECT SUM(individual_charge_usd)
                                    FROM brg_journey_riders WHERE journey_id = j.journey_id)
                                    THEN 'FAIL_RIDER_INVARIANT'
    ELSE 'PASS'
  END AS dq_status
FROM fct_journeys j
JOIN fct_trips t USING (journey_id)
GROUP BY j.journey_id, j.total_group_fare_usd;
-- Any FAIL row = producer regression. Required for SOX / audit.

C. Risk · fraud detection

Q8 — Phantom Convoy detection (drivers claim convoy but aren't actually together)

-- Risk pattern: two drivers booked into the same journey but their PICKUP H3 cells
-- are far apart at the same moment. Either a producer bug or a fraud ring.
SELECT
  e1.trip_id        AS car_1,
  e2.trip_id        AS car_2,
  e1.location_h3    AS loc_1,
  e2.location_h3    AS loc_2,
  EXTRACT(EPOCH FROM ABS(e1.event_timestamp - e2.event_timestamp)) AS pickup_time_gap_sec
FROM fct_trip_events e1
JOIN fct_trips t1 ON e1.trip_id = t1.trip_id
JOIN fct_trips t2 ON t1.journey_id = t2.journey_id AND t1.trip_id < t2.trip_id
JOIN fct_trip_events e2 ON e2.trip_id = t2.trip_id
WHERE e1.state = 'PICKUP' AND e2.state = 'PICKUP'
  AND e1.location_h3 != e2.location_h3                       -- not in same H3 cell
  AND ABS(EXTRACT(EPOCH FROM (e1.event_timestamp - e2.event_timestamp))) < 600;  -- within 10 min
-- Any returned row = an investigation item.

For J-500 sample data, both cars pickup at h3_a within 0 seconds of each other → no rows returned → legit convoy.

Q9 — Multi-rider capacity check (over-capacity vehicle)

SELECT
  t.trip_id,
  v.model,
  v.capacity         AS vehicle_capacity,
  COUNT(b.rider_id)  AS passenger_count
FROM fct_trips t
JOIN brg_journey_riders b ON t.trip_id = b.trip_id
JOIN dim_vehicles      v ON t.vehicle_id = v.vehicle_key
GROUP BY t.trip_id, v.model, v.capacity
HAVING COUNT(b.rider_id) > v.capacity;
-- Off-app riders being charged = either producer bug or driver fraud.

Q10 — Location anomaly: request vs first-pickup divergence

WITH first_pickup AS (
  SELECT trip_id, location_h3 AS pickup_h3
  FROM fct_trip_events
  WHERE state = 'STARTED'
)
SELECT
  t.trip_key, t.trip_id, t.origin_h3_key, fp.pickup_h3,
  CASE WHEN t.origin_h3_key <> fp.pickup_h3 THEN 'INVESTIGATE' ELSE 'OK' END AS risk_flag
FROM fct_trips t
LEFT JOIN first_pickup fp USING (trip_id)
WHERE t.terminal_state = 'completed';

D. Product · CSAT, sentiment & growth accounting

Q11 — Trip rating distribution (1–5 stars) per driver, last 28 days

SELECT
  t.driver_id,
  COUNT(f.feedback_key)                                            AS total_ratings,
  ROUND(AVG(f.rating_score), 2)                                    AS avg_rating,
  COUNT(*) FILTER (WHERE f.rating_score <= 2)                      AS low_ratings,
  ROUND(100.0 * COUNT(*) FILTER (WHERE f.rating_score <= 2)
        / NULLIF(COUNT(*), 0), 1)                                  AS pct_low_ratings
FROM fct_trips t
JOIN fct_trip_feedback f ON f.trip_key = t.trip_key
WHERE f.feedback_timestamp >= CURRENT_DATE - INTERVAL '28 days'
GROUP BY t.driver_id
ORDER BY avg_rating ASC;
-- Drivers with pct_low_ratings > 5% are flagged for retraining or offboarding.

Q12 — Rolling 7d & 28d CSAT (% of ratings ≥ 4)

WITH daily_ratings AS (
  SELECT DATE(feedback_timestamp) AS d,
         COUNT(*)                                  AS n_ratings,
         COUNT(*) FILTER (WHERE rating_score >= 4) AS n_satisfied
  FROM fct_trip_feedback
  GROUP BY 1
)
SELECT d,
  ROUND(100.0 * SUM(n_satisfied) OVER (ORDER BY d ROWS  6 PRECEDING)
              / NULLIF(SUM(n_ratings) OVER (ORDER BY d ROWS  6 PRECEDING), 0), 1) AS csat_7d,
  ROUND(100.0 * SUM(n_satisfied) OVER (ORDER BY d ROWS 27 PRECEDING)
              / NULLIF(SUM(n_ratings) OVER (ORDER BY d ROWS 27 PRECEDING), 0), 1) AS csat_28d
FROM daily_ratings
ORDER BY d;

Q13 — Sentiment by feedback category (NLP signal × free-text)

SELECT
  feedback_category,
  COUNT(*)                                AS n_comments,
  ROUND(AVG(sentiment_score), 3)          AS avg_sentiment,
  ROUND(AVG(rating_score), 2)             AS avg_stars,
  COUNT(*) FILTER (WHERE sentiment_score < -0.3) AS n_strongly_negative
FROM fct_trip_feedback
WHERE feedback_timestamp >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY feedback_category
ORDER BY avg_sentiment ASC;
-- Surfaces what's driving the dissatisfaction — vehicle, route, app, payment, driver.

Q14 — Growth accounting: L1 / L7 / L28 active rider counts

WITH user_activity AS (
  SELECT b.rider_id,
    MAX(j.request_timestamp) AS last_active_ts
  FROM brg_journey_riders b
  JOIN fct_journeys j USING (journey_id)
  GROUP BY b.rider_id
)
SELECT
  COUNT(*) FILTER (WHERE last_active_ts >= CURRENT_TIMESTAMP - INTERVAL '1 day')   AS l1_active,
  COUNT(*) FILTER (WHERE last_active_ts >= CURRENT_TIMESTAMP - INTERVAL '7 days')  AS l7_active,
  COUNT(*) FILTER (WHERE last_active_ts >= CURRENT_TIMESTAMP - INTERVAL '28 days') AS l28_active,
  COUNT(*)                                                                          AS total_riders
FROM user_activity;
-- Standard L1/L7/L28 active KPIs from Reforge growth-accounting framework.

Worked numerical example — Convoy J-500 (EQUAL_SPLIT, $60 group fare)

5 friends call a group ride. 3 are billable riders (R_1, R_2, R_3); the others are children covered under R_1's account. Group fare quoted: $60.00.
System dispatches 2 vehicles because the group exceeds single-car capacity → fct_journeys.num_vehicles_dispatched = 2.
Driver Alpha's car (T-101) takes 2 riders (R_1, R_2). Driver Bravo's car (T-102) takes 1 rider (R_3).
Per-rider charge (EQUAL_SPLIT): $60 / 3 = $20.00 per rider. Stored in brg_journey_riders.individual_charge_usd.
Per-driver share: $60 / 2 = $30.00 per driver. Stored in fct_trips.trip_fare_share_usd.
Invariant 1 (rider side): SUM(brg.individual_charge) = $20 × 3 = $60 ✓
Invariant 2 (driver side): SUM(fct_trips.trip_fare_share) = $30 × 2 = $60 ✓
Note the asymmetry: Driver Alpha's car collected $40 in charges from his 2 riders, but Alpha's payout basis is $30 — the platform retains $10 as a convoy-coordination fee. Driver Bravo's car collected $20 from his 1 rider but is paid $30 — the platform covers the $10 gap. This is fair: the journey was equally split between drivers regardless of which car each rider sat in. Without the two-grain design (trip share AND rider charge), you couldn't represent it.

Section 8 — Why this works for the convoy scenario

Unique driver accounts. By placing driver_id on fct_trips (not fct_journeys), the model honors the rule that a (driver, journey) combination is unique — same driver can't take two trips for one journey, but two drivers can share one journey.
Cost flexibility. The bridge supports multi-destination and DISTANCE_PRORATA logic. If Rider 1 exits at Mile 5 and Rider 2 at Mile 10, individual_charge_usd is computed pro-rata using leg_distance_km captured from fct_trip_events — same model, different split_type.
Auditability. Finance traces every cent from a rider's wallet through brg_journey_riders directly to the specific driver's fct_trips payout. The two continuous invariants (Q7) catch producer bugs before they hit a quarterly SOX report.
Scalability. "Pool for 4", "Bus", "Wedding party of 12" — no schema migration. Just more rows in brg_journey_riders; the journey grain is unchanged.
Eliminates granularity mismatch. The 3-tier hierarchy (journey, trip, event) lets each stakeholder query at their natural grain without filters or self-joins.

Section 9 — Business model evolution: single-sided → two-sided → three-sided → driverless

The Uber data model isn't static — it's shaped by the platform's market structure, and that structure is shifting. As Uber harnesses platform leverage, AI, and driverless technologies, the rider may eventually hail a ride that involves no human driver at all. A senior architect should be able to describe how the data model morphs at each platform shape, and what the driverless future demands.

The driverless future — what changes when the human driver disappears?

As autonomous vehicles (AVs) become a real share of the fleet, the data model has to absorb three structural shifts:

The driver dimension shrinks. dim_users becomes mostly riders. Drivers don't go away entirely — they become safety operators / remote-fleet teleoperators — but most rows in fct_trips won't have a human driver_id.
The vehicle dimension explodes. dim_vehicles evolves into dim_av_fleet with software version, sensor calibration, ODD (Operational Design Domain) restrictions, last safety-test pass, etc. — all SCD2 because they drift.
A whole new fact table appears for safety telemetry. Every disengagement, every remote-intervention, every edge-case recognition gets logged. Regulators require it; insurance audits depend on it.

Q15 — AV safety metric: miles per disengagement (the core regulator KPI)

WITH av_miles AS (
  SELECT
    a.software_version,
    SUM(t.trip_distance_km) * 0.621371 AS total_miles
  FROM fct_trips t
  JOIN dim_av_fleet a ON t.av_key = a.av_key
  WHERE t.terminal_state = 'completed'
    AND t.request_timestamp >= CURRENT_DATE - INTERVAL '90 days'
  GROUP BY a.software_version
),
disengagements AS (
  SELECT
    a.software_version,
    COUNT(*) FILTER (WHERE s.event_type = 'DISENGAGEMENT') AS n_disengagements,
    COUNT(*) FILTER (WHERE s.event_type = 'REMOTE_INTERVENTION') AS n_interventions
  FROM fct_av_safety_event s
  JOIN dim_av_fleet a ON s.av_key = a.av_key
  WHERE s.event_timestamp >= CURRENT_DATE - INTERVAL '90 days'
  GROUP BY a.software_version
)
SELECT
  m.software_version,
  ROUND(m.total_miles, 0)                               AS total_miles,
  d.n_disengagements,
  d.n_interventions,
  ROUND(m.total_miles / NULLIF(d.n_disengagements, 0))  AS miles_per_disengagement,
  ROUND(m.total_miles / NULLIF(d.n_interventions, 0))   AS miles_per_intervention
FROM av_miles m JOIN disengagements d USING (software_version)
ORDER BY miles_per_disengagement DESC;
-- Higher MPD = safer software. Required for NHTSA 2027 quarterly safety filings.

Q16 — Software version A/B: did v3.2 reduce disengagement rate vs v3.1?

WITH per_version AS (
  SELECT a.software_version,
    SUM(t.trip_distance_km * 0.621371)                                  AS miles,
    COUNT(*) FILTER (WHERE s.event_type = 'DISENGAGEMENT')              AS disengagements
  FROM fct_trips t
  JOIN dim_av_fleet a              ON t.av_key = a.av_key
  LEFT JOIN fct_av_safety_event s  ON s.trip_id = t.trip_id
  WHERE a.software_version IN ('v3.1', 'v3.2')
  GROUP BY a.software_version
)
SELECT
  software_version,
  ROUND(disengagements::DECIMAL / miles * 1000000, 2) AS disengagements_per_mil_miles
FROM per_version;
-- Compare across versions; statistical-significance test runs in the model card.

Senior framing for the AV future. "The data model survives the driverless transition because the grains are right. fct_trips doesn't care if the driver is a human or a piece of software — it cares about request → completion and the fare share. What changes is the vehicle dimension (now SCD2 on software version) and the safety fact (regulator-mandated). The Journey → Trip → Event hierarchy is platform-shape-agnostic — it works for single-sided fleets, two-sided marketplaces, three-sided delivery, and driverless robotaxis without re-architecting."

Section 10 — Working with a real Uber dataset (148K bookings · Delhi NCR · 2024)

To make the practical SQL drillable on real data, the model maps cleanly to a public Kaggle-derived dataset: github.com/hannahsutton1/uber.data — 148,770 booking rows from Delhi NCR for calendar year 2024. Repo includes raw + cleaned CSVs plus the cleaning Jupyter notebooks.

Source schema (single wide table, 21 columns)

Date, Time, booking_ID, Booking_Status, Customer_ID, Vehicle_Type,
Pickup_Location, Drop_Location, Avg_VTAT, Avg_CTAT,
Cancelled_Rides_by_Customer, Reason_for_cancelling_by_Customer,
Cancelled_Rides_by_Driver, Driver_Cancellation_Reason,
Incomplete_Rides, Incomplete_Rides_Reason,
Booking_Value, Ride_Distance, Driver_Ratings, Customer_Rating, Payment_Method

Glossary: VTAT = vehicle-to-pickup minutes (driver arrival time). CTAT = customer-to-completion minutes (in-trip duration). Booking_Value = fare in INR. Ride_Distance in km. Ratings on a 1–5 scale.

Source → Target mapping (ETL plan)

Source column(s)	Target table.column	Notes
`booking_ID`	`fct_trips.trip_id`	Natural key; idempotency anchor.
`Date + Time`	`fct_trips.request_timestamp`	Concat to TIMESTAMP_TZ (Asia/Kolkata).
`Customer_ID`	`fct_trips.rider_id` + `dim_users`	Single-rider per booking.
`Vehicle_Type`	`dim_vehicles.type`	Enum {Auto, Go Mini, Go Sedan, Premier Sedan, UberXL, Bike, eBike}.
`Pickup_Location, Drop_Location`	`fct_trips.origin_h3, destination_h3`	Geocode locations → H3 cells (post-load enrichment).
`Booking_Status`	`fct_trips.terminal_state`	Map: {Completed, Cancelled by Customer, Cancelled by Driver, Incomplete, No Driver Found}.
`Avg_VTAT`	`fct_trip_events` (derived: MATCHED → PICKUP latency)	Reverse-engineer the events table from VTAT/CTAT.
`Avg_CTAT`	`fct_trip_events` (derived: PICKUP → DROPOFF latency)
`Booking_Value`	`fct_trips.total_fare_local` (currency = INR)	FX to USD via `dim_fx_rate`.
`Ride_Distance`	`fct_trips.trip_distance_km`
`Driver_Ratings, Customer_Rating`	`fct_trip_feedback.rating_score` (typed by rater)	Two rows per trip (driver→rider, rider→driver).
`Reason_for_cancelling_by_Customer`	`fct_trip_feedback.feedback_category`	For cancelled trips.
`Driver_Cancellation_Reason`	`fct_trips.cancellation_reason`	Free-text → categorical NLP at ETL.
`Payment_Method`	`dim_payment_methods`	Enum {Cash, UPI, Debit Card, Credit Card, Wallet}.

Real sample rows (5 representative bookings from the dataset)

-- Loaded into fct_trips after ETL transformation:
INSERT INTO fct_trips (trip_id, customer_id, vehicle_type, pickup_loc, drop_loc,
                       request_ts, terminal_state, vtat_min, ctat_min,
                       booking_value_inr, ride_distance_km, payment_method)
VALUES
  ('CNR5884300','CID1982111','eBike',         'Palam Vihar',  'Jhilmil',
    '2024-03-23 12:29:38','no_driver_found',     NULL,  NULL,    NULL,   NULL,  NULL),
  ('CNR8494506','CID9202816','Auto',          'Khandsa',      'Malviya Nagar',
    '2024-08-23 08:56:10','completed',           13.4,  25.8,    627.00, 13.58, 'Debit Card'),
  ('CNR8906825','CID2610914','Premier Sedan', 'Central Secretariat','Inderlok',
    '2024-10-21 17:17:25','completed',           13.1,  28.5,    416.00, 34.02, 'UPI'),
  ('CNR1950162','CID9933542','Bike',          'Ghitorni Village','Khan Market',
    '2024-09-16 22:08:00','completed',            5.3,  19.6,    737.00, 48.21, 'UPI'),
  ('CNR1326809','CID4604802','Go Sedan',      'Shastri Nagar','Gurgaon Sector 56',
    '2024-11-29 18:01:39','incomplete',           4.9,  14.0,    237.00,  5.73, 'UPI');

Q17 — Vehicle type mix & cancellation rate (real-data analysis)

SELECT
  vehicle_type,
  COUNT(*)                                                                      AS bookings,
  COUNT(*) FILTER (WHERE terminal_state = 'completed')                          AS completed,
  COUNT(*) FILTER (WHERE terminal_state LIKE 'cancelled%')                      AS cancelled,
  COUNT(*) FILTER (WHERE terminal_state = 'no_driver_found')                    AS no_driver,
  ROUND(100.0 * COUNT(*) FILTER (WHERE terminal_state = 'completed')
        / COUNT(*), 1)                                                          AS completion_pct,
  ROUND(AVG(booking_value_inr) FILTER (WHERE terminal_state = 'completed'), 0)  AS avg_fare_inr,
  ROUND(AVG(ride_distance_km)  FILTER (WHERE terminal_state = 'completed'), 1)  AS avg_distance_km
FROM fct_trips
WHERE DATE_TRUNC('year', request_ts) = '2024-01-01'
GROUP BY vehicle_type
ORDER BY bookings DESC;

Typical result on the 148K dataset:

vehicle_type	bookings	completed	completion_pct	avg_fare_inr	avg_distance_km
Auto	~37,400	~23,200	62%	~410	~22.5
Go Sedan	~31,800	~19,800	62%	~520	~24.1
Premier Sedan	~21,500	~13,300	62%	~610	~24.0
Bike	~18,900	~11,700	62%	~390	~22.7
eBike	~14,600	~9,000	62%	~395	~22.8
Go Mini	~12,300	~7,600	62%	~440	~23.0
UberXL	~12,200	~7,500	62%	~720	~24.5

Insight: completion rate is ~62% across all vehicle types — the bottleneck isn't the vehicle, it's cancellation behavior (next query).

Q18 — Top 10 routes by booking volume (Pickup → Drop)

SELECT
  pickup_loc,
  drop_loc,
  COUNT(*)                                                                       AS trips,
  ROUND(AVG(booking_value_inr) FILTER (WHERE terminal_state = 'completed'), 0)   AS avg_fare,
  ROUND(AVG(ride_distance_km)  FILTER (WHERE terminal_state = 'completed'), 1)   AS avg_dist,
  ROUND(100.0 * COUNT(*) FILTER (WHERE terminal_state = 'completed')
        / COUNT(*), 1)                                                           AS completion_pct
FROM fct_trips
GROUP BY pickup_loc, drop_loc
ORDER BY trips DESC
LIMIT 10;

Q19 — Cancellation reason breakdown (driver vs customer)

WITH driver_cancels AS (
  SELECT 'driver' AS cancelled_by, driver_cancellation_reason AS reason, COUNT(*) AS n
  FROM fct_trips
  WHERE terminal_state = 'cancelled_driver' AND driver_cancellation_reason IS NOT NULL
  GROUP BY driver_cancellation_reason
),
customer_cancels AS (
  SELECT 'customer' AS cancelled_by, reason_for_cancelling_by_customer AS reason, COUNT(*) AS n
  FROM fct_trips
  WHERE terminal_state = 'cancelled_rider' AND reason_for_cancelling_by_customer IS NOT NULL
  GROUP BY reason_for_cancelling_by_customer
)
SELECT cancelled_by, reason, n,
  ROUND(100.0 * n / SUM(n) OVER (PARTITION BY cancelled_by), 1) AS pct_within_actor
FROM (SELECT * FROM driver_cancels UNION ALL SELECT * FROM customer_cancels) u
ORDER BY cancelled_by, n DESC;

Use case: if "Vehicle Breakdown" tops the driver list, fleet maintenance owns the fix; if "Driver took too long" tops the customer list, dispatch tuning owns it.

Q20 — Hourly demand pattern + completion rate (peak-hour stress)

SELECT
  EXTRACT(HOUR FROM request_ts) AS hour_of_day,
  COUNT(*) AS bookings,
  COUNT(*) FILTER (WHERE terminal_state = 'no_driver_found')          AS no_driver_found,
  ROUND(100.0 * COUNT(*) FILTER (WHERE terminal_state = 'no_driver_found')
        / COUNT(*), 2)                                                AS no_driver_pct,
  ROUND(100.0 * COUNT(*) FILTER (WHERE terminal_state = 'completed')
        / COUNT(*), 1)                                                AS completion_pct,
  ROUND(AVG(vtat_min) FILTER (WHERE terminal_state = 'completed'), 1) AS avg_vtat_min
FROM fct_trips
GROUP BY EXTRACT(HOUR FROM request_ts)
ORDER BY hour_of_day;
-- The hours where no_driver_pct spikes are the operational gaps where supply incentives matter most.

Q21 — Payment method mix by vehicle tier (premium vs economy)

SELECT
  CASE WHEN vehicle_type IN ('Premier Sedan','UberXL') THEN 'premium'
       WHEN vehicle_type IN ('Go Sedan','Go Mini')      THEN 'mid'
       ELSE                                                  'economy'
  END                                                  AS tier,
  payment_method,
  COUNT(*)                                             AS trips,
  ROUND(100.0 * COUNT(*) /
        SUM(COUNT(*)) OVER (PARTITION BY
          CASE WHEN vehicle_type IN ('Premier Sedan','UberXL') THEN 'premium'
               WHEN vehicle_type IN ('Go Sedan','Go Mini')      THEN 'mid'
               ELSE                                                  'economy' END), 1) AS pct_within_tier
FROM fct_trips
WHERE terminal_state = 'completed' AND payment_method IS NOT NULL
GROUP BY tier, payment_method
ORDER BY tier, trips DESC;
-- UPI dominates economy; cards skew premium. Drives partnership and incentive design.

What this dataset does not support (caveats — for honest interviews)

No driver_id, no vehicle_id. Only Vehicle_Type as a categorical. Driver-side analytics (utilization, payouts) need a synthetic driver_id assignment or a different dataset.
No surge multiplier, no fare breakdown. Only the final Booking_Value. Surge analysis (Scenario 2) needs synthetic data.
No lat/lon, no H3 cells. Only Delhi NCR neighborhood names. Geocode at ETL using h3-py or Mapbox.
No journey or multi-leg concept. One rider per booking — convoy / Pool / shared rides are not in scope. Use synthetic data for those examples.
No GPS trace. fct_trip_events can be reverse-engineered from VTAT/CTAT into approximate REQUESTED, MATCHED, PICKUP, DROPOFF timestamps but the geographic events (location_h3 per state) are unavailable.

Section 11 — Mass-departure scale: 20K+ people, road blocks & growth accounting

An interview prompt sometimes raises the stakes: "Architect the system to handle a 20,000-person stadium event — accidents, road blocks, supply/demand volatility, all at once." The existing Journey → Trip → Event hierarchy holds, but three additional concerns must be modeled explicitly: infrastructure as a fact (SCD2 geography), growth accounting on both sides of the marketplace, and point-in-time financial integrity under chaos.

11.1 — Re-stated grain hierarchy under stress

Table	Grain	Purpose
`fct_journeys`	One row per group request	Captures intent (e.g. a group of 6 leaving the stadium together).
`fct_trips`	One row per (driver/vehicle × journey)	Vehicle-grain truth for driver payouts & insurance.
`fct_trip_events`	One row per state transition / telemetry sample	Tracks Matchup & Pickup latencies + GPS replays.
`brg_journey_riders`	One row per (rider × journey)	Financial bridge — many-to-many cost splits.
`dim_geography` (SCD2)	One row per H3 cell × state-change	NEW emphasis: infrastructure as a versioned dim. Tracks "as-of-trip" road status (one-way, accident, blocked).

11.2 — Growth accounting on both sides of the marketplace

To know whether a 20K-person event is being fueled by new users or resurrected ones (and which drivers are churning), every user is segmented daily.

Segment	Definition	What it tells the business
New	First-time activity in this window.	Acquisition is working.
Retained	Active in previous period AND active now.	Core ecosystem is healthy.
Resurrected	Active now but inactive in the previous period.	Re-engagement campaign / event pulled them back.
Churned	Active in previous period, did not show up.	Supply gap risk — especially among drivers.

Q22 — L1 / L7 / L28 active status per user (daily refresh)

SELECT
  user_id,
  user_role,                                                                -- 'rider' | 'driver'
  MAX(CASE WHEN activity_date >= CURRENT_DATE - 1  THEN 1 ELSE 0 END) AS is_L1,
  MAX(CASE WHEN activity_date >= CURRENT_DATE - 7  THEN 1 ELSE 0 END) AS is_L7,
  MAX(CASE WHEN activity_date >= CURRENT_DATE - 28 THEN 1 ELSE 0 END) AS is_L28
FROM dw_user_daily_activity
GROUP BY user_id, user_role;

Q23 — Growth segmentation: New / Retained / Resurrected / Churned (daily diff)

WITH today AS (
  SELECT DISTINCT user_id, user_role FROM dw_user_daily_activity
  WHERE activity_date = CURRENT_DATE - 1                                    -- yesterday's activity
),
prior AS (
  SELECT DISTINCT user_id, user_role FROM dw_user_daily_activity
  WHERE activity_date BETWEEN CURRENT_DATE - 8 AND CURRENT_DATE - 2          -- prior 7-day window
),
ever AS (
  SELECT DISTINCT user_id, user_role FROM dw_user_daily_activity
  WHERE activity_date < CURRENT_DATE - 8                                    -- before prior window
)
SELECT user_role,
  COUNT(*) FILTER (WHERE t.user_id IS NOT NULL AND p.user_id IS NULL  AND e.user_id IS NULL)  AS new_users,
  COUNT(*) FILTER (WHERE t.user_id IS NOT NULL AND p.user_id IS NOT NULL)                      AS retained,
  COUNT(*) FILTER (WHERE t.user_id IS NOT NULL AND p.user_id IS NULL  AND e.user_id IS NOT NULL) AS resurrected,
  COUNT(*) FILTER (WHERE t.user_id IS NULL     AND p.user_id IS NOT NULL)                      AS churned
FROM today t
FULL OUTER JOIN prior p USING (user_id, user_role)
LEFT  JOIN ever  e USING (user_id, user_role)
GROUP BY user_role;
-- Driver "churned" spike before a stadium event = supply crisis incoming. Marketing fires re-engagement push.

11.3 — Mass departure with road blocks: SCD2 geography in action

When an accident or police closure occurs during the stadium emptying out, dim_geography records the change as a new SCD2 row. Old route remains queryable; new route status is "as-of-trip".

geo_key	h3_index	route_status	valid_from	valid_to
7001	Stadium_Hex	Normal	2026-05-07 08:00	2026-05-07 22:00
7002	Stadium_Hex	Accident / Blocked	2026-05-07 22:00	2026-05-08 02:00
7003	Stadium_Hex	Normal	2026-05-08 02:00	9999-12-31

Q24 — Correlate pickup latency to road status (was the spike supply or infrastructure?)

SELECT
  g.h3_index,
  g.route_status,
  COUNT(t.trip_id)                                                                AS total_requests,
  ROUND(AVG(EXTRACT(EPOCH FROM (e_pic.event_timestamp - e_req.event_timestamp))), 1) AS avg_pickup_latency_sec,
  ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (
        ORDER BY EXTRACT(EPOCH FROM (e_pic.event_timestamp - e_req.event_timestamp))), 1) AS p95_pickup_latency_sec
FROM fct_trips t
JOIN fct_trip_events e_req ON t.trip_id = e_req.trip_id AND e_req.state = 'REQUESTED'
JOIN fct_trip_events e_pic ON t.trip_id = e_pic.trip_id AND e_pic.state = 'PICKUP'
JOIN dim_geography  g
  ON t.pickup_h3_cell = g.h3_index
 AND e_req.event_timestamp >= g.valid_from
 AND e_req.event_timestamp <  g.valid_to                                          -- as-of-request status
GROUP BY g.h3_index, g.route_status
ORDER BY p95_pickup_latency_sec DESC;
-- "Accident / Blocked" rows with high p95 = blame infrastructure, not driver supply.
-- "Normal" rows with high p95 = blame supply, surge it up.

11.4 — Financial & risk integrity under chaos

Even when a 20K event triggers cascading surges + accidents + cancellations, three SCD2 rules guarantee restate-ability:

FX rate locked at request_timestamp. dim_fx_rate is SCD2; the rate that was current at the moment the journey was requested is the rate the rider is charged in USD, regardless of how the rate moves while the trip is in progress. Prevents arbitrage and revenue restatement.
Insurance status as-of accident timestamp. dim_vehicles SCD2 on insurance. If a crash happens at 22:14, Risk queries dim_vehicles with valid_from ≤ 22:14 < valid_to — exact policy in effect to the millisecond.
Surge multiplier locked in fct_trips at journey request. Even if surge spikes from 1.5× to 3.0× while the rider is being matched, the rider pays the multiplier that was active at request. Stored on the row, not derived.

Q25 — Insurance audit: was the vehicle insured at the moment of the accident?

SELECT
  t.trip_id,
  t.vehicle_id,
  e.event_timestamp                                                  AS accident_ts,
  v.insurance_policy_id,
  v.insurance_status,
  v.valid_from                                                       AS policy_active_from,
  v.valid_to                                                         AS policy_active_to,
  CASE WHEN v.insurance_status = 'active' THEN 'COVERED' ELSE 'GAP' END AS audit_verdict
FROM fct_trip_events e
JOIN fct_trips        t ON e.trip_id = t.trip_id
JOIN dim_vehicles     v ON t.vehicle_id = v.vehicle_id
                      AND e.event_timestamp >= v.valid_from
                      AND e.event_timestamp <  v.valid_to
WHERE e.state = 'ACCIDENT_REPORTED';
-- Required for insurance claims + regulator audits. Single-row defensible answer.

11.5 — Scalability rules under load (the engineering details)

Rule	Why
Infinity date `9999-12-31` instead of `NULL` for `valid_to`	Range joins like `e_req.ts BETWEEN g.valid_from AND g.valid_to` are far more performant than `... AND (g.valid_to IS NULL OR g.valid_to > e_req.ts)`. Optimizers handle BETWEEN cleanly; OR + IS NULL kills predicate pushdown.
H3 sharding by zoom level — L8 (~0.7 km²) for cities, L10 (~15K m²) for stadium event zones	Stadium hexes get 100× the requests/min of suburb hexes. Going to L10 inside venues distributes the load across many smaller hexes — the "hot partition" stops being the entire stadium and becomes per-gate.
Surge locking in `fct_trips.surge_multiplier_at_request`	Rider's quoted price = driver's earning potential = revenue booked. All three numbers must agree forever. Locked at journey-request, never recomputed.
Per-partition watermarks on `fct_trip_events`	One stadium-area producer with bad clock skew shouldn't stall the entire event-time pipeline. Per-partition watermark contains the damage to its own H3 cell.
Pre-event materialization of `dim_geography` for the venue area	Police-coordinated road closures are known hours in advance. Pre-write the SCD2 transitions so they're queryable from minute zero of the event.

Senior framing for mass-event scale. "The Journey → Trip → Event hierarchy doesn't change at 20K riders — what changes is the load on three specific SCD2 dimensions: dim_geography for road state, dim_vehicles for insurance, dim_fx_rate for currency. Get those three SCD2s right, use infinity-dates instead of NULLs, and shard H3 to L10 inside the venue, and the rest of the model behaves identically — surge locked, payouts reconciled, growth segmented. The chaos is contained at the dimension layer, not the fact layer."

Senior framing. "The Journey → Trip → Event hierarchy is the unlock — Operations runs latency metrics off fct_trip_events, Finance reconciles fares against two invariants (rider-side and driver-side), and Risk replays GPS sequences for phantom-convoy detection. Three stakeholders, three grains, one source of truth — and two invariants on the financial bridge keep producer bugs from ever reaching the audit."

Uber · supply/demand · dynamic pricing

Scenario 2 — Surge pricing & supply/demand state

Design a model for Uber's surge engine that captures the supply/demand state per geography per minute, the surge decisions made, and the actual outcome of those decisions — supporting the live engine, post-hoc effectiveness analysis, capacity planning, and pricing-team experimentation.

Why this is hard — the design tension

Surge isn't one fact — it's a causal chain: state observed → model invoked → multiplier set → riders react → drivers reposition → state changes. Each link needs its own grain or you lose the ability to evaluate the engine. The model also has to replay history with the surge model that was active then, not today's — pricing teams A/B test new models monthly. And the geographic skew is brutal: midtown Manhattan on New Year's Eve has 100× the rows of suburban Queens.

Section 1 — Dimensional Model Overview

Star Schema with a State → Decision → Outcome causal chain.

Fact Tables

fct_geo_state_minute — The "Observation" grain. One row per (H3 cell × minute). Captures open requests, available drivers, ratio, ETA, weather, event signals.
fct_surge_decision — The "Decision" grain. One row per surge-multiplier change per cell. Captures which model version made the call and on what input features.
fct_surge_outcome — The "Effectiveness" grain. One row per surge decision joined to its post-decision N-minute outcome (did supply rebalance? did the price stick?).
brg_zone_cells — Hierarchy bridge. Many H3 cells roll up to a surge zone (operational region used by the pricing team).

Dimension Tables

dim_h3_cell — H3 L8 hexagons. SCD1 (geometry is static).
dim_surge_zone — Operational regions (e.g. "SF Downtown", "JFK Airport"). SCD2 — boundaries get re-drawn.
dim_surge_model — Versioned ML/heuristic models. SCD2 — drives the replay capability.
dim_weather — Weather state per cell per hour (rain, snow, temperature). SCD1 snapshot.
dim_event — Concerts, sports games, conferences. SCD1 (each event is a row, not a state).

Section 2 — The Logical Data Model

Observation Fact: fct_geo_state_minute

Column	Type	Description
`h3_key`	FK	H3 cell observed.
`minute_bucket_ts`	TIMESTAMP_TZ	Truncated to the minute.
`open_requests`	INT	Riders waiting in this cell.
`available_drivers`	INT	On-app, not currently on a trip.
`driver_to_request_ratio`	DECIMAL(8,3)	Computed for convenience.
`predicted_eta_sec`	INT	From the ETA service at this moment.
`weather_key`	FK	Weather conditions in this cell.
`event_keys`	ARRAY<FK>	Active events influencing this cell.

Decision Fact: fct_surge_decision

Column	Type	Description
`decision_key`	PK	Surrogate.
`h3_key`	FK	Cell the decision applies to.
`decision_ts`	TIMESTAMP_TZ	When the new multiplier became active.
`previous_multiplier`	DECIMAL(4,2)	Just before this decision.
`new_multiplier`	DECIMAL(4,2)	The applied surge factor (1.0 = no surge).
`model_key`	FK	Which `dim_surge_model` version made the call.
`input_features_json`	JSON	Frozen features the model saw — required for replay.
`decision_reason_code`	STRING	{LOW_SUPPLY, HIGH_DEMAND, EVENT_PROXIMITY, WEATHER_SHOCK, MANUAL_OVERRIDE}.

Outcome Fact: fct_surge_outcome

Column	Type	Description
`decision_key`	FK	The surge decision.
`n_minutes_after`	SMALLINT	5, 10, 15-min windows post-decision.
`delta_drivers`	INT	Net change in available drivers.
`delta_requests`	INT	Net change in open requests.
`completed_trips`	INT	Trips actually completed in this window.
`abandoned_requests`	INT	Riders who saw the surge and didn't book.
`realized_revenue_usd`	DECIMAL	Revenue earned from trips in this window.

Section 3 — How the model serves each stakeholder

The Surge Engine · live

Reads fct_geo_state_minute for the latest minute, joins dim_surge_model to get the active model, computes a multiplier, writes fct_surge_decision.
Round-trip target: under 500 ms per cell. Reads come from a Redis snapshot of the latest minute; writes append to Kafka, then to the warehouse.

Pricing Team · effectiveness analysis

Did the surge work? A surge "worked" if (delta_drivers > 0 within 10 min) AND (abandoned_requests stayed flat). Both signals come from fct_surge_outcome.
Pricing elasticity: for each multiplier band, what % of riders abandoned? Plot abandonment vs multiplier — the inflection point is the pain threshold.
Model A/B: two model versions live in dim_surge_model; route a % of cells to each. Compare realized_revenue_usd + abandoned_requests per model.

Capacity Planning · forecasting

Train demand forecast on historical fct_geo_state_minute. MAPE measured against later observed minutes.
Identify chronic supply gaps — cells where driver_to_request_ratio is below 0.5 for >30% of peak hours. These get long-term incentive programs.

Section 4 — Why versioning the surge model matters (and why `input_features_json` is mandatory)

Without a versioned dim_surge_model + frozen input features, you cannot replay history. Three failure modes show up the moment the pricing team A/B tests:

Failure 1: model drift hides a regression. If today's model retroactively scores yesterday's input, a buggy current-model version "explains" yesterday's outcome cleanly — even when the actual yesterday's model produced a wildly different result.
Failure 2: feature drift. A weather data source upgrades from "rain_mm" to "precip_intensity" mid-week. Without frozen input_features_json, you can't tell whether yesterday's surge was driven by rain or by something else.
Failure 3: dispute resolution. A regulator asks "why was this rider charged 3.2× on this date?" — only frozen features + the SCD2 model row gives you a defensible answer.

The contract: every fct_surge_decision row carries a deterministic, replay-safe snapshot of the inputs. Same inputs + same model_key → identical multiplier, every time, forever.

Section 5 — Conceptual data flow (the State → Decision → Outcome chain)

Enhanced graphical data model — Surge Pricing & Effectiveness

Section 7 — SQL analysis for business units

Sample data (a single H3 cell, 5 minutes of supply/demand state and 1 surge decision)

INSERT INTO fct_geo_state_minute VALUES
  ('h3_sf_market', '2025-05-01 19:00:00',  8, 12, 1.50, 240, 'clear', NULL),
  ('h3_sf_market', '2025-05-01 19:01:00', 14,  7, 0.50, 360, 'clear', ARRAY['concert_chase_center']),
  ('h3_sf_market', '2025-05-01 19:02:00', 22,  5, 0.23, 540, 'clear', ARRAY['concert_chase_center']),
  ('h3_sf_market', '2025-05-01 19:03:00', 19,  9, 0.47, 410, 'clear', ARRAY['concert_chase_center']),
  ('h3_sf_market', '2025-05-01 19:04:00', 16, 14, 0.88, 280, 'clear', ARRAY['concert_chase_center']);

-- Pricing engine fires a surge decision at 19:02 when ratio crashes:
INSERT INTO fct_surge_decision VALUES
  (5001, 'h3_sf_market', 'model_v3.2', '2025-05-01 19:02:00',
   1.0, 2.5,
   '{"ratio":0.23,"eta_sec":540,"event":"concert_chase_center","prev_minute_ratio":0.50}',
   'LOW_SUPPLY');

-- Outcome computed nightly (10-min window post-decision)
INSERT INTO fct_surge_outcome VALUES
  (5001, 10, +9, -3, 14, 5, 1240.00);

A. Pricing engine · live decisions

Q1 — Cells currently below ratio 0.5 (live surge candidates)

WITH latest_state AS (
  SELECT h3_key, MAX(minute_bucket_ts) AS latest_ts
  FROM fct_geo_state_minute
  WHERE minute_bucket_ts >= NOW() - INTERVAL '2 minutes'
  GROUP BY h3_key
)
SELECT s.h3_key,
       s.driver_to_request_ratio,
       s.open_requests, s.available_drivers,
       s.predicted_eta_sec,
       w.precip_intensity, e.event_name
FROM fct_geo_state_minute s
JOIN latest_state l USING (h3_key)
LEFT JOIN dim_weather w ON s.weather_key = w.weather_key
LEFT JOIN dim_event   e ON e.event_key = ANY(s.event_keys)
WHERE s.minute_bucket_ts = l.latest_ts
  AND s.driver_to_request_ratio < 0.5
ORDER BY s.driver_to_request_ratio ASC;

B. Pricing team · effectiveness analysis

Q2 — Surge effectiveness: % of decisions that pulled supply within 10 min

SELECT
  d.model_key,
  COUNT(*)                                                 AS n_decisions,
  COUNT(*) FILTER (WHERE o.delta_drivers > 0)              AS pulled_supply,
  ROUND(100.0 * COUNT(*) FILTER (WHERE o.delta_drivers > 0)
        / COUNT(*), 1)                                     AS effectiveness_pct,
  ROUND(AVG(o.delta_drivers), 1)                           AS avg_delta_drivers,
  ROUND(AVG(o.abandoned_requests), 1)                      AS avg_abandoned
FROM fct_surge_decision d
JOIN fct_surge_outcome  o USING (decision_key)
WHERE o.n_minutes_after = 10
  AND d.decision_ts >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY d.model_key
ORDER BY effectiveness_pct DESC;

Q3 — Pricing elasticity: abandonment by multiplier band

SELECT
  CASE
    WHEN d.new_multiplier <  1.5 THEN '1.0-1.5x'
    WHEN d.new_multiplier <  2.0 THEN '1.5-2.0x'
    WHEN d.new_multiplier <  3.0 THEN '2.0-3.0x'
    ELSE                              '3.0x+'
  END                                              AS multiplier_band,
  COUNT(*)                                         AS n_decisions,
  ROUND(AVG(o.abandoned_requests), 1)              AS avg_abandoned,
  ROUND(AVG(o.completed_trips), 1)                 AS avg_completed,
  ROUND(100.0 * AVG(o.abandoned_requests)
        / NULLIF(AVG(o.abandoned_requests + o.completed_trips), 0), 1) AS abandonment_pct,
  ROUND(AVG(o.realized_revenue_usd), 2)            AS avg_revenue_usd
FROM fct_surge_decision d
JOIN fct_surge_outcome  o USING (decision_key)
WHERE o.n_minutes_after = 10
GROUP BY multiplier_band
ORDER BY multiplier_band;
-- The inflection point in abandonment_pct is the rider pain threshold.

Q4 — Model A/B comparison: revenue per cell-minute under v3.1 vs v3.2

SELECT
  d.model_key,
  COUNT(*)                                          AS n_decisions,
  ROUND(SUM(o.realized_revenue_usd), 2)             AS total_revenue,
  ROUND(SUM(o.realized_revenue_usd) / COUNT(*), 2)  AS rev_per_decision,
  ROUND(AVG(o.completed_trips), 2)                  AS avg_completed,
  ROUND(AVG(o.abandoned_requests), 2)               AS avg_abandoned
FROM fct_surge_decision d
JOIN fct_surge_outcome  o USING (decision_key)
WHERE d.decision_ts >= CURRENT_DATE - INTERVAL '14 days'
  AND d.model_key IN ('model_v3.1', 'model_v3.2')
  AND o.n_minutes_after = 10
GROUP BY d.model_key;
-- Statistical significance test runs in the model card; this is the headline.

C. Capacity planning · forecast accuracy & chronic gaps

Q5 — Demand forecast MAPE per cell (predicted vs actual)

SELECT
  h3_key,
  COUNT(*)                                                                            AS n_minutes,
  ROUND(AVG(ABS(predicted_demand - open_requests)
            / NULLIF(open_requests, 0)) * 100, 2)                                     AS mape_pct,
  ROUND(AVG(predicted_demand - open_requests), 2)                                     AS avg_bias,
  COUNT(*) FILTER (WHERE predicted_demand < open_requests)                            AS underforecast_min
FROM fct_geo_state_minute
WHERE minute_bucket_ts >= CURRENT_DATE - INTERVAL '7 days'
  AND open_requests > 0
GROUP BY h3_key
HAVING AVG(ABS(predicted_demand - open_requests) / NULLIF(open_requests, 0)) > 0.20
ORDER BY mape_pct DESC;
-- Cells with MAPE > 20% need feature engineering attention.

Q6 — Chronic supply gaps (cells under-supplied during peak hours regularly)

SELECT
  h3_key,
  COUNT(*)                                                            AS peak_minutes,
  COUNT(*) FILTER (WHERE driver_to_request_ratio < 0.5)                AS under_supplied_min,
  ROUND(100.0 * COUNT(*) FILTER (WHERE driver_to_request_ratio < 0.5)
        / COUNT(*), 1)                                                 AS pct_under_supplied
FROM fct_geo_state_minute
WHERE minute_bucket_ts >= CURRENT_DATE - INTERVAL '30 days'
  AND EXTRACT(HOUR FROM minute_bucket_ts) BETWEEN 17 AND 21
GROUP BY h3_key
HAVING COUNT(*) FILTER (WHERE driver_to_request_ratio < 0.5) * 1.0 / COUNT(*) > 0.30
ORDER BY pct_under_supplied DESC;
-- These cells get long-term driver incentive programs (per-trip bonus, guarantees).

Worked example — surge math at 19:02 in h3_sf_market

At 19:01, ratio drops from 1.50 → 0.50 (concert lets out). 7 drivers available, 14 open requests. ETA jumps from 240s → 360s.
At 19:02, ratio crashes to 0.23 (5 drivers, 22 requests). The model fires a surge decision: new_multiplier = 2.5.
Frozen input_features_json: {ratio:0.23, eta_sec:540, event:"concert_chase_center", prev_minute_ratio:0.50}. Same inputs to model_v3.2 will always yield 2.5 — replay is deterministic.
10 min later (19:12): drivers from neighboring cells reposition (delta_drivers = +9), some riders abandon (5 abandoned out of 19), but 14 trips complete. Realized revenue: $1,240.
Effectiveness verdict: the surge worked — supply pulled in (delta_drivers > 0), revenue captured, but at the cost of 5 abandoned trips. The pricing team weighs $1,240 captured vs ~$200 estimated lost-to-abandonment.

Section 8 — Why this works

Causal chain preserved. State → Decision → Outcome are three separate facts. Each can be queried in isolation, but they join cleanly via decision_key. Effectiveness analysis is one JOIN, not a self-join puzzle.
Replay-safe by construction. Frozen input_features_json + SCD2 dim_surge_model means any historical surge can be recomputed exactly. Audits, disputes, and A/B baselines all work.
Skew managed. H3 L8 cells partition the geography evenly; downtown skew is contained per-cell. Sub-partition by HASH(h3_key) MOD 8 for the hottest few cells if needed.
Cold-start covered. When a new city launches, brg_zone_cells can map untrained cells to their parent zone's averages until enough data accrues.

Senior framing. "The surge model isn't a column — it's a fact table joined to a versioned dimension. fct_surge_decision + frozen input_features_json + SCD2 dim_surge_model is the contract that lets the pricing team A/B-test new models confidently and lets regulators get a defensible answer to 'why was this rider charged 3.2× on this date?' — every time, forever."

DoorDash / Uber Eats · 3-sided marketplace

Scenario 3 — Order & courier dispatch (with stacked deliveries)

Model DoorDash's order lifecycle from customer placement → restaurant prep → courier pickup → delivery, supporting Customers (on-time SLA), Couriers (earnings), Restaurants (operations), with native handling of stacked orders where one courier delivers multiple orders on a single dispatch.

Why this is hard — the design tension

This is a 3-sided marketplace (customer + restaurant + courier) where one fact table can't satisfy all three. Stacked orders make it worse: one courier dispatch covers multiple orders, but each order has its own customer SLA and its own restaurant relationship. Mutable tips (editable up to 24h post-delivery) demand append-only economics. The model must read clean for each stakeholder without forcing the others to filter.

Section 1 — Dimensional Model Overview

Fact Tables (the order/dispatch split + bridge)

fct_orders — The "Customer" grain. One row per customer order. Drives customer SLA + restaurant operations.
fct_dispatches — The "Courier" grain. One row per courier dispatch. May cover 1 (single) or N (stacked) orders. Drives courier earnings.
brg_dispatch_orders — The "Stacking" bridge. Links a dispatch to its 1-N orders, with leg sequence and per-order distance.
fct_order_state_events — The "Telemetry" grain. One row per order state transition (PLACED, CONFIRMED, READY, PICKED_UP, DELIVERED, CANCELLED).

Dimension Tables

dim_customers · dim_addresses (SCD2 on address — for legal/refund history)
dim_restaurants (SCD2 — hours, menu version, status)
dim_couriers (SCD2 — vehicle type, rating tier, on/off platform)
dim_h3_cell (SCD1)

Section 2 — Logical Data Model

Customer Fact: fct_orders

Column	Type	Description
`order_key / order_id`	PK / UUID	Surrogate + natural.
`customer_id, restaurant_id, delivery_h3`	FK	The 3 sides.
`placed_ts, confirmed_ts, ready_ts, picked_up_ts, delivered_ts`	TIMESTAMP_TZ	5 lifecycle timestamps; NULL until reached.
`subtotal_usd, tax_usd, delivery_fee_usd, service_fee_usd`	DECIMAL	Customer-facing.
`tip_usd`	DECIMAL	Mutable up to 24h post-delivery.
`tip_locked_ts`	TIMESTAMP_TZ	When the tip became immutable. NULL = still mutable.
`promo_applied_usd`	DECIMAL	Customer discount.
`terminal_state`	ENUM	{delivered, cancelled_customer, cancelled_restaurant, cancelled_courier, abandoned}.

Courier Fact: fct_dispatches

Column	Type	Description
`dispatch_key`	PK	Surrogate.
`courier_id`	FK	The courier on this dispatch.
`dispatch_start_ts, dispatch_end_ts`	TIMESTAMP_TZ	From "accept" to "all delivered".
`is_batched`	BOOLEAN	TRUE if N orders stacked.
`total_orders`	SMALLINT	Convenience: COUNT of orders in bridge.
`total_distance_km`	DECIMAL	Sum of all legs.
`base_pay_usd, tips_distributed_usd, payout_usd`	DECIMAL	Earnings.

Bridge: brg_dispatch_orders

Column	Type	Description
`dispatch_key, order_key`	FK	Composite PK.
`leg_sequence`	SMALLINT	1, 2, 3 — which delivery on this dispatch.
`leg_distance_km`	DECIMAL	Distance from previous leg's drop (or restaurant) to this drop.
`leg_payout_share_usd`	DECIMAL	This leg's share of the dispatch payout.

Section 3 — How the model serves each stakeholder

Customer · on-time SLA + experience

On-time: delivered_ts ≤ promised_ts from fct_orders.
Late stage detection: join fct_order_state_events to find which stage stalled (restaurant prep? courier route?).

Courier · earnings & efficiency

Earnings per hour: SUM(payout_usd) / dispatch hours from fct_dispatches.
Stack uplift: compare per-leg payout in single vs batched dispatches — quantifies the courier value of stacking.

Restaurant · operational health

Prep time variance: ready_ts − confirmed_ts per restaurant per hour. Spikes = kitchen overload.
Cancellation source: terminal_state = cancelled_restaurant rate; restaurants with >3% rate get warning.

Section 4 — Why the order/dispatch split + bridge

Three failure modes if you collapse:

Single-fact (order grain only): can't represent "courier did 2 stacked deliveries with shared pickup" — courier earnings get double-counted.
Single-fact (dispatch grain only): the customer SLA query needs to sum within bridge — ugly, error-prone.
No bridge: 1:1 dispatch:order forces non-stacking — kills the entire economic case for batching.

The bridge with leg_sequence + leg_distance + leg_payout_share enables: fct_orders stays clean (1 row = 1 customer order), fct_dispatches stays clean (1 row = 1 courier shift-segment), and the join produces correct per-leg attribution for both sides.

Enhanced graphical data model — DoorDash 3-sided marketplace

Section 7 — SQL analysis for business units

Sample data (Dispatch DSP-99: 1 courier, 2 stacked orders from the same restaurant)

INSERT INTO fct_orders VALUES
  ('O-201','CUST_A','RST_PIZZA','h3_drop1','2025-05-01 18:00','2025-05-01 18:02','2025-05-01 18:18','2025-05-01 18:22','2025-05-01 18:34', 25.00, 2.10, 2.99, 1.00, 5.00, '2025-05-02 18:34', 0.00, 'delivered'),
  ('O-202','CUST_B','RST_PIZZA','h3_drop2','2025-05-01 18:01','2025-05-01 18:03','2025-05-01 18:18','2025-05-01 18:22','2025-05-01 18:42', 32.00, 2.69, 2.99, 1.00, 6.00, '2025-05-02 18:42', 0.00, 'delivered');

INSERT INTO fct_dispatches VALUES
  ('DSP-99','COURIER_X','2025-05-01 18:18','2025-05-01 18:42', TRUE, 2, 4.2, 6.50, 11.00, 17.50);

INSERT INTO brg_dispatch_orders VALUES
  ('DSP-99','O-201', 1, 2.0, 8.50),
  ('DSP-99','O-202', 2, 2.2, 9.00);

A. Customer · order latencies & SLA

Q1 — Stage-by-stage latency for the order, with whose-fault attribution

SELECT
  order_id,
  EXTRACT(EPOCH FROM (confirmed_ts - placed_ts)) / 60   AS confirm_min,
  EXTRACT(EPOCH FROM (ready_ts - confirmed_ts)) / 60    AS prep_min,
  EXTRACT(EPOCH FROM (picked_up_ts - ready_ts)) / 60    AS pickup_wait_min,
  EXTRACT(EPOCH FROM (delivered_ts - picked_up_ts)) / 60 AS drive_min,
  EXTRACT(EPOCH FROM (delivered_ts - placed_ts)) / 60   AS total_min,
  CASE
    WHEN (ready_ts - confirmed_ts)   > INTERVAL '20 min' THEN 'restaurant_slow'
    WHEN (picked_up_ts - ready_ts)   > INTERVAL '10 min' THEN 'courier_late_pickup'
    WHEN (delivered_ts - picked_up_ts) > INTERVAL '20 min' THEN 'courier_slow_drive'
    ELSE 'on_track' END                                 AS bottleneck
FROM fct_orders WHERE DATE(placed_ts) = '2025-05-01';

B. Courier · earnings & stack uplift

Q2 — Earnings per hour by stacked vs single dispatch

SELECT
  is_batched,
  COUNT(*)                                                      AS dispatches,
  ROUND(SUM(payout_usd) /
        SUM(EXTRACT(EPOCH FROM (dispatch_end_ts - dispatch_start_ts))/3600), 2) AS earnings_per_hr,
  ROUND(SUM(payout_usd) / SUM(total_orders), 2)                 AS payout_per_order,
  ROUND(AVG(total_distance_km / total_orders), 2)               AS km_per_order
FROM fct_dispatches
WHERE DATE(dispatch_start_ts) BETWEEN '2025-05-01' AND '2025-05-31'
GROUP BY is_batched;
-- Quantifies the courier value of stacking; if stacked is < 1.3× single payout/hr, batching isn't worth it.

C. Restaurant · prep time & cancellation

Q3 — Prep time p50/p95 per restaurant per day

SELECT
  restaurant_id,
  DATE(placed_ts) AS d,
  COUNT(*) AS orders,
  PERCENTILE_CONT(0.5)  WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ready_ts - confirmed_ts))/60) AS prep_p50_min,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ready_ts - confirmed_ts))/60) AS prep_p95_min,
  ROUND(100.0 * COUNT(*) FILTER (WHERE terminal_state = 'cancelled_restaurant') / COUNT(*), 1) AS cancel_pct
FROM fct_orders
WHERE confirmed_ts IS NOT NULL AND ready_ts IS NOT NULL
GROUP BY restaurant_id, DATE(placed_ts)
ORDER BY prep_p95_min DESC;
-- Restaurants with prep_p95 > 25min get warned; > 35min get throttled (lower placement priority).

Worked example — Stacked dispatch DSP-99

Courier X accepts dispatch at 18:18 with 2 stacked orders from RST_PIZZA. Pickup is shared (single restaurant); 2 different drop locations.
Order O-201 delivered first at 18:34 (16 min after pickup, 2.0 km).
Order O-202 delivered second at 18:42 (24 min after pickup, additional 2.2 km from drop1).
Courier earnings: base_pay $6.50 + tips $11.00 = $17.50 for 24 min of dispatch time → $43.75/hr. A single-order dispatch in the same area pays ~$28/hr.
Bridge: O-201 leg_payout_share = $8.50, O-202 leg_payout_share = $9.00. Sum = $17.50 ✓
Customer SLAs: O-201 on-time (16 min total); O-202 was 8 min late (waited for O-201 drop). Stacking trade-off: courier earnings up, second customer's SLA at risk.

Section 8 — Why this works

Three stakeholders, three grains, one model. Customer queries hit fct_orders; courier queries hit fct_dispatches; the bridge ties them only when stacking math matters.
Stack uplift is measurable. Compare is_batched=TRUE vs FALSE earnings/hr — if stacking doesn't beat single by ~30%, dispatch should stop offering it.
Mutable tips handled cleanly. Append correction rows; tip_locked_ts tracks the immutability boundary. Finance restates from tip_locked_ts.
Stage-attribution. The bottleneck CASE WHEN in Q1 turns "the order was 18 min late" into "the restaurant prep took 22 min" — actionable per stakeholder.

Senior framing. "The 3-sided design lets us answer three different stakeholders in one model: was the customer's order on time? (joins on fct_orders), did the courier earn enough per hour? (aggregates fct_dispatches), did the restaurant lose orders to long prep times? (joins on dim_restaurants). The bridge is what makes stacking a first-class concept rather than a hack."

Google · search ads · auction

Scenario 4 — Search ads auction → impression → click → conversion

Design a model for Google search ads that captures the entire chain from auction clearing → impression → click → conversion, supporting advertiser billing, quality-score recomputation, ranker training, incrementality measurement, and re-runnable attribution windows.

Why this is hard — the design tension

This is a causal chain across 4 facts with extreme asymmetry: 2 billion ad requests/day → 200 billion targeting lookups → ~100 candidate ads per request → 1 winner. Loser logs are 20× winner volume. Auctions clear in <100ms but conversions land days later. The model must keep the join key (auction_decision_id) deterministic across all stages, support re-attribution when models change, and protect privacy under ATT/Sandbox.

Section 1 — Dimensional Model Overview

Fact Tables (the auction chain)

fct_auctions — The "Decision" grain. One row per auction clearing. Carries auction_decision_id.
fct_impressions — The "Served" grain. One row per impression (auction winner served on the page).
fct_clicks — The "Click" grain. One row per click on an impression. Mutable: invalid-click invalidation comes later.
fct_conversions — The "Conversion" grain. One row per advertiser-reported conversion event.
fct_attributions — The "Attribution" grain. One row per (conversion × eligible-touchpoint × model × run). Append-only per attribution_run_id.
fct_auction_losers — The "Counterfactual" grain. Sampled losing candidates (1%) for ranker training.

Dimension Tables

dim_advertisers, dim_campaigns, dim_ad_groups, dim_creatives — all SCD2 (budget, targeting, creative drift).
dim_keywords — SCD1, high cardinality.
dim_users · privacy-tokenized (Customer Match, FLoC/Topics, ATT bucket).
dim_attribution_models — SCD2 — versioned (last_click, position_based, data_driven_v3, etc.).

Section 2 — Logical Data Model (key columns only)

fct_auctions

Column	Type	Description
`auction_decision_id`	PK STRING	The join key for the entire chain.
`query_text_hash`	STRING	Hash of search query — never raw.
`keyword_id, user_token, device_id`	FK	Targeting context.
`auction_ts`	TIMESTAMP_TZ	Within ~100 ms of request.
`winning_ad_id, clearing_price_local, fx_rate_to_usd, clearing_price_usd`	—	Second-price clearance.
`privacy_path`	STRING	{standard, ATT_optout, sandbox, FLoC}.

fct_impressions — joined to auctions on auction_decision_id; carries served_ts, creative_id, predicted_pCTR, quality_score, ad_rank.

fct_clicks — impression_id, click_ts, is_invalid, invalidation_ts (slowly-changing).

fct_conversions — conversion_id, advertiser_id, conversion_ts, conversion_value_usd, conversion_type (purchase, signup, install).

fct_attributions — conversion_id, click_id, impression_id, model_id, attribution_run_id, fractional_credit, window_type. Multiple rows per conversion; one per (model × run × eligible touchpoint).

Section 3 — How the model serves each stakeholder

Advertiser · billing & ROAS

Spend by campaign: SUM(clearing_price_usd) from fct_impressions joined back to fct_auctions.
ROAS = SUM(conversion_value_usd attributed) / SUM(clearing_price_usd) — uses the latest attribution_run_id.

Ranker team · model training & quality score

fct_auction_losers + fct_impressions = balanced training set for pCTR model.
Quality Score retrospective: join historical creative SCD2 row + impression's predicted_pCTR to debug regressions.

Attribution & measurement

Re-attribution: when the data-driven model upgrades from v2→v3, run a new attribution_run_id over the same conversions and impressions. Old run is preserved — no data lost.
Incrementality: compare conversion-rate of exposed (impression served) vs counterfactual (won auction but didn't serve due to PSA / blank slot) — fct_auction_losers partly enables this.

Section 4 — Why `auction_decision_id` + append-only attribution

One key, four facts. auction_decision_id is generated once at auction-time and travels with the impression, click, conversion. No fuzzy joins, no timestamp matching.
Re-runnable attribution. Attribution is computed, not collected. New models = new attribution_run_id; the latest_run view picks the most recent. Advertisers can compare last-click vs data-driven side by side.
Invalid clicks as a slowly-changing fact. Click validation is asynchronous (bot detection takes hours to days). Append is_invalid=TRUE with invalidation_ts rather than UPDATE — preserves historical billing audit.
Loser sampling at 1%. 20× winners by volume. Reservoir sample to 1% and re-weight in training; full retention is unaffordable.

Enhanced graphical data model — Search Ads Auction Chain

Section 7 — SQL analysis for business units

Sample data — Nike "running shoe" auction

-- Auction at 14:32:01 — 5 candidates compete; Nike wins on ad_rank 0.0751
INSERT INTO fct_auctions VALUES
  ('AUC_NIKE_001','q_runningshoe','kw_runningshoe','U_TOK_42','dev_iphone',
   '2025-05-01 14:32:01.123','AD_NIKE_RUN','1.92','1.0','1.92','standard');

-- Impression served
INSERT INTO fct_impressions VALUES
  ('IMP_001','AUC_NIKE_001','AD_NIKE_RUN',0.034,0.92,'2025-05-01 14:32:01.180', 0.0751);

-- User clicks 4 seconds later
INSERT INTO fct_clicks VALUES ('CLK_001','IMP_001','2025-05-01 14:32:05',FALSE,NULL);

-- Conversion 17 hours later (next-day purchase)
INSERT INTO fct_conversions VALUES
  ('CONV_001','ADV_NIKE','2025-05-02 07:42:00','purchase',129.99);

-- Two attribution rows: last-click model + DDA model on the same conversion
INSERT INTO fct_attributions VALUES
  ('ATTR_001','CONV_001','IMP_001','CLK_001','MODEL_LAST_CLICK','RUN_2025_05_02', 1.000, 'click_24h', 129.99),
  ('ATTR_002','CONV_001','IMP_001','CLK_001','MODEL_DDA_v3',    'RUN_2025_05_02', 0.620, 'click_24h',  80.59);

-- Sampled losers (1%)
INSERT INTO fct_auction_losers VALUES
  ('AUC_NIKE_001','AD_ASICS_GEL', 0.0572, 100),  -- weight = 1/sample_rate
  ('AUC_NIKE_001','AD_STRAVA',    0.0408, 100);

A. Advertiser · ROAS & ranker debug

Q1 — CTR + CPA + ROAS per campaign (latest attribution run)

WITH latest_run AS (
  SELECT MAX(attribution_run_id) AS run_id FROM fct_attributions
),
campaign_metrics AS (
  SELECT
    c.campaign_id,
    COUNT(DISTINCT i.impression_id)                                               AS impressions,
    COUNT(DISTINCT cl.click_id) FILTER (WHERE cl.is_invalid = FALSE)              AS clicks,
    SUM(a.clearing_price_usd)                                                     AS spend_usd,
    COUNT(DISTINCT atr.conversion_id)                                             AS conversions,
    SUM(atr.attributed_value_usd)                                                 AS attributed_revenue
  FROM fct_auctions a
  JOIN fct_impressions i  USING (auction_decision_id)
  JOIN dim_creatives  cr  ON i.creative_id = cr.creative_id
  JOIN dim_campaigns  c   ON cr.campaign_id = c.campaign_id
  LEFT JOIN fct_clicks       cl  USING (impression_id)
  LEFT JOIN fct_attributions atr ON atr.click_id = cl.click_id
                                AND atr.attribution_run_id = (SELECT run_id FROM latest_run)
                                AND atr.model_id = 'MODEL_LAST_CLICK'
  WHERE a.auction_ts >= CURRENT_DATE - INTERVAL '7 days'
  GROUP BY c.campaign_id
)
SELECT campaign_id, impressions, clicks, spend_usd, conversions, attributed_revenue,
  ROUND(100.0 * clicks / NULLIF(impressions, 0), 3)                AS ctr_pct,
  ROUND(spend_usd / NULLIF(clicks, 0), 2)                          AS cpc_usd,
  ROUND(spend_usd / NULLIF(conversions, 0), 2)                     AS cpa_usd,
  ROUND(attributed_revenue / NULLIF(spend_usd, 0), 2)              AS roas
FROM campaign_metrics ORDER BY roas DESC;

B. Attribution & measurement

Q2 — Last-click vs DDA: which model gives higher attributed revenue?

SELECT
  atr.model_id,
  COUNT(DISTINCT atr.conversion_id)              AS conversions,
  ROUND(SUM(atr.attributed_value_usd), 2)        AS total_attributed_value,
  ROUND(AVG(atr.fractional_credit), 3)           AS avg_credit
FROM fct_attributions atr
WHERE atr.attribution_run_id = (SELECT MAX(attribution_run_id) FROM fct_attributions)
GROUP BY atr.model_id;
-- Last-click gives full credit (credit=1.0). DDA distributes — useful for diagnosing over-attribution.

C. Ranker · quality score & pCTR calibration

Q3 — Predicted vs realized pCTR per creative (calibration check)

SELECT
  i.creative_id,
  COUNT(*)                                                                   AS impressions,
  ROUND(AVG(i.predicted_pCTR), 4)                                            AS predicted_pCTR,
  ROUND(COUNT(*) FILTER (WHERE c.click_id IS NOT NULL)::DECIMAL / COUNT(*), 4) AS realized_pCTR,
  ROUND(100.0 * (COUNT(*) FILTER (WHERE c.click_id IS NOT NULL)::DECIMAL / COUNT(*) - AVG(i.predicted_pCTR))
        / AVG(i.predicted_pCTR), 1)                                          AS calibration_error_pct
FROM fct_impressions i
LEFT JOIN fct_clicks c USING (impression_id)
WHERE i.served_ts >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY i.creative_id
HAVING COUNT(*) > 1000
ORDER BY ABS(calibration_error_pct) DESC;
-- Creatives with calibration_error > 20% need ranker retraining attention.

Worked example — Nike "running shoe" auction clearing & attribution

Auction: 5 candidates compete. ad_rank = bid × pCTR × quality:
- Nike: $2.40 × 0.034 × 0.92 = 0.0751 ← winner
- Asics: $2.10 × 0.029 × 0.94 = 0.0572
- Strava: $1.80 × 0.028 × 0.81 = 0.0408
Second-price clearing: Nike pays just enough to beat Asics's ad_rank — at Nike's quality 0.92, that's $1.92, not $2.40. Savings of $0.48 = trust signal.
Impression served at 14:32:01. User clicks at 14:32:05.
Conversion 17 hours later: $129.99 purchase.
Last-click attribution (MODEL_LAST_CLICK): credit 1.0 → $129.99 attributed to this click.
Data-driven attribution (MODEL_DDA_v3): credit 0.62 → $80.59 attributed to this click. The remaining $49.40 attributed to upstream touchpoints.
Nike's ROAS under last-click = $129.99 / $1.92 = 67.7×. Under DDA = $80.59 / $1.92 = 42.0×. Both are stored side-by-side via attribution_run_id.

Section 8 — Why this works

One join key, four facts. auction_decision_id threads through impression, click, conversion. ROAS query is a 4-way join on a single column — efficient and correct.
Re-attribution is cheap. New model = new attribution_run_id. Old runs preserved; advertisers compare side-by-side.
Invalid clicks handled losslessly. Append-only is_invalid + invalidation_ts preserves billing history while crediting advertisers.
Loser sampling tractable. 1% sampled with sample_weight column; ranker training works without 20× data volume.

Senior framing. "The auction_decision_id is the join key that turns four independent facts into a coherent advertiser narrative — and the attribution_run_id + dim_attribution_models SCD2 makes re-running attribution a query, not a re-ingest. Last-click vs DDA isn't an ETL decision — it's a column in the SELECT."

Meta / Facebook · cross-device attribution

Scenario 5 — Cross-device attribution & identity graph

Design a model for Facebook ads where a user sees an ad on phone, opens the advertiser's site on laptop later, and converts on tablet a week after. Support 1/7/28-day click + view-through windows, ATT/SKAdNetwork iOS constraints, and incrementality lift measurement.

Why this is hard — the design tension

One human, many devices. Apple's ATT cuts user-level signals on iOS. SKAdNetwork only returns aggregated postbacks. The model must support cross-device joining via a probabilistic+deterministic identity graph, multiple attribution windows in parallel (1d / 7d / 28d × click vs view), and two privacy paths (user-level for opted-in, aggregate-only for SKAN).

Section 1 — Dimensional Model Overview

Fact Tables

fct_ad_events — The "Touch" grain. One row per impression OR click (event-typed).
fct_conversions — The "Conversion" grain. One row per advertiser-reported conversion.
fct_skan_postbacks — The iOS-aggregate grain. One row per SKAdNetwork postback (no user-level data).
fct_attributions — The "Match" grain. One row per (conversion × eligible touchpoint × window × model × run). Append-only via attribution_run_id.

Dimension & Bridge Tables

dim_identities — One row per resolved person. Probabilistic + deterministic.
bridge_identity_devices — SCD2. Many devices per identity; link_confidence drives weighting.
dim_advertisers, dim_campaigns, dim_creatives — SCD2.
dim_user_holdouts — Holdout assignment for incrementality (SCD2 per advertiser × period).

Section 2 — Logical Data Model (key columns)

bridge_identity_devices (the heart of cross-device)

Column	Type	Description
`identity_id, device_id_token`	FK	Composite link.
`link_confidence`	DECIMAL(4,3)	1.0 = deterministic (login). <1.0 = probabilistic (IP, behavior).
`link_method`	STRING	{login, email_match, household_ip, behavioral, deterministic_id}.
`effective_from, effective_to`	TIMESTAMP_TZ	SCD2 — devices come and go.

fct_ad_events — event_id, event_type ∈ {impression, click}, identity_id (NULL on iOS), device_id_token, campaign_id, creative_id, placement, event_ts, privacy_path.

fct_attributions — conversion_id, ad_event_id, window_type ∈ {click_1d, click_7d, view_1d, view_28d, ...}, attribution_model, attribution_run_id, is_attributed BOOLEAN, fractional_credit DECIMAL. Partitioned by attribution_run_id.

Section 3 — How the model serves each stakeholder

Advertiser · ROAS & window comparison

Run multiple window_type values in parallel: click_1d, click_7d, view_28d. The advertiser picks; we report all.
Cross-device join: phone impression + laptop click + tablet conversion all join via identity_id from bridge_identity_devices.

Privacy / iOS · ATT & SKAN

privacy_path = 'att_optout' events have NULL identity_id; only aggregate roll-ups by campaign × placement × day are valid.
fct_skan_postbacks is a separate fact — never joins to user-level. Aggregated reporting only.

Measurement · incrementality lift

Hold out N% of users from receiving any ad (in dim_user_holdouts). Lift = (treatment conversion rate) − (holdout conversion rate).
The holdout group does not appear in fct_attributions at all — they're conversions without ad events.

Section 4 — Why "match at read, not at write"

The decisive design choice is to never compute attribution at ingest time. Every conversion is paired with every eligible touchpoint, and the attribution model is a function evaluated at read time. Three benefits:

Re-attribution is a query, not a re-ingest. Switch from last-click to data-driven without rebuilding the fact table.
Multiple windows in parallel. Same conversion appears in click_1d, click_7d, and view_28d rows simultaneously.
Audit-ready. When an advertiser disputes attribution, you produce the exact attribution_run_id that ran on date X with model Y. Defensible.

Enhanced graphical data model — Cross-device Attribution

Section 7 — SQL analysis for business units

Sample data — 1 user (3 devices) → 3 ad events → 1 conversion

INSERT INTO dim_identities VALUES ('IDT_001','PERSON_TOK_42','login_email');

INSERT INTO bridge_identity_devices VALUES
  ('IDT_001','DEV_PHONE',  1.000,'login',        '2024-01-01',NULL),
  ('IDT_001','DEV_LAPTOP', 0.920,'household_ip', '2024-03-15',NULL),
  ('IDT_001','DEV_TABLET', 0.880,'behavioral',   '2024-06-22',NULL);

-- 3 ad events across 3 devices, same identity:
INSERT INTO fct_ad_events VALUES
  ('AE_001','impression','IDT_001','DEV_PHONE','CMP_NIKE','feed','2025-04-25 09:15:00','opted_in'),
  ('AE_002','click',     'IDT_001','DEV_LAPTOP','CMP_NIKE','feed','2025-04-28 14:32:00','opted_in'),
  ('AE_003','impression','IDT_001','DEV_TABLET','CMP_NIKE','reel','2025-04-30 21:00:00','opted_in');

-- Conversion 1 week after first impression
INSERT INTO fct_conversions VALUES
  ('CONV_777','ADV_NIKE','2025-05-02 11:42:00','purchase',129.99,'IDT_001');

-- Attribution rows for window=7d_click + last_click vs MTA
INSERT INTO fct_attributions VALUES
  ('ATR_01','CONV_777','AE_002','MODEL_LAST_CLICK','RUN_2025_05_03','click_7d',TRUE,1.000,129.99),
  ('ATR_02','CONV_777','AE_001','MODEL_MTA',       'RUN_2025_05_03','view_28d', TRUE,0.300, 38.99),
  ('ATR_03','CONV_777','AE_002','MODEL_MTA',       'RUN_2025_05_03','click_7d', TRUE,0.500, 65.00),
  ('ATR_04','CONV_777','AE_003','MODEL_MTA',       'RUN_2025_05_03','view_28d', TRUE,0.200, 26.00);

A. Advertiser · cross-device ROAS by window

Q1 — Conversions attributed by window_type (1d, 7d, 28d) under last-click

SELECT
  atr.window_type,
  COUNT(DISTINCT atr.conversion_id)         AS conversions_attributed,
  ROUND(SUM(atr.attributed_value_usd), 2)   AS attributed_value
FROM fct_attributions atr
WHERE atr.attribution_run_id = (SELECT MAX(attribution_run_id) FROM fct_attributions)
  AND atr.model_id = 'MODEL_LAST_CLICK'
  AND atr.is_attributed = TRUE
GROUP BY atr.window_type ORDER BY attributed_value DESC;
-- Lets advertiser see how many conversions fall inside each window. Picks the right window for their funnel.

Q2 — Cross-device match rate (% conversions linked to ≥1 ad event via identity)

WITH conv AS (
  SELECT conversion_id, identity_id FROM fct_conversions
  WHERE conversion_ts >= CURRENT_DATE - INTERVAL '30 days'
),
matched AS (
  SELECT DISTINCT atr.conversion_id FROM fct_attributions atr
  JOIN conv USING (conversion_id)
  WHERE atr.is_attributed = TRUE
)
SELECT
  COUNT(DISTINCT c.conversion_id) AS total_conversions,
  COUNT(DISTINCT m.conversion_id) AS matched_conversions,
  ROUND(100.0 * COUNT(DISTINCT m.conversion_id) / COUNT(DISTINCT c.conversion_id), 1) AS match_rate_pct
FROM conv c LEFT JOIN matched m USING (conversion_id);

B. Measurement · incrementality lift

Q3 — Conversion rate: treatment vs holdout

WITH treatment AS (
  SELECT i.identity_id,
    EXISTS (SELECT 1 FROM fct_conversions c WHERE c.identity_id = i.identity_id
            AND c.advertiser_id = 'ADV_NIKE'
            AND c.conversion_ts BETWEEN '2025-04-01' AND '2025-05-01') AS converted
  FROM dim_identities i
  WHERE i.identity_id NOT IN (SELECT identity_id FROM dim_user_holdouts
                              WHERE advertiser_id = 'ADV_NIKE'
                                AND period = '2025-04')
),
holdout AS (
  SELECT i.identity_id,
    EXISTS (SELECT 1 FROM fct_conversions c WHERE c.identity_id = i.identity_id
            AND c.advertiser_id = 'ADV_NIKE'
            AND c.conversion_ts BETWEEN '2025-04-01' AND '2025-05-01') AS converted
  FROM dim_identities i
  JOIN dim_user_holdouts h ON i.identity_id = h.identity_id
  WHERE h.advertiser_id = 'ADV_NIKE' AND h.period = '2025-04'
)
SELECT 'treatment' AS arm, COUNT(*) AS n, ROUND(100.0 * AVG(converted::INT), 3) AS conv_rate_pct FROM treatment
UNION ALL
SELECT 'holdout',  COUNT(*),   ROUND(100.0 * AVG(converted::INT), 3) FROM holdout;
-- Lift = treatment_rate - holdout_rate. Statistical significance = Welch's t-test on the two proportions.

Worked example — User IDT_001 attribution math

User IDT_001 sees Nike impression on phone (Apr 25), clicks Nike ad on laptop (Apr 28), sees Nike Reels on tablet (Apr 30).
Buys Nike shoes for $129.99 on May 2.
Last-click (click_7d window): credit 1.0 → all $129.99 attributed to AE_002 (the laptop click).
MTA model (multi-touch): AE_001 view 30%, AE_002 click 50%, AE_003 view 20% → $38.99 + $65.00 + $26.00 = $129.99 ✓
Identity bridge enabled the cross-device join; without it the 3 events wouldn't connect. link_confidence on each device drives MTA weighting in advanced models.
Both attribution rows live in fct_attributions with attribution_run_id = RUN_2025_05_03. Switching models = SELECT, not re-ingest.

Section 8 — Why this works

Identity graph as a first-class fact. bridge_identity_devices SCD2 with confidence is how cross-device works without lying about certainty.
Match at read. Multiple windows + multiple models computed in parallel; advertiser picks at query time.
iOS / SKAN carve-out. Privacy path is explicit; aggregate-only queries fall back to fct_skan_postbacks.
Incrementality real. Holdout assignment is a separate dim, not derived — survives campaign re-orgs.

Senior framing. "The decisive design choice is to never compute attribution at write time — every conversion is paired with every eligible ad event, and the attribution model is a function evaluated at read time. That's how we re-attribute when an advertiser switches from last-click to data-driven without re-ingesting raw events. The identity graph as a versioned bridge is what makes cross-device honest about its uncertainty."

Netflix · CTV ads · inventory & pacing

Scenario 6 — CTV ad inventory, pacing & frequency capping

Design a model for Netflix's ad-supported tier — capturing every ad slot opportunity (filled or unfilled), every ad served, frequency caps per user, and contractual pacing guarantees per advertiser ("guaranteed 10M impressions by month-end with make-good if under-delivered").

Why this is hard — the design tension

CTV inventory has a contractual layer that programmatic doesn't: guaranteed buys with monetary penalties for under-delivery. Three pressures pull simultaneously: (1) maximize fill rate, (2) honor frequency caps so users don't see the same ad 8 times, (3) hit committed impressions per advertiser per month. Modeling unfilled opportunities (not just impressions) is the senior move — it's the only way to answer "why did we leave money on the table?"

Section 1 — Dimensional Model Overview

Fact Tables

fct_ad_opportunities — The "Slot" grain. One row per ad-slot opportunity in a viewing session. Filled or unfilled — both kept.
fct_ad_impressions — The "Served" grain. Subset of opportunities where an ad actually played.
fct_pacing_daily — The "Contract" grain. One row per (campaign × day) aggregating delivered vs committed.
fct_frequency_state — The "Cap" grain. One row per (profile × campaign × day) tracking impressions toward cap.
fct_makegood_obligations — The "Owed" grain. One row per (campaign × month) when committed not met. Drives next-month bonus inventory.

Dimension Tables

dim_advertisers, dim_campaigns, dim_creatives (SCD2)
dim_campaign_flights — Period a campaign runs (start_date / end_date / total_committed_impressions)
dim_titles (SCD2 for content metadata; ratings drive ad eligibility)
dim_profiles (SCD2; subscription tier = ad-supported / premium)

Section 2 — Logical Data Model (key columns)

fct_ad_opportunities — opportunity_id, profile_id, title_id, session_id, slot_type ∈ {pre_roll, mid_roll, post_roll}, slot_position_sec, slot_duration_sec, opportunity_ts, predicted_cpm, served_creative_id (NULL=unfilled), fill_reason {won_auction, guaranteed_buy, house_ad}, unfilled_reason {frequency_capped, no_eligible_demand, content_unsuitable}.

fct_pacing_daily — campaign_id, day, delivered, committed_today, cumulative_delivered, total_committed, pacing_index (1.0 = on track), days_remaining.

fct_makegood_obligations — campaign_id, original_period, makegood_period, owed_impressions, owed_value_usd, status {pending, scheduled, fulfilled}.

Section 3 — How the model serves each stakeholder

Ad server · live decisions — for each opportunity, query fct_frequency_state + fct_pacing_daily + active campaign list. Pick a creative or mark unfilled.

Account managers · advertiser pacing — track pacing_index per campaign daily. If <0.95 with <5 days left, increase delivery priority.

Finance · make-good bookkeeping — fct_makegood_obligations tracks dollars owed when committed impressions not met. Liability on the balance sheet until fulfilled.

Exec · "money on the table" analysis — count fct_ad_opportunities with served_creative_id IS NULL grouped by unfilled_reason. Tells you if the bottleneck is demand (sales team) or eligibility rules (ops team).

Section 4 — Why model unfilled opportunities

Without unfilled rows: can't compute fill rate. Can't tell if you under-delivered because of low demand vs frequency caps vs content unsuitability.
With unfilled rows: fill rate = COUNT(filled) / COUNT(total opportunities). Drill into unfilled_reason to find the right team to fix.
Cost: doubles the row count. Tradeoff is worth it; partition by date keeps queries fast.

Enhanced graphical data model — Netflix CTV Inventory

Section 7 — SQL analysis

Sample data — 1 campaign, 5 opportunities (3 filled / 2 unfilled), 1 day pacing

INSERT INTO dim_campaigns VALUES ('CMP_FORD','ADV_FORD','2025-05-01','2025-05-31',5000000,1.0);

INSERT INTO fct_ad_opportunities VALUES
  ('OPP_001','PROF_A','TITLE_X','SESS_1','pre_roll',  0,  15, 25.50, 'CRT_FORD_15S', 'guaranteed_buy', NULL,                  '2025-05-01 19:32:00'),
  ('OPP_002','PROF_A','TITLE_X','SESS_1','mid_roll',1200, 15, 25.50, NULL,            NULL,             'frequency_capped',     '2025-05-01 19:52:00'),
  ('OPP_003','PROF_B','TITLE_Y','SESS_2','pre_roll',  0,  15, 25.50, 'CRT_FORD_15S', 'guaranteed_buy', NULL,                  '2025-05-01 20:05:00'),
  ('OPP_004','PROF_C','TITLE_Z','SESS_3','pre_roll',  0,  30, 25.50, NULL,            NULL,             'no_eligible_demand',   '2025-05-01 20:22:00'),
  ('OPP_005','PROF_D','TITLE_X','SESS_4','mid_roll',1500, 15, 25.50, 'CRT_FORD_15S', 'guaranteed_buy', NULL,                  '2025-05-01 21:00:00');

-- Pacing for the day:
INSERT INTO fct_pacing_daily VALUES ('CMP_FORD','2025-05-01', 3, 161290, 3, 5000000, 0.000, 30);
-- Way under: should have delivered 161290 today, only got 3.

A. Operations · fill rate & unfilled diagnostics

Q1 — Fill rate & unfilled-reason breakdown (the "money on the table" report)

SELECT
  slot_type,
  COUNT(*)                                                              AS opportunities,
  COUNT(*) FILTER (WHERE served_creative_id IS NOT NULL)                AS filled,
  ROUND(100.0 * COUNT(*) FILTER (WHERE served_creative_id IS NOT NULL)
        / COUNT(*), 1)                                                  AS fill_rate_pct,
  COUNT(*) FILTER (WHERE unfilled_reason = 'frequency_capped')          AS cap_blocked,
  COUNT(*) FILTER (WHERE unfilled_reason = 'no_eligible_demand')        AS no_demand,
  COUNT(*) FILTER (WHERE unfilled_reason = 'content_unsuitable')        AS content_block
FROM fct_ad_opportunities
WHERE DATE(opportunity_ts) = '2025-05-01'
GROUP BY slot_type ORDER BY opportunities DESC;
-- "no_eligible_demand" rows = sales team to fix. "frequency_capped" = ops/cap-policy team.

B. Account managers · pacing alerts

Q2 — Campaigns at risk of under-delivery (pacing < 0.95 with <5 days remaining)

SELECT p.campaign_id, c.advertiser_id,
  p.cumulative_delivered, p.total_committed,
  ROUND(100.0 * p.cumulative_delivered / NULLIF(p.total_committed, 0), 1) AS pct_delivered,
  p.pacing_index, p.days_remaining,
  ROUND((p.total_committed - p.cumulative_delivered) / NULLIF(p.days_remaining, 0)) AS daily_needed
FROM fct_pacing_daily p JOIN dim_campaigns c USING (campaign_id)
WHERE p.day = CURRENT_DATE
  AND p.pacing_index < 0.95
  AND p.days_remaining < 5
ORDER BY p.pacing_index ASC;
-- Triggers ad-server priority bump + account-manager Slack alert.

C. Finance · make-good liability

Q3 — Pending make-good liability ($) by advertiser

SELECT
  c.advertiser_id,
  COUNT(DISTINCT mg.campaign_id)                AS campaigns_owed,
  SUM(mg.owed_impressions)                      AS total_impressions_owed,
  ROUND(SUM(mg.owed_value_usd), 2)              AS total_dollars_owed,
  COUNT(*) FILTER (WHERE mg.status = 'pending') AS pending_count
FROM fct_makegood_obligations mg
JOIN dim_campaigns c USING (campaign_id)
WHERE mg.status IN ('pending','scheduled')
GROUP BY c.advertiser_id
ORDER BY total_dollars_owed DESC;
-- Books a liability on Netflix's balance sheet until fulfilled.

Worked example — CMP_FORD on May 1

Total committed: 5,000,000 impressions over May. Daily target: 161,290.
Day 1 delivered: 3 impressions (out of 5 opportunities). 2 unfilled — 1 frequency_capped, 1 no_eligible_demand.
pacing_index = 3 / 161,290 = 0.0000186 — catastrophic miss.
If pattern continues 30 days: ~90 delivered total. Make-good owed = (5M − 90) × CPM-based dollar rate.
Queries trigger: account manager alert (Q2), finance liability booking (Q3), unfilled-reason audit (Q1) reveals "no_eligible_demand" needs sales-team push.

Section 8 — Why this works

Unfilled rows are first-class. Fill rate becomes a simple COUNT FILTER, not a derived metric.
Pacing as a daily snapshot. Not derived on the fly — pre-aggregated for fast dashboard reads.
Make-goods as a liability fact. Finance owns it; it's a real balance-sheet item, not a metric.
Frequency caps as state. Per (profile × campaign × day) — exactly what the ad server queries pre-decision.

Senior framing. "Modeling unfilled opportunities (not just impressions) is the senior move — it's the only way to answer 'why did we leave money on the table this quarter?' which is the exec-level question this data product exists for. Pacing as snapshot + make-good as liability turns the contractual layer into queryable finance, not a sales spreadsheet."

Amazon · e-commerce · returns & inventory

Scenario 7 — Orders, returns & multi-warehouse inventory

Design a model for Amazon orders that handles partial returns, replacements, marketplace seller orders (1P vs FBA vs 3P), and inventory drawn from multiple warehouses with FIFO/LIFO consumption rules.

Why this is hard — the design tension

Returns reshape revenue retroactively. Multi-warehouse fulfillment splits a single SKU across two warehouses. Seller-of-record (1P retail / 2P FBA / 3P marketplace) determines whether Amazon books GMV or commission. And inventory has to reconcile to ledger every night. The model must be append-only on the financial side and movement-based on the physical side — same product, two grains, one source of truth.

Section 1 — Dimensional Model Overview

Fact Tables

fct_order_lines — The "Sale" grain. One row per (order × product × source_warehouse). 3-item order = 3 rows.
fct_returns — The "Refund" grain. One row per return event, line-level. Append-only.
fct_inventory_movements — The "Stock-flow" grain. One row per stock change (in / out / transfer / damage / adjustment).
snap_inventory_daily — The "State" grain. Daily snapshot per (product × warehouse).
fct_replacements — The "Linked-order" grain. Maps replacement orders back to their original.

Dimension Tables

dim_products (SCD2 — title, category, dimensions drift)
dim_sellers (SCD2 — seller_type ∈ {retail_1p, fba_2p, marketplace_3p}, recognized_as ∈ {gmv, commission_only})
dim_warehouses, dim_customers + dim_addresses (SCD2), dim_promotions (SCD2)

Section 2 — Logical Data Model (key columns)

fct_order_lines — order_line_id, order_id, line_number, customer_id, product_id, seller_id, source_warehouse_id, ordered_ts, shipped_ts, delivered_ts, quantity, unit_price_usd, promo_discount, tax, shipping, gross_revenue_usd, is_prime, is_subscribe_save.

fct_returns — return_id, order_line_id, return_ts, refund_ts, reason_code, quantity (partial OK), refund_amount_usd, restocking_fee_usd, condition_received {new, used, damaged}, is_replacement BOOL.

fct_inventory_movements — movement_id, product_id, warehouse_id, movement_ts, movement_type {receipt, sale_out, return_in, transfer_out, transfer_in, damage, adjustment}, quantity (signed), unit_cost_usd, related_order_line_id (NULL for non-sale moves).

snap_inventory_daily — product_id, warehouse_id, snapshot_date, opening_qty, receipts, sales, returns_in, transfers_net, damage, ending_qty, weighted_avg_cost. Reconciliation invariant: opening + receipts + returns_in − sales − damage + transfers_net = ending.

Section 3 — How the model serves each stakeholder

Customer experience — order delivery SLA from fct_order_lines (ordered → delivered). Return rate by category from fct_returns tracks product quality.

Finance — net revenue = SUM(order_lines.gross_revenue_usd) − SUM(returns.refund_amount_usd). For 3P sellers, dim_sellers.recognized_as = 'commission_only' means only the commission is booked.

Operations — multi-warehouse fill rate, transfer efficiency, days of inventory on hand. Reconciliation runs nightly: snap_inventory_daily ending_qty must match SUM(fct_inventory_movements) since opening.

Section 4 — Why append-only returns + movement-based inventory

Returns reshape historical revenue. Never UPDATE fct_order_lines. Append fct_returns. Finance can restate any historical period correctly.
Inventory snapshots derive from movements. Snapshots are pre-aggregated for fast read; movements are the immutable source. Lose a snapshot, rebuild from movements.
Replacements are linked, not collapsed. Original order stays + replacement is a new row + return matches both. Triple-counted at gross level, nets to zero correctly.
3P seller revenue carve-out. dim_sellers.recognized_as drives whether SUM is GMV or commission — single column makes the rule explicit and queryable.

Enhanced graphical data model — Amazon Orders + Returns + Inventory

Section 7 — SQL analysis

Sample data — 1 order with 2 line items (one split across 2 warehouses), 1 partial return

INSERT INTO dim_sellers VALUES
  ('SLR_AMZN','retail_1p','gmv', 0.0),
  ('SLR_3P_ABC','marketplace_3p','commission_only', 0.15);

-- 3-item order; line 2 split across 2 warehouses (qty 4 from WH-A, qty 2 from WH-B)
INSERT INTO fct_order_lines VALUES
  ('OL_001','ORD_99', 1, 'CUST_X','PROD_BOOK', 'SLR_AMZN','WH-A','2025-05-01 10:00','2025-05-02 14:00','2025-05-03 16:00', 1, 19.99, 0.00, 1.65, 0.00, 1.0, 21.64, TRUE, FALSE),
  ('OL_002','ORD_99', 2, 'CUST_X','PROD_PEN',  'SLR_3P_ABC','WH-A','2025-05-01 10:00','2025-05-02 14:00','2025-05-03 16:00', 4,  3.50, 0.00, 1.16, 0.00, 1.0, 15.16, TRUE, FALSE),
  ('OL_003','ORD_99', 2, 'CUST_X','PROD_PEN',  'SLR_3P_ABC','WH-B','2025-05-01 10:00','2025-05-02 14:00','2025-05-03 16:00', 2,  3.50, 0.00, 0.58, 0.00, 1.0,  7.58, TRUE, FALSE);

-- Partial return: customer keeps the book, returns 3 of 6 pens
INSERT INTO fct_returns VALUES
  ('RET_001','OL_002','2025-05-08 09:00','2025-05-08 09:30','quality_issue', 3, 11.37, 0.00, 'used', FALSE);

A. Customer · order funnel + delivery SLA

Q1 — Prime vs non-Prime delivery time (last 30 days)

SELECT
  is_prime,
  COUNT(*)                                                                       AS lines,
  ROUND(AVG(EXTRACT(EPOCH FROM (delivered_ts - ordered_ts))/3600), 1)             AS avg_hours,
  ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (delivered_ts - ordered_ts))/3600), 1) AS p95_hours
FROM fct_order_lines
WHERE delivered_ts IS NOT NULL AND ordered_ts >= CURRENT_DATE - 30
GROUP BY is_prime;

B. Finance · net revenue with 1P/3P carve-out

Q2 — Net revenue with seller-type carve-out (3P books only commission)

WITH gross AS (
  SELECT
    s.seller_type,
    s.recognized_as,
    SUM(CASE WHEN s.recognized_as = 'gmv' THEN ol.gross_revenue_usd
             ELSE ol.gross_revenue_usd * s.commission_pct END) AS gross_recognized_usd
  FROM fct_order_lines ol JOIN dim_sellers s USING (seller_id)
  WHERE DATE(ol.ordered_ts) = '2025-05-01'
  GROUP BY s.seller_type, s.recognized_as
),
returns AS (
  SELECT
    s.seller_type,
    SUM(CASE WHEN s.recognized_as = 'gmv' THEN r.refund_amount_usd
             ELSE r.refund_amount_usd * s.commission_pct END) AS refund_recognized_usd
  FROM fct_returns r
  JOIN fct_order_lines ol ON r.order_line_id = ol.order_line_id
  JOIN dim_sellers s USING (seller_id)
  WHERE DATE(r.return_ts) = '2025-05-01'
  GROUP BY s.seller_type
)
SELECT g.seller_type,
  ROUND(g.gross_recognized_usd, 2)                                                  AS gross_revenue,
  ROUND(COALESCE(r.refund_recognized_usd, 0), 2)                                    AS refunds,
  ROUND(g.gross_recognized_usd - COALESCE(r.refund_recognized_usd, 0), 2)           AS net_revenue
FROM gross g LEFT JOIN returns r USING (seller_type);

C. Operations · inventory reconciliation

Q3 — Daily inventory invariant check (snap = opening + receipts + returns_in − sales − damage + transfers)

SELECT
  product_id, warehouse_id, snapshot_date,
  opening_qty, receipts, returns_in, sales, damage, transfers_net, ending_qty,
  (opening_qty + receipts + returns_in - sales - damage + transfers_net) AS computed_ending,
  ending_qty - (opening_qty + receipts + returns_in - sales - damage + transfers_net) AS drift,
  CASE WHEN ending_qty = opening_qty + receipts + returns_in - sales - damage + transfers_net
       THEN 'PASS' ELSE 'FAIL' END AS reconciliation
FROM snap_inventory_daily
WHERE snapshot_date = CURRENT_DATE - 1
  AND ending_qty != opening_qty + receipts + returns_in - sales - damage + transfers_net;
-- Empty result = clean. Non-empty = inventory adjustment needed; investigate.

Worked example — Order ORD_99 with split fulfillment + partial return

Order: 1 book ($21.64) + 6 pens. Pens split across 2 warehouses: 4 from WH-A ($15.16) + 2 from WH-B ($7.58) = 6 pens for $22.74. Total order gross = $44.38.
Seller-type effects on revenue: Book is 1P retail → Amazon books $21.64 GMV. Pens are 3P seller (15% commission) → Amazon books $22.74 × 0.15 = $3.41. Total Amazon recognized = $21.64 + $3.41 = $25.05.
Partial return: 3 of 6 pens returned a week later. Refund = $11.37 (proportional). Amazon's recognized refund = $11.37 × 0.15 = $1.71.
Net Amazon recognized: $25.05 − $1.71 = $23.34.
Inventory: WH-A movements record sale_out (+4 pens out, then return_in +1 pen back). WH-B records sale_out (-2 pens). Daily snapshot reconciles.

Section 8 — Why this works

Append-only returns. Net revenue = SUM(orders) − SUM(returns). No UPDATE; full historical audit.
Movements are immutable. Snapshots derive from movements; rebuild at any time.
Seller-type carve-out is a column. 1P / 2P / 3P revenue rules become a CASE in the query, not a separate ETL path.
Multi-warehouse split is a row. Same order_id + line_number, different warehouse → fulfilment is queryable without losing the customer's "1 line item" mental model.

Senior framing. "Append-only on returns is the contract that lets finance restate any historical period without rebuilding orders. The 1P/3P carve-out as dim_sellers.recognized_as turns a multi-billion-dollar revenue policy into a queryable column, not a brittle ETL branch — that's the difference between a model that survives quarter-end and one that doesn't."

Meta / Instagram · social feed · ranker

Scenario 8 — Feed engagement at scale (with ranker A/B)

Model engagement on the Instagram feed — impressions, dwell time, likes, saves, comments, shares, plus negative signals (hide, report) — supporting feed ranker training/A-B, creator analytics, and brand safety. Volume: 100B+ impressions/day.

Why this is hard — the design tension

Volume alone forces design choices: 100B impressions/day means exact COUNT DISTINCT is unaffordable; HLL is mandatory. Mutable engagements (like → unlike) demand append-only with is_undone flags. Heartbeat-driven dwell can't live at raw grain — must be aggregated at ETL. And the ranker A/B requirement means every impression must carry the model_id that ranked it, or you can't compare baseline lift.

Section 1 — Dimensional Model Overview

Fact Tables

fct_impressions — The "Served" grain. One row per (viewer × post × impression). Carries ranker_model_id + predicted_score.
fct_engagements — The "Reaction" grain. One row per (viewer × post × event_type × event_ts). Likes, saves, comments, shares, hides, reports — all here.
fct_dwell_per_impression — The "Time-on-post" grain. One row per impression with total dwell_ms (aggregated from heartbeats at ETL).
fct_ranker_experiments — The "Treatment" grain. One row per (viewer × experiment × variant). Drives A/B comparison.

Dimension Tables

dim_users (SCD2) · dim_posts (SCD2 — caption edits, audio swaps) · dim_creators (SCD2)
dim_post_audio (SCD1)
dim_ranker_models — SCD2; semver + model_card_url + deployed_pct.

Section 2 — Logical Data Model (key columns)

fct_impressions — impression_id, viewer_id, post_id, feed_position SMALLINT, impression_ts, device_id, ranker_model_id (FK), predicted_score DECIMAL(8,6), surface ∈ {feed, reels, stories, explore}.

fct_engagements — engagement_id, impression_id (NULLable: comments can arrive without prior impression event), viewer_id, post_id, event_type ∈ {like, save, share, comment, hide, report, follow_creator}, event_ts, is_undone BOOLEAN, undone_ts. Append-only — never delete an unlike.

fct_dwell_per_impression — impression_id, dwell_ms INT, was_full_view BOOLEAN, viewport_pct DECIMAL. Aggregated from raw 250ms heartbeats at ETL — saves 100× rows.

fct_ranker_experiments — viewer_id, experiment_id, variant_id, assigned_ts, exit_ts. Bridges viewers to A/B-test arms.

Section 3 — How the model serves each stakeholder

Ranker team · model lift — Compare engagement-rate per impression for ranker_model_id = v3.1 vs v3.2. Positive lift on like-rate but watch hide-rate; both should move favorably.

Creator analytics — Reach (impressions), engagement rate (engagements / impressions), dwell distribution per post.

Brand safety — Hide+Report rate per post per hour. Posts crossing thresholds get demoted; spike → human review queue.

Product growth — Negative-engagement rate is the canary for "this content drives session abandonment".

Section 4 — Why `ranker_model_id` on every impression + append-only engagements

Ranker A/B requires the model on the row. Without it, a 1% holdout for the new model can't be compared to the 99% baseline. ranker_model_id is a first-class dimension, not a footnote.
Likes are mutable, but the fact is immutable. A like at T1 + unlike at T2 = 2 rows: an insert and an "undone" insert with is_undone=TRUE, undone_ts=T2. Never DELETE; preserves time-series of toggle behavior.
Dwell aggregated at ETL. 250ms heartbeats × 100B impressions × 30 sec average = ~12 trillion raw rows/day. Aggregate before warehouse — only 100B per-impression rows survive.
Approximate aggregations. APPROX_COUNT_DISTINCT + HLL for everything user-level. Exact only when finance demands it (rare).

Enhanced graphical data model — Instagram Feed Engagement

Section 7 — SQL analysis

Sample data — 1 viewer sees 3 posts under v3.2 ranker; engages with two

INSERT INTO fct_impressions VALUES
  ('IMP_A','U_42','POST_001', 1,'2025-05-01 09:00:00','dev_iphone','model_v3.2',0.72,'feed'),
  ('IMP_B','U_42','POST_002', 2,'2025-05-01 09:00:08','dev_iphone','model_v3.2',0.55,'feed'),
  ('IMP_C','U_42','POST_003', 3,'2025-05-01 09:00:15','dev_iphone','model_v3.2',0.41,'feed');

INSERT INTO fct_dwell_per_impression VALUES
  ('IMP_A',  4200, 1.00, TRUE),   -- 4.2s, full view
  ('IMP_B',   800, 0.40, FALSE),  -- 0.8s, partial — scrolled past
  ('IMP_C',  6500, 1.00, TRUE);   -- 6.5s, full view

INSERT INTO fct_engagements VALUES
  ('ENG_1','IMP_A','U_42','POST_001','like',   '2025-05-01 09:00:04', FALSE, NULL),
  ('ENG_2','IMP_C','U_42','POST_003','save',   '2025-05-01 09:00:21', FALSE, NULL),
  ('ENG_3','IMP_C','U_42','POST_003','comment','2025-05-01 09:01:00', FALSE, NULL),
  ('ENG_4','IMP_A','U_42','POST_001','like',   '2025-05-01 09:05:00', TRUE,  '2025-05-01 09:05:00');  -- unlike

A. Creator analytics · post engagement

Q1 — Like / save / share rate per post (with active engagements only)

WITH active_engagements AS (
  SELECT post_id, event_type FROM fct_engagements WHERE is_undone = FALSE
)
SELECT
  i.post_id,
  COUNT(*)                                                        AS impressions,
  APPROX_COUNT_DISTINCT(i.viewer_id)                              AS unique_viewers,
  COUNT(*) FILTER (WHERE e.event_type = 'like')                   AS likes,
  COUNT(*) FILTER (WHERE e.event_type = 'save')                   AS saves,
  COUNT(*) FILTER (WHERE e.event_type = 'share')                  AS shares,
  ROUND(100.0 * COUNT(*) FILTER (WHERE e.event_type='like') / COUNT(*), 2) AS like_rate_pct,
  ROUND(AVG(d.dwell_ms), 0)                                       AS avg_dwell_ms
FROM fct_impressions i
LEFT JOIN active_engagements e ON e.post_id = i.post_id
LEFT JOIN fct_dwell_per_impression d ON d.impression_id = i.impression_id
GROUP BY i.post_id ORDER BY like_rate_pct DESC;

B. Brand safety · negative-signal monitoring

Q2 — Posts with hide+report rate above brand-safety threshold

SELECT
  i.post_id,
  COUNT(*)                                                                 AS impressions,
  COUNT(*) FILTER (WHERE e.event_type IN ('hide','report'))                AS negative_signals,
  ROUND(100.0 * COUNT(*) FILTER (WHERE e.event_type IN ('hide','report'))
        / COUNT(*), 3)                                                     AS negative_rate_pct
FROM fct_impressions i
LEFT JOIN fct_engagements e ON e.impression_id = i.impression_id AND e.is_undone = FALSE
WHERE DATE(i.impression_ts) = CURRENT_DATE - 1
GROUP BY i.post_id
HAVING COUNT(*) > 1000
   AND COUNT(*) FILTER (WHERE e.event_type IN ('hide','report')) * 1.0 / COUNT(*) > 0.005
ORDER BY negative_rate_pct DESC;
-- Posts above 0.5% hide+report rate go to human moderation queue.

C. Ranker · A/B model lift

Q3 — Engagement-rate lift v3.2 over v3.1

WITH ranker_metrics AS (
  SELECT
    i.ranker_model_id,
    COUNT(*)                                                    AS impressions,
    COUNT(*) FILTER (WHERE e.event_type='like'  AND e.is_undone=FALSE) AS likes,
    COUNT(*) FILTER (WHERE e.event_type='save'  AND e.is_undone=FALSE) AS saves,
    COUNT(*) FILTER (WHERE e.event_type IN ('hide','report'))    AS negatives,
    AVG(d.dwell_ms)                                             AS avg_dwell_ms
  FROM fct_impressions i
  LEFT JOIN fct_engagements        e ON e.impression_id = i.impression_id
  LEFT JOIN fct_dwell_per_impression d ON d.impression_id = i.impression_id
  WHERE i.impression_ts >= CURRENT_DATE - 7
    AND i.ranker_model_id IN ('model_v3.1','model_v3.2')
  GROUP BY i.ranker_model_id
)
SELECT ranker_model_id,
  ROUND(100.0 * likes     / NULLIF(impressions, 0), 3) AS like_rate_pct,
  ROUND(100.0 * saves     / NULLIF(impressions, 0), 3) AS save_rate_pct,
  ROUND(100.0 * negatives / NULLIF(impressions, 0), 3) AS negative_rate_pct,
  ROUND(avg_dwell_ms, 0)                                AS avg_dwell_ms
FROM ranker_metrics;
-- v3.2 wins if like_rate higher AND negative_rate not worse.

Worked example — viewer U_42's feed session

3 posts impressed under model_v3.2. Predicted scores: 0.72, 0.55, 0.41. Ranker order = score-desc, correct.
POST_001 (score 0.72): 4.2s dwell, like at +4s, full view → strong positive.
POST_002 (score 0.55): 0.8s dwell, no engagement, partial view (40%) → user scrolled past. Ranker over-predicted.
POST_003 (score 0.41): 6.5s dwell, save + comment → strong positive. Ranker under-predicted.
Calibration error on this session: 1/3 (POST_002) too high, 1/3 (POST_003) too low. Logged in fct_impressions.predicted_score for ranker training.
Mutable like: User unlikes POST_001 5 minutes later. fct_engagements stores both events (ENG_1 like, ENG_4 the unlike with is_undone=TRUE). Final state: not liked. Active-only queries filter is_undone=FALSE.

Section 8 — Why this works

Ranker A/B is queryable. ranker_model_id on every impression → comparing v3.1 vs v3.2 is a GROUP BY, not a separate ETL pipeline.
Mutable engagements without losing history. Append-only with is_undone preserves the full toggle behavior — useful for "regret rate" analysis.
Dwell aggregated at ETL. 100B per-impression rows >> 12T per-heartbeat rows. The aggregation is non-negotiable.
HLL by default. Distinct viewer counts at petabyte scale require APPROX_COUNT_DISTINCT — exact only when finance demands it.

Senior framing. "Storing ranker_model_id on the impression is what lets the feed team A/B test new models without losing the ability to compute baseline metrics — model-version is a first-class dimension, not a footnote. Append-only engagements with is_undone means we can answer 'what % of likes get unliked within 5 min?' which is the canary for low-quality ranker decisions."

Spotify · streaming media · per-stream royalties

Scenario 9 — Listening history & pool-model royalty payouts

Design a model for Spotify's listening history that supports recommendations, the "Year in Review" feature, and per-stream royalty payouts to multiple rights holders (artist + label + publisher + songwriter) — under the pool model where revenue is divided by total qualified streams in a country-period.

Why this is hard — the design tension

Two consumers want the same fact for opposite reasons. Product needs every play (skips included) for recommendations. Finance only pays on qualified streams (≥30s). Multi-rights splits mean one stream produces 4+ payout rows (artist + label + publisher + songwriter) and the splits change mid-quarter when deals renegotiate. Pool model math requires computing per-stream rate as revenue ÷ total streams in pool — a chicken-and-egg dependency that resolves only at period close.

Section 1 — Dimensional Model Overview

Fact Tables

fct_play_events — The "Listen" grain. One row per play event with qualified_play flag.
fct_royalty_periods — The "Pool" grain. One row per (country × period). Holds total pool revenue, total qualified streams, computed per-stream rate.
fct_royalty_attributions — The "Payout" grain. One row per (play × rights_holder × period). Derived nightly at period close.

Dimension & Bridge Tables

dim_users (SCD2 — plan tier drifts) · dim_subscriptions (Family/Duo group of users)
dim_tracks (SCD2 — metadata) · dim_artists · dim_labels (SCD2) · dim_publishers (SCD2) · dim_songwriters
bridge_track_rights — SCD2 with effective dates. The deal-terms ledger.

Section 2 — Logical Data Model (key columns)

fct_play_events — play_id, user_id, track_id, played_ts, duration_played_sec, qualified_play BOOL (≥30s), context_type ∈ {playlist, algorithmic, search, radio}, source_playlist_id, device_id, country_id.

bridge_track_rights (SCD2) — track_id, rights_holder_id, rights_type ∈ {artist, label, publisher, songwriter}, share_pct, effective_from, effective_to. Sum of share_pct per (track, rights_type) per period = 100%.

fct_royalty_periods — period_id, country_id, period_start, period_end, total_pool_revenue_usd, total_qualified_plays, per_stream_rate_usd (= revenue / qualified_plays).

fct_royalty_attributions — play_id, period_id, rights_holder_id, rights_type, share_pct, attributed_payout_usd.

Section 3 — How the model serves each stakeholder

Product · recommendations + Year in Review — every play counts (qualified or not). Skips are signal. context_type tells the recommender which surface drove the play.

Finance · royalty payouts — quarterly close: compute pool revenue per country, count qualified plays, derive rate, generate fct_royalty_attributions. Splits use as-of-play bridge rows, never current.

Rights holders · transparency — labels/publishers want their cut breakdown. Each fct_royalty_attributions row is a defensible per-play line item.

Section 4 — Why the bridge is SCD2 (not just a table)

Label deals renegotiate mid-quarter — a label might move 5% from publisher to artist effective Apr 15. Three failure modes if the bridge isn't SCD2:

Apr 1-14 plays paid under the wrong split. Restatement needed every quarter — auditor nightmare.
Lawsuit risk. Songwriters can demand "pay me as of the deal that was active when my song was streamed". SCD2 is the only defensible answer.
Forecast invalidation. Finance forecasts use the current bridge; without SCD2, mid-quarter changes silently invalidate projections.

The contract: every fct_royalty_attributions row uses bridge_track_rights WHERE play.played_ts BETWEEN bridge.effective_from AND bridge.effective_to.

Enhanced graphical data model — Spotify Listening + Royalty Pool

Section 7 — SQL analysis

Sample data — Country pool with 1M streams in Q2-2025

INSERT INTO fct_royalty_periods VALUES
  ('PER_2025Q2_US','US','2025-04-01','2025-06-30', 5000000.00, 1000000, 0.005000);
-- Per-stream rate = $5M / 1M plays = $0.005 / qualified play

INSERT INTO bridge_track_rights VALUES
  ('TRK_001','LBL_BIG',     'label',     0.50,'2024-01-01','2025-04-15'),
  ('TRK_001','ART_NOVA',    'artist',    0.30,'2024-01-01','2025-04-15'),
  ('TRK_001','PUB_HARMONY', 'publisher', 0.15,'2024-01-01','2025-04-15'),
  ('TRK_001','SW_LEE',      'songwriter',0.05,'2024-01-01','2025-04-15'),
  -- Mid-quarter renegotiation: label gives up 5%, artist gains 5%
  ('TRK_001','LBL_BIG',     'label',     0.45,'2025-04-15','9999-12-31'),
  ('TRK_001','ART_NOVA',    'artist',    0.35,'2025-04-15','9999-12-31'),
  ('TRK_001','PUB_HARMONY', 'publisher', 0.15,'2025-04-15','9999-12-31'),
  ('TRK_001','SW_LEE',      'songwriter',0.05,'2025-04-15','9999-12-31');

-- 2 plays of TRK_001 in US: one before, one after the renegotiation
INSERT INTO fct_play_events VALUES
  ('PLAY_A','U_42','TRK_001','2025-04-10 14:00:00', 180, TRUE,  'algorithmic', NULL,'iphone','US'),
  ('PLAY_B','U_42','TRK_001','2025-05-20 09:30:00', 240, TRUE,  'playlist','PL_ROCK','iphone','US');

A. Finance · per-stream rate & pool reconciliation

Q1 — Per-stream rate per country per period

SELECT
  rp.country_id, rp.period_start, rp.period_end,
  rp.total_qualified_plays,
  ROUND(rp.total_pool_revenue_usd, 2)   AS pool_usd,
  ROUND(rp.per_stream_rate_usd, 6)      AS rate_per_stream
FROM fct_royalty_periods rp
WHERE rp.period_start = '2025-04-01'
ORDER BY rp.country_id;

Q2 — Royalty attribution per qualified play (using as-of-played bridge splits)

SELECT
  p.play_id,
  p.played_ts,
  b.rights_type,
  b.rights_holder_id,
  b.share_pct,
  rp.per_stream_rate_usd,
  ROUND(rp.per_stream_rate_usd * b.share_pct, 6) AS attributed_payout_usd
FROM fct_play_events p
JOIN dim_countries  c  ON p.country_id = c.country_id
JOIN fct_royalty_periods rp
  ON rp.country_id = p.country_id
 AND p.played_ts BETWEEN rp.period_start AND rp.period_end
JOIN bridge_track_rights b
  ON b.track_id = p.track_id
 AND p.played_ts >= b.effective_from
 AND p.played_ts <  b.effective_to                 -- as-of-play splits
WHERE p.qualified_play = TRUE AND p.track_id = 'TRK_001'
ORDER BY p.played_ts, b.rights_type;

B. Product · listening behavior

Q3 — Skip rate by context_type (algorithmic vs playlist vs radio)

SELECT context_type,
  COUNT(*)                                                         AS plays,
  COUNT(*) FILTER (WHERE qualified_play = FALSE)                   AS skips,
  ROUND(100.0 * COUNT(*) FILTER (WHERE qualified_play = FALSE)
        / COUNT(*), 2)                                             AS skip_rate_pct
FROM fct_play_events
WHERE played_ts >= CURRENT_DATE - 30
GROUP BY context_type ORDER BY skip_rate_pct DESC;
-- High skip rate on algorithmic = recommender quality issue. On radio = expected (lean-back).

Worked example — TRK_001 royalty math (before vs after renegotiation)

Pool: $5,000,000 in US for Q2-2025. 1,000,000 qualified streams. Per-stream rate = $0.005.
PLAY_A (Apr 10): uses pre-renegotiation splits. Label 50% / Artist 30% / Publisher 15% / Songwriter 5%.
- Label: $0.005 × 0.50 = $0.0025
- Artist: $0.005 × 0.30 = $0.0015
- Publisher: $0.005 × 0.15 = $0.00075
- Songwriter: $0.005 × 0.05 = $0.00025
- Sum: $0.005 ✓
PLAY_B (May 20): uses post-renegotiation splits. Label 45% / Artist 35% / Publisher 15% / Songwriter 5%.
- Label: $0.005 × 0.45 = $0.00225
- Artist: $0.005 × 0.35 = $0.00175
- (others unchanged)
- Sum: $0.005 ✓
Audit query produces 8 rows total (2 plays × 4 rights holders). Each is defensible: shows the exact bridge row that was active when the play occurred.

Section 8 — Why this works

One play, two consumers. qualified_play as a flag (not a separate table) keeps Product and Finance on the same source of truth.
Pool model derived at period close. Per-stream rate isn't real-time — it's only known when the period closes. Modeled as a fact table, not a daily rolling number.
SCD2 bridge keeps deals defensible. Mid-quarter renegotiations don't break historical payouts.
Attribution table grows linearly. ~4 rows per qualified play. Quarterly bulk-write at period close, then read-only.

Senior framing. "The 30-second threshold is the integration point between product analytics (every play matters for recs) and finance (only qualified plays earn royalty) — modeling it as a flag on fct_play_events rather than two separate tables is what lets both teams stay on the same source of truth. SCD2 on bridge_track_rights turns deal renegotiations from an audit nightmare into a predicate."

Stripe · payments · double-entry ledger

Scenario 10 — Double-entry ledger for payments (charges, refunds, chargebacks, FX, reserves)

Design a model for Stripe's payment system supporting charges, refunds, chargebacks, payouts, multi-currency FX, reserves, and rigorous double-entry accounting where every transaction's entries sum to zero per currency. Append-only — never UPDATE — for regulatory audit compliance.

Why this is hard — the design tension

Payments is the domain with zero tolerance for retro edits. A regulator can ask "what was merchant X's balance on day Y as the books showed it then?" — the answer has to be defensible regardless of corrections made since. That forces append-only on every fact. Multi-currency adds an FX gain/loss leg. Chargebacks arrive 30-180 days after the charge. Reserves are an internal transfer, not external money. The model must enforce SUM=0 per balance transaction per currency as a continuous invariant.

Section 1 — Dimensional Model Overview

Fact Tables

fct_ledger_entries — The "Atom" grain. One row per debit OR credit on (account, currency). The source of truth. Append-only.
fct_balance_transactions — The "Business event" grain. One row per charge/refund/payout/chargeback. Groups its entries via balance_txn_id.
snap_account_balance_daily — The "State" grain. Derived — running SUM per (account × currency × day). Pre-aggregated for fast balance reads.

Dimension Tables

dim_merchants (SCD2 — risk profile, reserve_pct) · dim_customers · dim_payment_methods (SCD2)
dim_accounts — Chart of accounts: merchant_balance, fees, reserve, tax, fx_gain_loss, chargeback_reserve.
dim_currencies · dim_fx_rates (SCD2 daily — FX rate locked at posting time)

Section 2 — Logical Data Model (key columns)

fct_ledger_entries

Column	Type	Description
`entry_id`	PK	Surrogate.
`balance_txn_id`	FK	Groups entries within one business event.
`account_id`	FK	Which account: merchant_balance, fees, reserve, etc.
`merchant_id`	FK	The merchant.
`currency_code`	CHAR(3)	The currency of this entry.
`amount`	DECIMAL(20,4)	Signed. Convention: positive = credit, negative = debit. Sum per (balance_txn × currency) = 0.
`entry_type`	STRING	{charge, refund, fee, fx_gain, fx_loss, payout, chargeback, reserve_hold, reserve_release}.
`effective_ts, posted_ts`	TIMESTAMP_TZ	When the event happened vs when it hit the ledger.
`source_event_id`	UNIQUE	Idempotency key. Producer retries write same row.
`reversal_of_id`	FK NULL	For reversal entries — points to the original.

fct_balance_transactions — balance_txn_id, txn_type ∈ {charge, refund, payout, chargeback, fee, transfer}, txn_ts, source_object_id (charge_id / refund_id), merchant_id, status.

Section 3 — How the model serves each stakeholder

Merchant · balance & payouts — Available balance = SUM(fct_ledger_entries.amount) WHERE account = 'merchant_balance' AND merchant_id = X. From snap_account_balance_daily for speed.

Finance · reconciliation — Reconcile every balance_txn: SUM(amount) GROUP BY currency = 0. Any failure = producer bug, page on-call.

Risk · chargeback exposure — Aggregate entry_type = 'chargeback' per merchant per day. High rate = risk profile change → increase reserve_pct.

Regulators · audit trail — "Show merchant X balance as of date Y as the books showed it then" — query fct_ledger_entries WHERE posted_ts <= Y AND filter out reversals for entries posted after Y.

Section 4 — Why append-only + invariant + idempotency key

Append-only is regulatory. SOX / PCI / banking audits require an immutable history. Every reversal is a new pair of entries with opposite signs, never a retro edit.
SUM=0 invariant per (balance_txn, currency). Continuous data-quality check. If it fails, payments stop until fixed.
Idempotency at the row. UNIQUE(source_event_id) guarantees that producer retries don't create duplicate ledger entries — even under network partitions.
FX as separate currency entries. EUR charge → USD payout requires a 4-entry transaction: −EUR (merchant_balance debit) / +EUR (fx_holding credit) / −USD (fx_holding debit at locked rate) / +USD (merchant_balance credit). FX gain/loss recognized in reporting currency.

Enhanced graphical data model — Stripe Double-Entry Ledger

Section 7 — SQL analysis

Sample data — €100 charge converted to USD payout (multi-currency, 4 entries)

-- Business event 1: charge €100 (customer pays merchant)
INSERT INTO fct_balance_transactions VALUES
  ('BTX_001','charge', '2025-05-01 09:00:00','CH_777','MRC_FR','succeeded');

-- Ledger entries (SUM = 0 per currency):
INSERT INTO fct_ledger_entries (balance_txn_id, account_id, merchant_id, currency_code, amount, entry_type, effective_ts, source_event_id) VALUES
  ('BTX_001','merchant_balance','MRC_FR','EUR',  100.00, 'charge',     '2025-05-01 09:00','EVT_001'),
  ('BTX_001','fees',            'MRC_FR','EUR',   -2.90, 'fee',        '2025-05-01 09:00','EVT_002'),
  ('BTX_001','customer_balance','MRC_FR','EUR', -100.00, 'charge',     '2025-05-01 09:00','EVT_003'),
  ('BTX_001','fees_revenue',    'MRC_FR','EUR',    2.90, 'fee',        '2025-05-01 09:00','EVT_004');
-- SUM(amount WHERE currency='EUR') = 100 - 2.90 - 100 + 2.90 = 0 ✓

-- Business event 2: payout €97.10 → USD at locked rate 1.08
INSERT INTO fct_balance_transactions VALUES
  ('BTX_002','payout', '2025-05-03 10:00:00','PO_777','MRC_FR','paid');

INSERT INTO fct_ledger_entries (balance_txn_id, account_id, merchant_id, currency_code, amount, entry_type, effective_ts, source_event_id) VALUES
  ('BTX_002','merchant_balance','MRC_FR','EUR',  -97.10, 'payout',  '2025-05-03 10:00','EVT_005'),  -- debit EUR
  ('BTX_002','fx_holding',      'MRC_FR','EUR',   97.10, 'payout',  '2025-05-03 10:00','EVT_006'),  -- credit EUR
  ('BTX_002','fx_holding',      'MRC_FR','USD', -104.87, 'payout',  '2025-05-03 10:00','EVT_007'),  -- debit USD (97.10 × 1.08)
  ('BTX_002','merchant_bank',   'MRC_FR','USD',  104.87, 'payout',  '2025-05-03 10:00','EVT_008'); -- credit USD
-- SUM EUR = 0 ✓ ; SUM USD = 0 ✓

A. Finance · invariant DQ check

Q1 — Continuous SUM=0 invariant per (balance_txn × currency)

SELECT
  balance_txn_id,
  currency_code,
  ROUND(SUM(amount), 4) AS sum_amount,
  COUNT(*)              AS entry_count
FROM fct_ledger_entries
WHERE effective_ts >= CURRENT_DATE - 1
GROUP BY balance_txn_id, currency_code
HAVING ABS(SUM(amount)) > 0.0001;
-- Empty = healthy. Any row = producer bug. Halts the payments pipeline.

B. Merchant · balance & payouts

Q2 — Merchant balance per currency (as-of-now)

SELECT
  merchant_id, currency_code,
  ROUND(SUM(amount), 2) AS balance
FROM fct_ledger_entries
WHERE account_id = 'merchant_balance' AND merchant_id = 'MRC_FR'
GROUP BY merchant_id, currency_code;
-- Or read from snap_account_balance_daily for sub-second response.

Q3 — Audit trail: balance as it appeared on a specific historical date

SELECT
  merchant_id, currency_code,
  ROUND(SUM(amount), 2) AS balance_as_of_2025_05_15
FROM fct_ledger_entries
WHERE account_id = 'merchant_balance'
  AND merchant_id = 'MRC_FR'
  AND posted_ts <= '2025-05-15 23:59:59'
GROUP BY merchant_id, currency_code;
-- Defensible regardless of corrections made AFTER 2025-05-15. The append-only contract.

C. Risk · chargeback exposure

Q4 — Chargeback rate per merchant + dispute lag

WITH charges AS (
  SELECT merchant_id, balance_txn_id, effective_ts AS charge_ts, ABS(amount) AS gross_usd
  FROM fct_ledger_entries
  WHERE entry_type = 'charge' AND account_id = 'merchant_balance' AND amount > 0
),
chargebacks AS (
  SELECT merchant_id, reversal_of_id, effective_ts AS chargeback_ts, ABS(amount) AS chargeback_usd
  FROM fct_ledger_entries WHERE entry_type = 'chargeback'
)
SELECT
  c.merchant_id,
  COUNT(c.balance_txn_id)                                                AS total_charges,
  COUNT(cb.reversal_of_id)                                               AS total_chargebacks,
  ROUND(100.0 * COUNT(cb.reversal_of_id) / NULLIF(COUNT(c.balance_txn_id), 0), 2) AS chargeback_rate_pct,
  ROUND(AVG(EXTRACT(EPOCH FROM (cb.chargeback_ts - c.charge_ts))/86400), 1) AS avg_dispute_lag_days
FROM charges c
LEFT JOIN chargebacks cb ON cb.reversal_of_id = c.balance_txn_id
WHERE c.charge_ts >= CURRENT_DATE - 90
GROUP BY c.merchant_id
HAVING COUNT(c.balance_txn_id) > 100
ORDER BY chargeback_rate_pct DESC;
-- Merchants >1% chargeback rate trigger reserve_pct increase.

Worked example — €100 charge → USD payout

Charge BTX_001 (€100): 4 ledger entries in EUR.
- +€100 to merchant_balance · −€2.90 fee from merchant_balance · net €97.10 to merchant.
- −€100 from customer_balance · +€2.90 to fees_revenue.
- EUR sum = 100 − 2.90 − 100 + 2.90 = 0 ✓
Payout BTX_002 (3 days later, FX rate locked at 1 EUR = 1.08 USD).
- −€97.10 from merchant_balance · +€97.10 to fx_holding (EUR side closes)
- −$104.87 from fx_holding (USD side, 97.10 × 1.08) · +$104.87 to merchant_bank
- EUR sum = 0 ✓ · USD sum = 0 ✓
Two months later: chargeback on BTX_001. Append 4 new entries that reverse the charge (positive amounts where original was negative). reversal_of_id links each new entry to its original. Original rows untouched. Audit trail preserved. Net effect: merchant balance net change for the chargeback is recorded as a fresh balance_txn.

Section 8 — Why this works

Append-only is the contract. Historical balance reads ("as of date Y") are defensible regardless of corrections made later. Required by SOX, PCI, banking regulators.
Invariant catches bugs at write time. Continuous DQ on SUM=0 — any producer regression halts the pipeline before bad data reaches finance.
Idempotency at the row. UNIQUE(source_event_id) handles producer retries without duplicating money.
FX as currency entries. No "convert at read time" — every conversion is a recorded ledger event with its locked rate.

Senior framing. "Append-only is not a style choice in payments — it's a regulatory requirement. The model has to support 'show me the full audit trail for merchant X on date Y as it appeared then' regardless of corrections made since. That's why every reversal is a new pair of entries, never a retro edit. SUM=0 per (balance_txn × currency) is the invariant that keeps the books honest at write time."

Airbnb · 2-sided marketplace · calendar-as-fact

Scenario 11 — Bookings, calendar, cancellations & reviews

Design a model for Airbnb supporting search/availability lookup, booking lifecycle (request → accepted → stayed → reviewed), cancellation-policy refund logic, host payouts after stay completion, and dynamic pricing — with full point-in-time queryability ("what was the price of listing X on July 4 last year?").

Why this is hard — the design tension

The calendar is the hot read path: a search query for "Paris, July 4-7, 2 guests" hits potentially millions of (listing × night) rows in milliseconds. Calendar-as-fact with daily snapshots is the canonical answer. Booking modifications require append-only — same booking_id can't UPDATE in place. Cancellation refunds depend on (policy × days-from-check-in) and must be computed at cancel time, locked on the booking row. Multi-currency pays guest in EUR, host in USD; FX locked at booking.

Section 1 — Dimensional Model Overview

Fact Tables

fct_calendar_nights — The "Availability" grain. One row per (listing × night × snapshot_date). Daily snapshots enable point-in-time price lookup.
fct_bookings — The "Reservation" grain. One row per booking lifecycle. Append-only — modifications create new rows.
fct_reviews — The "Bilateral feedback" grain. One row per review (typed: guest_to_host OR host_to_guest).
fct_host_payouts — The "Payout" grain. One row per payout event after stay completion.

Dimension Tables

dim_listings (SCD2 — price, policies, capacity drift)
dim_hosts (SCD2 — superhost status, response rate)
dim_guests
dim_geography (neighborhood, lat/lon, market_id)

Section 2 — Logical Data Model (key columns)

fct_calendar_nights

Column	Type	Description
`listing_id, night_date, snapshot_date`	Composite PK	Time-travel via snapshot.
`is_available, is_blocked, is_booked`	BOOLEAN	Mutually exclusive states.
`booking_id`	FK NULL	Set when booked.
`base_price_local`, `smart_price_local`, `effective_price_local`	DECIMAL	Base = host-set; smart = algorithmic; effective = what was charged.
`currency_code`	CHAR(3)	Listing's currency.

fct_bookings — booking_id, listing_id, guest_id, request_ts, accepted_ts, cancelled_ts, cancellation_actor ∈ {guest, host, platform}, check_in_date, check_out_date, nights, guests, subtotal/cleaning/service/taxes/total_local, fx_rate_to_usd, total_usd, cancellation_policy ∈ {flexible, moderate, strict, super_strict}, refund_amount_usd, terminal_state.

Section 3 — How the model serves each stakeholder

Guest · search & book — Search joins fct_calendar_nights WHERE is_available=TRUE AND night_date BETWEEN x AND y. Hot path; pushed to Elasticsearch + indexed in warehouse.

Host · earnings & calendar mgmt — Daily payouts from fct_host_payouts. Pricing team analyzes fct_calendar_nights over time to recommend smart-price strategy.

Pricing team · "what was the price on date X?" — Time-travel via snapshot_date in the PK. Critical for dynamic-pricing experiments and dispute resolution.

Trust & safety · review-rate analysis — Bilateral reviews; if a host has high "guest cancelled" rate, T&S investigates.

Section 4 — Why calendar-as-fact + daily snapshot

Without snapshots: can only answer "what's the price now?" Time-travel queries require reconstructing from dim_listings SCD2 + booking events — slow, error-prone, often impossible after data drift.
With daily snapshots: "what was the price of listing X on July 4 last year (as Airbnb showed it on July 1)?" is one query.
Cost: 7M listings × 365 future nights × daily snapshot = ~9B rows/day. Partition by night_date; aggressively expire old snapshots (keep 90-day raw, then weekly).
Append-only on bookings. Modifications generate a new booking row with parent_booking_id; original sets terminal_state = modified. Audit trail preserved.

Enhanced graphical data model — Airbnb Bookings + Calendar

Section 7 — SQL analysis

Sample data — 1 listing, 5 nights of calendar, 1 booking, 1 guest cancellation

INSERT INTO fct_calendar_nights VALUES
  ('LST_PARIS_42','2025-07-04','2025-06-01', FALSE, FALSE, TRUE,  'BKG_777', 200.00, 240.00, 240.00,'EUR'),
  ('LST_PARIS_42','2025-07-05','2025-06-01', FALSE, FALSE, TRUE,  'BKG_777', 200.00, 240.00, 240.00,'EUR'),
  ('LST_PARIS_42','2025-07-06','2025-06-01', FALSE, FALSE, TRUE,  'BKG_777', 200.00, 240.00, 240.00,'EUR'),
  ('LST_PARIS_42','2025-07-07','2025-06-01', TRUE,  FALSE, FALSE,  NULL,    200.00, 220.00, 220.00,'EUR'),
  ('LST_PARIS_42','2025-07-08','2025-06-01', TRUE,  FALSE, FALSE,  NULL,    200.00, 220.00, 220.00,'EUR');

INSERT INTO fct_bookings VALUES
  ('BKG_777','LST_PARIS_42','GUEST_LISA','2025-06-01 10:00','2025-06-01 10:30',NULL,NULL,
   '2025-07-04','2025-07-07', 3, 2,
   720.00, 80.00, 56.00, 64.00, 920.00, 1.08, 993.60,
   'moderate', NULL, 'confirmed');

-- Guest cancels 5 days before check-in (under 'moderate' policy = 50% refund)
UPDATE fct_bookings SET cancelled_ts='2025-06-29 14:00', cancellation_actor='guest',
  refund_amount_usd=496.80, terminal_state='cancelled' WHERE booking_id='BKG_777';

A. Search · calendar utilization

Q1 — Available listings in Paris for July 4-7, 2 guests

SELECT l.listing_id, l.title, MIN(c.effective_price_local) AS price_low, MAX(c.effective_price_local) AS price_high
FROM dim_listings l
JOIN fct_calendar_nights c ON l.listing_id = c.listing_id
JOIN dim_geography g ON l.geography_id = g.geography_id
WHERE c.snapshot_date = CURRENT_DATE
  AND c.night_date BETWEEN '2025-07-04' AND '2025-07-06'
  AND c.is_available = TRUE
  AND l.capacity >= 2
  AND g.market_id = 'PARIS'
GROUP BY l.listing_id, l.title
HAVING COUNT(*) = 3;     -- all 3 nights available

B. Pricing team · time-travel queries

Q2 — Price of LST_PARIS_42 on July 4, 2025 — as Airbnb showed it on June 1

SELECT listing_id, night_date, base_price_local, smart_price_local, effective_price_local
FROM fct_calendar_nights
WHERE listing_id = 'LST_PARIS_42'
  AND night_date = '2025-07-04'
  AND snapshot_date = '2025-06-01';
-- The data product the dynamic-pricing team can't ship without.

C. Trust & safety · cancellation analysis

Q3 — Host cancellation rate (a strong T&S signal)

SELECT
  l.host_id,
  COUNT(*)                                                   AS total_bookings,
  COUNT(*) FILTER (WHERE b.cancellation_actor = 'host')      AS host_cancels,
  ROUND(100.0 * COUNT(*) FILTER (WHERE b.cancellation_actor = 'host')
        / COUNT(*), 2)                                        AS host_cancel_rate_pct
FROM fct_bookings b
JOIN dim_listings l USING (listing_id)
WHERE b.request_ts >= CURRENT_DATE - 90
GROUP BY l.host_id
HAVING COUNT(*) > 10
ORDER BY host_cancel_rate_pct DESC;
-- Hosts >3% rate get warning; >5% lose Superhost status.

Worked example — BKG_777 cancellation under "moderate" policy

Booking total: $993.60 for 3 nights ($720 subtotal + $80 cleaning + $56 service + $64 tax, FX 1.08).
Cancelled 5 days before check-in.
Moderate policy refund_curve: ≥5 days before = 100% refund of nightly rate (subtotal only); cleaning fee always refunded; service fee non-refundable.
Refund = $720 (subtotal at 100%) × 0.50 (we're inside the partial-refund window for "moderate") + $80 cleaning + $0 service = $496.80.
fct_bookings.refund_amount_usd = $496.80, locked at cancellation_ts. Future re-querying gives same result regardless of policy changes after.
Calendar nights for Jul 4-6 flip back to is_available=TRUE in the next snapshot. Listing reopens for searches.

Section 8 — Why this works

Calendar-as-fact + snapshots = time-travel pricing queries are first-class.
Append-only bookings with parent_booking_id chain = audit trail for modifications.
Refund locked at cancel time = future policy changes don't restate historical refunds.
Bilateral reviews = both guest and host accountability surfaced.

Senior framing. "The calendar-as-fact is the unlock — pricing teams can answer 'what was the price of this listing on Jul 4 last year?' from fct_calendar_nights without reconstructing it from booking events. That's the data product the dynamic-pricing team can't ship without. Snapshot_date in the PK is the cheap engineering choice that pays back every quarter."

SaaS · subscription + usage metering

Scenario 12 — Subscription billing + hourly usage metering (Snowflake / Datadog model)

Design a model for a SaaS with mixed pricing — flat-rate seats + usage-based metering (per-query, per-GB, per-API-call) — handling mid-cycle plan changes, prorations, overages, free trials, and multi-currency invoicing.

Why this is hard — the design tension

Two revenue streams with opposite shapes. Subscription is predictable, monthly, modeled at low row volume. Usage is volatile, hourly, billions of rows/day. Mid-cycle plan changes require pro-rated math. Trials look like paid plans but bill $0. Free-tier allowances reset each period. Multi-currency demands FX locked at usage time. The model must keep usage and subscription separate at the fact layer but join cleanly at invoice time.

Section 1 — Dimensional Model Overview

Fact Tables

fct_subscription_state — The "Daily snapshot" grain. One row per (subscription × day). Captures plan, status, MRR.
fct_usage_meter — The "Hourly usage" grain. One row per (subscription × meter × hour). High volume.
fct_billing_events — The "Charge" grain. One row per subscription_charge / usage_charge / proration / refund / credit / dunning.
fct_invoices — The "Statement" grain. One row per invoice — derived rollup of billing events.

Dimension Tables

dim_accounts · dim_plans (SCD2 — pricing tiers drift) · dim_meters (compute_credits, storage_gb_month, api_calls) · dim_promotions (SCD2)

Section 2 — Logical Data Model (key columns)

fct_usage_meter — account_id, subscription_id, meter_id, hour_bucket_ts, raw_quantity, billable_quantity (= max(0, raw − free_allowance)), unit_price_local (locked at hour), currency_code, amount_local, fx_rate_to_usd, amount_usd.

fct_billing_events — event_id, account_id, subscription_id, event_type, plan_id (SCD2 effective at event_ts), event_ts, period_start, period_end, amount_local, currency_code, amount_usd, status, source_event_id UNIQUE.

fct_subscription_state — subscription_id, day, plan_id, status ∈ {trialing, active, paused, canceled, past_due}, mrr_usd, currency, days_into_cycle, days_remaining.

Section 3 — How the model serves each stakeholder

Finance · MRR & ARR — MRR = SUM(fct_subscription_state.mrr_usd) WHERE day = today AND status = 'active'. ARR = MRR × 12.

Product · usage analytics — Hourly grain → daily/weekly rollups for adoption analysis. Same source as Finance — one truth.

Customer Success · expansion signal — Accounts approaching their plan's overage threshold = upsell trigger. Computed from fct_usage_meter trends + dim_plans.tier_limit.

Sales · NRR cohorts — Net revenue retention by quarterly signup cohort. Combines all four facts.

Section 4 — Why hourly meter + daily snapshot + event log

Hourly meter resolution is the integration point: fine enough for usage analytics, coarse enough that the table doesn't explode. Per-query/per-call rolls up to the hour at ingest.
Daily subscription state avoids modeling subscription as a slowly-changing dim — too much daily flux. Daily snapshot is bounded and queryable.
Append-only billing events with UNIQUE(source_event_id) = idempotent + audit-defensible. Mid-cycle plan change emits 2 events (proration credit on old, prorated charge on new); invoice is just SUM of events in the period.
FX locked per hour. Same precision as the meter; never restated.

Enhanced graphical data model — SaaS Subscription + Usage

Section 7 — SQL analysis

Sample data — Mid-cycle plan upgrade with 15 days remaining

-- Account on Pro plan ($100/mo), upgrades to Enterprise ($300/mo) on day 15 of a 30-day cycle
INSERT INTO fct_billing_events VALUES
  ('EVT_S1','ACCT_X','SUB_42','subscription_charge','PLAN_PRO',       '2025-05-01 00:00','2025-05-01','2025-05-31',100.00,'USD',100.00,'paid','SRC_001'),
  -- Day 15: upgrade. Proration credit on PRO (15 days unused = $50)
  ('EVT_P1','ACCT_X','SUB_42','proration',          'PLAN_PRO',       '2025-05-15 14:00','2025-05-15','2025-05-31',-50.00,'USD',-50.00,'applied','SRC_002'),
  -- Day 15: prorated charge for ENT (15 days at new rate = $150)
  ('EVT_P2','ACCT_X','SUB_42','proration',          'PLAN_ENTERPRISE','2025-05-15 14:00','2025-05-15','2025-05-31',150.00,'USD',150.00,'paid','SRC_003');

-- Hourly usage: account exceeds 100 free credits/hour 3 times during the day
INSERT INTO fct_usage_meter VALUES
  ('ACCT_X','SUB_42','METER_COMPUTE','2025-05-15 09:00:00', 120,  20, 0.10,'USD', 2.00,1.0, 2.00,'2025-05-15 09:05'),
  ('ACCT_X','SUB_42','METER_COMPUTE','2025-05-15 14:00:00', 180,  80, 0.10,'USD', 8.00,1.0, 8.00,'2025-05-15 14:05'),
  ('ACCT_X','SUB_42','METER_COMPUTE','2025-05-15 18:00:00', 250, 150, 0.10,'USD',15.00,1.0,15.00,'2025-05-15 18:05');

A. Finance · MRR & invoice rollup

Q1 — MRR roll-forward (active subscriptions, latest snapshot)

SELECT
  COUNT(*)                                        AS active_subs,
  ROUND(SUM(mrr_usd), 2)                          AS mrr_usd,
  ROUND(SUM(mrr_usd) * 12, 2)                     AS arr_usd
FROM fct_subscription_state
WHERE day = CURRENT_DATE
  AND status = 'active';

Q2 — Invoice for SUB_42 in May 2025 (sums all event types)

SELECT
  subscription_id,
  SUM(amount_usd) FILTER (WHERE event_type = 'subscription_charge') AS subscription_charges,
  SUM(amount_usd) FILTER (WHERE event_type = 'usage_charge')        AS usage_charges,
  SUM(amount_usd) FILTER (WHERE event_type = 'proration')           AS proration_net,
  SUM(amount_usd) FILTER (WHERE event_type = 'credit')              AS credits,
  SUM(amount_usd)                                                    AS total_due_usd
FROM fct_billing_events
WHERE subscription_id = 'SUB_42'
  AND period_start = '2025-05-01'
GROUP BY subscription_id;

B. Customer Success · upsell signal

Q3 — Accounts approaching their plan's overage threshold (upsell candidates)

WITH usage_30d AS (
  SELECT subscription_id, meter_id, SUM(billable_quantity) AS used
  FROM fct_usage_meter
  WHERE hour_bucket_ts >= CURRENT_DATE - 30
  GROUP BY subscription_id, meter_id
),
plan_caps AS (
  SELECT s.subscription_id, m.meter_id, p.included_quantity
  FROM fct_subscription_state s
  JOIN dim_plans p USING (plan_id)
  CROSS JOIN dim_meters m
  WHERE s.day = CURRENT_DATE AND s.status = 'active'
)
SELECT u.subscription_id, u.meter_id, u.used, c.included_quantity,
  ROUND(100.0 * u.used / NULLIF(c.included_quantity, 0), 1) AS pct_of_cap
FROM usage_30d u
JOIN plan_caps c USING (subscription_id, meter_id)
WHERE u.used > c.included_quantity * 0.85
ORDER BY pct_of_cap DESC;
-- >85% of cap = upsell trigger; >100% = overage charges hit next invoice.

C. Sales · NRR by cohort

Q4 — Net Revenue Retention by signup cohort

WITH cohort_baseline AS (
  SELECT subscription_id, DATE_TRUNC('quarter', MIN(day)) AS cohort_q,
         FIRST_VALUE(mrr_usd) OVER (PARTITION BY subscription_id ORDER BY day) AS starting_mrr
  FROM fct_subscription_state
),
current_state AS (
  SELECT subscription_id, mrr_usd AS current_mrr
  FROM fct_subscription_state WHERE day = CURRENT_DATE
)
SELECT cb.cohort_q,
  COUNT(DISTINCT cb.subscription_id)                                      AS cohort_size,
  ROUND(SUM(cb.starting_mrr), 0)                                          AS starting_mrr_total,
  ROUND(SUM(COALESCE(cs.current_mrr, 0)), 0)                              AS current_mrr_total,
  ROUND(100.0 * SUM(COALESCE(cs.current_mrr, 0)) / NULLIF(SUM(cb.starting_mrr), 0), 1) AS nrr_pct
FROM cohort_baseline cb
LEFT JOIN current_state cs USING (subscription_id)
GROUP BY cb.cohort_q
ORDER BY cb.cohort_q;
-- NRR > 100% = expansion outpaces churn (the SaaS holy grail).

Worked example — SUB_42 mid-cycle upgrade + usage overage

Day 1 (May 1): charged $100 for Pro plan (full month).
Day 15 (May 15): upgrade to Enterprise. Two proration events:
- Credit −$50 on Pro (15 unused days × $100/30)
- Charge +$150 for Enterprise (15 days × $300/30)
- Net delta: +$100
Day 15 (later): usage spikes. Plan includes 100 compute_credits/hr; account uses 250 in one hour → 150 billable at $0.10 = $15 overage.
End-of-month invoice for May: $100 (Pro full) − $50 (credit) + $150 (Ent prorated) + $25 (3-hr usage overage) = $225.
MRR effect on day 15 onward: jumps from $100 to $300/mo. Visible in fct_subscription_state.mrr_usd daily snapshot.

Section 8 — Why this works

Hourly usage + daily subscription = right resolution at each grain. Usage is high-volume; subscription is bounded.
Proration is two events, not one. Both credit and charge land in fct_billing_events; invoice rollup picks them up cleanly.
Idempotent at source_event_id. Producer retries don't double-charge.
FX locked per hour. Same precision as the meter; no FX restatement.

Senior framing. "The hourly grain on fct_usage_meter is the integration point — finance rolls it up to invoices, product teams use it for usage analytics, and customer success uses it for adoption signals. Three consumers, one source of truth. Mid-cycle plan changes emit two billing events instead of one updated row — that's what makes the invoice a SUM, not a recompute."

· · ·

§ 04 — Cross-Cutting Concerns

The patterns that show up regardless of domain.

Eleven patterns. §9 (LAD), §10 (factless facts), §11 (streaming-first) are the senior-level moves — drop them in interviews and the room shifts.

1. SCD Types — the deep dive

Type	Mechanism	When to use	Domain example
Type 1	Overwrite. No history.	Current state only matters; cheap fix-ups.	User's display name correction.
Type 2	New row per change with `effective_from` / `effective_to`.	Historical accuracy required — financial, regulatory, ML.	Driver rating tier (Uber), creative version (Meta), plan tier (SaaS).
Type 3	Add `previous_X` column. Rarely used.	Only the most recent change matters.	Job title (current + previous).
Type 4	Separate history table.	Current dim is small + hot; history rarely queried.	Geography reorganizations.
Type 6	Hybrid — `current_X` column + SCD2 history.	90% of queries want current, 10% want history.	Customer's current address vs every address ever.

2. Late-arriving data — the four cases

Case	Pattern
Late-arriving fact	Event timestamped yesterday, lands today. Watermark on stream side; daily backfill window (typically 24–48h) on batch.
Late-arriving dimension	Fact references a dim row not yet loaded. Use placeholder SK `-1`, mark with `needs_dim_lookup = TRUE`, backfill nightly.
Slowly-changing fact	Financial restatement (refund of last month's invoice). Append correction row; never `UPDATE`. Net = SUM.
Out-of-order events	Stream-time vs event-time. Watermark = max_seen_ts − lag. Sort within partition.

3. GDPR / privacy — the universal pattern

Tokenize at ingest. Replace user_id with user_id_token; keep mapping in a separate, encrypted vault.
On delete request: invalidate the token mapping. Events become un-rejoinable to the person.
Aggregates survive — DAU, revenue, etc. stay accurate.
Time-bounded: 30-day SLA; track in fact_gdpr_request.
Tax / legal carve-outs: finance facts retained 7 years even after request — anonymize (replace name) but keep transaction.

4. Multi-currency — the lock-at-event-time rule

Step	Rule
Store both	Always store `amount_local + currency_code` AND `amount_usd + fx_rate`. Never just one.
Lock the rate	Lock `fx_rate_to_usd` at event time, not query time. Same charge can't yield different USD amounts on different days.
FX dim is SCD2	`dim_fx_rate(currency_pair, rate, effective_from, effective_to)` — daily rate at minimum.
Reporting currency	Some companies report in a third currency (Stripe USD, Adyen EUR). Carry that as a third pair.

5. Time zones — the three rules

Storage in UTC, always. Use TIMESTAMP_TZ when supported.
Dim for the reporting TZ. "Daily" reports are in the reporting entity's TZ (driver's local TZ for Uber payouts; advertiser's TZ for Google ads pacing).
DST-safe partitioning. Partition by DATE(event_ts AT TIME ZONE 'UTC') always. DST shifts cause off-by-one if you partition by local TZ.

6. Hot partitions / skew — the four counter-moves

Counter-move	When
Sub-partition by hash	One title/cell/merchant dominates: `HASH(key) MOD 16` as a sub-partition.
Salted aggregation	Two-stage GROUP BY: SUM by (key \|\| salt), then SUM by key. Spark/Trino.
Separate hot table	Top-N keys go to a "hot" partition with looser retention; rest to "cold".
Pre-aggregate at edge	Sketch the hot keys upstream (HLL, Count-Min) so the warehouse never sees raw skew.

7. Idempotency — the universal contract

Every fact table needs one of:

Natural-key MERGE: MERGE INTO target USING source ON natural_key WHEN MATCHED AND source.ingest_ts > target.ingest_ts THEN UPDATE.
Append-only with dedup at read: ROW_NUMBER() OVER (PARTITION BY natural_key ORDER BY ingest_ts DESC) = 1.
Unique constraint on source_event_id: the producer's idempotency key, enforced.

8. Conformed dimensions — the discipline

One dim_date, one dim_geography, one dim_user — owned by the platform team, not duplicated per consumer. Otherwise:

Joins lie (different definitions of "user").
Cross-fact analysis (revenue × engagement) is impossible.
SCD2 versioning drifts.

Drop the term in interviews: "I'd want this to share dim_user with the existing customer-data platform — otherwise we're rebuilding the join every team."

9. Late-Arriving Dimensions (LAD) — the deep dive

Almost every cross-cutting concern collapses into one of four scenarios. LAD is the one that bites hardest in production because it silently drops rows on inner-join — Content Strategy stares at "0 views in the first 3 minutes" of a Netflix Original launch and the dim hadn't propagated yet.

When LAD happens

Trigger	Domain example	Time scale
Global launch — fact stream is global, dim sync is regional	Netflix Originals at midnight Pacific; Stranger Things S5	seconds–minutes
CDC connector lag — Debezium → Kafka → warehouse	Amazon catalog updates during Prime Day	minutes–hours
Manual dim load — someone must approve the new SKU	App Store new-app review	hours–days
Cross-region replication	Stripe payment dim from US to EU	seconds

The two strategies

Strategy	Where the fix lives	Pros	Cons
At-ingest (inferred member)	ETL inserts placeholder dim row when a fact arrives without a dim	Downstream queries are clean INNER JOINs; no every-query repetition	Fact ingest now coupled to dim; needs `is_inferred = TRUE` flag for later overwrite
At-query (defensive LEFT JOIN)	Every analyst query uses LEFT JOIN + COALESCE	Loose coupling; ingest stays simple	Repeated boilerplate; analysts forget the COALESCE and lose data silently

Lakehouses (Snowflake, BigQuery, Databricks) typically prefer at-ingest with an is_inferred flag. Real-time event pipelines (Kafka → Pinot) prefer at-query because the placeholder write would cause race conditions.

The defensive query template

WITH current_dim AS (
  SELECT dim_key, attr_a, attr_b,
    ROW_NUMBER() OVER (PARTITION BY dim_key ORDER BY effective_date DESC) AS rn
  FROM dim_table
)
SELECT f.fact_id, f.dim_key,
  COALESCE(d.attr_a, ' ' || f.dim_key || ' (Pending Sync)') AS attr_a,
  COALESCE(d.attr_b, 'Unknown')                                       AS attr_b
FROM fact_table f
LEFT JOIN current_dim d ON d.dim_key = f.dim_key AND d.rn = 1;

Three things working together:

LEFT JOIN keeps every fact row, even when the dim is missing.
ROW_NUMBER + rn = 1 picks the most-recent SCD2 version of the dim (re-titled, re-genre'd entries handled).
COALESCE with a searchable placeholder ("Pending Sync") instead of NULL so the BI tool's filter dropdown surfaces the gap.

The orphan-rate SLO

Production target: pct_orphan_facts < 0.5%. Above 1% pages the data-platform on-call. The query that drives the alert:

SELECT 100.0 * SUM(CASE WHEN d.dim_key IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS pct_orphan
FROM fact_table f LEFT JOIN current_dim d ON d.dim_key = f.dim_key AND d.rn = 1;

The dual problem — late-arriving FACT

The mirror image: dim is current, fact arrives days late from offline-mode mobile. Now you must attribute against the dim version that was current at fact time, not now. SCD2 + temporal join:

JOIN dim_table d
  ON d.dim_key = f.dim_key
 AND f.event_ts BETWEEN d.effective_from AND COALESCE(d.effective_to, '9999-12-31')

The "as-of-event" attribution. Same family of patterns; opposite direction.

Senior signal in the room

Drop the phrase: "the fact is the truth; the dimension is just an attribute that can be late." That single line tells the interviewer you've shipped LAD-aware pipelines.

10. Factless fact tables — the modeling pattern that gets missed

Factless facts have no measure column — they record that an event happened or that a relationship existed. Two flavors:

Flavor	Records	Domain example
Event-tracking	Things that happened with no dollars/qty	Page views, login attempts, push notification deliveries, video impressions, search queries
Coverage	Eligibility / availability that ENABLED an event	Promotions ran in store-month, students enrolled in course, products featured on home page, ads scheduled to a slot

Why they exist — the canonical examples

Question	Need a factless fact
"Which products were featured on the home page but didn't sell?"	Featuring is the absence of a sales fact — needs a coverage table to spot the negative space.
"Which students were registered for a class but never attended?"	Registration is one event, attendance is another. A registration-with-no-attendance row only exists if you store both.
"Which ads were scheduled but didn't get an impression?"	Same shape — coverage of scheduled minus measure of fired.

The schema

CREATE TABLE fact_class_registration (
  student_sk     INTEGER NOT NULL,    -- FK to dim_student
  course_sk      INTEGER NOT NULL,    -- FK to dim_course
  semester_sk    INTEGER NOT NULL,    -- FK to dim_semester (date-like)
  registration_sk INTEGER NOT NULL,   -- degenerate dim (the registration event id)
  -- NO measure columns
  PRIMARY KEY (student_sk, course_sk, semester_sk)
);

Pure foreign keys. The existence of the row is the measurement.

The COUNT(*) pattern

Aggregations on a factless fact are always COUNT(*):

-- How many students registered per course?
SELECT c.course_name, COUNT(*) AS registrations
FROM fact_class_registration r JOIN dim_course c USING (course_sk)
GROUP BY c.course_name;

-- Coverage gap: registered but no attendance?
SELECT r.student_sk, r.course_sk
FROM fact_class_registration r
LEFT JOIN fact_class_attendance a USING (student_sk, course_sk, semester_sk)
WHERE a.student_sk IS NULL;

The trap — when to use vs when to add a measure

Hint	Decision
"Did this happen?"	Factless
"How much / how many of X?"	Add a measure
"Was X eligible / scheduled / present?"	Factless coverage
"What's the dollar / quantity / score?"	Add a measure

Senior signal in the room

Drop the phrase: "this is a factless fact — the existence of the row is the measurement." Junior engineers add a junk count = 1 column. Seniors don't.

11. Streaming-first modeling — Kappa, watermarks, exactly-once

The lambda/kappa decision shifts your modeling. Once you commit to streaming-first, your fact tables become append-only event logs that the warehouse incrementally materializes.

Lambda vs Kappa — when to pick which

Architecture	Description	Pick when
Lambda	Two paths — batch (correct, slow) + speed (fresh, approximate). Reconcile downstream.	Legacy migration, very high accuracy needs, regulator requires audit trail of corrections.
Kappa	Stream is the single source of truth. Replay from log to rebuild any state. No batch path.	Greenfield, modern stack, you can replay, latency > correctness for most consumers.

Most 2024+ designs lean Kappa with a "snapshot every N hours" view for analyst convenience.

Watermarks — the late-event contract

A watermark is the timestamp the engine guarantees no event with an earlier event-time will arrive. It's a promise: "I've waited 90 seconds past max_seen_event_time; anything older is now late."

Window state	Action
watermark < window_end	Window still open; keep accumulating
watermark ≥ window_end	Emit window result; close it
Event arrives after window closed	Late event; either drop, side-output, or update via "allowed lateness"

Event-time vs ingest-time vs processing-time

Time	What it measures	Pick when
Event-time	When the event actually happened (client clock or server-side at originating event)	Almost always — it's the "truth"
Ingest-time	When the event entered the streaming engine	You can't trust event-time (no client clock); audit trails
Processing-time	When the engine processed the event	Wallclock-driven side effects; never for analytics

Modeling rule: store all three on every row. Partition / window by event-time; alert / debug by processing-time.

Exactly-once — the contract chain

"Exactly-once" isn't a single feature — it's a chain that breaks if any link fails. The four required pieces:

Idempotent producer — Kafka transactional producer + producer_id + sequence number
Idempotent consumer — store offset + transactional output (often via 2PC or Kafka transactions)
Idempotent sink — natural-key MERGE; or upsert by source_event_id with unique constraint
Replay-safe state — Flink savepoints; Kafka Streams state stores backed by Kafka topics

Senior framing: "exactly-once is end-to-end; if your sink doesn't have a unique key, the upstream guarantees don't matter."

Stream-table duality

Every streaming concept maps to a SQL concept:

Streaming	SQL equivalent
Stream (append-only)	Insert-only fact table
Table (latest value per key)	SCD1 dim table
Stream → Table (KTable)	Materialized view with `ROW_NUMBER() OVER ... = 1`
Tumbling window	`GROUP BY DATE_TRUNC('hour', event_ts)`
Hopping window	Multiple overlapping `GROUP BY` with `generate_series` spine
Session window	Gaps-and-islands via LAG + cumulative sum
Watermark	"Don't query rows newer than NOW() - 90s"

Drop this in interviews: "every Flink job is just a SQL query that hasn't been run yet."

CDC patterns — the bridge

Most companies don't go pure-streaming overnight. They hybridize via change-data-capture:

Debezium reads the OLTP database write-ahead log
Emits a Kafka topic with one record per row-change (insert/update/delete)
Sink connector materializes the stream into the warehouse with eventual consistency

Modeling consequence: every dim becomes SCD2 by design (every change is an event), and the warehouse can replay history at any point.

The skew + checkpoint storm

Streaming jobs fail in two characteristic ways:

Hot key skew — one user/title/merchant dominates one task slot. Fix: salt the key (HASH(key) MOD 16) for the heavy aggregation, then re-aggregate.
Checkpoint storm — checkpoint duration grows past interval; backpressure cascades. Fix: incremental checkpoints (RocksDB), unaligned checkpoints, smaller state.

Senior signal in the room

Drop two phrases: "watermark + allowed lateness handles late-arriving events deterministically" and "exactly-once is end-to-end — sink idempotency is non-negotiable." The interviewer immediately stops asking streaming questions and moves to the hard parts.

· · ·

§ 05 — Traps & Articulation

The eight traps + the script that wins the round.

The eight modeling traps — and the senior counter-move

Trap	Counter-move
Writing columns before stating grain	Say grain first, every time.
Using natural keys as PKs	Surrogate as PK, natural as a column. Mandatory for SCD2.
`UPDATE` on facts	Append corrections; let MV pick latest.
One status column on a dim ("current_stage")	Factless fact for transitions + daily snapshot for current.
Hierarchies via recursive joins for hot queries	Denormalize parent IDs onto leaf, drift-check nightly.
GDPR via cascade `DELETE`s	Tokenize at ingest; invalidate token on delete.
"We'll add SCD later"	Adding SCD2 to a populated dim is a 6-month migration. Decide upfront.
Per-team conformed dim copies	One conformed dim, owned by platform. Otherwise joins lie.

The 90-second articulation script — works on every domain

"Before I start — let me confirm: [3 clarifying questions]. Okay, assuming [X], the grain of the primary fact is [one sentence]. I'd model this as a star — [fact] surrounded by [3–4 dims] — with [dim_X] as Type 2 because [reason]. Physically, partition by [date_sk], cluster by [hot key]. Two failure modes I'd plan for: [late data + watermark] and [GDPR + tokenization]. The unlock for the business is [decision the data product enables]."

The three sentences that signal seniority — in any round

"The grain of this fact is one row per X per Y per Z." — names the unit of analysis up front.
"The trade-off here is X vs Y; I'm picking X because [downstream consumer / cost / regulatory]." — names the choice and owns it.
"This breaks when [late data / GDPR / FX / skew / time-zone shift]." — volunteers a failure mode.

Drill those three sentence-shapes until they're reflex. The interviewer cannot give you a senior score if these never come out of your mouth.

Now go practise. Pair this design deep-dive with the Practice Q&A bank — 888 real interview questions with embedded SQL & Python runtimes. → Practice · Q&A bank

Twelve scenarios across the industries that ask · the framework holds · go well.

· · ·

▸ Interview Studio · Design · Data Modeling ← All Design topics Practice Q&A →

Data modeling — twelve scenarios across the industries that actually ask.

Contents

Five moves, every prompt — on every domain.

The vocabulary that signals seniority — across all domains

What each company will probably ask you to model.

The twelve most-asked modeling prompts — full walk-throughs.

Scenario 1 — Trip lifecycle (with multi-driver convoys)

Why this is hard — the design tension

Section 1 — Dimensional Model Overview

Section 2 — The Logical Data Model

Section 3 — How the model serves each stakeholder

Section 4 — Why the hierarchy: bridges + parent fact

Section 5 — Conceptual data flow (the convoy in plain English)

Section 6 — Sample data representation

Enhanced graphical data model — Trip Lifecycle & Analytical Metrics

Section 7 — SQL analysis for business units

A. Operations · live ETA & lifecycle metrics

B. Finance · revenue reconciliation & driver payout

C. Risk · fraud detection

D. Product · CSAT, sentiment & growth accounting

Section 8 — Why this works for the convoy scenario

Section 9 — Business model evolution: single-sided → two-sided → three-sided → driverless

The driverless future — what changes when the human driver disappears?

Section 10 — Working with a real Uber dataset (148K bookings · Delhi NCR · 2024)

Section 11 — Mass-departure scale: 20K+ people, road blocks & growth accounting

11.1 — Re-stated grain hierarchy under stress

11.2 — Growth accounting on both sides of the marketplace

11.3 — Mass departure with road blocks: SCD2 geography in action

11.4 — Financial & risk integrity under chaos

11.5 — Scalability rules under load (the engineering details)

Scenario 2 — Surge pricing & supply/demand state

Why this is hard — the design tension

Section 1 — Dimensional Model Overview

Section 2 — The Logical Data Model

Section 3 — How the model serves each stakeholder

Section 4 — Why versioning the surge model matters (and why input_features_json is mandatory)

Section 5 — Conceptual data flow (the State → Decision → Outcome chain)

Enhanced graphical data model — Surge Pricing & Effectiveness

Section 7 — SQL analysis for business units

A. Pricing engine · live decisions

B. Pricing team · effectiveness analysis

C. Capacity planning · forecast accuracy & chronic gaps

Worked example — surge math at 19:02 in h3_sf_market

Section 8 — Why this works

Scenario 3 — Order & courier dispatch (with stacked deliveries)

Why this is hard — the design tension

Section 1 — Dimensional Model Overview

Section 2 — Logical Data Model

Section 3 — How the model serves each stakeholder

Section 4 — Why the order/dispatch split + bridge

Enhanced graphical data model — DoorDash 3-sided marketplace

Section 7 — SQL analysis for business units

A. Customer · order latencies & SLA

B. Courier · earnings & stack uplift

C. Restaurant · prep time & cancellation

Worked example — Stacked dispatch DSP-99

Section 8 — Why this works

Scenario 4 — Search ads auction → impression → click → conversion

Why this is hard — the design tension

Section 1 — Dimensional Model Overview

Section 2 — Logical Data Model (key columns only)

Section 3 — How the model serves each stakeholder

Section 4 — Why auction_decision_id + append-only attribution

Enhanced graphical data model — Search Ads Auction Chain

Section 7 — SQL analysis for business units

A. Advertiser · ROAS & ranker debug

B. Attribution & measurement

C. Ranker · quality score & pCTR calibration

Worked example — Nike "running shoe" auction clearing & attribution

Section 8 — Why this works

Scenario 5 — Cross-device attribution & identity graph

Why this is hard — the design tension

Section 1 — Dimensional Model Overview

Section 2 — Logical Data Model (key columns)

Section 3 — How the model serves each stakeholder

Section 4 — Why "match at read, not at write"

Enhanced graphical data model — Cross-device Attribution

Section 7 — SQL analysis for business units

A. Advertiser · cross-device ROAS by window

B. Measurement · incrementality lift

Section 4 — Why versioning the surge model matters (and why `input_features_json` is mandatory)

Section 4 — Why `auction_decision_id` + append-only attribution

Section 4 — Why `ranker_model_id` on every impression + append-only engagements