Data modeling — twelve scenarios across the industries that actually ask.
Modeling rounds at Uber, Google, Meta, Netflix, Amazon, DoorDash, Airbnb, Stripe and Spotify all test the same five-move sequence — but the domain context changes everything you say. This notebook walks the framework once, then drills the twelve most-asked scenarios end to end. Use it to swap domains in 60 seconds when the prompt arrives.
Contents
- The universal framework — five moves, on every prompt
- Company → most-likely scenario matrix
- Twelve scenarios, full walk-through
- 1. Uber/Lyft — trip lifecycle
- 2. Uber — surge pricing & supply/demand
- 3. DoorDash/Uber Eats — order & dispatch
- 4. Google search ads — auction & quality score
- 5. Meta/Facebook — attribution & identity
- 6. Netflix CTV ads — inventory & pacing
- 7. Amazon — orders, returns & inventory
- 8. Meta/Instagram — feed engagement
- 9. Spotify — listening history & royalties
- 10. Stripe — payments & double-entry ledger
- 11. Airbnb — bookings, calendar & reviews
- 12. SaaS subscription billing & usage metering
- Cross-cutting concerns — SCDs, LAD, factless facts, streaming, GDPR, FX, skew
- Common traps + the 90-second articulation script
Five moves, every prompt — on every domain.
The framework doesn't care whether you're modeling Uber trips, Google ads, or Spotify streams. Domain knowledge changes what you say in each move; the moves themselves stay identical. Drill the moves until they're reflex, then layer domain context on top.
- Clarify the use cases. Who reads this? What 3–5 questions must it answer? What's the SLA — minutes, hours, days? Don't model anything until you have 3–5 named consumers.
- Pin the grain. One sentence per fact table: "One row per X per Y per Z." If you can't say it cleanly, the model isn't ready.
- Conceptual → logical → physical. Entities first. Then attributes and types. Then partitioning, clustering, SCD strategy.
- SCDs and time. Type 1 (overwrite), Type 2 (versioned rows), Type 6 (hybrid). Justify each dim.
- Stress test. Volunteer two failure modes — late data, GDPR, schema drift, hot partitions, time zones — without being asked.
The vocabulary that signals seniority — across all domains
| Term | One-sentence definition |
|---|---|
| Grain | The meaning of one row, stated as a sentence. |
| SCD Type 1 / 2 / 6 | Overwrite vs versioned-rows vs hybrid (current + history columns). |
| Conformed dimension | Same dim used across multiple fact tables (dim_date, dim_geography). |
| Bridge / factless fact | M:N resolver, or an event-occurrence table with no measure. |
| Late-arriving dim | Fact references dim row not yet loaded — placeholder SK -1, backfill. |
| Slowly-changing fact | Restating financials. Append corrections; never UPDATE in place. |
| Idempotent load | Re-running yesterday produces the same result. MERGE on natural key. |
| Surrogate key | Synthetic integer PK. Required for SCD2. |
| Star vs snowflake | Star — denormalized dims. Snowflake — normalized. Star is default for analytics. |
| Wide table | One denormalized fact pre-joined to dims. Read speed vs storage + drift. |
| Activity schema | Single-table model where every event is a row. Useful for product analytics. |
| Data vault | 3-table pattern (hub/link/satellite) for high-change EDWs. Mention; rarely use. |
What each company will probably ask you to model.
Every company hires for the same five-move framework, but they probe with prompts shaped like their business. Use this matrix to anticipate.
| Company | Most-likely modeling prompts | The signal they want |
|---|---|---|
| Uber / Lyft | Trip lifecycle, surge pricing, driver-rider matching, ETA, payouts | Geo (H3 hexes), supply-demand thinking, multi-state lifecycle |
| DoorDash / Uber Eats | Order pipeline, courier dispatch, restaurant supply, batched delivery | 3-sided marketplace, time-sensitive SLAs |
| Search ads auction, YouTube viewing, Maps trips, Workspace events | Auction internals, quality score, ranker features | |
| Meta / Facebook | Feed engagement, Ads attribution, Reels watch, social graph | Cross-device identity, attribution windows, ATT/SKAdNetwork |
| Netflix | Viewing history, content catalog, A/B, talent contracts, CTV ads | Studio context, royalty calc, SCD2 fluency |
| Amazon | Orders, returns, inventory, recommendations, Prime entitlement | Returns reshape revenue; inventory snapshots; multi-warehouse |
| Spotify / Apple Music | Listening history, playlist generation, royalty payouts, podcasts | Per-stream royalty math, multi-rights-holder splits |
| Stripe / Square / PayPal | Transactions, ledger, fraud, payouts, FX | Double-entry rigor, multi-currency, restate-don't-update |
| Airbnb / Vrbo | Bookings, calendar, listings, reviews, host payouts | 2-sided marketplace, calendar-as-fact, cancellation policies |
| Salesforce / HubSpot | Opportunities, accounts, activities, quotas, attribution | SCD2 on opportunities (stage transitions), org hierarchy |
| Profile views, connections, feed engagement, recruiter search | Graph traversal, time-decayed relevance | |
| Snowflake / Databricks | Query telemetry, billing/credits, warehouse usage | Usage metering, multi-tenant, charge-back |
The twelve most-asked modeling prompts — full walk-throughs.
Scenario 1 — Trip lifecycle (with multi-driver convoys)
Why this is hard — the design tension
The problem isn't a single trip — it's the group intent. Five friends going to the airport expect to be charged a single price, but Uber dispatches two vehicles. Operations sees two trips; Finance must charge the group once but pay two drivers fairly; Risk must verify the convoy actually moved together. The robust solution is a hierarchical Star Schema that decouples the physical movement of the car from the financial obligation of the rider.
Section 1 — Dimensional Model Overview
This model uses a Journey → Trip → Event hierarchy to handle everything from a single ride to complex shared convoys.
Fact Tables (the hierarchy)
fct_journeys— The "Parent" grain. One row per group request. Captures group intent (split type, total group cost, status).fct_trips— The "Vehicle" grain. One row per driver / vehicle per journey. Source of truth for Finance (driver payouts).fct_trip_events— The "Telemetry" grain. One row per state transition (REQUEST, MATCH, PICKUP, etc.) per trip. Powers Ops (Live ETA) and Risk (GPS replay).fct_trip_feedback— The "Sentiment" grain. One row per (trip × rider) feedback event. Captures rating, category, free-text comment, sentiment. Powers Product (CSAT, NPS).brg_journey_riders— The "Financial" bridge. Links riders to journeys and to a specific assigned trip/vehicle — handles cost-splitting (journey-level) and capacity routing (trip-level).
Dimension Tables
dim_users— Riders and drivers, typed viauser_role. SCD2 on plan tier and rating.dim_vehicles— Vehicle specs (model, type), insurance status, owner. SCD2 on insurance status (Risk needs as-of-trip status).dim_geography— H3 hexagonal indexing (level 8 ≈ 0.7 km²) for spatial analysis. SCD1 (geography is structurally static).dim_payment_methods— Card / wallet on file. SCD2 — Finance audit needs the as-of-trip card.dim_fx_rate— SCD2 daily rates; locked atrequest_timestamp.
Section 2 — The Logical Data Model
Parent Fact: fct_journeys (the group intent)
| Column | Type | Description |
|---|---|---|
journey_key | PK BIGINT | Surrogate. |
journey_id | UUID | Natural ID; idempotency anchor. |
requesting_user_id | FK | Who initiated the group request. |
split_type | ENUM | {EQUAL_SPLIT, DISTANCE_PRORATA, REQUESTER_PAYS, CUSTOM}. |
total_group_fare_usd | DECIMAL(10,2) | Single bill for the whole group; sum of brg_journey_riders.individual_charge reconciles to this. |
num_vehicles_dispatched | SMALLINT | 1 for normal, ≥2 for convoy. |
num_riders | SMALLINT | Total riders in the group. |
journey_status | ENUM | {requested, in_progress, completed, partial, cancelled}. |
request_timestamp | TIMESTAMP_TZ | Group request time; FX rate locked here. |
Vehicle Fact: fct_trips (one row per driver/vehicle per journey)
| Column | Type | Description |
|---|---|---|
trip_key | PK BIGINT | Surrogate key for the trip. |
trip_id | UUID | Natural ID from the source system; idempotency anchor. |
journey_id | FK | Parent journey. Multiple trips can share a journey (convoy). |
trip_fare_share_usd | DECIMAL(10,2) | This vehicle's share of the journey fare. Drives driver payout, NOT what riders pay. |
driver_id | FK | The matched driver. (driver,journey) is unique — same driver can't take two trips for one journey. |
vehicle_id | FK | The vehicle used. |
origin_h3_key | FK | H3 cell at request time. |
destination_h3_key | FK | H3 cell at dropoff (NULL on cancellation). |
request_timestamp | TIMESTAMP_TZ | The beginning of the lifecycle. |
terminal_state | ENUM | {completed, cancelled_rider, cancelled_driver, cancelled_system, expired, no_show}. |
surge_multiplier | DECIMAL(4,2) | Locked at request — immutable. |
total_fare_local | DECIMAL(10,2) | Final fare in source currency. |
fx_rate_to_usd | DECIMAL(18,8) | Locked at request_timestamp — guarantees revenue can never be restated by FX drift. |
total_fare_usd | DECIMAL(10,2) | Final amount charged to rider(s) — Finance source of truth. |
driver_payout_usd | DECIMAL(10,2) | Final amount owed to driver after take rate (Finance focus). |
is_flagged_risk | BOOLEAN | Fraud suspicion flag (Risk focus). Slowly-changing — append correction; never UPDATE. |
ingest_ts | TIMESTAMP_TZ | For idempotent MERGE. |
Event Fact: fct_trip_events
| Column | Type | Description |
|---|---|---|
event_id | PK | Unique event identifier (idempotency). |
trip_id | FK | Link to fct_trips. |
state | STRING | REQUESTED, MATCHED, ARRIVED, STARTED, DROPOFF, CANCELLED. |
event_timestamp | TIMESTAMP_TZ | Precise time of transition. |
location_h3 | H3_INDEX | Cell where the event occurred. |
actor | STRING | Who triggered the transition: rider, driver, system. |
Sentiment Fact: fct_trip_feedback (per-rider satisfaction)
| Column | Type | Description |
|---|---|---|
feedback_key | PK BIGINT | Surrogate. |
trip_key | FK | The trip rated. |
rider_id | FK | Who left the feedback. |
rating_score | TINYINT | 1–5 stars. |
feedback_category | STRING | {driver_behavior, vehicle_cleanliness, route, payment, app_issue, other}. |
rider_comment | TEXT | Free-text; fed to NLP for sentiment. |
sentiment_score | DECIMAL(4,3) | −1.0 to +1.0; computed nightly via NLP model. |
feedback_timestamp | TIMESTAMP_TZ | When submitted (often hours after dropoff). |
Financial Bridge: brg_journey_riders
| Column | Type | Description |
|---|---|---|
journey_id | FK | Parent journey — drives the group cost-split. |
rider_id | FK | The rider on this leg. |
trip_id | FK | The specific car this rider was assigned to. Critical: in a convoy, riders R_1 + R_2 may ride in T-101 while R_3 rides in T-102. |
individual_charge_usd | DECIMAL(10,2) | What this rider actually pays. Computed from fct_journeys.split_type. |
pickup_sequence | SMALLINT | Order picked up within their assigned trip (1, 2, 3…). |
leg_distance_km | DECIMAL(8,3) | This rider's portion of the route; basis for distance-prorata splits. |
Section 3 — How the model serves each stakeholder
Operations · live ETA & monitoring
By using fct_trip_events, Operations calculates lead-time metrics directly:
- Dispatch Latency =
timestamp(MATCHED) − timestamp(REQUESTED) - Pickup ETA accuracy =
timestamp(STARTED) − timestamp(MATCHED)compared to the predicted ETA - Spatial hot-spot detection via
location_h3: aggregate REQUESTED events per cell per minute against MATCHED events to identify cells where demand exceeds supply in real time.
Finance · driver payouts & revenue
fct_tripsis the General Ledger source of truth — every payout infct_payoutsjoins back here ontrip_key.surge_multiplieris immutable once the request is made — ensures the rider's quoted price matches the driver's earning potential. Stored on the row, never derived.terminal_statedictates whether a cancellation fee triggers a partial driver payout (e.g.cancelled_riderafter MATCHED triggers a fee;cancelled_systemdoes not).- FX is locked at
request_timestamp— guarantees revenue restatement is impossible from rate drift.
Risk · fraud detection
- Location anomaly: compare
origin_h3onfct_tripswith the first STARTED event's H3 infct_trip_events— large divergence = potential ride-tampering. - Trip-event GPS replay reveals "teleportation" — impossible speed between consecutive events implies spoofed location.
- Account farming: joining
fct_tripswith adim_devicestable (linked viafct_session) — a single device cycling through multiplerider_ids in a short window flags suspicious activity. is_flagged_riskis a slowly-changing fact — appended via correction row, never updated, so Finance can replay any payout decision regardless of subsequent risk re-evaluation.
Section 4 — Why the hierarchy: bridges + parent fact
Without the journey/trip split + bridge, you face three bad choices:
- Option A: One row per trip, rider IDs in an array column. Finance can't aggregate revenue per rider without exploding the array. Disqualifying.
- Option B: One row per rider in
fct_trips. Pool with 3 riders becomes 3 rows —fareis triple-counted on every aggregation. Driver and dispatch counts also triple-count. Disqualifying. - Option C: Single fact for everything (rider-grain + driver-grain merged). A 3-rider, 2-driver convoy = 6 rows. Now you can't tell drivers from riders without filtering by role, and convoy-cost reconciliation needs a self-join. Disqualifying.
The hierarchy with bridge is the only design that satisfies all three stakeholders:
- One row per group in
fct_journeys— group-level KPIs (avg group size, split-type mix) are correct without filters. - One row per driver/vehicle per journey in
fct_trips— Operations and Finance per-driver counts are correct on simple aggregation.(driver_id, journey_id)uniqueness honored. - Multiple rows per journey in
brg_journey_riders— Finance computes per-rider charge with the journey's split logic; Risk can route a rider to a specific trip via the FK. - Two enforceable invariants:
SUM(brg_journey_riders.individual_charge_usd) per journey_id = fct_journeys.total_group_fare_usdSUM(fct_trips.trip_fare_share_usd) per journey_id = fct_journeys.total_group_fare_usd
Allocation rules:
EQUAL_SPLIT:individual_charge = total_group_fare / num_ridersDISTANCE_PRORATA:individual_charge = total_group_fare × (leg_distance_km / SUM(leg_distance_km))REQUESTER_PAYS:individual_charge = total_group_farefor requester,0for others
Section 5 — Conceptual data flow (the convoy in plain English)
For Journey J-500 with 5 riders heading to the airport, dispatched as a 2-car convoy:
fct_journeys— the group's perspective. One row, $60 total, EQUAL_SPLIT, 2 vehicles dispatched.fct_trips— the drivers' perspective. Two rows: Driver Alpha ($30 share), Driver Bravo ($30 share).brg_journey_riders— the riders' perspective. Three rows: each rider charged $20, two assigned to T-101, one to T-102.
Section 6 — Sample data representation
Table: fct_journeys (the logical group)
| journey_id | split_type | total_group_fare_usd | num_vehicles | num_riders | status |
|---|---|---|---|---|---|
| J-500 | EQUAL_SPLIT | $60.00 | 2 | 3 | completed |
Table: fct_trips (the physical vehicles)
| trip_id | journey_id | driver_id | vehicle_id | trip_fare_share_usd |
|---|---|---|---|---|
| T-101 | J-500 | D_Alpha | V_99 | $30.00 |
| T-102 | J-500 | D_Bravo | V_88 | $30.00 |
Table: brg_journey_riders (the rider split)
| journey_id | rider_id | trip_id (assigned car) | individual_charge_usd |
|---|---|---|---|
| J-500 | R_1 | T-101 | $20.00 |
| J-500 | R_2 | T-101 | $20.00 |
| J-500 | R_3 | T-102 | $20.00 |
Two reconciliations: SUM(brg.individual_charge) = $20 + $20 + $20 = $60 ✓ | SUM(fct_trips.trip_fare_share) = $30 + $30 = $60 ✓ | both equal fct_journeys.total_group_fare.
Enhanced graphical data model — Trip Lifecycle & Analytical Metrics
Solid arrows = referential FK relationships. Dashed arrows = analytical lineage (which fact tables feed which metric module).
Section 7 — SQL analysis for business units
Sample data (Journey J-500: airport convoy with 3 riders split across 2 vehicles)
INSERT INTO dim_users VALUES
(1,'R_1','rider', 'Gold', NULL),
(2,'R_2','rider', 'Silver', NULL),
(3,'R_3','rider', 'Silver', NULL),
(4,'D_Alpha','driver','Platinum', NULL),
(5,'D_Bravo','driver','Gold', NULL);
-- 1 journey
INSERT INTO fct_journeys
(journey_key, journey_id, requesting_user_id, split_type,
total_group_fare_usd, num_vehicles_dispatched, num_riders, journey_status, request_timestamp)
VALUES
(500,'J-500',1,'EQUAL_SPLIT', 60.00, 2, 3, 'completed', '2025-05-01 08:30:00');
-- 2 trips under J-500 (the convoy)
INSERT INTO fct_trips
(trip_key, trip_id, journey_id, driver_id, vehicle_id, trip_fare_share_usd,
surge_multiplier, terminal_state)
VALUES
(101,'T-101','J-500', 4, 99, 30.00, 1.0, 'completed'),
(102,'T-102','J-500', 5, 88, 30.00, 1.0, 'completed');
-- Trip events for both vehicles
INSERT INTO fct_trip_events VALUES
('e01','T-101','REQUESTED','2025-05-01 08:30:00','h3_a','rider'),
('e02','T-101','MATCHED', '2025-05-01 08:31:00','h3_a','system'),
('e03','T-101','PICKUP', '2025-05-01 08:35:00','h3_a','driver'),
('e04','T-101','DROPOFF', '2025-05-01 09:00:00','h3_z','driver'),
('e05','T-102','REQUESTED','2025-05-01 08:30:00','h3_a','rider'),
('e06','T-102','MATCHED', '2025-05-01 08:31:30','h3_a','system'),
('e07','T-102','PICKUP', '2025-05-01 08:35:00','h3_a','driver'), -- same cell as T-101 (legit convoy)
('e08','T-102','DROPOFF', '2025-05-01 09:01:00','h3_z','driver');
-- Bridge: 3 riders, 2 assigned to T-101, 1 assigned to T-102
INSERT INTO brg_journey_riders VALUES
('J-500', 1, 'T-101', 20.00, 1, 12.0),
('J-500', 2, 'T-101', 20.00, 2, 12.0),
('J-500', 3, 'T-102', 20.00, 1, 12.0);
-- Feedback: post-dropoff ratings + sentiment from NLP
INSERT INTO fct_trip_feedback VALUES
(9001,101,1,5,'driver_behavior', 'Great driver, smooth ride', 0.78,'2025-05-01 10:30:00'),
(9002,101,2,4,'route', 'Took a slight detour', 0.20,'2025-05-01 10:42:00'),
(9003,102,3,2,'vehicle_cleanliness', 'Car smelled bad', -0.65,'2025-05-01 11:05:00');
A. Operations · live ETA & lifecycle metrics
Q1 — System lag: per-state transition latency using gaps-and-islands
SELECT
trip_id,
state,
event_timestamp,
LAG(event_timestamp) OVER (PARTITION BY trip_id ORDER BY event_timestamp) AS prev_event_time,
EXTRACT(EPOCH FROM (
event_timestamp -
LAG(event_timestamp) OVER (PARTITION BY trip_id ORDER BY event_timestamp)
)) AS transition_latency_sec
FROM fct_trip_events
WHERE trip_id = 'T-101'
ORDER BY event_timestamp;
Result for T-101:
| state | event_timestamp | prev_event_time | transition_latency_sec |
|---|---|---|---|
| REQUESTED | 08:30:00 | NULL | NULL |
| MATCHED | 08:31:00 | 08:30:00 | 60 |
| PICKUP | 08:35:00 | 08:31:00 | 240 |
| DROPOFF | 09:00:00 | 08:35:00 | 1500 |
Use case: high transition_latency_sec on REQUESTED→MATCHED indicates "hot partitions" or processing delays — Ops can alert when the cell-level p95 exceeds threshold.
Q2 — Dispatch latency (p50 / p95) per H3 cell per hour
WITH latency AS (
SELECT
e.trip_id,
DATE_TRUNC('hour',
MIN(CASE WHEN state='REQUESTED' THEN event_timestamp END)) AS hour_bucket,
MAX(CASE WHEN state='REQUESTED' THEN location_h3 END) AS h3_cell,
EXTRACT(EPOCH FROM (
MAX(CASE WHEN state='MATCHED' THEN event_timestamp END)
- MAX(CASE WHEN state='REQUESTED' THEN event_timestamp END)
)) AS latency_sec
FROM fct_trip_events e
GROUP BY trip_id
HAVING MAX(CASE WHEN state='MATCHED' THEN 1 END) = 1
)
SELECT hour_bucket, h3_cell, COUNT(*) AS matched_trips,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY latency_sec) AS p50_sec,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_sec) AS p95_sec
FROM latency GROUP BY 1, 2 ORDER BY p95_sec DESC;
Q3 — Cancellation breakdown by actor
SELECT terminal_state, COUNT(*) AS cnt,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 1) AS pct_of_total
FROM fct_trips
GROUP BY terminal_state ORDER BY cnt DESC;
Q4 — Convoy efficiency (vehicles per journey, riders per vehicle)
SELECT
j.journey_id,
j.num_riders,
j.num_vehicles_dispatched,
ROUND(j.num_riders::DECIMAL / j.num_vehicles_dispatched, 2) AS riders_per_vehicle,
COUNT(DISTINCT t.trip_id) AS actual_trips,
COUNT(DISTINCT b.rider_id) AS riders_seen
FROM fct_journeys j
JOIN fct_trips t USING (journey_id)
JOIN brg_journey_riders b USING (journey_id)
GROUP BY j.journey_id, j.num_riders, j.num_vehicles_dispatched;
Result for J-500:
| journey_id | num_riders | num_vehicles | riders_per_vehicle | actual_trips | riders_seen |
|---|---|---|---|---|---|
| J-500 | 3 | 2 | 1.50 | 2 | 3 |
B. Finance · revenue reconciliation & driver payout
Q5 — Driver payout reconciliation (per-driver share vs riders' charges)
SELECT
t.driver_id,
t.trip_id,
t.trip_fare_share_usd AS revenue_basis,
SUM(b.individual_charge_usd) AS total_collected_from_riders,
COUNT(b.rider_id) AS riders_in_this_car
FROM fct_trips t
JOIN brg_journey_riders b ON t.trip_id = b.trip_id
GROUP BY t.driver_id, t.trip_id, t.trip_fare_share_usd
ORDER BY t.trip_id;
Result for J-500:
| driver_id | trip_id | revenue_basis | total_collected_from_riders | riders_in_car |
|---|---|---|---|---|
| D_Alpha (4) | T-101 | $30.00 | $40.00 | 2 |
| D_Bravo (5) | T-102 | $30.00 | $20.00 | 1 |
Critical insight: Driver Alpha's car collected $40 from riders but Alpha is paid only on $30 revenue_basis (the journey was split equally between drivers). The $10 gap goes to the platform as the "convoy coordination" fee. Without this two-grain design, you couldn't represent it.
Q6 — Pool fare allocation per rider (single-vehicle Pool, distance-prorata)
-- Same query, useful when split_type = DISTANCE_PRORATA
SELECT
b.journey_id,
b.rider_id,
b.trip_id,
b.pickup_sequence,
b.leg_distance_km,
ROUND(100.0 * b.leg_distance_km
/ SUM(b.leg_distance_km) OVER (PARTITION BY b.journey_id), 1) AS distance_share_pct,
b.individual_charge_usd,
j.total_group_fare_usd
FROM brg_journey_riders b
JOIN fct_journeys j USING (journey_id)
WHERE b.journey_id = 'J-500'
ORDER BY b.trip_id, b.pickup_sequence;
Q7 — Twin invariant data-quality check (continuous DQ)
-- Invariant 1: SUM(brg.individual_charge) per journey = fct_journeys.total_group_fare
-- Invariant 2: SUM(fct_trips.trip_fare_share) per journey = fct_journeys.total_group_fare
SELECT
j.journey_id,
j.total_group_fare_usd AS reported_total,
ROUND(SUM(DISTINCT t.trip_fare_share_usd), 2) AS sum_trip_shares,
(SELECT ROUND(SUM(b.individual_charge_usd), 2)
FROM brg_journey_riders b WHERE b.journey_id = j.journey_id)
AS sum_rider_charges,
CASE
WHEN j.total_group_fare_usd != SUM(DISTINCT t.trip_fare_share_usd) THEN 'FAIL_TRIP_INVARIANT'
WHEN j.total_group_fare_usd != (SELECT SUM(individual_charge_usd)
FROM brg_journey_riders WHERE journey_id = j.journey_id)
THEN 'FAIL_RIDER_INVARIANT'
ELSE 'PASS'
END AS dq_status
FROM fct_journeys j
JOIN fct_trips t USING (journey_id)
GROUP BY j.journey_id, j.total_group_fare_usd;
-- Any FAIL row = producer regression. Required for SOX / audit.
C. Risk · fraud detection
Q8 — Phantom Convoy detection (drivers claim convoy but aren't actually together)
-- Risk pattern: two drivers booked into the same journey but their PICKUP H3 cells
-- are far apart at the same moment. Either a producer bug or a fraud ring.
SELECT
e1.trip_id AS car_1,
e2.trip_id AS car_2,
e1.location_h3 AS loc_1,
e2.location_h3 AS loc_2,
EXTRACT(EPOCH FROM ABS(e1.event_timestamp - e2.event_timestamp)) AS pickup_time_gap_sec
FROM fct_trip_events e1
JOIN fct_trips t1 ON e1.trip_id = t1.trip_id
JOIN fct_trips t2 ON t1.journey_id = t2.journey_id AND t1.trip_id < t2.trip_id
JOIN fct_trip_events e2 ON e2.trip_id = t2.trip_id
WHERE e1.state = 'PICKUP' AND e2.state = 'PICKUP'
AND e1.location_h3 != e2.location_h3 -- not in same H3 cell
AND ABS(EXTRACT(EPOCH FROM (e1.event_timestamp - e2.event_timestamp))) < 600; -- within 10 min
-- Any returned row = an investigation item.
For J-500 sample data, both cars pickup at h3_a within 0 seconds of each other → no rows returned → legit convoy.
Q9 — Multi-rider capacity check (over-capacity vehicle)
SELECT
t.trip_id,
v.model,
v.capacity AS vehicle_capacity,
COUNT(b.rider_id) AS passenger_count
FROM fct_trips t
JOIN brg_journey_riders b ON t.trip_id = b.trip_id
JOIN dim_vehicles v ON t.vehicle_id = v.vehicle_key
GROUP BY t.trip_id, v.model, v.capacity
HAVING COUNT(b.rider_id) > v.capacity;
-- Off-app riders being charged = either producer bug or driver fraud.
Q10 — Location anomaly: request vs first-pickup divergence
WITH first_pickup AS (
SELECT trip_id, location_h3 AS pickup_h3
FROM fct_trip_events
WHERE state = 'STARTED'
)
SELECT
t.trip_key, t.trip_id, t.origin_h3_key, fp.pickup_h3,
CASE WHEN t.origin_h3_key <> fp.pickup_h3 THEN 'INVESTIGATE' ELSE 'OK' END AS risk_flag
FROM fct_trips t
LEFT JOIN first_pickup fp USING (trip_id)
WHERE t.terminal_state = 'completed';
D. Product · CSAT, sentiment & growth accounting
Q11 — Trip rating distribution (1–5 stars) per driver, last 28 days
SELECT
t.driver_id,
COUNT(f.feedback_key) AS total_ratings,
ROUND(AVG(f.rating_score), 2) AS avg_rating,
COUNT(*) FILTER (WHERE f.rating_score <= 2) AS low_ratings,
ROUND(100.0 * COUNT(*) FILTER (WHERE f.rating_score <= 2)
/ NULLIF(COUNT(*), 0), 1) AS pct_low_ratings
FROM fct_trips t
JOIN fct_trip_feedback f ON f.trip_key = t.trip_key
WHERE f.feedback_timestamp >= CURRENT_DATE - INTERVAL '28 days'
GROUP BY t.driver_id
ORDER BY avg_rating ASC;
-- Drivers with pct_low_ratings > 5% are flagged for retraining or offboarding.
Q12 — Rolling 7d & 28d CSAT (% of ratings ≥ 4)
WITH daily_ratings AS (
SELECT DATE(feedback_timestamp) AS d,
COUNT(*) AS n_ratings,
COUNT(*) FILTER (WHERE rating_score >= 4) AS n_satisfied
FROM fct_trip_feedback
GROUP BY 1
)
SELECT d,
ROUND(100.0 * SUM(n_satisfied) OVER (ORDER BY d ROWS 6 PRECEDING)
/ NULLIF(SUM(n_ratings) OVER (ORDER BY d ROWS 6 PRECEDING), 0), 1) AS csat_7d,
ROUND(100.0 * SUM(n_satisfied) OVER (ORDER BY d ROWS 27 PRECEDING)
/ NULLIF(SUM(n_ratings) OVER (ORDER BY d ROWS 27 PRECEDING), 0), 1) AS csat_28d
FROM daily_ratings
ORDER BY d;
Q13 — Sentiment by feedback category (NLP signal × free-text)
SELECT
feedback_category,
COUNT(*) AS n_comments,
ROUND(AVG(sentiment_score), 3) AS avg_sentiment,
ROUND(AVG(rating_score), 2) AS avg_stars,
COUNT(*) FILTER (WHERE sentiment_score < -0.3) AS n_strongly_negative
FROM fct_trip_feedback
WHERE feedback_timestamp >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY feedback_category
ORDER BY avg_sentiment ASC;
-- Surfaces what's driving the dissatisfaction — vehicle, route, app, payment, driver.
Q14 — Growth accounting: L1 / L7 / L28 active rider counts
WITH user_activity AS (
SELECT b.rider_id,
MAX(j.request_timestamp) AS last_active_ts
FROM brg_journey_riders b
JOIN fct_journeys j USING (journey_id)
GROUP BY b.rider_id
)
SELECT
COUNT(*) FILTER (WHERE last_active_ts >= CURRENT_TIMESTAMP - INTERVAL '1 day') AS l1_active,
COUNT(*) FILTER (WHERE last_active_ts >= CURRENT_TIMESTAMP - INTERVAL '7 days') AS l7_active,
COUNT(*) FILTER (WHERE last_active_ts >= CURRENT_TIMESTAMP - INTERVAL '28 days') AS l28_active,
COUNT(*) AS total_riders
FROM user_activity;
-- Standard L1/L7/L28 active KPIs from Reforge growth-accounting framework.
Worked numerical example — Convoy J-500 (EQUAL_SPLIT, $60 group fare)
- 5 friends call a group ride. 3 are billable riders (R_1, R_2, R_3); the others are children covered under R_1's account. Group fare quoted: $60.00.
- System dispatches 2 vehicles because the group exceeds single-car capacity →
fct_journeys.num_vehicles_dispatched = 2. - Driver Alpha's car (T-101) takes 2 riders (R_1, R_2). Driver Bravo's car (T-102) takes 1 rider (R_3).
- Per-rider charge (EQUAL_SPLIT): $60 / 3 = $20.00 per rider. Stored in
brg_journey_riders.individual_charge_usd. - Per-driver share: $60 / 2 = $30.00 per driver. Stored in
fct_trips.trip_fare_share_usd. - Invariant 1 (rider side): SUM(
brg.individual_charge) = $20 × 3 = $60 ✓ - Invariant 2 (driver side): SUM(
fct_trips.trip_fare_share) = $30 × 2 = $60 ✓ - Note the asymmetry: Driver Alpha's car collected $40 in charges from his 2 riders, but Alpha's payout basis is $30 — the platform retains $10 as a convoy-coordination fee. Driver Bravo's car collected $20 from his 1 rider but is paid $30 — the platform covers the $10 gap. This is fair: the journey was equally split between drivers regardless of which car each rider sat in. Without the two-grain design (trip share AND rider charge), you couldn't represent it.
Section 8 — Why this works for the convoy scenario
- Unique driver accounts. By placing
driver_idonfct_trips(notfct_journeys), the model honors the rule that a (driver, journey) combination is unique — same driver can't take two trips for one journey, but two drivers can share one journey. - Cost flexibility. The bridge supports multi-destination and DISTANCE_PRORATA logic. If Rider 1 exits at Mile 5 and Rider 2 at Mile 10,
individual_charge_usdis computed pro-rata usingleg_distance_kmcaptured fromfct_trip_events— same model, differentsplit_type. - Auditability. Finance traces every cent from a rider's wallet through
brg_journey_ridersdirectly to the specific driver'sfct_tripspayout. The two continuous invariants (Q7) catch producer bugs before they hit a quarterly SOX report. - Scalability. "Pool for 4", "Bus", "Wedding party of 12" — no schema migration. Just more rows in
brg_journey_riders; the journey grain is unchanged. - Eliminates granularity mismatch. The 3-tier hierarchy (journey, trip, event) lets each stakeholder query at their natural grain without filters or self-joins.
Section 9 — Business model evolution: single-sided → two-sided → three-sided → driverless
The Uber data model isn't static — it's shaped by the platform's market structure, and that structure is shifting. As Uber harnesses platform leverage, AI, and driverless technologies, the rider may eventually hail a ride that involves no human driver at all. A senior architect should be able to describe how the data model morphs at each platform shape, and what the driverless future demands.
The driverless future — what changes when the human driver disappears?
As autonomous vehicles (AVs) become a real share of the fleet, the data model has to absorb three structural shifts:
- The driver dimension shrinks.
dim_usersbecomes mostly riders. Drivers don't go away entirely — they become safety operators / remote-fleet teleoperators — but most rows infct_tripswon't have a humandriver_id. - The vehicle dimension explodes.
dim_vehiclesevolves intodim_av_fleetwith software version, sensor calibration, ODD (Operational Design Domain) restrictions, last safety-test pass, etc. — all SCD2 because they drift. - A whole new fact table appears for safety telemetry. Every disengagement, every remote-intervention, every edge-case recognition gets logged. Regulators require it; insurance audits depend on it.
Q15 — AV safety metric: miles per disengagement (the core regulator KPI)
WITH av_miles AS (
SELECT
a.software_version,
SUM(t.trip_distance_km) * 0.621371 AS total_miles
FROM fct_trips t
JOIN dim_av_fleet a ON t.av_key = a.av_key
WHERE t.terminal_state = 'completed'
AND t.request_timestamp >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY a.software_version
),
disengagements AS (
SELECT
a.software_version,
COUNT(*) FILTER (WHERE s.event_type = 'DISENGAGEMENT') AS n_disengagements,
COUNT(*) FILTER (WHERE s.event_type = 'REMOTE_INTERVENTION') AS n_interventions
FROM fct_av_safety_event s
JOIN dim_av_fleet a ON s.av_key = a.av_key
WHERE s.event_timestamp >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY a.software_version
)
SELECT
m.software_version,
ROUND(m.total_miles, 0) AS total_miles,
d.n_disengagements,
d.n_interventions,
ROUND(m.total_miles / NULLIF(d.n_disengagements, 0)) AS miles_per_disengagement,
ROUND(m.total_miles / NULLIF(d.n_interventions, 0)) AS miles_per_intervention
FROM av_miles m JOIN disengagements d USING (software_version)
ORDER BY miles_per_disengagement DESC;
-- Higher MPD = safer software. Required for NHTSA 2027 quarterly safety filings.
Q16 — Software version A/B: did v3.2 reduce disengagement rate vs v3.1?
WITH per_version AS (
SELECT a.software_version,
SUM(t.trip_distance_km * 0.621371) AS miles,
COUNT(*) FILTER (WHERE s.event_type = 'DISENGAGEMENT') AS disengagements
FROM fct_trips t
JOIN dim_av_fleet a ON t.av_key = a.av_key
LEFT JOIN fct_av_safety_event s ON s.trip_id = t.trip_id
WHERE a.software_version IN ('v3.1', 'v3.2')
GROUP BY a.software_version
)
SELECT
software_version,
ROUND(disengagements::DECIMAL / miles * 1000000, 2) AS disengagements_per_mil_miles
FROM per_version;
-- Compare across versions; statistical-significance test runs in the model card.
fct_trips doesn't care if the driver is a human or a piece of software — it cares about request → completion and the fare share. What changes is the vehicle dimension (now SCD2 on software version) and the safety fact (regulator-mandated). The Journey → Trip → Event hierarchy is platform-shape-agnostic — it works for single-sided fleets, two-sided marketplaces, three-sided delivery, and driverless robotaxis without re-architecting."
Section 10 — Working with a real Uber dataset (148K bookings · Delhi NCR · 2024)
To make the practical SQL drillable on real data, the model maps cleanly to a public Kaggle-derived dataset: github.com/hannahsutton1/uber.data — 148,770 booking rows from Delhi NCR for calendar year 2024. Repo includes raw + cleaned CSVs plus the cleaning Jupyter notebooks.
Source schema (single wide table, 21 columns)
Date, Time, booking_ID, Booking_Status, Customer_ID, Vehicle_Type,
Pickup_Location, Drop_Location, Avg_VTAT, Avg_CTAT,
Cancelled_Rides_by_Customer, Reason_for_cancelling_by_Customer,
Cancelled_Rides_by_Driver, Driver_Cancellation_Reason,
Incomplete_Rides, Incomplete_Rides_Reason,
Booking_Value, Ride_Distance, Driver_Ratings, Customer_Rating, Payment_Method
Glossary: VTAT = vehicle-to-pickup minutes (driver arrival time). CTAT = customer-to-completion minutes (in-trip duration). Booking_Value = fare in INR. Ride_Distance in km. Ratings on a 1–5 scale.
Source → Target mapping (ETL plan)
| Source column(s) | Target table.column | Notes |
|---|---|---|
booking_ID | fct_trips.trip_id | Natural key; idempotency anchor. |
Date + Time | fct_trips.request_timestamp | Concat to TIMESTAMP_TZ (Asia/Kolkata). |
Customer_ID | fct_trips.rider_id + dim_users | Single-rider per booking. |
Vehicle_Type | dim_vehicles.type | Enum {Auto, Go Mini, Go Sedan, Premier Sedan, UberXL, Bike, eBike}. |
Pickup_Location, Drop_Location | fct_trips.origin_h3, destination_h3 | Geocode locations → H3 cells (post-load enrichment). |
Booking_Status | fct_trips.terminal_state | Map: {Completed, Cancelled by Customer, Cancelled by Driver, Incomplete, No Driver Found}. |
Avg_VTAT | fct_trip_events (derived: MATCHED → PICKUP latency) | Reverse-engineer the events table from VTAT/CTAT. |
Avg_CTAT | fct_trip_events (derived: PICKUP → DROPOFF latency) | |
Booking_Value | fct_trips.total_fare_local (currency = INR) | FX to USD via dim_fx_rate. |
Ride_Distance | fct_trips.trip_distance_km | |
Driver_Ratings, Customer_Rating | fct_trip_feedback.rating_score (typed by rater) | Two rows per trip (driver→rider, rider→driver). |
Reason_for_cancelling_by_Customer | fct_trip_feedback.feedback_category | For cancelled trips. |
Driver_Cancellation_Reason | fct_trips.cancellation_reason | Free-text → categorical NLP at ETL. |
Payment_Method | dim_payment_methods | Enum {Cash, UPI, Debit Card, Credit Card, Wallet}. |
Real sample rows (5 representative bookings from the dataset)
-- Loaded into fct_trips after ETL transformation:
INSERT INTO fct_trips (trip_id, customer_id, vehicle_type, pickup_loc, drop_loc,
request_ts, terminal_state, vtat_min, ctat_min,
booking_value_inr, ride_distance_km, payment_method)
VALUES
('CNR5884300','CID1982111','eBike', 'Palam Vihar', 'Jhilmil',
'2024-03-23 12:29:38','no_driver_found', NULL, NULL, NULL, NULL, NULL),
('CNR8494506','CID9202816','Auto', 'Khandsa', 'Malviya Nagar',
'2024-08-23 08:56:10','completed', 13.4, 25.8, 627.00, 13.58, 'Debit Card'),
('CNR8906825','CID2610914','Premier Sedan', 'Central Secretariat','Inderlok',
'2024-10-21 17:17:25','completed', 13.1, 28.5, 416.00, 34.02, 'UPI'),
('CNR1950162','CID9933542','Bike', 'Ghitorni Village','Khan Market',
'2024-09-16 22:08:00','completed', 5.3, 19.6, 737.00, 48.21, 'UPI'),
('CNR1326809','CID4604802','Go Sedan', 'Shastri Nagar','Gurgaon Sector 56',
'2024-11-29 18:01:39','incomplete', 4.9, 14.0, 237.00, 5.73, 'UPI');
Q17 — Vehicle type mix & cancellation rate (real-data analysis)
SELECT
vehicle_type,
COUNT(*) AS bookings,
COUNT(*) FILTER (WHERE terminal_state = 'completed') AS completed,
COUNT(*) FILTER (WHERE terminal_state LIKE 'cancelled%') AS cancelled,
COUNT(*) FILTER (WHERE terminal_state = 'no_driver_found') AS no_driver,
ROUND(100.0 * COUNT(*) FILTER (WHERE terminal_state = 'completed')
/ COUNT(*), 1) AS completion_pct,
ROUND(AVG(booking_value_inr) FILTER (WHERE terminal_state = 'completed'), 0) AS avg_fare_inr,
ROUND(AVG(ride_distance_km) FILTER (WHERE terminal_state = 'completed'), 1) AS avg_distance_km
FROM fct_trips
WHERE DATE_TRUNC('year', request_ts) = '2024-01-01'
GROUP BY vehicle_type
ORDER BY bookings DESC;
Typical result on the 148K dataset:
| vehicle_type | bookings | completed | completion_pct | avg_fare_inr | avg_distance_km |
|---|---|---|---|---|---|
| Auto | ~37,400 | ~23,200 | 62% | ~410 | ~22.5 |
| Go Sedan | ~31,800 | ~19,800 | 62% | ~520 | ~24.1 |
| Premier Sedan | ~21,500 | ~13,300 | 62% | ~610 | ~24.0 |
| Bike | ~18,900 | ~11,700 | 62% | ~390 | ~22.7 |
| eBike | ~14,600 | ~9,000 | 62% | ~395 | ~22.8 |
| Go Mini | ~12,300 | ~7,600 | 62% | ~440 | ~23.0 |
| UberXL | ~12,200 | ~7,500 | 62% | ~720 | ~24.5 |
Insight: completion rate is ~62% across all vehicle types — the bottleneck isn't the vehicle, it's cancellation behavior (next query).
Q18 — Top 10 routes by booking volume (Pickup → Drop)
SELECT
pickup_loc,
drop_loc,
COUNT(*) AS trips,
ROUND(AVG(booking_value_inr) FILTER (WHERE terminal_state = 'completed'), 0) AS avg_fare,
ROUND(AVG(ride_distance_km) FILTER (WHERE terminal_state = 'completed'), 1) AS avg_dist,
ROUND(100.0 * COUNT(*) FILTER (WHERE terminal_state = 'completed')
/ COUNT(*), 1) AS completion_pct
FROM fct_trips
GROUP BY pickup_loc, drop_loc
ORDER BY trips DESC
LIMIT 10;
Q19 — Cancellation reason breakdown (driver vs customer)
WITH driver_cancels AS (
SELECT 'driver' AS cancelled_by, driver_cancellation_reason AS reason, COUNT(*) AS n
FROM fct_trips
WHERE terminal_state = 'cancelled_driver' AND driver_cancellation_reason IS NOT NULL
GROUP BY driver_cancellation_reason
),
customer_cancels AS (
SELECT 'customer' AS cancelled_by, reason_for_cancelling_by_customer AS reason, COUNT(*) AS n
FROM fct_trips
WHERE terminal_state = 'cancelled_rider' AND reason_for_cancelling_by_customer IS NOT NULL
GROUP BY reason_for_cancelling_by_customer
)
SELECT cancelled_by, reason, n,
ROUND(100.0 * n / SUM(n) OVER (PARTITION BY cancelled_by), 1) AS pct_within_actor
FROM (SELECT * FROM driver_cancels UNION ALL SELECT * FROM customer_cancels) u
ORDER BY cancelled_by, n DESC;
Use case: if "Vehicle Breakdown" tops the driver list, fleet maintenance owns the fix; if "Driver took too long" tops the customer list, dispatch tuning owns it.
Q20 — Hourly demand pattern + completion rate (peak-hour stress)
SELECT
EXTRACT(HOUR FROM request_ts) AS hour_of_day,
COUNT(*) AS bookings,
COUNT(*) FILTER (WHERE terminal_state = 'no_driver_found') AS no_driver_found,
ROUND(100.0 * COUNT(*) FILTER (WHERE terminal_state = 'no_driver_found')
/ COUNT(*), 2) AS no_driver_pct,
ROUND(100.0 * COUNT(*) FILTER (WHERE terminal_state = 'completed')
/ COUNT(*), 1) AS completion_pct,
ROUND(AVG(vtat_min) FILTER (WHERE terminal_state = 'completed'), 1) AS avg_vtat_min
FROM fct_trips
GROUP BY EXTRACT(HOUR FROM request_ts)
ORDER BY hour_of_day;
-- The hours where no_driver_pct spikes are the operational gaps where supply incentives matter most.
Q21 — Payment method mix by vehicle tier (premium vs economy)
SELECT
CASE WHEN vehicle_type IN ('Premier Sedan','UberXL') THEN 'premium'
WHEN vehicle_type IN ('Go Sedan','Go Mini') THEN 'mid'
ELSE 'economy'
END AS tier,
payment_method,
COUNT(*) AS trips,
ROUND(100.0 * COUNT(*) /
SUM(COUNT(*)) OVER (PARTITION BY
CASE WHEN vehicle_type IN ('Premier Sedan','UberXL') THEN 'premium'
WHEN vehicle_type IN ('Go Sedan','Go Mini') THEN 'mid'
ELSE 'economy' END), 1) AS pct_within_tier
FROM fct_trips
WHERE terminal_state = 'completed' AND payment_method IS NOT NULL
GROUP BY tier, payment_method
ORDER BY tier, trips DESC;
-- UPI dominates economy; cards skew premium. Drives partnership and incentive design.
What this dataset does not support (caveats — for honest interviews)
- No driver_id, no vehicle_id. Only
Vehicle_Typeas a categorical. Driver-side analytics (utilization, payouts) need a syntheticdriver_idassignment or a different dataset. - No surge multiplier, no fare breakdown. Only the final
Booking_Value. Surge analysis (Scenario 2) needs synthetic data. - No lat/lon, no H3 cells. Only Delhi NCR neighborhood names. Geocode at ETL using h3-py or Mapbox.
- No journey or multi-leg concept. One rider per booking — convoy / Pool / shared rides are not in scope. Use synthetic data for those examples.
- No GPS trace.
fct_trip_eventscan be reverse-engineered from VTAT/CTAT into approximate REQUESTED, MATCHED, PICKUP, DROPOFF timestamps but the geographic events (location_h3 per state) are unavailable.
Section 11 — Mass-departure scale: 20K+ people, road blocks & growth accounting
An interview prompt sometimes raises the stakes: "Architect the system to handle a 20,000-person stadium event — accidents, road blocks, supply/demand volatility, all at once." The existing Journey → Trip → Event hierarchy holds, but three additional concerns must be modeled explicitly: infrastructure as a fact (SCD2 geography), growth accounting on both sides of the marketplace, and point-in-time financial integrity under chaos.
11.1 — Re-stated grain hierarchy under stress
| Table | Grain | Purpose |
|---|---|---|
fct_journeys | One row per group request | Captures intent (e.g. a group of 6 leaving the stadium together). |
fct_trips | One row per (driver/vehicle × journey) | Vehicle-grain truth for driver payouts & insurance. |
fct_trip_events | One row per state transition / telemetry sample | Tracks Matchup & Pickup latencies + GPS replays. |
brg_journey_riders | One row per (rider × journey) | Financial bridge — many-to-many cost splits. |
dim_geography (SCD2) | One row per H3 cell × state-change | NEW emphasis: infrastructure as a versioned dim. Tracks "as-of-trip" road status (one-way, accident, blocked). |
11.2 — Growth accounting on both sides of the marketplace
To know whether a 20K-person event is being fueled by new users or resurrected ones (and which drivers are churning), every user is segmented daily.
| Segment | Definition | What it tells the business |
|---|---|---|
| New | First-time activity in this window. | Acquisition is working. |
| Retained | Active in previous period AND active now. | Core ecosystem is healthy. |
| Resurrected | Active now but inactive in the previous period. | Re-engagement campaign / event pulled them back. |
| Churned | Active in previous period, did not show up. | Supply gap risk — especially among drivers. |
Q22 — L1 / L7 / L28 active status per user (daily refresh)
SELECT
user_id,
user_role, -- 'rider' | 'driver'
MAX(CASE WHEN activity_date >= CURRENT_DATE - 1 THEN 1 ELSE 0 END) AS is_L1,
MAX(CASE WHEN activity_date >= CURRENT_DATE - 7 THEN 1 ELSE 0 END) AS is_L7,
MAX(CASE WHEN activity_date >= CURRENT_DATE - 28 THEN 1 ELSE 0 END) AS is_L28
FROM dw_user_daily_activity
GROUP BY user_id, user_role;
Q23 — Growth segmentation: New / Retained / Resurrected / Churned (daily diff)
WITH today AS (
SELECT DISTINCT user_id, user_role FROM dw_user_daily_activity
WHERE activity_date = CURRENT_DATE - 1 -- yesterday's activity
),
prior AS (
SELECT DISTINCT user_id, user_role FROM dw_user_daily_activity
WHERE activity_date BETWEEN CURRENT_DATE - 8 AND CURRENT_DATE - 2 -- prior 7-day window
),
ever AS (
SELECT DISTINCT user_id, user_role FROM dw_user_daily_activity
WHERE activity_date < CURRENT_DATE - 8 -- before prior window
)
SELECT user_role,
COUNT(*) FILTER (WHERE t.user_id IS NOT NULL AND p.user_id IS NULL AND e.user_id IS NULL) AS new_users,
COUNT(*) FILTER (WHERE t.user_id IS NOT NULL AND p.user_id IS NOT NULL) AS retained,
COUNT(*) FILTER (WHERE t.user_id IS NOT NULL AND p.user_id IS NULL AND e.user_id IS NOT NULL) AS resurrected,
COUNT(*) FILTER (WHERE t.user_id IS NULL AND p.user_id IS NOT NULL) AS churned
FROM today t
FULL OUTER JOIN prior p USING (user_id, user_role)
LEFT JOIN ever e USING (user_id, user_role)
GROUP BY user_role;
-- Driver "churned" spike before a stadium event = supply crisis incoming. Marketing fires re-engagement push.
11.3 — Mass departure with road blocks: SCD2 geography in action
When an accident or police closure occurs during the stadium emptying out, dim_geography records the change as a new SCD2 row. Old route remains queryable; new route status is "as-of-trip".
| geo_key | h3_index | route_status | valid_from | valid_to |
|---|---|---|---|---|
| 7001 | Stadium_Hex | Normal | 2026-05-07 08:00 | 2026-05-07 22:00 |
| 7002 | Stadium_Hex | Accident / Blocked | 2026-05-07 22:00 | 2026-05-08 02:00 |
| 7003 | Stadium_Hex | Normal | 2026-05-08 02:00 | 9999-12-31 |
Q24 — Correlate pickup latency to road status (was the spike supply or infrastructure?)
SELECT
g.h3_index,
g.route_status,
COUNT(t.trip_id) AS total_requests,
ROUND(AVG(EXTRACT(EPOCH FROM (e_pic.event_timestamp - e_req.event_timestamp))), 1) AS avg_pickup_latency_sec,
ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (e_pic.event_timestamp - e_req.event_timestamp))), 1) AS p95_pickup_latency_sec
FROM fct_trips t
JOIN fct_trip_events e_req ON t.trip_id = e_req.trip_id AND e_req.state = 'REQUESTED'
JOIN fct_trip_events e_pic ON t.trip_id = e_pic.trip_id AND e_pic.state = 'PICKUP'
JOIN dim_geography g
ON t.pickup_h3_cell = g.h3_index
AND e_req.event_timestamp >= g.valid_from
AND e_req.event_timestamp < g.valid_to -- as-of-request status
GROUP BY g.h3_index, g.route_status
ORDER BY p95_pickup_latency_sec DESC;
-- "Accident / Blocked" rows with high p95 = blame infrastructure, not driver supply.
-- "Normal" rows with high p95 = blame supply, surge it up.
11.4 — Financial & risk integrity under chaos
Even when a 20K event triggers cascading surges + accidents + cancellations, three SCD2 rules guarantee restate-ability:
- FX rate locked at
request_timestamp.dim_fx_rateis SCD2; the rate that was current at the moment the journey was requested is the rate the rider is charged in USD, regardless of how the rate moves while the trip is in progress. Prevents arbitrage and revenue restatement. - Insurance status as-of accident timestamp.
dim_vehiclesSCD2 on insurance. If a crash happens at 22:14, Risk queriesdim_vehicleswithvalid_from ≤ 22:14 < valid_to— exact policy in effect to the millisecond. - Surge multiplier locked in
fct_tripsat journey request. Even if surge spikes from 1.5× to 3.0× while the rider is being matched, the rider pays the multiplier that was active at request. Stored on the row, not derived.
Q25 — Insurance audit: was the vehicle insured at the moment of the accident?
SELECT
t.trip_id,
t.vehicle_id,
e.event_timestamp AS accident_ts,
v.insurance_policy_id,
v.insurance_status,
v.valid_from AS policy_active_from,
v.valid_to AS policy_active_to,
CASE WHEN v.insurance_status = 'active' THEN 'COVERED' ELSE 'GAP' END AS audit_verdict
FROM fct_trip_events e
JOIN fct_trips t ON e.trip_id = t.trip_id
JOIN dim_vehicles v ON t.vehicle_id = v.vehicle_id
AND e.event_timestamp >= v.valid_from
AND e.event_timestamp < v.valid_to
WHERE e.state = 'ACCIDENT_REPORTED';
-- Required for insurance claims + regulator audits. Single-row defensible answer.
11.5 — Scalability rules under load (the engineering details)
| Rule | Why |
|---|---|
Infinity date 9999-12-31 instead of NULL for valid_to | Range joins like e_req.ts BETWEEN g.valid_from AND g.valid_to are far more performant than ... AND (g.valid_to IS NULL OR g.valid_to > e_req.ts). Optimizers handle BETWEEN cleanly; OR + IS NULL kills predicate pushdown. |
| H3 sharding by zoom level — L8 (~0.7 km²) for cities, L10 (~15K m²) for stadium event zones | Stadium hexes get 100× the requests/min of suburb hexes. Going to L10 inside venues distributes the load across many smaller hexes — the "hot partition" stops being the entire stadium and becomes per-gate. |
Surge locking in fct_trips.surge_multiplier_at_request | Rider's quoted price = driver's earning potential = revenue booked. All three numbers must agree forever. Locked at journey-request, never recomputed. |
Per-partition watermarks on fct_trip_events | One stadium-area producer with bad clock skew shouldn't stall the entire event-time pipeline. Per-partition watermark contains the damage to its own H3 cell. |
Pre-event materialization of dim_geography for the venue area | Police-coordinated road closures are known hours in advance. Pre-write the SCD2 transitions so they're queryable from minute zero of the event. |
dim_geography for road state, dim_vehicles for insurance, dim_fx_rate for currency. Get those three SCD2s right, use infinity-dates instead of NULLs, and shard H3 to L10 inside the venue, and the rest of the model behaves identically — surge locked, payouts reconciled, growth segmented. The chaos is contained at the dimension layer, not the fact layer."
fct_trip_events, Finance reconciles fares against two invariants (rider-side and driver-side), and Risk replays GPS sequences for phantom-convoy detection. Three stakeholders, three grains, one source of truth — and two invariants on the financial bridge keep producer bugs from ever reaching the audit."
Scenario 2 — Surge pricing & supply/demand state
Why this is hard — the design tension
Surge isn't one fact — it's a causal chain: state observed → model invoked → multiplier set → riders react → drivers reposition → state changes. Each link needs its own grain or you lose the ability to evaluate the engine. The model also has to replay history with the surge model that was active then, not today's — pricing teams A/B test new models monthly. And the geographic skew is brutal: midtown Manhattan on New Year's Eve has 100× the rows of suburban Queens.
Section 1 — Dimensional Model Overview
Star Schema with a State → Decision → Outcome causal chain.
Fact Tables
fct_geo_state_minute— The "Observation" grain. One row per (H3 cell × minute). Captures open requests, available drivers, ratio, ETA, weather, event signals.fct_surge_decision— The "Decision" grain. One row per surge-multiplier change per cell. Captures which model version made the call and on what input features.fct_surge_outcome— The "Effectiveness" grain. One row per surge decision joined to its post-decision N-minute outcome (did supply rebalance? did the price stick?).brg_zone_cells— Hierarchy bridge. Many H3 cells roll up to a surge zone (operational region used by the pricing team).
Dimension Tables
dim_h3_cell— H3 L8 hexagons. SCD1 (geometry is static).dim_surge_zone— Operational regions (e.g. "SF Downtown", "JFK Airport"). SCD2 — boundaries get re-drawn.dim_surge_model— Versioned ML/heuristic models. SCD2 — drives the replay capability.dim_weather— Weather state per cell per hour (rain, snow, temperature). SCD1 snapshot.dim_event— Concerts, sports games, conferences. SCD1 (each event is a row, not a state).
Section 2 — The Logical Data Model
Observation Fact: fct_geo_state_minute
| Column | Type | Description |
|---|---|---|
h3_key | FK | H3 cell observed. |
minute_bucket_ts | TIMESTAMP_TZ | Truncated to the minute. |
open_requests | INT | Riders waiting in this cell. |
available_drivers | INT | On-app, not currently on a trip. |
driver_to_request_ratio | DECIMAL(8,3) | Computed for convenience. |
predicted_eta_sec | INT | From the ETA service at this moment. |
weather_key | FK | Weather conditions in this cell. |
event_keys | ARRAY<FK> | Active events influencing this cell. |
Decision Fact: fct_surge_decision
| Column | Type | Description |
|---|---|---|
decision_key | PK | Surrogate. |
h3_key | FK | Cell the decision applies to. |
decision_ts | TIMESTAMP_TZ | When the new multiplier became active. |
previous_multiplier | DECIMAL(4,2) | Just before this decision. |
new_multiplier | DECIMAL(4,2) | The applied surge factor (1.0 = no surge). |
model_key | FK | Which dim_surge_model version made the call. |
input_features_json | JSON | Frozen features the model saw — required for replay. |
decision_reason_code | STRING | {LOW_SUPPLY, HIGH_DEMAND, EVENT_PROXIMITY, WEATHER_SHOCK, MANUAL_OVERRIDE}. |
Outcome Fact: fct_surge_outcome
| Column | Type | Description |
|---|---|---|
decision_key | FK | The surge decision. |
n_minutes_after | SMALLINT | 5, 10, 15-min windows post-decision. |
delta_drivers | INT | Net change in available drivers. |
delta_requests | INT | Net change in open requests. |
completed_trips | INT | Trips actually completed in this window. |
abandoned_requests | INT | Riders who saw the surge and didn't book. |
realized_revenue_usd | DECIMAL | Revenue earned from trips in this window. |
Section 3 — How the model serves each stakeholder
The Surge Engine · live
- Reads
fct_geo_state_minutefor the latest minute, joinsdim_surge_modelto get the active model, computes a multiplier, writesfct_surge_decision. - Round-trip target: under 500 ms per cell. Reads come from a Redis snapshot of the latest minute; writes append to Kafka, then to the warehouse.
Pricing Team · effectiveness analysis
- Did the surge work? A surge "worked" if (delta_drivers > 0 within 10 min) AND (abandoned_requests stayed flat). Both signals come from
fct_surge_outcome. - Pricing elasticity: for each multiplier band, what % of riders abandoned? Plot abandonment vs multiplier — the inflection point is the pain threshold.
- Model A/B: two model versions live in
dim_surge_model; route a % of cells to each. Comparerealized_revenue_usd+abandoned_requestsper model.
Capacity Planning · forecasting
- Train demand forecast on historical
fct_geo_state_minute. MAPE measured against later observed minutes. - Identify chronic supply gaps — cells where
driver_to_request_ratiois below 0.5 for >30% of peak hours. These get long-term incentive programs.
Section 4 — Why versioning the surge model matters (and why input_features_json is mandatory)
Without a versioned dim_surge_model + frozen input features, you cannot replay history. Three failure modes show up the moment the pricing team A/B tests:
- Failure 1: model drift hides a regression. If today's model retroactively scores yesterday's input, a buggy current-model version "explains" yesterday's outcome cleanly — even when the actual yesterday's model produced a wildly different result.
- Failure 2: feature drift. A weather data source upgrades from "rain_mm" to "precip_intensity" mid-week. Without frozen
input_features_json, you can't tell whether yesterday's surge was driven by rain or by something else. - Failure 3: dispute resolution. A regulator asks "why was this rider charged 3.2× on this date?" — only frozen features + the SCD2 model row gives you a defensible answer.
The contract: every fct_surge_decision row carries a deterministic, replay-safe snapshot of the inputs. Same inputs + same model_key → identical multiplier, every time, forever.
Section 5 — Conceptual data flow (the State → Decision → Outcome chain)
Enhanced graphical data model — Surge Pricing & Effectiveness
Section 7 — SQL analysis for business units
Sample data (a single H3 cell, 5 minutes of supply/demand state and 1 surge decision)
INSERT INTO fct_geo_state_minute VALUES
('h3_sf_market', '2025-05-01 19:00:00', 8, 12, 1.50, 240, 'clear', NULL),
('h3_sf_market', '2025-05-01 19:01:00', 14, 7, 0.50, 360, 'clear', ARRAY['concert_chase_center']),
('h3_sf_market', '2025-05-01 19:02:00', 22, 5, 0.23, 540, 'clear', ARRAY['concert_chase_center']),
('h3_sf_market', '2025-05-01 19:03:00', 19, 9, 0.47, 410, 'clear', ARRAY['concert_chase_center']),
('h3_sf_market', '2025-05-01 19:04:00', 16, 14, 0.88, 280, 'clear', ARRAY['concert_chase_center']);
-- Pricing engine fires a surge decision at 19:02 when ratio crashes:
INSERT INTO fct_surge_decision VALUES
(5001, 'h3_sf_market', 'model_v3.2', '2025-05-01 19:02:00',
1.0, 2.5,
'{"ratio":0.23,"eta_sec":540,"event":"concert_chase_center","prev_minute_ratio":0.50}',
'LOW_SUPPLY');
-- Outcome computed nightly (10-min window post-decision)
INSERT INTO fct_surge_outcome VALUES
(5001, 10, +9, -3, 14, 5, 1240.00);
A. Pricing engine · live decisions
Q1 — Cells currently below ratio 0.5 (live surge candidates)
WITH latest_state AS (
SELECT h3_key, MAX(minute_bucket_ts) AS latest_ts
FROM fct_geo_state_minute
WHERE minute_bucket_ts >= NOW() - INTERVAL '2 minutes'
GROUP BY h3_key
)
SELECT s.h3_key,
s.driver_to_request_ratio,
s.open_requests, s.available_drivers,
s.predicted_eta_sec,
w.precip_intensity, e.event_name
FROM fct_geo_state_minute s
JOIN latest_state l USING (h3_key)
LEFT JOIN dim_weather w ON s.weather_key = w.weather_key
LEFT JOIN dim_event e ON e.event_key = ANY(s.event_keys)
WHERE s.minute_bucket_ts = l.latest_ts
AND s.driver_to_request_ratio < 0.5
ORDER BY s.driver_to_request_ratio ASC;
B. Pricing team · effectiveness analysis
Q2 — Surge effectiveness: % of decisions that pulled supply within 10 min
SELECT
d.model_key,
COUNT(*) AS n_decisions,
COUNT(*) FILTER (WHERE o.delta_drivers > 0) AS pulled_supply,
ROUND(100.0 * COUNT(*) FILTER (WHERE o.delta_drivers > 0)
/ COUNT(*), 1) AS effectiveness_pct,
ROUND(AVG(o.delta_drivers), 1) AS avg_delta_drivers,
ROUND(AVG(o.abandoned_requests), 1) AS avg_abandoned
FROM fct_surge_decision d
JOIN fct_surge_outcome o USING (decision_key)
WHERE o.n_minutes_after = 10
AND d.decision_ts >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY d.model_key
ORDER BY effectiveness_pct DESC;
Q3 — Pricing elasticity: abandonment by multiplier band
SELECT
CASE
WHEN d.new_multiplier < 1.5 THEN '1.0-1.5x'
WHEN d.new_multiplier < 2.0 THEN '1.5-2.0x'
WHEN d.new_multiplier < 3.0 THEN '2.0-3.0x'
ELSE '3.0x+'
END AS multiplier_band,
COUNT(*) AS n_decisions,
ROUND(AVG(o.abandoned_requests), 1) AS avg_abandoned,
ROUND(AVG(o.completed_trips), 1) AS avg_completed,
ROUND(100.0 * AVG(o.abandoned_requests)
/ NULLIF(AVG(o.abandoned_requests + o.completed_trips), 0), 1) AS abandonment_pct,
ROUND(AVG(o.realized_revenue_usd), 2) AS avg_revenue_usd
FROM fct_surge_decision d
JOIN fct_surge_outcome o USING (decision_key)
WHERE o.n_minutes_after = 10
GROUP BY multiplier_band
ORDER BY multiplier_band;
-- The inflection point in abandonment_pct is the rider pain threshold.
Q4 — Model A/B comparison: revenue per cell-minute under v3.1 vs v3.2
SELECT
d.model_key,
COUNT(*) AS n_decisions,
ROUND(SUM(o.realized_revenue_usd), 2) AS total_revenue,
ROUND(SUM(o.realized_revenue_usd) / COUNT(*), 2) AS rev_per_decision,
ROUND(AVG(o.completed_trips), 2) AS avg_completed,
ROUND(AVG(o.abandoned_requests), 2) AS avg_abandoned
FROM fct_surge_decision d
JOIN fct_surge_outcome o USING (decision_key)
WHERE d.decision_ts >= CURRENT_DATE - INTERVAL '14 days'
AND d.model_key IN ('model_v3.1', 'model_v3.2')
AND o.n_minutes_after = 10
GROUP BY d.model_key;
-- Statistical significance test runs in the model card; this is the headline.
C. Capacity planning · forecast accuracy & chronic gaps
Q5 — Demand forecast MAPE per cell (predicted vs actual)
SELECT
h3_key,
COUNT(*) AS n_minutes,
ROUND(AVG(ABS(predicted_demand - open_requests)
/ NULLIF(open_requests, 0)) * 100, 2) AS mape_pct,
ROUND(AVG(predicted_demand - open_requests), 2) AS avg_bias,
COUNT(*) FILTER (WHERE predicted_demand < open_requests) AS underforecast_min
FROM fct_geo_state_minute
WHERE minute_bucket_ts >= CURRENT_DATE - INTERVAL '7 days'
AND open_requests > 0
GROUP BY h3_key
HAVING AVG(ABS(predicted_demand - open_requests) / NULLIF(open_requests, 0)) > 0.20
ORDER BY mape_pct DESC;
-- Cells with MAPE > 20% need feature engineering attention.
Q6 — Chronic supply gaps (cells under-supplied during peak hours regularly)
SELECT
h3_key,
COUNT(*) AS peak_minutes,
COUNT(*) FILTER (WHERE driver_to_request_ratio < 0.5) AS under_supplied_min,
ROUND(100.0 * COUNT(*) FILTER (WHERE driver_to_request_ratio < 0.5)
/ COUNT(*), 1) AS pct_under_supplied
FROM fct_geo_state_minute
WHERE minute_bucket_ts >= CURRENT_DATE - INTERVAL '30 days'
AND EXTRACT(HOUR FROM minute_bucket_ts) BETWEEN 17 AND 21
GROUP BY h3_key
HAVING COUNT(*) FILTER (WHERE driver_to_request_ratio < 0.5) * 1.0 / COUNT(*) > 0.30
ORDER BY pct_under_supplied DESC;
-- These cells get long-term driver incentive programs (per-trip bonus, guarantees).
Worked example — surge math at 19:02 in h3_sf_market
- At 19:01, ratio drops from 1.50 → 0.50 (concert lets out). 7 drivers available, 14 open requests. ETA jumps from 240s → 360s.
- At 19:02, ratio crashes to 0.23 (5 drivers, 22 requests). The model fires a surge decision:
new_multiplier = 2.5. - Frozen
input_features_json:{ratio:0.23, eta_sec:540, event:"concert_chase_center", prev_minute_ratio:0.50}. Same inputs tomodel_v3.2will always yield 2.5 — replay is deterministic. - 10 min later (19:12): drivers from neighboring cells reposition (delta_drivers = +9), some riders abandon (5 abandoned out of 19), but 14 trips complete. Realized revenue: $1,240.
- Effectiveness verdict: the surge worked — supply pulled in (delta_drivers > 0), revenue captured, but at the cost of 5 abandoned trips. The pricing team weighs $1,240 captured vs ~$200 estimated lost-to-abandonment.
Section 8 — Why this works
- Causal chain preserved. State → Decision → Outcome are three separate facts. Each can be queried in isolation, but they join cleanly via
decision_key. Effectiveness analysis is one JOIN, not a self-join puzzle. - Replay-safe by construction. Frozen
input_features_json+ SCD2dim_surge_modelmeans any historical surge can be recomputed exactly. Audits, disputes, and A/B baselines all work. - Skew managed. H3 L8 cells partition the geography evenly; downtown skew is contained per-cell. Sub-partition by
HASH(h3_key) MOD 8for the hottest few cells if needed. - Cold-start covered. When a new city launches,
brg_zone_cellscan map untrained cells to their parent zone's averages until enough data accrues.
fct_surge_decision + frozen input_features_json + SCD2 dim_surge_model is the contract that lets the pricing team A/B-test new models confidently and lets regulators get a defensible answer to 'why was this rider charged 3.2× on this date?' — every time, forever."
Scenario 3 — Order & courier dispatch (with stacked deliveries)
Why this is hard — the design tension
This is a 3-sided marketplace (customer + restaurant + courier) where one fact table can't satisfy all three. Stacked orders make it worse: one courier dispatch covers multiple orders, but each order has its own customer SLA and its own restaurant relationship. Mutable tips (editable up to 24h post-delivery) demand append-only economics. The model must read clean for each stakeholder without forcing the others to filter.
Section 1 — Dimensional Model Overview
Fact Tables (the order/dispatch split + bridge)
fct_orders— The "Customer" grain. One row per customer order. Drives customer SLA + restaurant operations.fct_dispatches— The "Courier" grain. One row per courier dispatch. May cover 1 (single) or N (stacked) orders. Drives courier earnings.brg_dispatch_orders— The "Stacking" bridge. Links a dispatch to its 1-N orders, with leg sequence and per-order distance.fct_order_state_events— The "Telemetry" grain. One row per order state transition (PLACED, CONFIRMED, READY, PICKED_UP, DELIVERED, CANCELLED).
Dimension Tables
dim_customers·dim_addresses(SCD2 on address — for legal/refund history)dim_restaurants(SCD2 — hours, menu version, status)dim_couriers(SCD2 — vehicle type, rating tier, on/off platform)dim_h3_cell(SCD1)
Section 2 — Logical Data Model
Customer Fact: fct_orders
| Column | Type | Description |
|---|---|---|
order_key / order_id | PK / UUID | Surrogate + natural. |
customer_id, restaurant_id, delivery_h3 | FK | The 3 sides. |
placed_ts, confirmed_ts, ready_ts, picked_up_ts, delivered_ts | TIMESTAMP_TZ | 5 lifecycle timestamps; NULL until reached. |
subtotal_usd, tax_usd, delivery_fee_usd, service_fee_usd | DECIMAL | Customer-facing. |
tip_usd | DECIMAL | Mutable up to 24h post-delivery. |
tip_locked_ts | TIMESTAMP_TZ | When the tip became immutable. NULL = still mutable. |
promo_applied_usd | DECIMAL | Customer discount. |
terminal_state | ENUM | {delivered, cancelled_customer, cancelled_restaurant, cancelled_courier, abandoned}. |
Courier Fact: fct_dispatches
| Column | Type | Description |
|---|---|---|
dispatch_key | PK | Surrogate. |
courier_id | FK | The courier on this dispatch. |
dispatch_start_ts, dispatch_end_ts | TIMESTAMP_TZ | From "accept" to "all delivered". |
is_batched | BOOLEAN | TRUE if N orders stacked. |
total_orders | SMALLINT | Convenience: COUNT of orders in bridge. |
total_distance_km | DECIMAL | Sum of all legs. |
base_pay_usd, tips_distributed_usd, payout_usd | DECIMAL | Earnings. |
Bridge: brg_dispatch_orders
| Column | Type | Description |
|---|---|---|
dispatch_key, order_key | FK | Composite PK. |
leg_sequence | SMALLINT | 1, 2, 3 — which delivery on this dispatch. |
leg_distance_km | DECIMAL | Distance from previous leg's drop (or restaurant) to this drop. |
leg_payout_share_usd | DECIMAL | This leg's share of the dispatch payout. |
Section 3 — How the model serves each stakeholder
Customer · on-time SLA + experience
- On-time:
delivered_ts ≤ promised_tsfromfct_orders. - Late stage detection: join
fct_order_state_eventsto find which stage stalled (restaurant prep? courier route?).
Courier · earnings & efficiency
- Earnings per hour: SUM(
payout_usd) / dispatch hours fromfct_dispatches. - Stack uplift: compare per-leg payout in single vs batched dispatches — quantifies the courier value of stacking.
Restaurant · operational health
- Prep time variance:
ready_ts − confirmed_tsper restaurant per hour. Spikes = kitchen overload. - Cancellation source:
terminal_state = cancelled_restaurantrate; restaurants with >3% rate get warning.
Section 4 — Why the order/dispatch split + bridge
Three failure modes if you collapse:
- Single-fact (order grain only): can't represent "courier did 2 stacked deliveries with shared pickup" — courier earnings get double-counted.
- Single-fact (dispatch grain only): the customer SLA query needs to sum within bridge — ugly, error-prone.
- No bridge: 1:1 dispatch:order forces non-stacking — kills the entire economic case for batching.
The bridge with leg_sequence + leg_distance + leg_payout_share enables: fct_orders stays clean (1 row = 1 customer order), fct_dispatches stays clean (1 row = 1 courier shift-segment), and the join produces correct per-leg attribution for both sides.
Enhanced graphical data model — DoorDash 3-sided marketplace
Section 7 — SQL analysis for business units
Sample data (Dispatch DSP-99: 1 courier, 2 stacked orders from the same restaurant)
INSERT INTO fct_orders VALUES
('O-201','CUST_A','RST_PIZZA','h3_drop1','2025-05-01 18:00','2025-05-01 18:02','2025-05-01 18:18','2025-05-01 18:22','2025-05-01 18:34', 25.00, 2.10, 2.99, 1.00, 5.00, '2025-05-02 18:34', 0.00, 'delivered'),
('O-202','CUST_B','RST_PIZZA','h3_drop2','2025-05-01 18:01','2025-05-01 18:03','2025-05-01 18:18','2025-05-01 18:22','2025-05-01 18:42', 32.00, 2.69, 2.99, 1.00, 6.00, '2025-05-02 18:42', 0.00, 'delivered');
INSERT INTO fct_dispatches VALUES
('DSP-99','COURIER_X','2025-05-01 18:18','2025-05-01 18:42', TRUE, 2, 4.2, 6.50, 11.00, 17.50);
INSERT INTO brg_dispatch_orders VALUES
('DSP-99','O-201', 1, 2.0, 8.50),
('DSP-99','O-202', 2, 2.2, 9.00);
A. Customer · order latencies & SLA
Q1 — Stage-by-stage latency for the order, with whose-fault attribution
SELECT
order_id,
EXTRACT(EPOCH FROM (confirmed_ts - placed_ts)) / 60 AS confirm_min,
EXTRACT(EPOCH FROM (ready_ts - confirmed_ts)) / 60 AS prep_min,
EXTRACT(EPOCH FROM (picked_up_ts - ready_ts)) / 60 AS pickup_wait_min,
EXTRACT(EPOCH FROM (delivered_ts - picked_up_ts)) / 60 AS drive_min,
EXTRACT(EPOCH FROM (delivered_ts - placed_ts)) / 60 AS total_min,
CASE
WHEN (ready_ts - confirmed_ts) > INTERVAL '20 min' THEN 'restaurant_slow'
WHEN (picked_up_ts - ready_ts) > INTERVAL '10 min' THEN 'courier_late_pickup'
WHEN (delivered_ts - picked_up_ts) > INTERVAL '20 min' THEN 'courier_slow_drive'
ELSE 'on_track' END AS bottleneck
FROM fct_orders WHERE DATE(placed_ts) = '2025-05-01';
B. Courier · earnings & stack uplift
Q2 — Earnings per hour by stacked vs single dispatch
SELECT
is_batched,
COUNT(*) AS dispatches,
ROUND(SUM(payout_usd) /
SUM(EXTRACT(EPOCH FROM (dispatch_end_ts - dispatch_start_ts))/3600), 2) AS earnings_per_hr,
ROUND(SUM(payout_usd) / SUM(total_orders), 2) AS payout_per_order,
ROUND(AVG(total_distance_km / total_orders), 2) AS km_per_order
FROM fct_dispatches
WHERE DATE(dispatch_start_ts) BETWEEN '2025-05-01' AND '2025-05-31'
GROUP BY is_batched;
-- Quantifies the courier value of stacking; if stacked is < 1.3× single payout/hr, batching isn't worth it.
C. Restaurant · prep time & cancellation
Q3 — Prep time p50/p95 per restaurant per day
SELECT
restaurant_id,
DATE(placed_ts) AS d,
COUNT(*) AS orders,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ready_ts - confirmed_ts))/60) AS prep_p50_min,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ready_ts - confirmed_ts))/60) AS prep_p95_min,
ROUND(100.0 * COUNT(*) FILTER (WHERE terminal_state = 'cancelled_restaurant') / COUNT(*), 1) AS cancel_pct
FROM fct_orders
WHERE confirmed_ts IS NOT NULL AND ready_ts IS NOT NULL
GROUP BY restaurant_id, DATE(placed_ts)
ORDER BY prep_p95_min DESC;
-- Restaurants with prep_p95 > 25min get warned; > 35min get throttled (lower placement priority).
Worked example — Stacked dispatch DSP-99
- Courier X accepts dispatch at 18:18 with 2 stacked orders from RST_PIZZA. Pickup is shared (single restaurant); 2 different drop locations.
- Order O-201 delivered first at 18:34 (16 min after pickup, 2.0 km).
- Order O-202 delivered second at 18:42 (24 min after pickup, additional 2.2 km from drop1).
- Courier earnings:
base_pay $6.50 + tips $11.00 = $17.50for 24 min of dispatch time → $43.75/hr. A single-order dispatch in the same area pays ~$28/hr. - Bridge: O-201 leg_payout_share = $8.50, O-202 leg_payout_share = $9.00. Sum = $17.50 ✓
- Customer SLAs: O-201 on-time (16 min total); O-202 was 8 min late (waited for O-201 drop). Stacking trade-off: courier earnings up, second customer's SLA at risk.
Section 8 — Why this works
- Three stakeholders, three grains, one model. Customer queries hit
fct_orders; courier queries hitfct_dispatches; the bridge ties them only when stacking math matters. - Stack uplift is measurable. Compare
is_batched=TRUEvsFALSEearnings/hr — if stacking doesn't beat single by ~30%, dispatch should stop offering it. - Mutable tips handled cleanly. Append correction rows;
tip_locked_tstracks the immutability boundary. Finance restates fromtip_locked_ts. - Stage-attribution. The bottleneck CASE WHEN in Q1 turns "the order was 18 min late" into "the restaurant prep took 22 min" — actionable per stakeholder.
fct_orders), did the courier earn enough per hour? (aggregates fct_dispatches), did the restaurant lose orders to long prep times? (joins on dim_restaurants). The bridge is what makes stacking a first-class concept rather than a hack."
Scenario 4 — Search ads auction → impression → click → conversion
Why this is hard — the design tension
This is a causal chain across 4 facts with extreme asymmetry: 2 billion ad requests/day → 200 billion targeting lookups → ~100 candidate ads per request → 1 winner. Loser logs are 20× winner volume. Auctions clear in <100ms but conversions land days later. The model must keep the join key (auction_decision_id) deterministic across all stages, support re-attribution when models change, and protect privacy under ATT/Sandbox.
Section 1 — Dimensional Model Overview
Fact Tables (the auction chain)
fct_auctions— The "Decision" grain. One row per auction clearing. Carriesauction_decision_id.fct_impressions— The "Served" grain. One row per impression (auction winner served on the page).fct_clicks— The "Click" grain. One row per click on an impression. Mutable: invalid-click invalidation comes later.fct_conversions— The "Conversion" grain. One row per advertiser-reported conversion event.fct_attributions— The "Attribution" grain. One row per (conversion × eligible-touchpoint × model × run). Append-only perattribution_run_id.fct_auction_losers— The "Counterfactual" grain. Sampled losing candidates (1%) for ranker training.
Dimension Tables
dim_advertisers, dim_campaigns, dim_ad_groups, dim_creatives— all SCD2 (budget, targeting, creative drift).dim_keywords— SCD1, high cardinality.dim_users· privacy-tokenized (Customer Match, FLoC/Topics, ATT bucket).dim_attribution_models— SCD2 — versioned (last_click, position_based, data_driven_v3, etc.).
Section 2 — Logical Data Model (key columns only)
fct_auctions
| Column | Type | Description |
|---|---|---|
auction_decision_id | PK STRING | The join key for the entire chain. |
query_text_hash | STRING | Hash of search query — never raw. |
keyword_id, user_token, device_id | FK | Targeting context. |
auction_ts | TIMESTAMP_TZ | Within ~100 ms of request. |
winning_ad_id, clearing_price_local, fx_rate_to_usd, clearing_price_usd | — | Second-price clearance. |
privacy_path | STRING | {standard, ATT_optout, sandbox, FLoC}. |
fct_impressions — joined to auctions on auction_decision_id; carries served_ts, creative_id, predicted_pCTR, quality_score, ad_rank.
fct_clicks — impression_id, click_ts, is_invalid, invalidation_ts (slowly-changing).
fct_conversions — conversion_id, advertiser_id, conversion_ts, conversion_value_usd, conversion_type (purchase, signup, install).
fct_attributions — conversion_id, click_id, impression_id, model_id, attribution_run_id, fractional_credit, window_type. Multiple rows per conversion; one per (model × run × eligible touchpoint).
Section 3 — How the model serves each stakeholder
Advertiser · billing & ROAS
- Spend by campaign: SUM(
clearing_price_usd) fromfct_impressionsjoined back tofct_auctions. - ROAS = SUM(
conversion_value_usdattributed) / SUM(clearing_price_usd) — uses the latestattribution_run_id.
Ranker team · model training & quality score
fct_auction_losers+fct_impressions= balanced training set for pCTR model.- Quality Score retrospective: join historical creative SCD2 row + impression's
predicted_pCTRto debug regressions.
Attribution & measurement
- Re-attribution: when the data-driven model upgrades from v2→v3, run a new
attribution_run_idover the same conversions and impressions. Old run is preserved — no data lost. - Incrementality: compare conversion-rate of exposed (impression served) vs counterfactual (won auction but didn't serve due to PSA / blank slot) —
fct_auction_loserspartly enables this.
Section 4 — Why auction_decision_id + append-only attribution
- One key, four facts.
auction_decision_idis generated once at auction-time and travels with the impression, click, conversion. No fuzzy joins, no timestamp matching. - Re-runnable attribution. Attribution is computed, not collected. New models = new
attribution_run_id; the latest_run view picks the most recent. Advertisers can compare last-click vs data-driven side by side. - Invalid clicks as a slowly-changing fact. Click validation is asynchronous (bot detection takes hours to days). Append
is_invalid=TRUEwithinvalidation_tsrather than UPDATE — preserves historical billing audit. - Loser sampling at 1%. 20× winners by volume. Reservoir sample to 1% and re-weight in training; full retention is unaffordable.
Enhanced graphical data model — Search Ads Auction Chain
Section 7 — SQL analysis for business units
Sample data — Nike "running shoe" auction
-- Auction at 14:32:01 — 5 candidates compete; Nike wins on ad_rank 0.0751
INSERT INTO fct_auctions VALUES
('AUC_NIKE_001','q_runningshoe','kw_runningshoe','U_TOK_42','dev_iphone',
'2025-05-01 14:32:01.123','AD_NIKE_RUN','1.92','1.0','1.92','standard');
-- Impression served
INSERT INTO fct_impressions VALUES
('IMP_001','AUC_NIKE_001','AD_NIKE_RUN',0.034,0.92,'2025-05-01 14:32:01.180', 0.0751);
-- User clicks 4 seconds later
INSERT INTO fct_clicks VALUES ('CLK_001','IMP_001','2025-05-01 14:32:05',FALSE,NULL);
-- Conversion 17 hours later (next-day purchase)
INSERT INTO fct_conversions VALUES
('CONV_001','ADV_NIKE','2025-05-02 07:42:00','purchase',129.99);
-- Two attribution rows: last-click model + DDA model on the same conversion
INSERT INTO fct_attributions VALUES
('ATTR_001','CONV_001','IMP_001','CLK_001','MODEL_LAST_CLICK','RUN_2025_05_02', 1.000, 'click_24h', 129.99),
('ATTR_002','CONV_001','IMP_001','CLK_001','MODEL_DDA_v3', 'RUN_2025_05_02', 0.620, 'click_24h', 80.59);
-- Sampled losers (1%)
INSERT INTO fct_auction_losers VALUES
('AUC_NIKE_001','AD_ASICS_GEL', 0.0572, 100), -- weight = 1/sample_rate
('AUC_NIKE_001','AD_STRAVA', 0.0408, 100);
A. Advertiser · ROAS & ranker debug
Q1 — CTR + CPA + ROAS per campaign (latest attribution run)
WITH latest_run AS (
SELECT MAX(attribution_run_id) AS run_id FROM fct_attributions
),
campaign_metrics AS (
SELECT
c.campaign_id,
COUNT(DISTINCT i.impression_id) AS impressions,
COUNT(DISTINCT cl.click_id) FILTER (WHERE cl.is_invalid = FALSE) AS clicks,
SUM(a.clearing_price_usd) AS spend_usd,
COUNT(DISTINCT atr.conversion_id) AS conversions,
SUM(atr.attributed_value_usd) AS attributed_revenue
FROM fct_auctions a
JOIN fct_impressions i USING (auction_decision_id)
JOIN dim_creatives cr ON i.creative_id = cr.creative_id
JOIN dim_campaigns c ON cr.campaign_id = c.campaign_id
LEFT JOIN fct_clicks cl USING (impression_id)
LEFT JOIN fct_attributions atr ON atr.click_id = cl.click_id
AND atr.attribution_run_id = (SELECT run_id FROM latest_run)
AND atr.model_id = 'MODEL_LAST_CLICK'
WHERE a.auction_ts >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY c.campaign_id
)
SELECT campaign_id, impressions, clicks, spend_usd, conversions, attributed_revenue,
ROUND(100.0 * clicks / NULLIF(impressions, 0), 3) AS ctr_pct,
ROUND(spend_usd / NULLIF(clicks, 0), 2) AS cpc_usd,
ROUND(spend_usd / NULLIF(conversions, 0), 2) AS cpa_usd,
ROUND(attributed_revenue / NULLIF(spend_usd, 0), 2) AS roas
FROM campaign_metrics ORDER BY roas DESC;
B. Attribution & measurement
Q2 — Last-click vs DDA: which model gives higher attributed revenue?
SELECT
atr.model_id,
COUNT(DISTINCT atr.conversion_id) AS conversions,
ROUND(SUM(atr.attributed_value_usd), 2) AS total_attributed_value,
ROUND(AVG(atr.fractional_credit), 3) AS avg_credit
FROM fct_attributions atr
WHERE atr.attribution_run_id = (SELECT MAX(attribution_run_id) FROM fct_attributions)
GROUP BY atr.model_id;
-- Last-click gives full credit (credit=1.0). DDA distributes — useful for diagnosing over-attribution.
C. Ranker · quality score & pCTR calibration
Q3 — Predicted vs realized pCTR per creative (calibration check)
SELECT
i.creative_id,
COUNT(*) AS impressions,
ROUND(AVG(i.predicted_pCTR), 4) AS predicted_pCTR,
ROUND(COUNT(*) FILTER (WHERE c.click_id IS NOT NULL)::DECIMAL / COUNT(*), 4) AS realized_pCTR,
ROUND(100.0 * (COUNT(*) FILTER (WHERE c.click_id IS NOT NULL)::DECIMAL / COUNT(*) - AVG(i.predicted_pCTR))
/ AVG(i.predicted_pCTR), 1) AS calibration_error_pct
FROM fct_impressions i
LEFT JOIN fct_clicks c USING (impression_id)
WHERE i.served_ts >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY i.creative_id
HAVING COUNT(*) > 1000
ORDER BY ABS(calibration_error_pct) DESC;
-- Creatives with calibration_error > 20% need ranker retraining attention.
Worked example — Nike "running shoe" auction clearing & attribution
- Auction: 5 candidates compete.
ad_rank = bid × pCTR × quality:- Nike: $2.40 × 0.034 × 0.92 = 0.0751 ← winner
- Asics: $2.10 × 0.029 × 0.94 = 0.0572
- Strava: $1.80 × 0.028 × 0.81 = 0.0408
- Second-price clearing: Nike pays just enough to beat Asics's ad_rank — at Nike's quality 0.92, that's $1.92, not $2.40. Savings of $0.48 = trust signal.
- Impression served at 14:32:01. User clicks at 14:32:05.
- Conversion 17 hours later: $129.99 purchase.
- Last-click attribution (MODEL_LAST_CLICK): credit 1.0 → $129.99 attributed to this click.
- Data-driven attribution (MODEL_DDA_v3): credit 0.62 → $80.59 attributed to this click. The remaining $49.40 attributed to upstream touchpoints.
- Nike's ROAS under last-click = $129.99 / $1.92 = 67.7×. Under DDA = $80.59 / $1.92 = 42.0×. Both are stored side-by-side via
attribution_run_id.
Section 8 — Why this works
- One join key, four facts.
auction_decision_idthreads through impression, click, conversion. ROAS query is a 4-way join on a single column — efficient and correct. - Re-attribution is cheap. New model = new
attribution_run_id. Old runs preserved; advertisers compare side-by-side. - Invalid clicks handled losslessly. Append-only
is_invalid+invalidation_tspreserves billing history while crediting advertisers. - Loser sampling tractable. 1% sampled with
sample_weightcolumn; ranker training works without 20× data volume.
auction_decision_id is the join key that turns four independent facts into a coherent advertiser narrative — and the attribution_run_id + dim_attribution_models SCD2 makes re-running attribution a query, not a re-ingest. Last-click vs DDA isn't an ETL decision — it's a column in the SELECT."
Scenario 5 — Cross-device attribution & identity graph
Why this is hard — the design tension
One human, many devices. Apple's ATT cuts user-level signals on iOS. SKAdNetwork only returns aggregated postbacks. The model must support cross-device joining via a probabilistic+deterministic identity graph, multiple attribution windows in parallel (1d / 7d / 28d × click vs view), and two privacy paths (user-level for opted-in, aggregate-only for SKAN).
Section 1 — Dimensional Model Overview
Fact Tables
fct_ad_events— The "Touch" grain. One row per impression OR click (event-typed).fct_conversions— The "Conversion" grain. One row per advertiser-reported conversion.fct_skan_postbacks— The iOS-aggregate grain. One row per SKAdNetwork postback (no user-level data).fct_attributions— The "Match" grain. One row per (conversion × eligible touchpoint × window × model × run). Append-only viaattribution_run_id.
Dimension & Bridge Tables
dim_identities— One row per resolved person. Probabilistic + deterministic.bridge_identity_devices— SCD2. Many devices per identity;link_confidencedrives weighting.dim_advertisers, dim_campaigns, dim_creatives— SCD2.dim_user_holdouts— Holdout assignment for incrementality (SCD2 per advertiser × period).
Section 2 — Logical Data Model (key columns)
bridge_identity_devices (the heart of cross-device)
| Column | Type | Description |
|---|---|---|
identity_id, device_id_token | FK | Composite link. |
link_confidence | DECIMAL(4,3) | 1.0 = deterministic (login). <1.0 = probabilistic (IP, behavior). |
link_method | STRING | {login, email_match, household_ip, behavioral, deterministic_id}. |
effective_from, effective_to | TIMESTAMP_TZ | SCD2 — devices come and go. |
fct_ad_events — event_id, event_type ∈ {impression, click}, identity_id (NULL on iOS), device_id_token, campaign_id, creative_id, placement, event_ts, privacy_path.
fct_attributions — conversion_id, ad_event_id, window_type ∈ {click_1d, click_7d, view_1d, view_28d, ...}, attribution_model, attribution_run_id, is_attributed BOOLEAN, fractional_credit DECIMAL. Partitioned by attribution_run_id.
Section 3 — How the model serves each stakeholder
Advertiser · ROAS & window comparison
- Run multiple
window_typevalues in parallel:click_1d,click_7d,view_28d. The advertiser picks; we report all. - Cross-device join: phone impression + laptop click + tablet conversion all join via
identity_idfrombridge_identity_devices.
Privacy / iOS · ATT & SKAN
privacy_path = 'att_optout'events have NULLidentity_id; only aggregate roll-ups by campaign × placement × day are valid.fct_skan_postbacksis a separate fact — never joins to user-level. Aggregated reporting only.
Measurement · incrementality lift
- Hold out N% of users from receiving any ad (in
dim_user_holdouts). Lift = (treatment conversion rate) − (holdout conversion rate). - The holdout group does not appear in
fct_attributionsat all — they're conversions without ad events.
Section 4 — Why "match at read, not at write"
The decisive design choice is to never compute attribution at ingest time. Every conversion is paired with every eligible touchpoint, and the attribution model is a function evaluated at read time. Three benefits:
- Re-attribution is a query, not a re-ingest. Switch from last-click to data-driven without rebuilding the fact table.
- Multiple windows in parallel. Same conversion appears in
click_1d,click_7d, andview_28drows simultaneously. - Audit-ready. When an advertiser disputes attribution, you produce the exact
attribution_run_idthat ran on date X with model Y. Defensible.
Enhanced graphical data model — Cross-device Attribution
Section 7 — SQL analysis for business units
Sample data — 1 user (3 devices) → 3 ad events → 1 conversion
INSERT INTO dim_identities VALUES ('IDT_001','PERSON_TOK_42','login_email');
INSERT INTO bridge_identity_devices VALUES
('IDT_001','DEV_PHONE', 1.000,'login', '2024-01-01',NULL),
('IDT_001','DEV_LAPTOP', 0.920,'household_ip', '2024-03-15',NULL),
('IDT_001','DEV_TABLET', 0.880,'behavioral', '2024-06-22',NULL);
-- 3 ad events across 3 devices, same identity:
INSERT INTO fct_ad_events VALUES
('AE_001','impression','IDT_001','DEV_PHONE','CMP_NIKE','feed','2025-04-25 09:15:00','opted_in'),
('AE_002','click', 'IDT_001','DEV_LAPTOP','CMP_NIKE','feed','2025-04-28 14:32:00','opted_in'),
('AE_003','impression','IDT_001','DEV_TABLET','CMP_NIKE','reel','2025-04-30 21:00:00','opted_in');
-- Conversion 1 week after first impression
INSERT INTO fct_conversions VALUES
('CONV_777','ADV_NIKE','2025-05-02 11:42:00','purchase',129.99,'IDT_001');
-- Attribution rows for window=7d_click + last_click vs MTA
INSERT INTO fct_attributions VALUES
('ATR_01','CONV_777','AE_002','MODEL_LAST_CLICK','RUN_2025_05_03','click_7d',TRUE,1.000,129.99),
('ATR_02','CONV_777','AE_001','MODEL_MTA', 'RUN_2025_05_03','view_28d', TRUE,0.300, 38.99),
('ATR_03','CONV_777','AE_002','MODEL_MTA', 'RUN_2025_05_03','click_7d', TRUE,0.500, 65.00),
('ATR_04','CONV_777','AE_003','MODEL_MTA', 'RUN_2025_05_03','view_28d', TRUE,0.200, 26.00);
A. Advertiser · cross-device ROAS by window
Q1 — Conversions attributed by window_type (1d, 7d, 28d) under last-click
SELECT
atr.window_type,
COUNT(DISTINCT atr.conversion_id) AS conversions_attributed,
ROUND(SUM(atr.attributed_value_usd), 2) AS attributed_value
FROM fct_attributions atr
WHERE atr.attribution_run_id = (SELECT MAX(attribution_run_id) FROM fct_attributions)
AND atr.model_id = 'MODEL_LAST_CLICK'
AND atr.is_attributed = TRUE
GROUP BY atr.window_type ORDER BY attributed_value DESC;
-- Lets advertiser see how many conversions fall inside each window. Picks the right window for their funnel.
Q2 — Cross-device match rate (% conversions linked to ≥1 ad event via identity)
WITH conv AS (
SELECT conversion_id, identity_id FROM fct_conversions
WHERE conversion_ts >= CURRENT_DATE - INTERVAL '30 days'
),
matched AS (
SELECT DISTINCT atr.conversion_id FROM fct_attributions atr
JOIN conv USING (conversion_id)
WHERE atr.is_attributed = TRUE
)
SELECT
COUNT(DISTINCT c.conversion_id) AS total_conversions,
COUNT(DISTINCT m.conversion_id) AS matched_conversions,
ROUND(100.0 * COUNT(DISTINCT m.conversion_id) / COUNT(DISTINCT c.conversion_id), 1) AS match_rate_pct
FROM conv c LEFT JOIN matched m USING (conversion_id);
B. Measurement · incrementality lift
Q3 — Conversion rate: treatment vs holdout
WITH treatment AS (
SELECT i.identity_id,
EXISTS (SELECT 1 FROM fct_conversions c WHERE c.identity_id = i.identity_id
AND c.advertiser_id = 'ADV_NIKE'
AND c.conversion_ts BETWEEN '2025-04-01' AND '2025-05-01') AS converted
FROM dim_identities i
WHERE i.identity_id NOT IN (SELECT identity_id FROM dim_user_holdouts
WHERE advertiser_id = 'ADV_NIKE'
AND period = '2025-04')
),
holdout AS (
SELECT i.identity_id,
EXISTS (SELECT 1 FROM fct_conversions c WHERE c.identity_id = i.identity_id
AND c.advertiser_id = 'ADV_NIKE'
AND c.conversion_ts BETWEEN '2025-04-01' AND '2025-05-01') AS converted
FROM dim_identities i
JOIN dim_user_holdouts h ON i.identity_id = h.identity_id
WHERE h.advertiser_id = 'ADV_NIKE' AND h.period = '2025-04'
)
SELECT 'treatment' AS arm, COUNT(*) AS n, ROUND(100.0 * AVG(converted::INT), 3) AS conv_rate_pct FROM treatment
UNION ALL
SELECT 'holdout', COUNT(*), ROUND(100.0 * AVG(converted::INT), 3) FROM holdout;
-- Lift = treatment_rate - holdout_rate. Statistical significance = Welch's t-test on the two proportions.
Worked example — User IDT_001 attribution math
- User IDT_001 sees Nike impression on phone (Apr 25), clicks Nike ad on laptop (Apr 28), sees Nike Reels on tablet (Apr 30).
- Buys Nike shoes for $129.99 on May 2.
- Last-click (click_7d window): credit 1.0 → all $129.99 attributed to AE_002 (the laptop click).
- MTA model (multi-touch): AE_001 view 30%, AE_002 click 50%, AE_003 view 20% → $38.99 + $65.00 + $26.00 = $129.99 ✓
- Identity bridge enabled the cross-device join; without it the 3 events wouldn't connect.
link_confidenceon each device drives MTA weighting in advanced models. - Both attribution rows live in
fct_attributionswithattribution_run_id = RUN_2025_05_03. Switching models = SELECT, not re-ingest.
Section 8 — Why this works
- Identity graph as a first-class fact.
bridge_identity_devicesSCD2 with confidence is how cross-device works without lying about certainty. - Match at read. Multiple windows + multiple models computed in parallel; advertiser picks at query time.
- iOS / SKAN carve-out. Privacy path is explicit; aggregate-only queries fall back to
fct_skan_postbacks. - Incrementality real. Holdout assignment is a separate dim, not derived — survives campaign re-orgs.
Scenario 6 — CTV ad inventory, pacing & frequency capping
Why this is hard — the design tension
CTV inventory has a contractual layer that programmatic doesn't: guaranteed buys with monetary penalties for under-delivery. Three pressures pull simultaneously: (1) maximize fill rate, (2) honor frequency caps so users don't see the same ad 8 times, (3) hit committed impressions per advertiser per month. Modeling unfilled opportunities (not just impressions) is the senior move — it's the only way to answer "why did we leave money on the table?"
Section 1 — Dimensional Model Overview
Fact Tables
fct_ad_opportunities— The "Slot" grain. One row per ad-slot opportunity in a viewing session. Filled or unfilled — both kept.fct_ad_impressions— The "Served" grain. Subset of opportunities where an ad actually played.fct_pacing_daily— The "Contract" grain. One row per (campaign × day) aggregating delivered vs committed.fct_frequency_state— The "Cap" grain. One row per (profile × campaign × day) tracking impressions toward cap.fct_makegood_obligations— The "Owed" grain. One row per (campaign × month) when committed not met. Drives next-month bonus inventory.
Dimension Tables
dim_advertisers, dim_campaigns, dim_creatives(SCD2)dim_campaign_flights— Period a campaign runs (start_date / end_date / total_committed_impressions)dim_titles(SCD2 for content metadata; ratings drive ad eligibility)dim_profiles(SCD2; subscription tier = ad-supported / premium)
Section 2 — Logical Data Model (key columns)
fct_ad_opportunities — opportunity_id, profile_id, title_id, session_id, slot_type ∈ {pre_roll, mid_roll, post_roll}, slot_position_sec, slot_duration_sec, opportunity_ts, predicted_cpm, served_creative_id (NULL=unfilled), fill_reason {won_auction, guaranteed_buy, house_ad}, unfilled_reason {frequency_capped, no_eligible_demand, content_unsuitable}.
fct_pacing_daily — campaign_id, day, delivered, committed_today, cumulative_delivered, total_committed, pacing_index (1.0 = on track), days_remaining.
fct_makegood_obligations — campaign_id, original_period, makegood_period, owed_impressions, owed_value_usd, status {pending, scheduled, fulfilled}.
Section 3 — How the model serves each stakeholder
Ad server · live decisions — for each opportunity, query fct_frequency_state + fct_pacing_daily + active campaign list. Pick a creative or mark unfilled.
Account managers · advertiser pacing — track pacing_index per campaign daily. If <0.95 with <5 days left, increase delivery priority.
Finance · make-good bookkeeping — fct_makegood_obligations tracks dollars owed when committed impressions not met. Liability on the balance sheet until fulfilled.
Exec · "money on the table" analysis — count fct_ad_opportunities with served_creative_id IS NULL grouped by unfilled_reason. Tells you if the bottleneck is demand (sales team) or eligibility rules (ops team).
Section 4 — Why model unfilled opportunities
- Without unfilled rows: can't compute fill rate. Can't tell if you under-delivered because of low demand vs frequency caps vs content unsuitability.
- With unfilled rows: fill rate = COUNT(filled) / COUNT(total opportunities). Drill into
unfilled_reasonto find the right team to fix. - Cost: doubles the row count. Tradeoff is worth it; partition by date keeps queries fast.
Enhanced graphical data model — Netflix CTV Inventory
Section 7 — SQL analysis
Sample data — 1 campaign, 5 opportunities (3 filled / 2 unfilled), 1 day pacing
INSERT INTO dim_campaigns VALUES ('CMP_FORD','ADV_FORD','2025-05-01','2025-05-31',5000000,1.0);
INSERT INTO fct_ad_opportunities VALUES
('OPP_001','PROF_A','TITLE_X','SESS_1','pre_roll', 0, 15, 25.50, 'CRT_FORD_15S', 'guaranteed_buy', NULL, '2025-05-01 19:32:00'),
('OPP_002','PROF_A','TITLE_X','SESS_1','mid_roll',1200, 15, 25.50, NULL, NULL, 'frequency_capped', '2025-05-01 19:52:00'),
('OPP_003','PROF_B','TITLE_Y','SESS_2','pre_roll', 0, 15, 25.50, 'CRT_FORD_15S', 'guaranteed_buy', NULL, '2025-05-01 20:05:00'),
('OPP_004','PROF_C','TITLE_Z','SESS_3','pre_roll', 0, 30, 25.50, NULL, NULL, 'no_eligible_demand', '2025-05-01 20:22:00'),
('OPP_005','PROF_D','TITLE_X','SESS_4','mid_roll',1500, 15, 25.50, 'CRT_FORD_15S', 'guaranteed_buy', NULL, '2025-05-01 21:00:00');
-- Pacing for the day:
INSERT INTO fct_pacing_daily VALUES ('CMP_FORD','2025-05-01', 3, 161290, 3, 5000000, 0.000, 30);
-- Way under: should have delivered 161290 today, only got 3.
A. Operations · fill rate & unfilled diagnostics
Q1 — Fill rate & unfilled-reason breakdown (the "money on the table" report)
SELECT
slot_type,
COUNT(*) AS opportunities,
COUNT(*) FILTER (WHERE served_creative_id IS NOT NULL) AS filled,
ROUND(100.0 * COUNT(*) FILTER (WHERE served_creative_id IS NOT NULL)
/ COUNT(*), 1) AS fill_rate_pct,
COUNT(*) FILTER (WHERE unfilled_reason = 'frequency_capped') AS cap_blocked,
COUNT(*) FILTER (WHERE unfilled_reason = 'no_eligible_demand') AS no_demand,
COUNT(*) FILTER (WHERE unfilled_reason = 'content_unsuitable') AS content_block
FROM fct_ad_opportunities
WHERE DATE(opportunity_ts) = '2025-05-01'
GROUP BY slot_type ORDER BY opportunities DESC;
-- "no_eligible_demand" rows = sales team to fix. "frequency_capped" = ops/cap-policy team.
B. Account managers · pacing alerts
Q2 — Campaigns at risk of under-delivery (pacing < 0.95 with <5 days remaining)
SELECT p.campaign_id, c.advertiser_id,
p.cumulative_delivered, p.total_committed,
ROUND(100.0 * p.cumulative_delivered / NULLIF(p.total_committed, 0), 1) AS pct_delivered,
p.pacing_index, p.days_remaining,
ROUND((p.total_committed - p.cumulative_delivered) / NULLIF(p.days_remaining, 0)) AS daily_needed
FROM fct_pacing_daily p JOIN dim_campaigns c USING (campaign_id)
WHERE p.day = CURRENT_DATE
AND p.pacing_index < 0.95
AND p.days_remaining < 5
ORDER BY p.pacing_index ASC;
-- Triggers ad-server priority bump + account-manager Slack alert.
C. Finance · make-good liability
Q3 — Pending make-good liability ($) by advertiser
SELECT
c.advertiser_id,
COUNT(DISTINCT mg.campaign_id) AS campaigns_owed,
SUM(mg.owed_impressions) AS total_impressions_owed,
ROUND(SUM(mg.owed_value_usd), 2) AS total_dollars_owed,
COUNT(*) FILTER (WHERE mg.status = 'pending') AS pending_count
FROM fct_makegood_obligations mg
JOIN dim_campaigns c USING (campaign_id)
WHERE mg.status IN ('pending','scheduled')
GROUP BY c.advertiser_id
ORDER BY total_dollars_owed DESC;
-- Books a liability on Netflix's balance sheet until fulfilled.
Worked example — CMP_FORD on May 1
- Total committed: 5,000,000 impressions over May. Daily target: 161,290.
- Day 1 delivered: 3 impressions (out of 5 opportunities). 2 unfilled — 1 frequency_capped, 1 no_eligible_demand.
pacing_index= 3 / 161,290 = 0.0000186 — catastrophic miss.- If pattern continues 30 days: ~90 delivered total. Make-good owed = (5M − 90) × CPM-based dollar rate.
- Queries trigger: account manager alert (Q2), finance liability booking (Q3), unfilled-reason audit (Q1) reveals "no_eligible_demand" needs sales-team push.
Section 8 — Why this works
- Unfilled rows are first-class. Fill rate becomes a simple COUNT FILTER, not a derived metric.
- Pacing as a daily snapshot. Not derived on the fly — pre-aggregated for fast dashboard reads.
- Make-goods as a liability fact. Finance owns it; it's a real balance-sheet item, not a metric.
- Frequency caps as state. Per (profile × campaign × day) — exactly what the ad server queries pre-decision.
Scenario 7 — Orders, returns & multi-warehouse inventory
Why this is hard — the design tension
Returns reshape revenue retroactively. Multi-warehouse fulfillment splits a single SKU across two warehouses. Seller-of-record (1P retail / 2P FBA / 3P marketplace) determines whether Amazon books GMV or commission. And inventory has to reconcile to ledger every night. The model must be append-only on the financial side and movement-based on the physical side — same product, two grains, one source of truth.
Section 1 — Dimensional Model Overview
Fact Tables
fct_order_lines— The "Sale" grain. One row per (order × product × source_warehouse). 3-item order = 3 rows.fct_returns— The "Refund" grain. One row per return event, line-level. Append-only.fct_inventory_movements— The "Stock-flow" grain. One row per stock change (in / out / transfer / damage / adjustment).snap_inventory_daily— The "State" grain. Daily snapshot per (product × warehouse).fct_replacements— The "Linked-order" grain. Maps replacement orders back to their original.
Dimension Tables
dim_products(SCD2 — title, category, dimensions drift)dim_sellers(SCD2 —seller_type ∈ {retail_1p, fba_2p, marketplace_3p},recognized_as ∈ {gmv, commission_only})dim_warehouses,dim_customers+dim_addresses(SCD2),dim_promotions(SCD2)
Section 2 — Logical Data Model (key columns)
fct_order_lines — order_line_id, order_id, line_number, customer_id, product_id, seller_id, source_warehouse_id, ordered_ts, shipped_ts, delivered_ts, quantity, unit_price_usd, promo_discount, tax, shipping, gross_revenue_usd, is_prime, is_subscribe_save.
fct_returns — return_id, order_line_id, return_ts, refund_ts, reason_code, quantity (partial OK), refund_amount_usd, restocking_fee_usd, condition_received {new, used, damaged}, is_replacement BOOL.
fct_inventory_movements — movement_id, product_id, warehouse_id, movement_ts, movement_type {receipt, sale_out, return_in, transfer_out, transfer_in, damage, adjustment}, quantity (signed), unit_cost_usd, related_order_line_id (NULL for non-sale moves).
snap_inventory_daily — product_id, warehouse_id, snapshot_date, opening_qty, receipts, sales, returns_in, transfers_net, damage, ending_qty, weighted_avg_cost. Reconciliation invariant: opening + receipts + returns_in − sales − damage + transfers_net = ending.
Section 3 — How the model serves each stakeholder
Customer experience — order delivery SLA from fct_order_lines (ordered → delivered). Return rate by category from fct_returns tracks product quality.
Finance — net revenue = SUM(order_lines.gross_revenue_usd) − SUM(returns.refund_amount_usd). For 3P sellers, dim_sellers.recognized_as = 'commission_only' means only the commission is booked.
Operations — multi-warehouse fill rate, transfer efficiency, days of inventory on hand. Reconciliation runs nightly: snap_inventory_daily ending_qty must match SUM(fct_inventory_movements) since opening.
Section 4 — Why append-only returns + movement-based inventory
- Returns reshape historical revenue. Never UPDATE
fct_order_lines. Appendfct_returns. Finance can restate any historical period correctly. - Inventory snapshots derive from movements. Snapshots are pre-aggregated for fast read; movements are the immutable source. Lose a snapshot, rebuild from movements.
- Replacements are linked, not collapsed. Original order stays + replacement is a new row + return matches both. Triple-counted at gross level, nets to zero correctly.
- 3P seller revenue carve-out.
dim_sellers.recognized_asdrives whether SUM is GMV or commission — single column makes the rule explicit and queryable.
Enhanced graphical data model — Amazon Orders + Returns + Inventory
Section 7 — SQL analysis
Sample data — 1 order with 2 line items (one split across 2 warehouses), 1 partial return
INSERT INTO dim_sellers VALUES
('SLR_AMZN','retail_1p','gmv', 0.0),
('SLR_3P_ABC','marketplace_3p','commission_only', 0.15);
-- 3-item order; line 2 split across 2 warehouses (qty 4 from WH-A, qty 2 from WH-B)
INSERT INTO fct_order_lines VALUES
('OL_001','ORD_99', 1, 'CUST_X','PROD_BOOK', 'SLR_AMZN','WH-A','2025-05-01 10:00','2025-05-02 14:00','2025-05-03 16:00', 1, 19.99, 0.00, 1.65, 0.00, 1.0, 21.64, TRUE, FALSE),
('OL_002','ORD_99', 2, 'CUST_X','PROD_PEN', 'SLR_3P_ABC','WH-A','2025-05-01 10:00','2025-05-02 14:00','2025-05-03 16:00', 4, 3.50, 0.00, 1.16, 0.00, 1.0, 15.16, TRUE, FALSE),
('OL_003','ORD_99', 2, 'CUST_X','PROD_PEN', 'SLR_3P_ABC','WH-B','2025-05-01 10:00','2025-05-02 14:00','2025-05-03 16:00', 2, 3.50, 0.00, 0.58, 0.00, 1.0, 7.58, TRUE, FALSE);
-- Partial return: customer keeps the book, returns 3 of 6 pens
INSERT INTO fct_returns VALUES
('RET_001','OL_002','2025-05-08 09:00','2025-05-08 09:30','quality_issue', 3, 11.37, 0.00, 'used', FALSE);
A. Customer · order funnel + delivery SLA
Q1 — Prime vs non-Prime delivery time (last 30 days)
SELECT
is_prime,
COUNT(*) AS lines,
ROUND(AVG(EXTRACT(EPOCH FROM (delivered_ts - ordered_ts))/3600), 1) AS avg_hours,
ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (delivered_ts - ordered_ts))/3600), 1) AS p95_hours
FROM fct_order_lines
WHERE delivered_ts IS NOT NULL AND ordered_ts >= CURRENT_DATE - 30
GROUP BY is_prime;
B. Finance · net revenue with 1P/3P carve-out
Q2 — Net revenue with seller-type carve-out (3P books only commission)
WITH gross AS (
SELECT
s.seller_type,
s.recognized_as,
SUM(CASE WHEN s.recognized_as = 'gmv' THEN ol.gross_revenue_usd
ELSE ol.gross_revenue_usd * s.commission_pct END) AS gross_recognized_usd
FROM fct_order_lines ol JOIN dim_sellers s USING (seller_id)
WHERE DATE(ol.ordered_ts) = '2025-05-01'
GROUP BY s.seller_type, s.recognized_as
),
returns AS (
SELECT
s.seller_type,
SUM(CASE WHEN s.recognized_as = 'gmv' THEN r.refund_amount_usd
ELSE r.refund_amount_usd * s.commission_pct END) AS refund_recognized_usd
FROM fct_returns r
JOIN fct_order_lines ol ON r.order_line_id = ol.order_line_id
JOIN dim_sellers s USING (seller_id)
WHERE DATE(r.return_ts) = '2025-05-01'
GROUP BY s.seller_type
)
SELECT g.seller_type,
ROUND(g.gross_recognized_usd, 2) AS gross_revenue,
ROUND(COALESCE(r.refund_recognized_usd, 0), 2) AS refunds,
ROUND(g.gross_recognized_usd - COALESCE(r.refund_recognized_usd, 0), 2) AS net_revenue
FROM gross g LEFT JOIN returns r USING (seller_type);
C. Operations · inventory reconciliation
Q3 — Daily inventory invariant check (snap = opening + receipts + returns_in − sales − damage + transfers)
SELECT
product_id, warehouse_id, snapshot_date,
opening_qty, receipts, returns_in, sales, damage, transfers_net, ending_qty,
(opening_qty + receipts + returns_in - sales - damage + transfers_net) AS computed_ending,
ending_qty - (opening_qty + receipts + returns_in - sales - damage + transfers_net) AS drift,
CASE WHEN ending_qty = opening_qty + receipts + returns_in - sales - damage + transfers_net
THEN 'PASS' ELSE 'FAIL' END AS reconciliation
FROM snap_inventory_daily
WHERE snapshot_date = CURRENT_DATE - 1
AND ending_qty != opening_qty + receipts + returns_in - sales - damage + transfers_net;
-- Empty result = clean. Non-empty = inventory adjustment needed; investigate.
Worked example — Order ORD_99 with split fulfillment + partial return
- Order: 1 book ($21.64) + 6 pens. Pens split across 2 warehouses: 4 from WH-A ($15.16) + 2 from WH-B ($7.58) = 6 pens for $22.74. Total order gross = $44.38.
- Seller-type effects on revenue: Book is 1P retail → Amazon books $21.64 GMV. Pens are 3P seller (15% commission) → Amazon books $22.74 × 0.15 = $3.41. Total Amazon recognized = $21.64 + $3.41 = $25.05.
- Partial return: 3 of 6 pens returned a week later. Refund = $11.37 (proportional). Amazon's recognized refund = $11.37 × 0.15 = $1.71.
- Net Amazon recognized: $25.05 − $1.71 = $23.34.
- Inventory: WH-A movements record sale_out (+4 pens out, then return_in +1 pen back). WH-B records sale_out (-2 pens). Daily snapshot reconciles.
Section 8 — Why this works
- Append-only returns. Net revenue = SUM(orders) − SUM(returns). No UPDATE; full historical audit.
- Movements are immutable. Snapshots derive from movements; rebuild at any time.
- Seller-type carve-out is a column. 1P / 2P / 3P revenue rules become a CASE in the query, not a separate ETL path.
- Multi-warehouse split is a row. Same order_id + line_number, different warehouse → fulfilment is queryable without losing the customer's "1 line item" mental model.
dim_sellers.recognized_as turns a multi-billion-dollar revenue policy into a queryable column, not a brittle ETL branch — that's the difference between a model that survives quarter-end and one that doesn't."
Scenario 8 — Feed engagement at scale (with ranker A/B)
Why this is hard — the design tension
Volume alone forces design choices: 100B impressions/day means exact COUNT DISTINCT is unaffordable; HLL is mandatory. Mutable engagements (like → unlike) demand append-only with is_undone flags. Heartbeat-driven dwell can't live at raw grain — must be aggregated at ETL. And the ranker A/B requirement means every impression must carry the model_id that ranked it, or you can't compare baseline lift.
Section 1 — Dimensional Model Overview
Fact Tables
fct_impressions— The "Served" grain. One row per (viewer × post × impression). Carriesranker_model_id+predicted_score.fct_engagements— The "Reaction" grain. One row per (viewer × post × event_type × event_ts). Likes, saves, comments, shares, hides, reports — all here.fct_dwell_per_impression— The "Time-on-post" grain. One row per impression with total dwell_ms (aggregated from heartbeats at ETL).fct_ranker_experiments— The "Treatment" grain. One row per (viewer × experiment × variant). Drives A/B comparison.
Dimension Tables
dim_users(SCD2) ·dim_posts(SCD2 — caption edits, audio swaps) ·dim_creators(SCD2)dim_post_audio(SCD1)dim_ranker_models— SCD2; semver + model_card_url + deployed_pct.
Section 2 — Logical Data Model (key columns)
fct_impressions — impression_id, viewer_id, post_id, feed_position SMALLINT, impression_ts, device_id, ranker_model_id (FK), predicted_score DECIMAL(8,6), surface ∈ {feed, reels, stories, explore}.
fct_engagements — engagement_id, impression_id (NULLable: comments can arrive without prior impression event), viewer_id, post_id, event_type ∈ {like, save, share, comment, hide, report, follow_creator}, event_ts, is_undone BOOLEAN, undone_ts. Append-only — never delete an unlike.
fct_dwell_per_impression — impression_id, dwell_ms INT, was_full_view BOOLEAN, viewport_pct DECIMAL. Aggregated from raw 250ms heartbeats at ETL — saves 100× rows.
fct_ranker_experiments — viewer_id, experiment_id, variant_id, assigned_ts, exit_ts. Bridges viewers to A/B-test arms.
Section 3 — How the model serves each stakeholder
Ranker team · model lift — Compare engagement-rate per impression for ranker_model_id = v3.1 vs v3.2. Positive lift on like-rate but watch hide-rate; both should move favorably.
Creator analytics — Reach (impressions), engagement rate (engagements / impressions), dwell distribution per post.
Brand safety — Hide+Report rate per post per hour. Posts crossing thresholds get demoted; spike → human review queue.
Product growth — Negative-engagement rate is the canary for "this content drives session abandonment".
Section 4 — Why ranker_model_id on every impression + append-only engagements
- Ranker A/B requires the model on the row. Without it, a 1% holdout for the new model can't be compared to the 99% baseline.
ranker_model_idis a first-class dimension, not a footnote. - Likes are mutable, but the fact is immutable. A like at T1 + unlike at T2 = 2 rows: an insert and an "undone" insert with
is_undone=TRUE, undone_ts=T2. Never DELETE; preserves time-series of toggle behavior. - Dwell aggregated at ETL. 250ms heartbeats × 100B impressions × 30 sec average = ~12 trillion raw rows/day. Aggregate before warehouse — only 100B per-impression rows survive.
- Approximate aggregations.
APPROX_COUNT_DISTINCT+ HLL for everything user-level. Exact only when finance demands it (rare).
Enhanced graphical data model — Instagram Feed Engagement
Section 7 — SQL analysis
Sample data — 1 viewer sees 3 posts under v3.2 ranker; engages with two
INSERT INTO fct_impressions VALUES
('IMP_A','U_42','POST_001', 1,'2025-05-01 09:00:00','dev_iphone','model_v3.2',0.72,'feed'),
('IMP_B','U_42','POST_002', 2,'2025-05-01 09:00:08','dev_iphone','model_v3.2',0.55,'feed'),
('IMP_C','U_42','POST_003', 3,'2025-05-01 09:00:15','dev_iphone','model_v3.2',0.41,'feed');
INSERT INTO fct_dwell_per_impression VALUES
('IMP_A', 4200, 1.00, TRUE), -- 4.2s, full view
('IMP_B', 800, 0.40, FALSE), -- 0.8s, partial — scrolled past
('IMP_C', 6500, 1.00, TRUE); -- 6.5s, full view
INSERT INTO fct_engagements VALUES
('ENG_1','IMP_A','U_42','POST_001','like', '2025-05-01 09:00:04', FALSE, NULL),
('ENG_2','IMP_C','U_42','POST_003','save', '2025-05-01 09:00:21', FALSE, NULL),
('ENG_3','IMP_C','U_42','POST_003','comment','2025-05-01 09:01:00', FALSE, NULL),
('ENG_4','IMP_A','U_42','POST_001','like', '2025-05-01 09:05:00', TRUE, '2025-05-01 09:05:00'); -- unlike
A. Creator analytics · post engagement
Q1 — Like / save / share rate per post (with active engagements only)
WITH active_engagements AS (
SELECT post_id, event_type FROM fct_engagements WHERE is_undone = FALSE
)
SELECT
i.post_id,
COUNT(*) AS impressions,
APPROX_COUNT_DISTINCT(i.viewer_id) AS unique_viewers,
COUNT(*) FILTER (WHERE e.event_type = 'like') AS likes,
COUNT(*) FILTER (WHERE e.event_type = 'save') AS saves,
COUNT(*) FILTER (WHERE e.event_type = 'share') AS shares,
ROUND(100.0 * COUNT(*) FILTER (WHERE e.event_type='like') / COUNT(*), 2) AS like_rate_pct,
ROUND(AVG(d.dwell_ms), 0) AS avg_dwell_ms
FROM fct_impressions i
LEFT JOIN active_engagements e ON e.post_id = i.post_id
LEFT JOIN fct_dwell_per_impression d ON d.impression_id = i.impression_id
GROUP BY i.post_id ORDER BY like_rate_pct DESC;
B. Brand safety · negative-signal monitoring
Q2 — Posts with hide+report rate above brand-safety threshold
SELECT
i.post_id,
COUNT(*) AS impressions,
COUNT(*) FILTER (WHERE e.event_type IN ('hide','report')) AS negative_signals,
ROUND(100.0 * COUNT(*) FILTER (WHERE e.event_type IN ('hide','report'))
/ COUNT(*), 3) AS negative_rate_pct
FROM fct_impressions i
LEFT JOIN fct_engagements e ON e.impression_id = i.impression_id AND e.is_undone = FALSE
WHERE DATE(i.impression_ts) = CURRENT_DATE - 1
GROUP BY i.post_id
HAVING COUNT(*) > 1000
AND COUNT(*) FILTER (WHERE e.event_type IN ('hide','report')) * 1.0 / COUNT(*) > 0.005
ORDER BY negative_rate_pct DESC;
-- Posts above 0.5% hide+report rate go to human moderation queue.
C. Ranker · A/B model lift
Q3 — Engagement-rate lift v3.2 over v3.1
WITH ranker_metrics AS (
SELECT
i.ranker_model_id,
COUNT(*) AS impressions,
COUNT(*) FILTER (WHERE e.event_type='like' AND e.is_undone=FALSE) AS likes,
COUNT(*) FILTER (WHERE e.event_type='save' AND e.is_undone=FALSE) AS saves,
COUNT(*) FILTER (WHERE e.event_type IN ('hide','report')) AS negatives,
AVG(d.dwell_ms) AS avg_dwell_ms
FROM fct_impressions i
LEFT JOIN fct_engagements e ON e.impression_id = i.impression_id
LEFT JOIN fct_dwell_per_impression d ON d.impression_id = i.impression_id
WHERE i.impression_ts >= CURRENT_DATE - 7
AND i.ranker_model_id IN ('model_v3.1','model_v3.2')
GROUP BY i.ranker_model_id
)
SELECT ranker_model_id,
ROUND(100.0 * likes / NULLIF(impressions, 0), 3) AS like_rate_pct,
ROUND(100.0 * saves / NULLIF(impressions, 0), 3) AS save_rate_pct,
ROUND(100.0 * negatives / NULLIF(impressions, 0), 3) AS negative_rate_pct,
ROUND(avg_dwell_ms, 0) AS avg_dwell_ms
FROM ranker_metrics;
-- v3.2 wins if like_rate higher AND negative_rate not worse.
Worked example — viewer U_42's feed session
- 3 posts impressed under
model_v3.2. Predicted scores: 0.72, 0.55, 0.41. Ranker order = score-desc, correct. - POST_001 (score 0.72): 4.2s dwell, like at +4s, full view → strong positive.
- POST_002 (score 0.55): 0.8s dwell, no engagement, partial view (40%) → user scrolled past. Ranker over-predicted.
- POST_003 (score 0.41): 6.5s dwell, save + comment → strong positive. Ranker under-predicted.
- Calibration error on this session: 1/3 (POST_002) too high, 1/3 (POST_003) too low. Logged in
fct_impressions.predicted_scorefor ranker training. - Mutable like: User unlikes POST_001 5 minutes later.
fct_engagementsstores both events (ENG_1 like, ENG_4 the unlike withis_undone=TRUE). Final state: not liked. Active-only queries filteris_undone=FALSE.
Section 8 — Why this works
- Ranker A/B is queryable.
ranker_model_idon every impression → comparing v3.1 vs v3.2 is a GROUP BY, not a separate ETL pipeline. - Mutable engagements without losing history. Append-only with
is_undonepreserves the full toggle behavior — useful for "regret rate" analysis. - Dwell aggregated at ETL. 100B per-impression rows >> 12T per-heartbeat rows. The aggregation is non-negotiable.
- HLL by default. Distinct viewer counts at petabyte scale require
APPROX_COUNT_DISTINCT— exact only when finance demands it.
ranker_model_id on the impression is what lets the feed team A/B test new models without losing the ability to compute baseline metrics — model-version is a first-class dimension, not a footnote. Append-only engagements with is_undone means we can answer 'what % of likes get unliked within 5 min?' which is the canary for low-quality ranker decisions."
Scenario 9 — Listening history & pool-model royalty payouts
Why this is hard — the design tension
Two consumers want the same fact for opposite reasons. Product needs every play (skips included) for recommendations. Finance only pays on qualified streams (≥30s). Multi-rights splits mean one stream produces 4+ payout rows (artist + label + publisher + songwriter) and the splits change mid-quarter when deals renegotiate. Pool model math requires computing per-stream rate as revenue ÷ total streams in pool — a chicken-and-egg dependency that resolves only at period close.
Section 1 — Dimensional Model Overview
Fact Tables
fct_play_events— The "Listen" grain. One row per play event withqualified_playflag.fct_royalty_periods— The "Pool" grain. One row per (country × period). Holds total pool revenue, total qualified streams, computed per-stream rate.fct_royalty_attributions— The "Payout" grain. One row per (play × rights_holder × period). Derived nightly at period close.
Dimension & Bridge Tables
dim_users(SCD2 — plan tier drifts) ·dim_subscriptions(Family/Duo group of users)dim_tracks(SCD2 — metadata) ·dim_artists·dim_labels(SCD2) ·dim_publishers(SCD2) ·dim_songwritersbridge_track_rights— SCD2 with effective dates. The deal-terms ledger.
Section 2 — Logical Data Model (key columns)
fct_play_events — play_id, user_id, track_id, played_ts, duration_played_sec, qualified_play BOOL (≥30s), context_type ∈ {playlist, algorithmic, search, radio}, source_playlist_id, device_id, country_id.
bridge_track_rights (SCD2) — track_id, rights_holder_id, rights_type ∈ {artist, label, publisher, songwriter}, share_pct, effective_from, effective_to. Sum of share_pct per (track, rights_type) per period = 100%.
fct_royalty_periods — period_id, country_id, period_start, period_end, total_pool_revenue_usd, total_qualified_plays, per_stream_rate_usd (= revenue / qualified_plays).
fct_royalty_attributions — play_id, period_id, rights_holder_id, rights_type, share_pct, attributed_payout_usd.
Section 3 — How the model serves each stakeholder
Product · recommendations + Year in Review — every play counts (qualified or not). Skips are signal. context_type tells the recommender which surface drove the play.
Finance · royalty payouts — quarterly close: compute pool revenue per country, count qualified plays, derive rate, generate fct_royalty_attributions. Splits use as-of-play bridge rows, never current.
Rights holders · transparency — labels/publishers want their cut breakdown. Each fct_royalty_attributions row is a defensible per-play line item.
Section 4 — Why the bridge is SCD2 (not just a table)
Label deals renegotiate mid-quarter — a label might move 5% from publisher to artist effective Apr 15. Three failure modes if the bridge isn't SCD2:
- Apr 1-14 plays paid under the wrong split. Restatement needed every quarter — auditor nightmare.
- Lawsuit risk. Songwriters can demand "pay me as of the deal that was active when my song was streamed". SCD2 is the only defensible answer.
- Forecast invalidation. Finance forecasts use the current bridge; without SCD2, mid-quarter changes silently invalidate projections.
The contract: every fct_royalty_attributions row uses bridge_track_rights WHERE play.played_ts BETWEEN bridge.effective_from AND bridge.effective_to.
Enhanced graphical data model — Spotify Listening + Royalty Pool
Section 7 — SQL analysis
Sample data — Country pool with 1M streams in Q2-2025
INSERT INTO fct_royalty_periods VALUES
('PER_2025Q2_US','US','2025-04-01','2025-06-30', 5000000.00, 1000000, 0.005000);
-- Per-stream rate = $5M / 1M plays = $0.005 / qualified play
INSERT INTO bridge_track_rights VALUES
('TRK_001','LBL_BIG', 'label', 0.50,'2024-01-01','2025-04-15'),
('TRK_001','ART_NOVA', 'artist', 0.30,'2024-01-01','2025-04-15'),
('TRK_001','PUB_HARMONY', 'publisher', 0.15,'2024-01-01','2025-04-15'),
('TRK_001','SW_LEE', 'songwriter',0.05,'2024-01-01','2025-04-15'),
-- Mid-quarter renegotiation: label gives up 5%, artist gains 5%
('TRK_001','LBL_BIG', 'label', 0.45,'2025-04-15','9999-12-31'),
('TRK_001','ART_NOVA', 'artist', 0.35,'2025-04-15','9999-12-31'),
('TRK_001','PUB_HARMONY', 'publisher', 0.15,'2025-04-15','9999-12-31'),
('TRK_001','SW_LEE', 'songwriter',0.05,'2025-04-15','9999-12-31');
-- 2 plays of TRK_001 in US: one before, one after the renegotiation
INSERT INTO fct_play_events VALUES
('PLAY_A','U_42','TRK_001','2025-04-10 14:00:00', 180, TRUE, 'algorithmic', NULL,'iphone','US'),
('PLAY_B','U_42','TRK_001','2025-05-20 09:30:00', 240, TRUE, 'playlist','PL_ROCK','iphone','US');
A. Finance · per-stream rate & pool reconciliation
Q1 — Per-stream rate per country per period
SELECT
rp.country_id, rp.period_start, rp.period_end,
rp.total_qualified_plays,
ROUND(rp.total_pool_revenue_usd, 2) AS pool_usd,
ROUND(rp.per_stream_rate_usd, 6) AS rate_per_stream
FROM fct_royalty_periods rp
WHERE rp.period_start = '2025-04-01'
ORDER BY rp.country_id;
Q2 — Royalty attribution per qualified play (using as-of-played bridge splits)
SELECT
p.play_id,
p.played_ts,
b.rights_type,
b.rights_holder_id,
b.share_pct,
rp.per_stream_rate_usd,
ROUND(rp.per_stream_rate_usd * b.share_pct, 6) AS attributed_payout_usd
FROM fct_play_events p
JOIN dim_countries c ON p.country_id = c.country_id
JOIN fct_royalty_periods rp
ON rp.country_id = p.country_id
AND p.played_ts BETWEEN rp.period_start AND rp.period_end
JOIN bridge_track_rights b
ON b.track_id = p.track_id
AND p.played_ts >= b.effective_from
AND p.played_ts < b.effective_to -- as-of-play splits
WHERE p.qualified_play = TRUE AND p.track_id = 'TRK_001'
ORDER BY p.played_ts, b.rights_type;
B. Product · listening behavior
Q3 — Skip rate by context_type (algorithmic vs playlist vs radio)
SELECT context_type,
COUNT(*) AS plays,
COUNT(*) FILTER (WHERE qualified_play = FALSE) AS skips,
ROUND(100.0 * COUNT(*) FILTER (WHERE qualified_play = FALSE)
/ COUNT(*), 2) AS skip_rate_pct
FROM fct_play_events
WHERE played_ts >= CURRENT_DATE - 30
GROUP BY context_type ORDER BY skip_rate_pct DESC;
-- High skip rate on algorithmic = recommender quality issue. On radio = expected (lean-back).
Worked example — TRK_001 royalty math (before vs after renegotiation)
- Pool: $5,000,000 in US for Q2-2025. 1,000,000 qualified streams. Per-stream rate = $0.005.
- PLAY_A (Apr 10): uses pre-renegotiation splits. Label 50% / Artist 30% / Publisher 15% / Songwriter 5%.
- Label: $0.005 × 0.50 = $0.0025
- Artist: $0.005 × 0.30 = $0.0015
- Publisher: $0.005 × 0.15 = $0.00075
- Songwriter: $0.005 × 0.05 = $0.00025
- Sum: $0.005 ✓
- PLAY_B (May 20): uses post-renegotiation splits. Label 45% / Artist 35% / Publisher 15% / Songwriter 5%.
- Label: $0.005 × 0.45 = $0.00225
- Artist: $0.005 × 0.35 = $0.00175
- (others unchanged)
- Sum: $0.005 ✓
- Audit query produces 8 rows total (2 plays × 4 rights holders). Each is defensible: shows the exact bridge row that was active when the play occurred.
Section 8 — Why this works
- One play, two consumers.
qualified_playas a flag (not a separate table) keeps Product and Finance on the same source of truth. - Pool model derived at period close. Per-stream rate isn't real-time — it's only known when the period closes. Modeled as a fact table, not a daily rolling number.
- SCD2 bridge keeps deals defensible. Mid-quarter renegotiations don't break historical payouts.
- Attribution table grows linearly. ~4 rows per qualified play. Quarterly bulk-write at period close, then read-only.
fct_play_events rather than two separate tables is what lets both teams stay on the same source of truth. SCD2 on bridge_track_rights turns deal renegotiations from an audit nightmare into a predicate."
Scenario 10 — Double-entry ledger for payments (charges, refunds, chargebacks, FX, reserves)
Why this is hard — the design tension
Payments is the domain with zero tolerance for retro edits. A regulator can ask "what was merchant X's balance on day Y as the books showed it then?" — the answer has to be defensible regardless of corrections made since. That forces append-only on every fact. Multi-currency adds an FX gain/loss leg. Chargebacks arrive 30-180 days after the charge. Reserves are an internal transfer, not external money. The model must enforce SUM=0 per balance transaction per currency as a continuous invariant.
Section 1 — Dimensional Model Overview
Fact Tables
fct_ledger_entries— The "Atom" grain. One row per debit OR credit on (account, currency). The source of truth. Append-only.fct_balance_transactions— The "Business event" grain. One row per charge/refund/payout/chargeback. Groups its entries viabalance_txn_id.snap_account_balance_daily— The "State" grain. Derived — running SUM per (account × currency × day). Pre-aggregated for fast balance reads.
Dimension Tables
dim_merchants(SCD2 — risk profile, reserve_pct) ·dim_customers·dim_payment_methods(SCD2)dim_accounts— Chart of accounts:merchant_balance,fees,reserve,tax,fx_gain_loss,chargeback_reserve.dim_currencies·dim_fx_rates(SCD2 daily — FX rate locked at posting time)
Section 2 — Logical Data Model (key columns)
fct_ledger_entries
| Column | Type | Description |
|---|---|---|
entry_id | PK | Surrogate. |
balance_txn_id | FK | Groups entries within one business event. |
account_id | FK | Which account: merchant_balance, fees, reserve, etc. |
merchant_id | FK | The merchant. |
currency_code | CHAR(3) | The currency of this entry. |
amount | DECIMAL(20,4) | Signed. Convention: positive = credit, negative = debit. Sum per (balance_txn × currency) = 0. |
entry_type | STRING | {charge, refund, fee, fx_gain, fx_loss, payout, chargeback, reserve_hold, reserve_release}. |
effective_ts, posted_ts | TIMESTAMP_TZ | When the event happened vs when it hit the ledger. |
source_event_id | UNIQUE | Idempotency key. Producer retries write same row. |
reversal_of_id | FK NULL | For reversal entries — points to the original. |
fct_balance_transactions — balance_txn_id, txn_type ∈ {charge, refund, payout, chargeback, fee, transfer}, txn_ts, source_object_id (charge_id / refund_id), merchant_id, status.
Section 3 — How the model serves each stakeholder
Merchant · balance & payouts — Available balance = SUM(fct_ledger_entries.amount) WHERE account = 'merchant_balance' AND merchant_id = X. From snap_account_balance_daily for speed.
Finance · reconciliation — Reconcile every balance_txn: SUM(amount) GROUP BY currency = 0. Any failure = producer bug, page on-call.
Risk · chargeback exposure — Aggregate entry_type = 'chargeback' per merchant per day. High rate = risk profile change → increase reserve_pct.
Regulators · audit trail — "Show merchant X balance as of date Y as the books showed it then" — query fct_ledger_entries WHERE posted_ts <= Y AND filter out reversals for entries posted after Y.
Section 4 — Why append-only + invariant + idempotency key
- Append-only is regulatory. SOX / PCI / banking audits require an immutable history. Every reversal is a new pair of entries with opposite signs, never a retro edit.
- SUM=0 invariant per (balance_txn, currency). Continuous data-quality check. If it fails, payments stop until fixed.
- Idempotency at the row.
UNIQUE(source_event_id)guarantees that producer retries don't create duplicate ledger entries — even under network partitions. - FX as separate currency entries. EUR charge → USD payout requires a 4-entry transaction: −EUR (merchant_balance debit) / +EUR (fx_holding credit) / −USD (fx_holding debit at locked rate) / +USD (merchant_balance credit). FX gain/loss recognized in reporting currency.
Enhanced graphical data model — Stripe Double-Entry Ledger
Section 7 — SQL analysis
Sample data — €100 charge converted to USD payout (multi-currency, 4 entries)
-- Business event 1: charge €100 (customer pays merchant)
INSERT INTO fct_balance_transactions VALUES
('BTX_001','charge', '2025-05-01 09:00:00','CH_777','MRC_FR','succeeded');
-- Ledger entries (SUM = 0 per currency):
INSERT INTO fct_ledger_entries (balance_txn_id, account_id, merchant_id, currency_code, amount, entry_type, effective_ts, source_event_id) VALUES
('BTX_001','merchant_balance','MRC_FR','EUR', 100.00, 'charge', '2025-05-01 09:00','EVT_001'),
('BTX_001','fees', 'MRC_FR','EUR', -2.90, 'fee', '2025-05-01 09:00','EVT_002'),
('BTX_001','customer_balance','MRC_FR','EUR', -100.00, 'charge', '2025-05-01 09:00','EVT_003'),
('BTX_001','fees_revenue', 'MRC_FR','EUR', 2.90, 'fee', '2025-05-01 09:00','EVT_004');
-- SUM(amount WHERE currency='EUR') = 100 - 2.90 - 100 + 2.90 = 0 ✓
-- Business event 2: payout €97.10 → USD at locked rate 1.08
INSERT INTO fct_balance_transactions VALUES
('BTX_002','payout', '2025-05-03 10:00:00','PO_777','MRC_FR','paid');
INSERT INTO fct_ledger_entries (balance_txn_id, account_id, merchant_id, currency_code, amount, entry_type, effective_ts, source_event_id) VALUES
('BTX_002','merchant_balance','MRC_FR','EUR', -97.10, 'payout', '2025-05-03 10:00','EVT_005'), -- debit EUR
('BTX_002','fx_holding', 'MRC_FR','EUR', 97.10, 'payout', '2025-05-03 10:00','EVT_006'), -- credit EUR
('BTX_002','fx_holding', 'MRC_FR','USD', -104.87, 'payout', '2025-05-03 10:00','EVT_007'), -- debit USD (97.10 × 1.08)
('BTX_002','merchant_bank', 'MRC_FR','USD', 104.87, 'payout', '2025-05-03 10:00','EVT_008'); -- credit USD
-- SUM EUR = 0 ✓ ; SUM USD = 0 ✓
A. Finance · invariant DQ check
Q1 — Continuous SUM=0 invariant per (balance_txn × currency)
SELECT
balance_txn_id,
currency_code,
ROUND(SUM(amount), 4) AS sum_amount,
COUNT(*) AS entry_count
FROM fct_ledger_entries
WHERE effective_ts >= CURRENT_DATE - 1
GROUP BY balance_txn_id, currency_code
HAVING ABS(SUM(amount)) > 0.0001;
-- Empty = healthy. Any row = producer bug. Halts the payments pipeline.
B. Merchant · balance & payouts
Q2 — Merchant balance per currency (as-of-now)
SELECT
merchant_id, currency_code,
ROUND(SUM(amount), 2) AS balance
FROM fct_ledger_entries
WHERE account_id = 'merchant_balance' AND merchant_id = 'MRC_FR'
GROUP BY merchant_id, currency_code;
-- Or read from snap_account_balance_daily for sub-second response.
Q3 — Audit trail: balance as it appeared on a specific historical date
SELECT
merchant_id, currency_code,
ROUND(SUM(amount), 2) AS balance_as_of_2025_05_15
FROM fct_ledger_entries
WHERE account_id = 'merchant_balance'
AND merchant_id = 'MRC_FR'
AND posted_ts <= '2025-05-15 23:59:59'
GROUP BY merchant_id, currency_code;
-- Defensible regardless of corrections made AFTER 2025-05-15. The append-only contract.
C. Risk · chargeback exposure
Q4 — Chargeback rate per merchant + dispute lag
WITH charges AS (
SELECT merchant_id, balance_txn_id, effective_ts AS charge_ts, ABS(amount) AS gross_usd
FROM fct_ledger_entries
WHERE entry_type = 'charge' AND account_id = 'merchant_balance' AND amount > 0
),
chargebacks AS (
SELECT merchant_id, reversal_of_id, effective_ts AS chargeback_ts, ABS(amount) AS chargeback_usd
FROM fct_ledger_entries WHERE entry_type = 'chargeback'
)
SELECT
c.merchant_id,
COUNT(c.balance_txn_id) AS total_charges,
COUNT(cb.reversal_of_id) AS total_chargebacks,
ROUND(100.0 * COUNT(cb.reversal_of_id) / NULLIF(COUNT(c.balance_txn_id), 0), 2) AS chargeback_rate_pct,
ROUND(AVG(EXTRACT(EPOCH FROM (cb.chargeback_ts - c.charge_ts))/86400), 1) AS avg_dispute_lag_days
FROM charges c
LEFT JOIN chargebacks cb ON cb.reversal_of_id = c.balance_txn_id
WHERE c.charge_ts >= CURRENT_DATE - 90
GROUP BY c.merchant_id
HAVING COUNT(c.balance_txn_id) > 100
ORDER BY chargeback_rate_pct DESC;
-- Merchants >1% chargeback rate trigger reserve_pct increase.
Worked example — €100 charge → USD payout
- Charge BTX_001 (€100): 4 ledger entries in EUR.
- +€100 to merchant_balance · −€2.90 fee from merchant_balance · net €97.10 to merchant.
- −€100 from customer_balance · +€2.90 to fees_revenue.
- EUR sum = 100 − 2.90 − 100 + 2.90 = 0 ✓
- Payout BTX_002 (3 days later, FX rate locked at 1 EUR = 1.08 USD).
- −€97.10 from merchant_balance · +€97.10 to fx_holding (EUR side closes)
- −$104.87 from fx_holding (USD side, 97.10 × 1.08) · +$104.87 to merchant_bank
- EUR sum = 0 ✓ · USD sum = 0 ✓
- Two months later: chargeback on BTX_001. Append 4 new entries that reverse the charge (positive amounts where original was negative).
reversal_of_idlinks each new entry to its original. Original rows untouched. Audit trail preserved. Net effect: merchant balance net change for the chargeback is recorded as a fresh balance_txn.
Section 8 — Why this works
- Append-only is the contract. Historical balance reads ("as of date Y") are defensible regardless of corrections made later. Required by SOX, PCI, banking regulators.
- Invariant catches bugs at write time. Continuous DQ on SUM=0 — any producer regression halts the pipeline before bad data reaches finance.
- Idempotency at the row.
UNIQUE(source_event_id)handles producer retries without duplicating money. - FX as currency entries. No "convert at read time" — every conversion is a recorded ledger event with its locked rate.
Scenario 11 — Bookings, calendar, cancellations & reviews
Why this is hard — the design tension
The calendar is the hot read path: a search query for "Paris, July 4-7, 2 guests" hits potentially millions of (listing × night) rows in milliseconds. Calendar-as-fact with daily snapshots is the canonical answer. Booking modifications require append-only — same booking_id can't UPDATE in place. Cancellation refunds depend on (policy × days-from-check-in) and must be computed at cancel time, locked on the booking row. Multi-currency pays guest in EUR, host in USD; FX locked at booking.
Section 1 — Dimensional Model Overview
Fact Tables
fct_calendar_nights— The "Availability" grain. One row per (listing × night × snapshot_date). Daily snapshots enable point-in-time price lookup.fct_bookings— The "Reservation" grain. One row per booking lifecycle. Append-only — modifications create new rows.fct_reviews— The "Bilateral feedback" grain. One row per review (typed: guest_to_host OR host_to_guest).fct_host_payouts— The "Payout" grain. One row per payout event after stay completion.
Dimension Tables
dim_listings(SCD2 — price, policies, capacity drift)dim_hosts(SCD2 — superhost status, response rate)dim_guestsdim_geography(neighborhood, lat/lon, market_id)
Section 2 — Logical Data Model (key columns)
fct_calendar_nights
| Column | Type | Description |
|---|---|---|
listing_id, night_date, snapshot_date | Composite PK | Time-travel via snapshot. |
is_available, is_blocked, is_booked | BOOLEAN | Mutually exclusive states. |
booking_id | FK NULL | Set when booked. |
base_price_local, smart_price_local, effective_price_local | DECIMAL | Base = host-set; smart = algorithmic; effective = what was charged. |
currency_code | CHAR(3) | Listing's currency. |
fct_bookings — booking_id, listing_id, guest_id, request_ts, accepted_ts, cancelled_ts, cancellation_actor ∈ {guest, host, platform}, check_in_date, check_out_date, nights, guests, subtotal/cleaning/service/taxes/total_local, fx_rate_to_usd, total_usd, cancellation_policy ∈ {flexible, moderate, strict, super_strict}, refund_amount_usd, terminal_state.
Section 3 — How the model serves each stakeholder
Guest · search & book — Search joins fct_calendar_nights WHERE is_available=TRUE AND night_date BETWEEN x AND y. Hot path; pushed to Elasticsearch + indexed in warehouse.
Host · earnings & calendar mgmt — Daily payouts from fct_host_payouts. Pricing team analyzes fct_calendar_nights over time to recommend smart-price strategy.
Pricing team · "what was the price on date X?" — Time-travel via snapshot_date in the PK. Critical for dynamic-pricing experiments and dispute resolution.
Trust & safety · review-rate analysis — Bilateral reviews; if a host has high "guest cancelled" rate, T&S investigates.
Section 4 — Why calendar-as-fact + daily snapshot
- Without snapshots: can only answer "what's the price now?" Time-travel queries require reconstructing from
dim_listingsSCD2 + booking events — slow, error-prone, often impossible after data drift. - With daily snapshots: "what was the price of listing X on July 4 last year (as Airbnb showed it on July 1)?" is one query.
- Cost: 7M listings × 365 future nights × daily snapshot = ~9B rows/day. Partition by
night_date; aggressively expire old snapshots (keep 90-day raw, then weekly). - Append-only on bookings. Modifications generate a new booking row with
parent_booking_id; original setsterminal_state = modified. Audit trail preserved.
Enhanced graphical data model — Airbnb Bookings + Calendar
Section 7 — SQL analysis
Sample data — 1 listing, 5 nights of calendar, 1 booking, 1 guest cancellation
INSERT INTO fct_calendar_nights VALUES
('LST_PARIS_42','2025-07-04','2025-06-01', FALSE, FALSE, TRUE, 'BKG_777', 200.00, 240.00, 240.00,'EUR'),
('LST_PARIS_42','2025-07-05','2025-06-01', FALSE, FALSE, TRUE, 'BKG_777', 200.00, 240.00, 240.00,'EUR'),
('LST_PARIS_42','2025-07-06','2025-06-01', FALSE, FALSE, TRUE, 'BKG_777', 200.00, 240.00, 240.00,'EUR'),
('LST_PARIS_42','2025-07-07','2025-06-01', TRUE, FALSE, FALSE, NULL, 200.00, 220.00, 220.00,'EUR'),
('LST_PARIS_42','2025-07-08','2025-06-01', TRUE, FALSE, FALSE, NULL, 200.00, 220.00, 220.00,'EUR');
INSERT INTO fct_bookings VALUES
('BKG_777','LST_PARIS_42','GUEST_LISA','2025-06-01 10:00','2025-06-01 10:30',NULL,NULL,
'2025-07-04','2025-07-07', 3, 2,
720.00, 80.00, 56.00, 64.00, 920.00, 1.08, 993.60,
'moderate', NULL, 'confirmed');
-- Guest cancels 5 days before check-in (under 'moderate' policy = 50% refund)
UPDATE fct_bookings SET cancelled_ts='2025-06-29 14:00', cancellation_actor='guest',
refund_amount_usd=496.80, terminal_state='cancelled' WHERE booking_id='BKG_777';
A. Search · calendar utilization
Q1 — Available listings in Paris for July 4-7, 2 guests
SELECT l.listing_id, l.title, MIN(c.effective_price_local) AS price_low, MAX(c.effective_price_local) AS price_high
FROM dim_listings l
JOIN fct_calendar_nights c ON l.listing_id = c.listing_id
JOIN dim_geography g ON l.geography_id = g.geography_id
WHERE c.snapshot_date = CURRENT_DATE
AND c.night_date BETWEEN '2025-07-04' AND '2025-07-06'
AND c.is_available = TRUE
AND l.capacity >= 2
AND g.market_id = 'PARIS'
GROUP BY l.listing_id, l.title
HAVING COUNT(*) = 3; -- all 3 nights available
B. Pricing team · time-travel queries
Q2 — Price of LST_PARIS_42 on July 4, 2025 — as Airbnb showed it on June 1
SELECT listing_id, night_date, base_price_local, smart_price_local, effective_price_local
FROM fct_calendar_nights
WHERE listing_id = 'LST_PARIS_42'
AND night_date = '2025-07-04'
AND snapshot_date = '2025-06-01';
-- The data product the dynamic-pricing team can't ship without.
C. Trust & safety · cancellation analysis
Q3 — Host cancellation rate (a strong T&S signal)
SELECT
l.host_id,
COUNT(*) AS total_bookings,
COUNT(*) FILTER (WHERE b.cancellation_actor = 'host') AS host_cancels,
ROUND(100.0 * COUNT(*) FILTER (WHERE b.cancellation_actor = 'host')
/ COUNT(*), 2) AS host_cancel_rate_pct
FROM fct_bookings b
JOIN dim_listings l USING (listing_id)
WHERE b.request_ts >= CURRENT_DATE - 90
GROUP BY l.host_id
HAVING COUNT(*) > 10
ORDER BY host_cancel_rate_pct DESC;
-- Hosts >3% rate get warning; >5% lose Superhost status.
Worked example — BKG_777 cancellation under "moderate" policy
- Booking total: $993.60 for 3 nights ($720 subtotal + $80 cleaning + $56 service + $64 tax, FX 1.08).
- Cancelled 5 days before check-in.
- Moderate policy refund_curve: ≥5 days before = 100% refund of nightly rate (subtotal only); cleaning fee always refunded; service fee non-refundable.
- Refund = $720 (subtotal at 100%) × 0.50 (we're inside the partial-refund window for "moderate") + $80 cleaning + $0 service = $496.80.
fct_bookings.refund_amount_usd= $496.80, locked at cancellation_ts. Future re-querying gives same result regardless of policy changes after.- Calendar nights for Jul 4-6 flip back to
is_available=TRUEin the next snapshot. Listing reopens for searches.
Section 8 — Why this works
- Calendar-as-fact + snapshots = time-travel pricing queries are first-class.
- Append-only bookings with
parent_booking_idchain = audit trail for modifications. - Refund locked at cancel time = future policy changes don't restate historical refunds.
- Bilateral reviews = both guest and host accountability surfaced.
fct_calendar_nights without reconstructing it from booking events. That's the data product the dynamic-pricing team can't ship without. Snapshot_date in the PK is the cheap engineering choice that pays back every quarter."
Scenario 12 — Subscription billing + hourly usage metering (Snowflake / Datadog model)
Why this is hard — the design tension
Two revenue streams with opposite shapes. Subscription is predictable, monthly, modeled at low row volume. Usage is volatile, hourly, billions of rows/day. Mid-cycle plan changes require pro-rated math. Trials look like paid plans but bill $0. Free-tier allowances reset each period. Multi-currency demands FX locked at usage time. The model must keep usage and subscription separate at the fact layer but join cleanly at invoice time.
Section 1 — Dimensional Model Overview
Fact Tables
fct_subscription_state— The "Daily snapshot" grain. One row per (subscription × day). Captures plan, status, MRR.fct_usage_meter— The "Hourly usage" grain. One row per (subscription × meter × hour). High volume.fct_billing_events— The "Charge" grain. One row per subscription_charge / usage_charge / proration / refund / credit / dunning.fct_invoices— The "Statement" grain. One row per invoice — derived rollup of billing events.
Dimension Tables
dim_accounts·dim_plans(SCD2 — pricing tiers drift) ·dim_meters(compute_credits, storage_gb_month, api_calls) ·dim_promotions(SCD2)
Section 2 — Logical Data Model (key columns)
fct_usage_meter — account_id, subscription_id, meter_id, hour_bucket_ts, raw_quantity, billable_quantity (= max(0, raw − free_allowance)), unit_price_local (locked at hour), currency_code, amount_local, fx_rate_to_usd, amount_usd.
fct_billing_events — event_id, account_id, subscription_id, event_type, plan_id (SCD2 effective at event_ts), event_ts, period_start, period_end, amount_local, currency_code, amount_usd, status, source_event_id UNIQUE.
fct_subscription_state — subscription_id, day, plan_id, status ∈ {trialing, active, paused, canceled, past_due}, mrr_usd, currency, days_into_cycle, days_remaining.
Section 3 — How the model serves each stakeholder
Finance · MRR & ARR — MRR = SUM(fct_subscription_state.mrr_usd) WHERE day = today AND status = 'active'. ARR = MRR × 12.
Product · usage analytics — Hourly grain → daily/weekly rollups for adoption analysis. Same source as Finance — one truth.
Customer Success · expansion signal — Accounts approaching their plan's overage threshold = upsell trigger. Computed from fct_usage_meter trends + dim_plans.tier_limit.
Sales · NRR cohorts — Net revenue retention by quarterly signup cohort. Combines all four facts.
Section 4 — Why hourly meter + daily snapshot + event log
- Hourly meter resolution is the integration point: fine enough for usage analytics, coarse enough that the table doesn't explode. Per-query/per-call rolls up to the hour at ingest.
- Daily subscription state avoids modeling subscription as a slowly-changing dim — too much daily flux. Daily snapshot is bounded and queryable.
- Append-only billing events with
UNIQUE(source_event_id)= idempotent + audit-defensible. Mid-cycle plan change emits 2 events (proration credit on old, prorated charge on new); invoice is just SUM of events in the period. - FX locked per hour. Same precision as the meter; never restated.
Enhanced graphical data model — SaaS Subscription + Usage
Section 7 — SQL analysis
Sample data — Mid-cycle plan upgrade with 15 days remaining
-- Account on Pro plan ($100/mo), upgrades to Enterprise ($300/mo) on day 15 of a 30-day cycle
INSERT INTO fct_billing_events VALUES
('EVT_S1','ACCT_X','SUB_42','subscription_charge','PLAN_PRO', '2025-05-01 00:00','2025-05-01','2025-05-31',100.00,'USD',100.00,'paid','SRC_001'),
-- Day 15: upgrade. Proration credit on PRO (15 days unused = $50)
('EVT_P1','ACCT_X','SUB_42','proration', 'PLAN_PRO', '2025-05-15 14:00','2025-05-15','2025-05-31',-50.00,'USD',-50.00,'applied','SRC_002'),
-- Day 15: prorated charge for ENT (15 days at new rate = $150)
('EVT_P2','ACCT_X','SUB_42','proration', 'PLAN_ENTERPRISE','2025-05-15 14:00','2025-05-15','2025-05-31',150.00,'USD',150.00,'paid','SRC_003');
-- Hourly usage: account exceeds 100 free credits/hour 3 times during the day
INSERT INTO fct_usage_meter VALUES
('ACCT_X','SUB_42','METER_COMPUTE','2025-05-15 09:00:00', 120, 20, 0.10,'USD', 2.00,1.0, 2.00,'2025-05-15 09:05'),
('ACCT_X','SUB_42','METER_COMPUTE','2025-05-15 14:00:00', 180, 80, 0.10,'USD', 8.00,1.0, 8.00,'2025-05-15 14:05'),
('ACCT_X','SUB_42','METER_COMPUTE','2025-05-15 18:00:00', 250, 150, 0.10,'USD',15.00,1.0,15.00,'2025-05-15 18:05');
A. Finance · MRR & invoice rollup
Q1 — MRR roll-forward (active subscriptions, latest snapshot)
SELECT
COUNT(*) AS active_subs,
ROUND(SUM(mrr_usd), 2) AS mrr_usd,
ROUND(SUM(mrr_usd) * 12, 2) AS arr_usd
FROM fct_subscription_state
WHERE day = CURRENT_DATE
AND status = 'active';
Q2 — Invoice for SUB_42 in May 2025 (sums all event types)
SELECT
subscription_id,
SUM(amount_usd) FILTER (WHERE event_type = 'subscription_charge') AS subscription_charges,
SUM(amount_usd) FILTER (WHERE event_type = 'usage_charge') AS usage_charges,
SUM(amount_usd) FILTER (WHERE event_type = 'proration') AS proration_net,
SUM(amount_usd) FILTER (WHERE event_type = 'credit') AS credits,
SUM(amount_usd) AS total_due_usd
FROM fct_billing_events
WHERE subscription_id = 'SUB_42'
AND period_start = '2025-05-01'
GROUP BY subscription_id;
B. Customer Success · upsell signal
Q3 — Accounts approaching their plan's overage threshold (upsell candidates)
WITH usage_30d AS (
SELECT subscription_id, meter_id, SUM(billable_quantity) AS used
FROM fct_usage_meter
WHERE hour_bucket_ts >= CURRENT_DATE - 30
GROUP BY subscription_id, meter_id
),
plan_caps AS (
SELECT s.subscription_id, m.meter_id, p.included_quantity
FROM fct_subscription_state s
JOIN dim_plans p USING (plan_id)
CROSS JOIN dim_meters m
WHERE s.day = CURRENT_DATE AND s.status = 'active'
)
SELECT u.subscription_id, u.meter_id, u.used, c.included_quantity,
ROUND(100.0 * u.used / NULLIF(c.included_quantity, 0), 1) AS pct_of_cap
FROM usage_30d u
JOIN plan_caps c USING (subscription_id, meter_id)
WHERE u.used > c.included_quantity * 0.85
ORDER BY pct_of_cap DESC;
-- >85% of cap = upsell trigger; >100% = overage charges hit next invoice.
C. Sales · NRR by cohort
Q4 — Net Revenue Retention by signup cohort
WITH cohort_baseline AS (
SELECT subscription_id, DATE_TRUNC('quarter', MIN(day)) AS cohort_q,
FIRST_VALUE(mrr_usd) OVER (PARTITION BY subscription_id ORDER BY day) AS starting_mrr
FROM fct_subscription_state
),
current_state AS (
SELECT subscription_id, mrr_usd AS current_mrr
FROM fct_subscription_state WHERE day = CURRENT_DATE
)
SELECT cb.cohort_q,
COUNT(DISTINCT cb.subscription_id) AS cohort_size,
ROUND(SUM(cb.starting_mrr), 0) AS starting_mrr_total,
ROUND(SUM(COALESCE(cs.current_mrr, 0)), 0) AS current_mrr_total,
ROUND(100.0 * SUM(COALESCE(cs.current_mrr, 0)) / NULLIF(SUM(cb.starting_mrr), 0), 1) AS nrr_pct
FROM cohort_baseline cb
LEFT JOIN current_state cs USING (subscription_id)
GROUP BY cb.cohort_q
ORDER BY cb.cohort_q;
-- NRR > 100% = expansion outpaces churn (the SaaS holy grail).
Worked example — SUB_42 mid-cycle upgrade + usage overage
- Day 1 (May 1): charged $100 for Pro plan (full month).
- Day 15 (May 15): upgrade to Enterprise. Two proration events:
- Credit −$50 on Pro (15 unused days × $100/30)
- Charge +$150 for Enterprise (15 days × $300/30)
- Net delta: +$100
- Day 15 (later): usage spikes. Plan includes 100 compute_credits/hr; account uses 250 in one hour → 150 billable at $0.10 = $15 overage.
- End-of-month invoice for May: $100 (Pro full) − $50 (credit) + $150 (Ent prorated) + $25 (3-hr usage overage) = $225.
- MRR effect on day 15 onward: jumps from $100 to $300/mo. Visible in
fct_subscription_state.mrr_usddaily snapshot.
Section 8 — Why this works
- Hourly usage + daily subscription = right resolution at each grain. Usage is high-volume; subscription is bounded.
- Proration is two events, not one. Both credit and charge land in
fct_billing_events; invoice rollup picks them up cleanly. - Idempotent at
source_event_id. Producer retries don't double-charge. - FX locked per hour. Same precision as the meter; no FX restatement.
fct_usage_meter is the integration point — finance rolls it up to invoices, product teams use it for usage analytics, and customer success uses it for adoption signals. Three consumers, one source of truth. Mid-cycle plan changes emit two billing events instead of one updated row — that's what makes the invoice a SUM, not a recompute."
The patterns that show up regardless of domain.
Eleven patterns. §9 (LAD), §10 (factless facts), §11 (streaming-first) are the senior-level moves — drop them in interviews and the room shifts.
1. SCD Types — the deep dive
| Type | Mechanism | When to use | Domain example |
|---|---|---|---|
| Type 1 | Overwrite. No history. | Current state only matters; cheap fix-ups. | User's display name correction. |
| Type 2 | New row per change with effective_from / effective_to. | Historical accuracy required — financial, regulatory, ML. | Driver rating tier (Uber), creative version (Meta), plan tier (SaaS). |
| Type 3 | Add previous_X column. Rarely used. | Only the most recent change matters. | Job title (current + previous). |
| Type 4 | Separate history table. | Current dim is small + hot; history rarely queried. | Geography reorganizations. |
| Type 6 | Hybrid — current_X column + SCD2 history. | 90% of queries want current, 10% want history. | Customer's current address vs every address ever. |
2. Late-arriving data — the four cases
| Case | Pattern |
|---|---|
| Late-arriving fact | Event timestamped yesterday, lands today. Watermark on stream side; daily backfill window (typically 24–48h) on batch. |
| Late-arriving dimension | Fact references a dim row not yet loaded. Use placeholder SK -1, mark with needs_dim_lookup = TRUE, backfill nightly. |
| Slowly-changing fact | Financial restatement (refund of last month's invoice). Append correction row; never UPDATE. Net = SUM. |
| Out-of-order events | Stream-time vs event-time. Watermark = max_seen_ts − lag. Sort within partition. |
3. GDPR / privacy — the universal pattern
- Tokenize at ingest. Replace
user_idwithuser_id_token; keep mapping in a separate, encrypted vault. - On delete request: invalidate the token mapping. Events become un-rejoinable to the person.
- Aggregates survive — DAU, revenue, etc. stay accurate.
- Time-bounded: 30-day SLA; track in
fact_gdpr_request. - Tax / legal carve-outs: finance facts retained 7 years even after request — anonymize (replace name) but keep transaction.
4. Multi-currency — the lock-at-event-time rule
| Step | Rule |
|---|---|
| Store both | Always store amount_local + currency_code AND amount_usd + fx_rate. Never just one. |
| Lock the rate | Lock fx_rate_to_usd at event time, not query time. Same charge can't yield different USD amounts on different days. |
| FX dim is SCD2 | dim_fx_rate(currency_pair, rate, effective_from, effective_to) — daily rate at minimum. |
| Reporting currency | Some companies report in a third currency (Stripe USD, Adyen EUR). Carry that as a third pair. |
5. Time zones — the three rules
- Storage in UTC, always. Use
TIMESTAMP_TZwhen supported. - Dim for the reporting TZ. "Daily" reports are in the reporting entity's TZ (driver's local TZ for Uber payouts; advertiser's TZ for Google ads pacing).
- DST-safe partitioning. Partition by
DATE(event_ts AT TIME ZONE 'UTC')always. DST shifts cause off-by-one if you partition by local TZ.
6. Hot partitions / skew — the four counter-moves
| Counter-move | When |
|---|---|
| Sub-partition by hash | One title/cell/merchant dominates: HASH(key) MOD 16 as a sub-partition. |
| Salted aggregation | Two-stage GROUP BY: SUM by (key || salt), then SUM by key. Spark/Trino. |
| Separate hot table | Top-N keys go to a "hot" partition with looser retention; rest to "cold". |
| Pre-aggregate at edge | Sketch the hot keys upstream (HLL, Count-Min) so the warehouse never sees raw skew. |
7. Idempotency — the universal contract
Every fact table needs one of:
- Natural-key MERGE:
MERGE INTO target USING source ON natural_key WHEN MATCHED AND source.ingest_ts > target.ingest_ts THEN UPDATE. - Append-only with dedup at read:
ROW_NUMBER() OVER (PARTITION BY natural_key ORDER BY ingest_ts DESC) = 1. - Unique constraint on source_event_id: the producer's idempotency key, enforced.
8. Conformed dimensions — the discipline
One dim_date, one dim_geography, one dim_user — owned by the platform team, not duplicated per consumer. Otherwise:
- Joins lie (different definitions of "user").
- Cross-fact analysis (revenue × engagement) is impossible.
- SCD2 versioning drifts.
Drop the term in interviews: "I'd want this to share dim_user with the existing customer-data platform — otherwise we're rebuilding the join every team."
9. Late-Arriving Dimensions (LAD) — the deep dive
Almost every cross-cutting concern collapses into one of four scenarios. LAD is the one that bites hardest in production because it silently drops rows on inner-join — Content Strategy stares at "0 views in the first 3 minutes" of a Netflix Original launch and the dim hadn't propagated yet.
When LAD happens
| Trigger | Domain example | Time scale |
|---|---|---|
| Global launch — fact stream is global, dim sync is regional | Netflix Originals at midnight Pacific; Stranger Things S5 | seconds–minutes |
| CDC connector lag — Debezium → Kafka → warehouse | Amazon catalog updates during Prime Day | minutes–hours |
| Manual dim load — someone must approve the new SKU | App Store new-app review | hours–days |
| Cross-region replication | Stripe payment dim from US to EU | seconds |
The two strategies
| Strategy | Where the fix lives | Pros | Cons |
|---|---|---|---|
| At-ingest (inferred member) | ETL inserts placeholder dim row when a fact arrives without a dim | Downstream queries are clean INNER JOINs; no every-query repetition | Fact ingest now coupled to dim; needs is_inferred = TRUE flag for later overwrite |
| At-query (defensive LEFT JOIN) | Every analyst query uses LEFT JOIN + COALESCE | Loose coupling; ingest stays simple | Repeated boilerplate; analysts forget the COALESCE and lose data silently |
Lakehouses (Snowflake, BigQuery, Databricks) typically prefer at-ingest with an is_inferred flag. Real-time event pipelines (Kafka → Pinot) prefer at-query because the placeholder write would cause race conditions.
The defensive query template
WITH current_dim AS (
SELECT dim_key, attr_a, attr_b,
ROW_NUMBER() OVER (PARTITION BY dim_key ORDER BY effective_date DESC) AS rn
FROM dim_table
)
SELECT f.fact_id, f.dim_key,
COALESCE(d.attr_a, ' ' || f.dim_key || ' (Pending Sync)') AS attr_a,
COALESCE(d.attr_b, 'Unknown') AS attr_b
FROM fact_table f
LEFT JOIN current_dim d ON d.dim_key = f.dim_key AND d.rn = 1;
Three things working together:
- LEFT JOIN keeps every fact row, even when the dim is missing.
- ROW_NUMBER + rn = 1 picks the most-recent SCD2 version of the dim (re-titled, re-genre'd entries handled).
- COALESCE with a searchable placeholder ("Pending Sync") instead of NULL so the BI tool's filter dropdown surfaces the gap.
The orphan-rate SLO
Production target: pct_orphan_facts < 0.5%. Above 1% pages the data-platform on-call. The query that drives the alert:
SELECT 100.0 * SUM(CASE WHEN d.dim_key IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS pct_orphan
FROM fact_table f LEFT JOIN current_dim d ON d.dim_key = f.dim_key AND d.rn = 1;
The dual problem — late-arriving FACT
The mirror image: dim is current, fact arrives days late from offline-mode mobile. Now you must attribute against the dim version that was current at fact time, not now. SCD2 + temporal join:
JOIN dim_table d
ON d.dim_key = f.dim_key
AND f.event_ts BETWEEN d.effective_from AND COALESCE(d.effective_to, '9999-12-31')
The "as-of-event" attribution. Same family of patterns; opposite direction.
Senior signal in the room
Drop the phrase: "the fact is the truth; the dimension is just an attribute that can be late." That single line tells the interviewer you've shipped LAD-aware pipelines.
10. Factless fact tables — the modeling pattern that gets missed
Factless facts have no measure column — they record that an event happened or that a relationship existed. Two flavors:
| Flavor | Records | Domain example |
|---|---|---|
| Event-tracking | Things that happened with no dollars/qty | Page views, login attempts, push notification deliveries, video impressions, search queries |
| Coverage | Eligibility / availability that ENABLED an event | Promotions ran in store-month, students enrolled in course, products featured on home page, ads scheduled to a slot |
Why they exist — the canonical examples
| Question | Need a factless fact |
|---|---|
| "Which products were featured on the home page but didn't sell?" | Featuring is the absence of a sales fact — needs a coverage table to spot the negative space. |
| "Which students were registered for a class but never attended?" | Registration is one event, attendance is another. A registration-with-no-attendance row only exists if you store both. |
| "Which ads were scheduled but didn't get an impression?" | Same shape — coverage of scheduled minus measure of fired. |
The schema
CREATE TABLE fact_class_registration (
student_sk INTEGER NOT NULL, -- FK to dim_student
course_sk INTEGER NOT NULL, -- FK to dim_course
semester_sk INTEGER NOT NULL, -- FK to dim_semester (date-like)
registration_sk INTEGER NOT NULL, -- degenerate dim (the registration event id)
-- NO measure columns
PRIMARY KEY (student_sk, course_sk, semester_sk)
);
Pure foreign keys. The existence of the row is the measurement.
The COUNT(*) pattern
Aggregations on a factless fact are always COUNT(*):
-- How many students registered per course?
SELECT c.course_name, COUNT(*) AS registrations
FROM fact_class_registration r JOIN dim_course c USING (course_sk)
GROUP BY c.course_name;
-- Coverage gap: registered but no attendance?
SELECT r.student_sk, r.course_sk
FROM fact_class_registration r
LEFT JOIN fact_class_attendance a USING (student_sk, course_sk, semester_sk)
WHERE a.student_sk IS NULL;
The trap — when to use vs when to add a measure
| Hint | Decision |
|---|---|
| "Did this happen?" | Factless |
| "How much / how many of X?" | Add a measure |
| "Was X eligible / scheduled / present?" | Factless coverage |
| "What's the dollar / quantity / score?" | Add a measure |
Senior signal in the room
Drop the phrase: "this is a factless fact — the existence of the row is the measurement." Junior engineers add a junk count = 1 column. Seniors don't.
11. Streaming-first modeling — Kappa, watermarks, exactly-once
The lambda/kappa decision shifts your modeling. Once you commit to streaming-first, your fact tables become append-only event logs that the warehouse incrementally materializes.
Lambda vs Kappa — when to pick which
| Architecture | Description | Pick when |
|---|---|---|
| Lambda | Two paths — batch (correct, slow) + speed (fresh, approximate). Reconcile downstream. | Legacy migration, very high accuracy needs, regulator requires audit trail of corrections. |
| Kappa | Stream is the single source of truth. Replay from log to rebuild any state. No batch path. | Greenfield, modern stack, you can replay, latency > correctness for most consumers. |
Most 2024+ designs lean Kappa with a "snapshot every N hours" view for analyst convenience.
Watermarks — the late-event contract
A watermark is the timestamp the engine guarantees no event with an earlier event-time will arrive. It's a promise: "I've waited 90 seconds past max_seen_event_time; anything older is now late."
| Window state | Action |
|---|---|
| watermark < window_end | Window still open; keep accumulating |
| watermark ≥ window_end | Emit window result; close it |
| Event arrives after window closed | Late event; either drop, side-output, or update via "allowed lateness" |
Event-time vs ingest-time vs processing-time
| Time | What it measures | Pick when |
|---|---|---|
| Event-time | When the event actually happened (client clock or server-side at originating event) | Almost always — it's the "truth" |
| Ingest-time | When the event entered the streaming engine | You can't trust event-time (no client clock); audit trails |
| Processing-time | When the engine processed the event | Wallclock-driven side effects; never for analytics |
Modeling rule: store all three on every row. Partition / window by event-time; alert / debug by processing-time.
Exactly-once — the contract chain
"Exactly-once" isn't a single feature — it's a chain that breaks if any link fails. The four required pieces:
- Idempotent producer — Kafka transactional producer + producer_id + sequence number
- Idempotent consumer — store offset + transactional output (often via 2PC or Kafka transactions)
- Idempotent sink — natural-key MERGE; or upsert by source_event_id with unique constraint
- Replay-safe state — Flink savepoints; Kafka Streams state stores backed by Kafka topics
Senior framing: "exactly-once is end-to-end; if your sink doesn't have a unique key, the upstream guarantees don't matter."
Stream-table duality
Every streaming concept maps to a SQL concept:
| Streaming | SQL equivalent |
|---|---|
| Stream (append-only) | Insert-only fact table |
| Table (latest value per key) | SCD1 dim table |
| Stream → Table (KTable) | Materialized view with ROW_NUMBER() OVER ... = 1 |
| Tumbling window | GROUP BY DATE_TRUNC('hour', event_ts) |
| Hopping window | Multiple overlapping GROUP BY with generate_series spine |
| Session window | Gaps-and-islands via LAG + cumulative sum |
| Watermark | "Don't query rows newer than NOW() - 90s" |
Drop this in interviews: "every Flink job is just a SQL query that hasn't been run yet."
CDC patterns — the bridge
Most companies don't go pure-streaming overnight. They hybridize via change-data-capture:
- Debezium reads the OLTP database write-ahead log
- Emits a Kafka topic with one record per row-change (insert/update/delete)
- Sink connector materializes the stream into the warehouse with eventual consistency
Modeling consequence: every dim becomes SCD2 by design (every change is an event), and the warehouse can replay history at any point.
The skew + checkpoint storm
Streaming jobs fail in two characteristic ways:
- Hot key skew — one user/title/merchant dominates one task slot. Fix: salt the key (
HASH(key) MOD 16) for the heavy aggregation, then re-aggregate. - Checkpoint storm — checkpoint duration grows past interval; backpressure cascades. Fix: incremental checkpoints (RocksDB), unaligned checkpoints, smaller state.
Senior signal in the room
Drop two phrases: "watermark + allowed lateness handles late-arriving events deterministically" and "exactly-once is end-to-end — sink idempotency is non-negotiable." The interviewer immediately stops asking streaming questions and moves to the hard parts.
The eight traps + the script that wins the round.
The eight modeling traps — and the senior counter-move
| Trap | Counter-move |
|---|---|
| Writing columns before stating grain | Say grain first, every time. |
| Using natural keys as PKs | Surrogate as PK, natural as a column. Mandatory for SCD2. |
UPDATE on facts | Append corrections; let MV pick latest. |
| One status column on a dim ("current_stage") | Factless fact for transitions + daily snapshot for current. |
| Hierarchies via recursive joins for hot queries | Denormalize parent IDs onto leaf, drift-check nightly. |
GDPR via cascade DELETEs | Tokenize at ingest; invalidate token on delete. |
| "We'll add SCD later" | Adding SCD2 to a populated dim is a 6-month migration. Decide upfront. |
| Per-team conformed dim copies | One conformed dim, owned by platform. Otherwise joins lie. |
The 90-second articulation script — works on every domain
The three sentences that signal seniority — in any round
- "The grain of this fact is one row per X per Y per Z." — names the unit of analysis up front.
- "The trade-off here is X vs Y; I'm picking X because [downstream consumer / cost / regulatory]." — names the choice and owns it.
- "This breaks when [late data / GDPR / FX / skew / time-zone shift]." — volunteers a failure mode.
Drill those three sentence-shapes until they're reflex. The interviewer cannot give you a senior score if these never come out of your mouth.
Twelve scenarios across the industries that ask · the framework holds · go well.