PaddySpeaks · Systems at the Whiteboard · No. 01

The Dispatch Problem

Design the backend for a ride-sharing service. Forty-five minutes, one whiteboard, three subsystems — and a single decision that separates a senior answer from a memorized one. A complete working through: data flow, schema, SQL, streaming Python, the aggregation layer, and the dashboard that proves it works.

§ 01 — THE QUESTIONOne sentence, three systems

Every senior data and platform engineer eventually meets some version of this question. It arrives in one casual sentence and contains three entire distributed systems.

Interview Prompt

"Design the core backend for a ride-sharing service — focus on driver matching, real-time location tracking, and surge pricing. How would you scope this out?"

LEVEL · SENIOR / STAFFDURATION · 45 MINFORMAT · WHITEBOARD

The question is popular with interviewers for a reason: it cannot be answered with one architecture. The three subsystems it names run at wildly different tempos and want contradictory things. Location tracking is a firehose of data that is worthless four seconds after it arrives. Matching is a low-volume flow where a single consistency mistake — dispatching one driver to two riders — is the cardinal sin of the entire product. Surge pricing is a slow aggregation that must nevertheless be deterministic at the moment of quoting, because the price a rider sees must be the price a rider pays.

A weak answer treats these as one system and reaches for one database. A strong answer notices the tempos first. So before any boxes and arrows, the working frame for the whole session:

THE FAST LOOP
Location tracking. Millions of phones reporting position every few seconds. Dominates write volume by two orders of magnitude. Needs freshness, not durability — a dropped ping self-heals when the next one lands.
THE MEDIUM LOOP
Matching & dispatch. Rider requests, candidate search, ranked offers, accept/decline/timeout, exactly one driver per ride. Low volume, latency-sensitive, and the only place in the system that demands strong consistency.
THE SLOW LOOP
Surge pricing. Continuous supply/demand aggregation per geographic cell, recomputed on a sliding window, smoothed to prevent oscillation, and frozen into a quote the instant a rider sees a price.
Availability over consistency everywhere — except dispatch. A slightly suboptimal driver beats no driver. Two riders in one back seat beats nothing at all, except the company.

Scoping out loud

Scope is the first scored dimension, and most candidates skip it. State what you are building, what you are deliberately ignoring, and the numbers that will shape every later decision. Out of scope here, said explicitly: payments processing, ratings and fraud, the routing/ETA engine itself (treated as a callable service), the maps stack, and pooled rides — with the caveat that the matching design should not preclude pooling.

Then the envelope math, volunteered rather than extracted. Uber-shaped numbers:

QuantityEstimateConsequence
Peak concurrent drivers5,000,000Sets the size of the hot index
Ping interval4 sThe freshness budget everything is measured against
Location writes≈ 1.25 M / sThe number that shapes the whole architecture
Trips per day25,000,000≈ 300 matches/s average, ~3 K at peak — tiny by comparison
Hot index footprint~100 B × 5 M ≈ 500 MBFits in one box of RAM; sharded for throughput, not size
Surge recompute~1 M cells / 30 s ≈ 35 K updates/sA modest streaming job
Raw location historytens of TB / daySample and compress to cold storage; keep only latest hot

Notice the asymmetry: nothing else in the system comes within two orders of magnitude of location ingestion. That single row of the table dictates the partition strategy, the storage medium, and the failure philosophy. The rest of this article follows the data.


§ 02 — DATA FLOWFollowing a ping through the building

One architecture, three tempos. The spine of the design is a partitioned change stream doing triple duty: it feeds the geospatial index, the surge aggregator, and the analytics lake — from a single ordered log.

FAST LOOP · ~1.25M WRITES/S MEDIUM LOOP · ~3K MATCHES/S PEAK DRIVER APPS ping every 4s · gRPC GATEWAY TIER batch 100ms · coalesce by driver_id · never block LOCATION STREAM key: (city, hash(driver_id) % N) over-partitioned · ordered per driver H3 GEO INDEX · IN-MEM cell → {drivers} · TTL 30s rebuildable cache, owns nothing SURGE AGGREGATOR sliding window · per-cell ratio COLD STORE (SAMPLED) analytics · disputes · Parquet RIDER APP request + quote_id DISPATCH SERVICE candidates → rank → offer saga · 12s offer expiry DRIVER STATE STORE CAS: available → offered TRIP SERVICE state machine · pinned by trip_id PRICING / QUOTES quote locked 3 min · surge read NOTIFICATIONS socket + push · ack’d offers k-ring candidate query · <10ms multiplier SOLID — sustained data flow · DASHED — request-time reads · Geography proposes, the CAS disposes.
FIG. 1 — End-to-end flow. The stream is the spine; the geo index is a cache; truth lives in two small keyed stores.

Three properties of this picture do most of the interview’s work. First, the firehose never touches a durable store on the hot path — pings flow connection → gateway → stream → RAM, and only a sampled tributary reaches disk. Second, authority is deliberately tiny: the driver-state store and the trip service are small, keyed, boring databases, while everything large is a rebuildable cache. Third, the candidate query is dashed for a reason — the geo index merely proposes drivers; an atomic compare-and-swap on the driver record decides. Double-dispatch is prevented by that one line, not by geography.

The Failure Philosophy, In One Rule

Data plane degrades by dropping; control plane degrades by delaying. A lost ping costs nothing — the next one heals it in four seconds, so under backpressure the gateway buffers briefly, coalesces by driver (last value wins), drops oldest, and never blocks. A lost state transition — an acceptance, a cancellation — is never dropped: separate channel, durable local queue, retry with backoff.


§ 03 — DATA MODELTwo databases, one cache, and a log

The schema falls out of the authority question. Durable, transactional truth: trips, driver state, quotes. Ephemeral, rebuildable speed: the H3 index. Append-only history: the event log and the sampled ping archive.

The durable core (Postgres-class)

Trips are small records at trivial volume — 25 M rows a day is gigabytes — so a conventional relational store is the right home. The interesting choices are the version columns (optimistic concurrency for every state transition) and the append-only trip_events table, which gives the lifecycle an audit trail and feeds analytics without touching the hot row.

DDL · DURABLE CORE
-- Truth #1: who is this driver and what are they doing right now. -- Hash-sharded by driver_id. NOT geo-sharded: authority must not move -- when the driver does. CREATE TABLE driver_state ( driver_id BIGINT PRIMARY KEY, status TEXT NOT NULL CHECK (status IN ('offline','available','offered','on_trip')), active_trip_id BIGINT, offer_expires_at TIMESTAMPTZ, -- server-side expiry, never phone-side version BIGINT NOT NULL DEFAULT 0, updated_at TIMESTAMPTZ NOT NULL DEFAULT now() ); -- Truth #2: the trip itself. Pinned to its shard by trip_id for life. CREATE TABLE trips ( trip_id BIGINT PRIMARY KEY, -- snowflake: time-ordered, shard-aware rider_id BIGINT NOT NULL, driver_id BIGINT, quote_id UUID NOT NULL, state TEXT NOT NULL DEFAULT 'requested' CHECK (state IN ('requested','matched','driver_en_route', 'arrived','in_trip','completed','cancelled')), pickup_cell BIGINT NOT NULL, -- H3 res 8 pickup_lat DOUBLE PRECISION NOT NULL, pickup_lng DOUBLE PRECISION NOT NULL, drop_lat DOUBLE PRECISION, drop_lng DOUBLE PRECISION, surge_mult NUMERIC(4,2) NOT NULL DEFAULT 1.00, quoted_fare NUMERIC(8,2) NOT NULL, -- the locked promise version BIGINT NOT NULL DEFAULT 0, requested_at TIMESTAMPTZ NOT NULL DEFAULT now(), completed_at TIMESTAMPTZ ); CREATE INDEX idx_trips_rider ON trips (rider_id, requested_at DESC); CREATE INDEX idx_trips_driver ON trips (driver_id, requested_at DESC); -- Append-only lifecycle log. Every transition lands here exactly once; -- payments, notifications and analytics are downstream consumers. CREATE TABLE trip_events ( event_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, trip_id BIGINT NOT NULL, from_state TEXT NOT NULL, to_state TEXT NOT NULL, actor TEXT NOT NULL, -- rider | driver | system occurred_at TIMESTAMPTZ NOT NULL DEFAULT now(), detail JSONB ); -- The quote: surge honesty, materialized. Issued price is immutable. CREATE TABLE quotes ( quote_id UUID PRIMARY KEY, rider_id BIGINT NOT NULL, origin_cell BIGINT NOT NULL, surge_mult NUMERIC(4,2) NOT NULL, fare NUMERIC(8,2) NOT NULL, issued_at TIMESTAMPTZ NOT NULL DEFAULT now(), expires_at TIMESTAMPTZ NOT NULL -- issued_at + 3 minutes );

The ephemeral layer (RAM, by design)

The geospatial index is two hash maps and a discipline. It is not a database and must not become one: at 1.25 M writes per second, B-trees, WALs and MVCC are taxes paid for durability the data doesn’t want. If the process dies, the index rebuilds itself from live pings in about thirty seconds — recovery is free because the world keeps broadcasting its state.

LOGICAL SCHEMA · HOT INDEX (REDIS-STYLE)
# cell → set of available drivers (H3 resolution 8, ~460m hexes) SADD geo:{city}:cell:{h3_index} {driver_id} # driver → packed position + status, with TTL as the stale-driver reaper HSET geo:{city}:drv:{driver_id} lat 37.4419 lng -122.143 heading 213 status available seen_ms 1765432109482 EXPIRE geo:{city}:drv:{driver_id} 30 # no ping in 30s → not matchable # a ping that crosses a cell boundary is two O(1) ops: SREM geo:sfo:cell:8828308281fffff 91442 SADD geo:sfo:cell:8828308283fffff 91442

Sharding follows geography — city-level shards, hot cities dedicated, small cities packed — because riders in one city never match drivers in another. Boundary cells are replicated to both neighboring shards so a rider standing on a border still gets a single-query answer; results are unioned and de-duplicated by driver_id, latest seen_ms wins. And because the index owns nothing, brief double-presence during a crossing is harmless: geography proposes candidates; the CAS on driver_state decides truth.

The wire format

STREAM RECORD · LOCATION PING
{ "driver_id": 91442, "city": "sfo", "lat": 37.44188, "lng": -122.14302, "heading": 213, "speed_mps": 11.4, "status": "available", "ts_ms": 1765432109482 } // partition key = (city, hash(driver_id) % N) — N sized so the worst // city ÷ N fits one partition with 10× headroom. Ordering matters only // per driver, so the hash dilutes a concert-spike across N partitions // while preserving the per-driver sequence. Hot keys: structurally impossible.

§ 04 — MATCHING & LIFECYCLEThe only place consistency is sacred

Matching is a saga wrapped around one atomic instruction. Everything else — ranking, offers, timers, retries — exists to feed that instruction good candidates and clean up when it loses.

The flow: a rider request arrives carrying a quote_id. Dispatch covers the rider’s radius with H3 cells via k-ring expansion, unions the driver sets, filters (status=available, vehicle class, rating floor), and ranks the survivors. Ranking is ETA-to-pickup over the road network — never haversine; a driver across the river is 200 meters and 20 minutes away — blended with acceptance rate and a fairness term (time since last trip), so the marketplace doesn’t starve its patient drivers.

Then the offer saga: compare-and-swap the top candidate’s state from available → offered, push the offer, start a 12-second server-side expiry. Decline or timeout → CAS back, next candidate, cap at five attempts, then widen the radius or fail gracefully. Offers go out sequentially by default; when supply is tight, parallel offers with first-accept-wins — same CAS arbitrates — trading driver annoyance for match latency. Two riders’ queries may both see the same driver; only one CAS succeeds. The race exists with or without shard boundaries, and one mechanism handles both.

Offer expiry must live on the server, never the phone. Phones die mid-offer; the marketplace cannot.DISPATCH RULE Nº 1

The trip state machine

REQUESTED MATCHED DRIVER_EN_ROUTE ARRIVED IN_TRIP COMPLETED

Cancel edges exist from every pre-trip state, and CANCELLED records who and when — fee logic hangs off that pair. Failure transitions are explicit: MATCHED with no driver movement for N minutes auto-cancels and rematches; a driver app going dark mid-trip leaves the trip IN_TRIP with stale location and a support flag — the system never auto-completes on signal loss. Every transition is a CAS on the trip’s version, so a rider-cancel racing a driver-arrive resolves deterministically with exactly one winner; and every transition emits a trip_events row, which keeps the trip service small while payments, notifications and analytics consume downstream.

SQL · THE ATOMIC HEART OF DISPATCH
-- Claim a driver. Succeeds for exactly one concurrent caller. UPDATE driver_state SET status = 'offered', active_trip_id = :trip_id, offer_expires_at = now() + INTERVAL '12 seconds', version = version + 1 WHERE driver_id = :driver_id AND status = 'available'; -- the guard IS the lock -- rowcount = 1 → offer is yours. rowcount = 0 → next candidate. -- Versioned trip transition: rider-cancel vs driver-arrive, one winner. UPDATE trips SET state = 'cancelled', version = version + 1 WHERE trip_id = :trip_id AND version = :expected_version AND state IN ('requested','matched','driver_en_route','arrived');

§ 05 — INGESTION & STREAMSPython on the firehose

Three programs carry the fast loop: the gateway that tames the connections, the consumer that maintains the index, and the dispatcher’s candidate query. Each is small; the judgment is in what they refuse to do.

1 · The gateway — batch, coalesce, never block

The batch window is 100 ms with a dual trigger — flush on timer or size cap, whichever first. The reasoning is physical: end-to-end staleness is ping interval + batch + stream hop + apply. The ping interval is 4,000 ms, so 100 ms adds 2.5% — and a driver at 30 mph moves ~1.3 m in that window, below GPS’s own 5–10 m noise floor. The latency cost is literally beneath the sensor; the downstream win is ~100× fewer produce requests and 5–10× batch compression. When the trade is imperceptible latency against an order of magnitude of infrastructure, take it every time.

PYTHON · GATEWAY: PER-PARTITION COALESCING BUFFER
import asyncio, time from collections import OrderedDict BATCH_WINDOW_S = 0.100 # dual trigger: timer bounds latency… BATCH_MAX_MSGS = 1_000 # …size cap bounds memory at peak BUFFER_HARD_CAP = 60_000 # ≈ active drivers on this gateway class CoalescingSender: """One per stream partition. Location is last-value-wins by nature: a new ping from a driver OVERWRITES their queued ping, so under backpressure the buffer self-compacts to one entry per driver. Memory is bounded by design, not by luck.""" def __init__(self, producer, partition): self.buf: OrderedDict[int, bytes] = OrderedDict() # driver_id → ping self.producer, self.partition = producer, partition self.dropped = 0 def offer(self, driver_id: int, ping: bytes) -> None: # Called from connection handlers. Synchronous, O(1), never awaits: # a slow partition must never backpressure 50k phone connections. if driver_id in self.buf: self.buf.move_to_end(driver_id) # supersede, not append elif len(self.buf) >= BUFFER_HARD_CAP: self.buf.popitem(last=False) # drop oldest: newest wins self.dropped += 1 # alarm on rate, not count self.buf[driver_id] = ping async def run(self): while True: deadline = time.monotonic() + BATCH_WINDOW_S while time.monotonic() < deadline and len(self.buf) < BATCH_MAX_MSGS: await asyncio.sleep(0.005) if not self.buf: continue batch, self.buf = self.buf, OrderedDict() # atomic swap try: await self.producer.send_batch(self.partition, list(batch.values())) except ProducerUnavailable: # circuit-break path: re-offer so coalescing keeps compacting; # downstream TTLs shrink the matching pool — the SAFE failure: # fewer matchable drivers, never phantom ones. for d, p in batch.items(): self.offer(d, p)

One carve-out, always stated: status transitions bypass this entirely. A driver going offline or accepting a trip is control plane — low volume, dispatch-correctness-critical — flushed immediately on a separate durable channel. The buffer above is for the data plane only.

2 · The index consumer — two O(1) ops per ping

PYTHON · STREAM CONSUMER → H3 INDEX
import h3, json RES, TTL_S = 8, 30 # ~460m hexes; 30s silence → not matchable async def apply_batch(r, msgs: list[bytes]) -> None: """Idempotent by construction: applying a ping twice converges to the same state, so at-least-once delivery from the stream is fine. No transactions, no fsync — if this process dies, the index rebuilds from live pings in ~30 seconds.""" pipe = r.pipeline(transaction=False) for m in msgs: p = json.loads(m) d, c = p["driver_id"], p["city"] cell = h3.latlng_to_cell(p["lat"], p["lng"], RES) key = f"geo:{c}:drv:{d}" prev = await r.hget(key, "cell") if prev and prev != cell: # crossed a hex boundary pipe.srem(f"geo:{c}:cell:{prev}", d) if p["status"] == "available": pipe.sadd(f"geo:{c}:cell:{cell}", d) else: # on_trip/offline: out of pool, pipe.srem(f"geo:{c}:cell:{cell}", d) # pings still flow for the map pipe.hset(key, mapping={"lat": p["lat"], "lng": p["lng"], "cell": cell, "status": p["status"], "seen_ms": p["ts_ms"]}) pipe.expire(key, TTL_S) # the stale-driver reaper await pipe.execute()

3 · The candidate query — rings until satisfied

PYTHON · K-RING CANDIDATE SEARCH
async def candidates(r, city, lat, lng, want=20, max_ring=4) -> list[int]: """Hexagons earn their keep here: all six neighbors are equidistant, so ring number ≈ real distance and expansion maps cleanly to a radius. (On a square grid, corner neighbors lie farther than edge neighbors — ring search over-includes diagonals.)""" center = h3.latlng_to_cell(lat, lng, RES) found: set[int] = set() for k in range(max_ring + 1): cells = h3.grid_ring(center, k) # ring k only, not the disk sets = await r.sunion(*[f"geo:{city}:cell:{c}" for c in cells]) found |= {int(d) for d in sets} if len(found) >= want: # enough for the ranker break return list(found) # → filter → ETA-rank → offer saga → the CAS decides
Why H3 and not S2 — the honest version

Both work; the choice deserves reasons, not a default. S2’s exact hierarchy and Hilbert-curve locality are decisive in a disk-backed sorted store — but this index is hash maps in RAM, where locality buys nothing. The query is ring-shaped, where hexagons’ uniform neighbor distance wins; and surge runs on the same grid, where hexes give near-uniform cell area and smooth neighbor gradients with no corner-adjacency ambiguity. Two systems, one cell vocabulary. If the index moved to disk, the answer flips to S2.


§ 06 — AGGREGATIONSurge: a sliding window with manners

Surge is a streaming aggregation with two jobs: price honestly, and move supply. The computation is a per-cell supply/demand ratio over a sliding window; the craft is in the smoothing — and in the promise.

Two signals per H3 cell. Demand is ride requests plus app-opens — the “eyeballs” signal sees intent before the request button is pressed, which is what makes surge leading rather than trailing. Supply is available drivers, read from the same stream that feeds the geo index. Both aggregate over a sliding window of two to five minutes, recomputed every thirty seconds, with event-time windows and watermarks — gateway batching and partition lag mean pings arrive slightly out of order, and processing-time windows would let infrastructure jitter masquerade as demand spikes.

A raw ratio oscillates, so the multiplier is smoothed twice. Spatially: each cell’s ratio blends with its neighbors — hexagons again, six equidistant neighbors make this a clean diffusion — which prevents a one-block price cliff that teaches riders to walk across the street. Temporally: hysteresis, fast up and slow down — surge responds to a demand spike in one window but decays over several, so the price doesn’t flicker and drivers chasing the heat map aren’t whipsawed.

PYTHON · WINDOWED SURGE JOB (FLINK-STYLE SEMANTICS)
# Sketch of the aggregation graph — keyed sliding windows + smoothing. demand = (events .filter(lambda e: e.kind in ("ride_request", "app_open")) .assign_timestamps(watermark_delay_s=10) # tolerate late arrivals .key_by(lambda e: e.h3_cell) .window(sliding(size_s=300, slide_s=30)) .aggregate(weighted_count(request=1.0, app_open=0.25))) supply = (pings .filter(lambda p: p.status == "available") .key_by(lambda p: p.h3_cell) .window(sliding(size_s=300, slide_s=30)) .aggregate(distinct_count("driver_id"))) # a driver pings ~75× per # window: count drivers, not pings def multiplier(cell, ratio, neighbors, prev): blended = 0.6 * ratio + 0.4 * mean(neighbors) # spatial diffusion target = step_fn(blended) # e.g. 1.0/1.3/1.6/2.0… if target >= prev: # hysteresis: return target # up fast, return max(target, prev - 0.1) # decay slow — no flicker surge = (demand.join(supply, on="h3_cell") .map(multiplier) .sink(surge_table)) # read by pricing at quote time; pushed to # idle drivers as a heat map — the second job
The multiplier may change every thirty seconds. The quote, once seen, may never change at all.SURGE RULE Nº 1 — QUOTE-TIME LOCK

The lock is the product-honesty boundary between the slow loop and the medium loop: when a rider opens the app, pricing reads the current multiplier, mints a quote_id with the fare frozen inside, valid for three minutes. The request that follows carries the quote_id; the trip is charged exactly that figure. Surge updates the next quote, never an issued one. Eventual consistency everywhere in the pipeline, determinism at the instant of promise.


§ 07 — ANALYTICS SQLInterrogating the marketplace

The append-only tables — trip_events and the sampled ping archive — are where the system explains itself. Three queries an interviewer loves, because each one carries a classic SQL pattern on its back.

The match funnel — conditional aggregation

SQL · WHERE DO REQUESTS DIE?
-- Hourly funnel: requested → matched → completed, with time-to-match. SELECT date_trunc('hour', t.requested_at) AS hr, count(*) AS requested, count(*) FILTER (WHERE t.state NOT IN ('requested','cancelled') OR t.driver_id IS NOT NULL) AS matched, count(*) FILTER (WHERE t.state = 'completed') AS completed, round(100.0 * count(*) FILTER (WHERE t.state = 'completed') / nullif(count(*), 0), 1) AS completion_pct, percentile_cont(0.50) WITHIN GROUP (ORDER BY m.ttm_s) AS p50_match_s, percentile_cont(0.95) WITHIN GROUP (ORDER BY m.ttm_s) AS p95_match_s FROM trips t LEFT JOIN LATERAL ( SELECT extract(epoch FROM e.occurred_at - t.requested_at) AS ttm_s FROM trip_events e WHERE e.trip_id = t.trip_id AND e.to_state = 'matched' LIMIT 1 ) m ON TRUE WHERE t.requested_at >= now() - INTERVAL '7 days' GROUP BY 1 ORDER BY 1;

Driver sessions — gaps and islands

“How long do drivers actually stay online?” The ping archive has no session boundaries — only timestamps. Sessionization is the canonical gaps-and-islands move: flag every gap larger than the threshold, then a running sum of flags becomes the session ID.

SQL · SESSIONIZATION OVER THE PING ARCHIVE
WITH gaps AS ( SELECT driver_id, ts, CASE WHEN ts - lag(ts) OVER w > INTERVAL '15 minutes' THEN 1 ELSE 0 END AS new_session FROM driver_pings_sampled WHERE ts::date = CURRENT_DATE - 1 WINDOW w AS (PARTITION BY driver_id ORDER BY ts) ), sessions AS ( SELECT driver_id, ts, sum(new_session) OVER (PARTITION BY driver_id ORDER BY ts) AS session_id FROM gaps ) SELECT driver_id, session_id, min(ts) AS session_start, max(ts) AS session_end, max(ts) - min(ts) AS online_duration FROM sessions GROUP BY driver_id, session_id;

Did surge actually summon drivers?

SQL · SUPPLY RESPONSE, 15 MIN AFTER SURGE ONSET
-- For each cell-hour that crossed 1.5×, compare available-driver counts -- before and after onset. LAG over the surge snapshot table finds onsets. WITH onsets AS ( SELECT h3_cell, snapshot_ts AS onset_ts FROM ( SELECT h3_cell, snapshot_ts, surge_mult, lag(surge_mult) OVER (PARTITION BY h3_cell ORDER BY snapshot_ts) AS prev_mult FROM surge_snapshots ) s WHERE surge_mult >= 1.5 AND coalesce(prev_mult, 1.0) < 1.5 ) SELECT o.h3_cell, avg(b.driver_cnt) AS supply_before, avg(a.driver_cnt) AS supply_after_15m, round(avg(a.driver_cnt) / nullif(avg(b.driver_cnt), 0), 2) AS supply_lift FROM onsets o JOIN supply_snapshots b ON b.h3_cell = o.h3_cell AND b.snapshot_ts BETWEEN o.onset_ts - INTERVAL '15 min' AND o.onset_ts JOIN supply_snapshots a ON a.h3_cell = o.h3_cell AND a.snapshot_ts BETWEEN o.onset_ts AND o.onset_ts + INTERVAL '15 min' GROUP BY o.h3_cell ORDER BY supply_lift DESC;

§ 08 — THE DASHBOARDProving the system is alive

A senior design ends with observability, because every clever degradation mode above is invisible without it. The dashboard watches the three loops separately — each has a different definition of “healthy.”

FAST LOOP
ping ingest rate vs. expected (5M ÷ 4s), index freshness p99 (now − seen_ms across a sample), gateway buffer depth and coalesce/drop rate, TTL eviction rate — a spike means a region went dark, not that drivers went home.
MEDIUM LOOP
time-to-match p50/p95, first-offer acceptance %, offers per match (the ranking-quality proxy), unfulfilled request %, CAS conflict rate — rising conflicts mean candidate sets overlap too much: rings too wide or supply too thin.
SLOW LOOP
surge cell coverage %, multiplier oscillation score (sign-flips per cell-hour — the hysteresis health check), quote→request conversion by multiplier, watermark lag on the streaming job.
Marketplace Ops — SFO FRI 18:40 PT · ALL SYSTEMS · 30s REFRESH
Ping Ingest
1.21M/s
Index Freshness p99
4.6s
Time-to-Match p95
14.2s
Unfulfilled Req
1.8%
Surge Heat — Mission District (H3 res 8) · concert letout 18:25
2.0× 1.6× 1.0×
First-Offer Accept
71%
Gateway Drop Rate
0.02%
CAS Conflict Rate
6.4%
Watermark Lag
7s
Oscillation Score
0.3
TTL Evictions
2.1k/s
FIG. 2 — The story a healthy incident tells: surge heat rising, time-to-match drifting amber, CAS conflicts climbing as candidate sets overlap — and drop rate flat, because the gateway is coalescing exactly as designed.

Read the amber tiles together and the dashboard narrates the concert letout from §02: demand spiked, surge painted the heat map, drivers are converging (TTL evictions normal, ingest steady), matching is briefly slower and more contended — and nothing was dropped that mattered. That is what a designed degradation looks like from the operator’s chair.


§ 09 — THE RUBRICWhat was actually being tested

Strip the trip details away and the question was testing five judgments, each of which generalizes far beyond ride-sharing:

TEMPO
Seeing three loops where the prompt says one system — and letting write volume, not feature lists, dictate the architecture.
AUTHORITY
Keeping truth tiny and keyed; making everything large a rebuildable cache. The geo index owns nothing; one CAS owns everything that matters.
PHYSICS
Anchoring trade-offs to the world: a 100 ms batch against a 4 s ping interval and a 10 m GPS noise floor isn’t a latency cost, it’s a rounding error.
FAILURE SHAPE
Choosing degradations on purpose: data plane drops, control plane delays, and the matching pool shrinks rather than serving phantom drivers.
HONESTY
Engineering the promise: eventual consistency throughout the pipeline, determinism at the instant of the quote. The system may be approximate; the bill may not.
Candidates come from geography. Truth comes from a compare-and-swap. Everything else is plumbing — excellent, carefully reasoned plumbing.— CLOSING ARGUMENT