The Dispatch Problem — PaddySpeaks

§ 01 — THE QUESTIONOne sentence, three systems

Every senior data and platform engineer eventually meets some version of this question. It arrives in one casual sentence and contains three entire distributed systems.

Interview Prompt

"Design the core backend for a ride-sharing service — focus on driver matching, real-time location tracking, and surge pricing. How would you scope this out?"

LEVEL · SENIOR / STAFFDURATION · 45 MINFORMAT · WHITEBOARD

The question is popular with interviewers for a reason: it cannot be answered with one architecture. The three subsystems it names run at wildly different tempos and want contradictory things. Location tracking is a firehose of data that is worthless four seconds after it arrives. Matching is a low-volume flow where a single consistency mistake — dispatching one driver to two riders — is the cardinal sin of the entire product. Surge pricing is a slow aggregation that must nevertheless be deterministic at the moment of quoting, because the price a rider sees must be the price a rider pays.

A weak answer treats these as one system and reaches for one database. A strong answer notices the tempos first. So before any boxes and arrows, the working frame for the whole session:

THE FAST LOOP: Location tracking. Millions of phones reporting position every few seconds. Dominates write volume by two orders of magnitude. Needs freshness, not durability — a dropped ping self-heals when the next one lands.
THE MEDIUM LOOP: Matching & dispatch. Rider requests, candidate search, ranked offers, accept/decline/timeout, exactly one driver per ride. Low volume, latency-sensitive, and the only place in the system that demands strong consistency.
THE SLOW LOOP: Surge pricing. Continuous supply/demand aggregation per geographic cell, recomputed on a sliding window, smoothed to prevent oscillation, and frozen into a quote the instant a rider sees a price.

Availability over consistency everywhere — except dispatch. A slightly suboptimal driver beats no driver. Two riders in one back seat beats nothing at all, except the company.

Scoping out loud

Scope is the first scored dimension, and most candidates skip it. State what you are building, what you are deliberately ignoring, and the numbers that will shape every later decision. Out of scope here, said explicitly: payments processing, ratings and fraud, the routing/ETA engine itself (treated as a callable service), the maps stack, and pooled rides — with the caveat that the matching design should not preclude pooling.

Then the envelope math, volunteered rather than extracted. Uber-shaped numbers:

Quantity	Estimate	Consequence
Peak concurrent drivers	5,000,000	Sets the size of the hot index
Ping interval	4 s	The freshness budget everything is measured against
Location writes	≈ 1.25 M / s	The number that shapes the whole architecture
Trips per day	25,000,000	≈ 300 matches/s average, ~3 K at peak — tiny by comparison
Hot index footprint	~100 B × 5 M ≈ 500 MB	Fits in one box of RAM; sharded for throughput, not size
Surge recompute	~1 M cells / 30 s ≈ 35 K updates/s	A modest streaming job
Raw location history	tens of TB / day	Sample and compress to cold storage; keep only latest hot

Notice the asymmetry: nothing else in the system comes within two orders of magnitude of location ingestion. That single row of the table dictates the partition strategy, the storage medium, and the failure philosophy. The rest of this article follows the data.

§ 02 — DATA FLOWFollowing a ping through the building

One architecture, three tempos. The spine of the design is a partitioned change stream doing triple duty: it feeds the geospatial index, the surge aggregator, and the analytics lake — from a single ordered log.

FIG. 1 — End-to-end flow. The stream is the spine; the geo index is a cache; truth lives in two small keyed stores.

Three properties of this picture do most of the interview’s work. First, the firehose never touches a durable store on the hot path — pings flow connection → gateway → stream → RAM, and only a sampled tributary reaches disk. Second, authority is deliberately tiny: the driver-state store and the trip service are small, keyed, boring databases, while everything large is a rebuildable cache. Third, the candidate query is dashed for a reason — the geo index merely proposes drivers; an atomic compare-and-swap on the driver record decides. Double-dispatch is prevented by that one line, not by geography.

The Failure Philosophy, In One Rule

Data plane degrades by dropping; control plane degrades by delaying. A lost ping costs nothing — the next one heals it in four seconds, so under backpressure the gateway buffers briefly, coalesces by driver (last value wins), drops oldest, and never blocks. A lost state transition — an acceptance, a cancellation — is never dropped: separate channel, durable local queue, retry with backoff.

§ 03 — DATA MODELTwo databases, one cache, and a log

The schema falls out of the authority question. Durable, transactional truth: trips, driver state, quotes. Ephemeral, rebuildable speed: the H3 index. Append-only history: the event log and the sampled ping archive.

The durable core (Postgres-class)

Trips are small records at trivial volume — 25 M rows a day is gigabytes — so a conventional relational store is the right home. The interesting choices are the version columns (optimistic concurrency for every state transition) and the append-only trip_events table, which gives the lifecycle an audit trail and feeds analytics without touching the hot row.

DDL · DURABLE CORE

-- Truth #1: who is this driver and what are they doing right now.
-- Hash-sharded by driver_id. NOT geo-sharded: authority must not move
-- when the driver does.
CREATE TABLE driver_state (
    driver_id        BIGINT       PRIMARY KEY,
    status           TEXT         NOT NULL
                     CHECK (status IN ('offline','available','offered','on_trip')),
    active_trip_id   BIGINT,
    offer_expires_at TIMESTAMPTZ,             -- server-side expiry, never phone-side
    version          BIGINT       NOT NULL DEFAULT 0,
    updated_at       TIMESTAMPTZ  NOT NULL DEFAULT now()
);

-- Truth #2: the trip itself. Pinned to its shard by trip_id for life.
CREATE TABLE trips (
    trip_id        BIGINT       PRIMARY KEY,   -- snowflake: time-ordered, shard-aware
    rider_id       BIGINT       NOT NULL,
    driver_id      BIGINT,
    quote_id       UUID         NOT NULL,
    state          TEXT         NOT NULL DEFAULT 'requested'
                   CHECK (state IN ('requested','matched','driver_en_route',
                          'arrived','in_trip','completed','cancelled')),
    pickup_cell    BIGINT       NOT NULL,     -- H3 res 8
    pickup_lat     DOUBLE PRECISION NOT NULL,
    pickup_lng     DOUBLE PRECISION NOT NULL,
    drop_lat       DOUBLE PRECISION,
    drop_lng       DOUBLE PRECISION,
    surge_mult     NUMERIC(4,2) NOT NULL DEFAULT 1.00,
    quoted_fare    NUMERIC(8,2) NOT NULL,     -- the locked promise
    version        BIGINT       NOT NULL DEFAULT 0,
    requested_at   TIMESTAMPTZ  NOT NULL DEFAULT now(),
    completed_at   TIMESTAMPTZ
);
CREATE INDEX idx_trips_rider  ON trips (rider_id, requested_at DESC);
CREATE INDEX idx_trips_driver ON trips (driver_id, requested_at DESC);

-- Append-only lifecycle log. Every transition lands here exactly once;
-- payments, notifications and analytics are downstream consumers.
CREATE TABLE trip_events (
    event_id     BIGINT       GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    trip_id      BIGINT       NOT NULL,
    from_state   TEXT         NOT NULL,
    to_state     TEXT         NOT NULL,
    actor        TEXT         NOT NULL,      -- rider | driver | system
    occurred_at  TIMESTAMPTZ  NOT NULL DEFAULT now(),
    detail       JSONB
);

-- The quote: surge honesty, materialized. Issued price is immutable.
CREATE TABLE quotes (
    quote_id     UUID         PRIMARY KEY,
    rider_id     BIGINT       NOT NULL,
    origin_cell  BIGINT       NOT NULL,
    surge_mult   NUMERIC(4,2) NOT NULL,
    fare         NUMERIC(8,2) NOT NULL,
    issued_at    TIMESTAMPTZ  NOT NULL DEFAULT now(),
    expires_at   TIMESTAMPTZ  NOT NULL       -- issued_at + 3 minutes
);

The ephemeral layer (RAM, by design)

The geospatial index is two hash maps and a discipline. It is not a database and must not become one: at 1.25 M writes per second, B-trees, WALs and MVCC are taxes paid for durability the data doesn’t want. If the process dies, the index rebuilds itself from live pings in about thirty seconds — recovery is free because the world keeps broadcasting its state.

LOGICAL SCHEMA · HOT INDEX (REDIS-STYLE)

# cell → set of available drivers (H3 resolution 8, ~460m hexes)
SADD   geo:{city}:cell:{h3_index}   {driver_id}

# driver → packed position + status, with TTL as the stale-driver reaper
HSET   geo:{city}:drv:{driver_id}   lat 37.4419  lng -122.143
                                     heading 213  status available
                                     seen_ms 1765432109482
EXPIRE geo:{city}:drv:{driver_id}   30           # no ping in 30s → not matchable

# a ping that crosses a cell boundary is two O(1) ops:
SREM geo:sfo:cell:8828308281fffff 91442
SADD geo:sfo:cell:8828308283fffff 91442

Sharding follows geography — city-level shards, hot cities dedicated, small cities packed — because riders in one city never match drivers in another. Boundary cells are replicated to both neighboring shards so a rider standing on a border still gets a single-query answer; results are unioned and de-duplicated by driver_id, latest seen_ms wins. And because the index owns nothing, brief double-presence during a crossing is harmless: geography proposes candidates; the CAS on driver_state decides truth.

The wire format

STREAM RECORD · LOCATION PING

{
  "driver_id": 91442,
  "city":      "sfo",
  "lat":       37.44188,
  "lng":       -122.14302,
  "heading":   213,
  "speed_mps": 11.4,
  "status":    "available",
  "ts_ms":     1765432109482
}
// partition key = (city, hash(driver_id) % N) — N sized so the worst
// city ÷ N fits one partition with 10× headroom. Ordering matters only
// per driver, so the hash dilutes a concert-spike across N partitions
// while preserving the per-driver sequence. Hot keys: structurally impossible.

§ 04 — MATCHING & LIFECYCLEThe only place consistency is sacred

Matching is a saga wrapped around one atomic instruction. Everything else — ranking, offers, timers, retries — exists to feed that instruction good candidates and clean up when it loses.

The flow: a rider request arrives carrying a quote_id. Dispatch covers the rider’s radius with H3 cells via k-ring expansion, unions the driver sets, filters (status=available, vehicle class, rating floor), and ranks the survivors. Ranking is ETA-to-pickup over the road network — never haversine; a driver across the river is 200 meters and 20 minutes away — blended with acceptance rate and a fairness term (time since last trip), so the marketplace doesn’t starve its patient drivers.

Then the offer saga: compare-and-swap the top candidate’s state from available → offered, push the offer, start a 12-second server-side expiry. Decline or timeout → CAS back, next candidate, cap at five attempts, then widen the radius or fail gracefully. Offers go out sequentially by default; when supply is tight, parallel offers with first-accept-wins — same CAS arbitrates — trading driver annoyance for match latency. Two riders’ queries may both see the same driver; only one CAS succeeds. The race exists with or without shard boundaries, and one mechanism handles both.

Offer expiry must live on the server, never the phone. Phones die mid-offer; the marketplace cannot.DISPATCH RULE Nº 1

The trip state machine

REQUESTED→ MATCHED→ DRIVER_EN_ROUTE→ ARRIVED→ IN_TRIP→ COMPLETED

Cancel edges exist from every pre-trip state, and CANCELLED records who and when — fee logic hangs off that pair. Failure transitions are explicit: MATCHED with no driver movement for N minutes auto-cancels and rematches; a driver app going dark mid-trip leaves the trip IN_TRIP with stale location and a support flag — the system never auto-completes on signal loss. Every transition is a CAS on the trip’s version, so a rider-cancel racing a driver-arrive resolves deterministically with exactly one winner; and every transition emits a trip_events row, which keeps the trip service small while payments, notifications and analytics consume downstream.

SQL · THE ATOMIC HEART OF DISPATCH

-- Claim a driver. Succeeds for exactly one concurrent caller.
UPDATE driver_state
SET    status           = 'offered',
       active_trip_id   = :trip_id,
       offer_expires_at = now() + INTERVAL '12 seconds',
       version          = version + 1
WHERE  driver_id = :driver_id
  AND  status    = 'available';          -- the guard IS the lock
-- rowcount = 1 → offer is yours. rowcount = 0 → next candidate.

-- Versioned trip transition: rider-cancel vs driver-arrive, one winner.
UPDATE trips
SET    state = 'cancelled', version = version + 1
WHERE  trip_id = :trip_id
  AND  version = :expected_version
  AND  state IN ('requested','matched','driver_en_route','arrived');

§ 05 — INGESTION & STREAMSPython on the firehose

Three programs carry the fast loop: the gateway that tames the connections, the consumer that maintains the index, and the dispatcher’s candidate query. Each is small; the judgment is in what they refuse to do.

1 · The gateway — batch, coalesce, never block

The batch window is 100 ms with a dual trigger — flush on timer or size cap, whichever first. The reasoning is physical: end-to-end staleness is ping interval + batch + stream hop + apply. The ping interval is 4,000 ms, so 100 ms adds 2.5% — and a driver at 30 mph moves ~1.3 m in that window, below GPS’s own 5–10 m noise floor. The latency cost is literally beneath the sensor; the downstream win is ~100× fewer produce requests and 5–10× batch compression. When the trade is imperceptible latency against an order of magnitude of infrastructure, take it every time.

PYTHON · GATEWAY: PER-PARTITION COALESCING BUFFER

import asyncio, time
from collections import OrderedDict

BATCH_WINDOW_S  = 0.100        # dual trigger: timer bounds latency…
BATCH_MAX_MSGS  = 1_000        # …size cap bounds memory at peak
BUFFER_HARD_CAP = 60_000       # ≈ active drivers on this gateway

class CoalescingSender:
    """One per stream partition. Location is last-value-wins by nature:
    a new ping from a driver OVERWRITES their queued ping, so under
    backpressure the buffer self-compacts to one entry per driver.
    Memory is bounded by design, not by luck."""

    def __init__(self, producer, partition):
        self.buf: OrderedDict[int, bytes] = OrderedDict()  # driver_id → ping
        self.producer, self.partition = producer, partition
        self.dropped = 0

    def offer(self, driver_id: int, ping: bytes) -> None:
        # Called from connection handlers. Synchronous, O(1), never awaits:
        # a slow partition must never backpressure 50k phone connections.
        if driver_id in self.buf:
            self.buf.move_to_end(driver_id)        # supersede, not append
        elif len(self.buf) >= BUFFER_HARD_CAP:
            self.buf.popitem(last=False)            # drop oldest: newest wins
            self.dropped += 1                       # alarm on rate, not count
        self.buf[driver_id] = ping

    async def run(self):
        while True:
            deadline = time.monotonic() + BATCH_WINDOW_S
            while time.monotonic() < deadline and len(self.buf) < BATCH_MAX_MSGS:
                await asyncio.sleep(0.005)
            if not self.buf:
                continue
            batch, self.buf = self.buf, OrderedDict()   # atomic swap
            try:
                await self.producer.send_batch(self.partition, list(batch.values()))
            except ProducerUnavailable:
                # circuit-break path: re-offer so coalescing keeps compacting;
                # downstream TTLs shrink the matching pool — the SAFE failure:
                # fewer matchable drivers, never phantom ones.
                for d, p in batch.items(): self.offer(d, p)

One carve-out, always stated: status transitions bypass this entirely. A driver going offline or accepting a trip is control plane — low volume, dispatch-correctness-critical — flushed immediately on a separate durable channel. The buffer above is for the data plane only.

2 · The index consumer — two O(1) ops per ping

PYTHON · STREAM CONSUMER → H3 INDEX

import h3, json

RES, TTL_S = 8, 30        # ~460m hexes; 30s silence → not matchable

async def apply_batch(r, msgs: list[bytes]) -> None:
    """Idempotent by construction: applying a ping twice converges to
    the same state, so at-least-once delivery from the stream is fine.
    No transactions, no fsync — if this process dies, the index
    rebuilds from live pings in ~30 seconds."""
    pipe = r.pipeline(transaction=False)
    for m in msgs:
        p    = json.loads(m)
        d, c = p["driver_id"], p["city"]
        cell = h3.latlng_to_cell(p["lat"], p["lng"], RES)
        key  = f"geo:{c}:drv:{d}"
        prev = await r.hget(key, "cell")

        if prev and prev != cell:                 # crossed a hex boundary
            pipe.srem(f"geo:{c}:cell:{prev}", d)
        if p["status"] == "available":
            pipe.sadd(f"geo:{c}:cell:{cell}", d)
        else:                                      # on_trip/offline: out of pool,
            pipe.srem(f"geo:{c}:cell:{cell}", d)   # pings still flow for the map

        pipe.hset(key, mapping={"lat": p["lat"], "lng": p["lng"],
                                "cell": cell, "status": p["status"],
                                "seen_ms": p["ts_ms"]})
        pipe.expire(key, TTL_S)                    # the stale-driver reaper
    await pipe.execute()

3 · The candidate query — rings until satisfied

PYTHON · K-RING CANDIDATE SEARCH

async def candidates(r, city, lat, lng, want=20, max_ring=4) -> list[int]:
    """Hexagons earn their keep here: all six neighbors are equidistant,
    so ring number ≈ real distance and expansion maps cleanly to a
    radius. (On a square grid, corner neighbors lie farther than edge
    neighbors — ring search over-includes diagonals.)"""
    center = h3.latlng_to_cell(lat, lng, RES)
    found: set[int] = set()
    for k in range(max_ring + 1):
        cells = h3.grid_ring(center, k)            # ring k only, not the disk
        sets  = await r.sunion(*[f"geo:{city}:cell:{c}" for c in cells])
        found |= {int(d) for d in sets}
        if len(found) >= want:                     # enough for the ranker
            break
    return list(found)   # → filter → ETA-rank → offer saga → the CAS decides

Why H3 and not S2 — the honest version

Both work; the choice deserves reasons, not a default. S2’s exact hierarchy and Hilbert-curve locality are decisive in a disk-backed sorted store — but this index is hash maps in RAM, where locality buys nothing. The query is ring-shaped, where hexagons’ uniform neighbor distance wins; and surge runs on the same grid, where hexes give near-uniform cell area and smooth neighbor gradients with no corner-adjacency ambiguity. Two systems, one cell vocabulary. If the index moved to disk, the answer flips to S2.

§ 06 — AGGREGATIONSurge: a sliding window with manners

Surge is a streaming aggregation with two jobs: price honestly, and move supply. The computation is a per-cell supply/demand ratio over a sliding window; the craft is in the smoothing — and in the promise.

Two signals per H3 cell. Demand is ride requests plus app-opens — the “eyeballs” signal sees intent before the request button is pressed, which is what makes surge leading rather than trailing. Supply is available drivers, read from the same stream that feeds the geo index. Both aggregate over a sliding window of two to five minutes, recomputed every thirty seconds, with event-time windows and watermarks — gateway batching and partition lag mean pings arrive slightly out of order, and processing-time windows would let infrastructure jitter masquerade as demand spikes.

A raw ratio oscillates, so the multiplier is smoothed twice. Spatially: each cell’s ratio blends with its neighbors — hexagons again, six equidistant neighbors make this a clean diffusion — which prevents a one-block price cliff that teaches riders to walk across the street. Temporally: hysteresis, fast up and slow down — surge responds to a demand spike in one window but decays over several, so the price doesn’t flicker and drivers chasing the heat map aren’t whipsawed.

PYTHON · WINDOWED SURGE JOB (FLINK-STYLE SEMANTICS)

# Sketch of the aggregation graph — keyed sliding windows + smoothing.

demand = (events
    .filter(lambda e: e.kind in ("ride_request", "app_open"))
    .assign_timestamps(watermark_delay_s=10)        # tolerate late arrivals
    .key_by(lambda e: e.h3_cell)
    .window(sliding(size_s=300, slide_s=30))
    .aggregate(weighted_count(request=1.0, app_open=0.25)))

supply = (pings
    .filter(lambda p: p.status == "available")
    .key_by(lambda p: p.h3_cell)
    .window(sliding(size_s=300, slide_s=30))
    .aggregate(distinct_count("driver_id")))       # a driver pings ~75× per
                                                    # window: count drivers, not pings
def multiplier(cell, ratio, neighbors, prev):
    blended = 0.6 * ratio + 0.4 * mean(neighbors)   # spatial diffusion
    target  = step_fn(blended)                      # e.g. 1.0/1.3/1.6/2.0…
    if target >= prev:                              # hysteresis:
        return target                               #   up fast,
    return max(target, prev - 0.1)                  #   decay slow — no flicker

surge = (demand.join(supply, on="h3_cell")
    .map(multiplier)
    .sink(surge_table))      # read by pricing at quote time; pushed to
                             # idle drivers as a heat map — the second job

The multiplier may change every thirty seconds. The quote, once seen, may never change at all.SURGE RULE Nº 1 — QUOTE-TIME LOCK

The lock is the product-honesty boundary between the slow loop and the medium loop: when a rider opens the app, pricing reads the current multiplier, mints a quote_id with the fare frozen inside, valid for three minutes. The request that follows carries the quote_id; the trip is charged exactly that figure. Surge updates the next quote, never an issued one. Eventual consistency everywhere in the pipeline, determinism at the instant of promise.

§ 07 — ANALYTICS SQLInterrogating the marketplace

The append-only tables — trip_events and the sampled ping archive — are where the system explains itself. Three queries an interviewer loves, because each one carries a classic SQL pattern on its back.

The match funnel — conditional aggregation

SQL · WHERE DO REQUESTS DIE?

-- Hourly funnel: requested → matched → completed, with time-to-match.
SELECT
    date_trunc('hour', t.requested_at)                        AS hr,
    count(*)                                                  AS requested,
    count(*) FILTER (WHERE t.state NOT IN ('requested','cancelled')
                       OR  t.driver_id IS NOT NULL)        AS matched,
    count(*) FILTER (WHERE t.state = 'completed')            AS completed,
    round(100.0 * count(*) FILTER (WHERE t.state = 'completed')
                 / nullif(count(*), 0), 1)                    AS completion_pct,
    percentile_cont(0.50) WITHIN GROUP (ORDER BY m.ttm_s)     AS p50_match_s,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY m.ttm_s)     AS p95_match_s
FROM trips t
LEFT JOIN LATERAL (
    SELECT extract(epoch FROM e.occurred_at - t.requested_at) AS ttm_s
    FROM   trip_events e
    WHERE  e.trip_id = t.trip_id AND e.to_state = 'matched'
    LIMIT  1
) m ON TRUE
WHERE t.requested_at >= now() - INTERVAL '7 days'
GROUP BY 1 ORDER BY 1;

Driver sessions — gaps and islands

“How long do drivers actually stay online?” The ping archive has no session boundaries — only timestamps. Sessionization is the canonical gaps-and-islands move: flag every gap larger than the threshold, then a running sum of flags becomes the session ID.

SQL · SESSIONIZATION OVER THE PING ARCHIVE

WITH gaps AS (
    SELECT driver_id, ts,
           CASE WHEN ts - lag(ts) OVER w > INTERVAL '15 minutes'
                THEN 1 ELSE 0 END                  AS new_session
    FROM   driver_pings_sampled
    WHERE  ts::date = CURRENT_DATE - 1
    WINDOW w AS (PARTITION BY driver_id ORDER BY ts)
),
sessions AS (
    SELECT driver_id, ts,
           sum(new_session) OVER
               (PARTITION BY driver_id ORDER BY ts)  AS session_id
    FROM   gaps
)
SELECT driver_id, session_id,
       min(ts)            AS session_start,
       max(ts)            AS session_end,
       max(ts) - min(ts)  AS online_duration
FROM   sessions
GROUP BY driver_id, session_id;

Did surge actually summon drivers?

SQL · SUPPLY RESPONSE, 15 MIN AFTER SURGE ONSET

-- For each cell-hour that crossed 1.5×, compare available-driver counts
-- before and after onset. LAG over the surge snapshot table finds onsets.
WITH onsets AS (
    SELECT h3_cell, snapshot_ts AS onset_ts
    FROM (
        SELECT h3_cell, snapshot_ts, surge_mult,
               lag(surge_mult) OVER
                   (PARTITION BY h3_cell ORDER BY snapshot_ts) AS prev_mult
        FROM   surge_snapshots
    ) s
    WHERE  surge_mult >= 1.5 AND coalesce(prev_mult, 1.0) < 1.5
)
SELECT o.h3_cell,
       avg(b.driver_cnt) AS supply_before,
       avg(a.driver_cnt) AS supply_after_15m,
       round(avg(a.driver_cnt) / nullif(avg(b.driver_cnt), 0), 2)
                         AS supply_lift
FROM onsets o
JOIN supply_snapshots b
  ON b.h3_cell = o.h3_cell
 AND b.snapshot_ts BETWEEN o.onset_ts - INTERVAL '15 min' AND o.onset_ts
JOIN supply_snapshots a
  ON a.h3_cell = o.h3_cell
 AND a.snapshot_ts BETWEEN o.onset_ts AND o.onset_ts + INTERVAL '15 min'
GROUP BY o.h3_cell
ORDER BY supply_lift DESC;

§ 08 — THE DASHBOARDProving the system is alive

A senior design ends with observability, because every clever degradation mode above is invisible without it. The dashboard watches the three loops separately — each has a different definition of “healthy.”

FAST LOOP: ping ingest rate vs. expected (5M ÷ 4s), index freshness p99 (now − seen_ms across a sample), gateway buffer depth and coalesce/drop rate, TTL eviction rate — a spike means a region went dark, not that drivers went home.
MEDIUM LOOP: time-to-match p50/p95, first-offer acceptance %, offers per match (the ranking-quality proxy), unfulfilled request %, CAS conflict rate — rising conflicts mean candidate sets overlap too much: rings too wide or supply too thin.
SLOW LOOP: surge cell coverage %, multiplier oscillation score (sign-flips per cell-hour — the hysteresis health check), quote→request conversion by multiplier, watermark lag on the streaming job.

Marketplace Ops — SFO FRI 18:40 PT · ALL SYSTEMS · 30s REFRESH

Ping Ingest

1.21M/s

Index Freshness p99

4.6s

Time-to-Match p95

14.2s

Unfulfilled Req

1.8%

Surge Heat — Mission District (H3 res 8) · concert letout 18:25

First-Offer Accept

71%

Gateway Drop Rate

0.02%

CAS Conflict Rate

6.4%

Watermark Lag

Oscillation Score

0.3

TTL Evictions

2.1k/s

FIG. 2 — The story a healthy incident tells: surge heat rising, time-to-match drifting amber, CAS conflicts climbing as candidate sets overlap — and drop rate flat, because the gateway is coalescing exactly as designed.

Read the amber tiles together and the dashboard narrates the concert letout from §02: demand spiked, surge painted the heat map, drivers are converging (TTL evictions normal, ingest steady), matching is briefly slower and more contended — and nothing was dropped that mattered. That is what a designed degradation looks like from the operator’s chair.

§ 09 — THE RUBRICWhat was actually being tested

Strip the trip details away and the question was testing five judgments, each of which generalizes far beyond ride-sharing:

TEMPO: Seeing three loops where the prompt says one system — and letting write volume, not feature lists, dictate the architecture.
AUTHORITY: Keeping truth tiny and keyed; making everything large a rebuildable cache. The geo index owns nothing; one CAS owns everything that matters.
PHYSICS: Anchoring trade-offs to the world: a 100 ms batch against a 4 s ping interval and a 10 m GPS noise floor isn’t a latency cost, it’s a rounding error.
FAILURE SHAPE: Choosing degradations on purpose: data plane drops, control plane delays, and the matching pool shrinks rather than serving phantom drivers.
HONESTY: Engineering the promise: eventual consistency throughout the pipeline, determinism at the instant of the quote. The system may be approximate; the bill may not.

Candidates come from geography. Truth comes from a compare-and-swap. Everything else is plumbing — excellent, carefully reasoned plumbing.— CLOSING ARGUMENT