Premise. Every second of every day, somewhere on Earth, a person taps a button and a stranger drives toward them. Multiply that to roughly ~30 million trips a day globally and you no longer have a transport company — you have a real-time, geo-spatial, latency-bound matching engine running over messy human inputs. This article tears the lid off that engine and frames each layer the way a real interview would: "design the dispatcher", "design surge", "design ETA", "explain why two riders going the same way got different drivers". We'll do it with SQL, Python, and visual data so you can see the system, not just memorize it.
Thirty million decisions a day,
each one in under two seconds.
Every tap on the Uber app triggers a chain reaction across H3 hexagons, batched matchers, streaming surge, layered ETA, contraction hierarchies, and a lakehouse the size of a small country. This article opens every box — with SQL, Python, visualizations, and the failure modes that kill naive designs at scale.
The seven decisions the algorithm makes, every trip
- Where am I? — index the rider's pin to a geographic cell.
- Who is close, online, and the right product? — candidate shortlist.
- Of those, who minimizes total system regret? — batched matching.
- What should I charge? — surge / dynamic pricing per hex.
- How long until pickup, and how long is the trip? — ETA models.
- Which roads should the driver take? — routing under live traffic.
- If multiple riders share, who gets picked up first? — pickup sequencing.
The rest of this article walks each one with the SQL, the Python, and the interview questions a serious panel will actually ask.
0. What Actually Breaks at 30 Million Trips a Day
Before any clever algorithm, you have to internalize what 30M trips/day does to a system. Most candidates skip this and lose the senior signal in the first 60 seconds of the interview. The number isn't decoration — it's a constraint generator. Every architectural decision in the rest of the article exists because something on this list was on fire.
The five things that break first
- Hot partition on a single hex. A stadium lets out → 12,000 ride requests in 3 minutes hit one H3 r5 cell. Whatever Kafka partition / Flink keyed stream / Pinot segment owns that hex melts. Throughput on that one shard goes from 5k/s baseline to 60k/s. Everything else in the city is fine; that shard's consumer lag goes vertical.
- Join cardinality blow-up. "Join driver_pings to ride_requests on hex" sounds innocent until you realize a 60-second window is ~75M pings × ~50k requests. A naïve hash join on the wrong key is a 4 TB shuffle. The job either OOMs or runs for 9 hours.
- Late events polluting closed windows. A driver's phone goes through a tunnel and dumps 90 s of buffered pings at once. Surge for that hex was already emitted as 1.0x. Now the late pings would have made it 1.6x. Do you retract? Do you ignore? Do you drop the rider's quote? Each answer has a different blast radius.
- State store fan-out. Every dispatch decision reads ~25 driver records and writes 1. That's ~250k reads/s and ~10k writes/s globally — and the reads must be < 1 ms p99. Postgres can't. MySQL can't. Even Cassandra struggles without careful partition design.
- The lake catches fire if you don't compact. A trip generates ~40 events. 30M × 40 = 1.2B events/day landing as small files on S3. Without Hudi/Iceberg compaction running every 15 minutes, your Spark queries spend 80% of their time listing files instead of reading them.
The three constraints that drive every design choice
Match latency
If dispatch takes longer than ~2 s the rider perceives lag and cancellation rate jumps ~3 percentage points. This forces in-memory state, batched matching, and pre-computed candidate sets.
Surge / ETA freshness
If pricing is > 1 minute stale, the marketplace oscillates: drivers chase a hot zone that already cooled. This forces streaming windows, not micro-batches.
Per-trip data cost
30M trips × even $0.01 of pipeline cost = $300k/day = $110M/year just to observe the system. This forces aggressive partition pruning, columnar storage, and tiering of hot vs cold data.
The three tradeoffs you must name out loud
The "obvious" architecture
- Postgres for driver state, single-writer per row
- Greedy nearest-driver dispatch as each request arrives
- Recompute surge on-demand from the trips table
Dies at: ~50k drivers per region. Locks contend, joins explode, p99 latency breaks 5 s.
What actually scales
- Aerospike-class store keyed by H3, eventual consistency with a 4 s freshness SLA
- Batched matching every 1–2 s on N×M cost matrix
- Streaming surge in 60 s tumbling windows, served from Pinot
Costs: a 1–2 s perceived delay and a stale-data window — both bounded and acceptable.
The rest of this article assumes you've internalized that frame: the algorithm exists to survive these constraints, not to be elegant. Now let's go layer by layer.
1. The Map Is the Database — H3 Hexagons
Before dispatch, surge, or ETA, you have to talk about how the world is indexed. Uber's open-source H3 library tiles the planet in hierarchical hexagons. A hex at resolution 9 is roughly the size of a city block; resolution 7 is roughly a neighborhood. Hexagons beat squares because every neighbor is equidistant, which makes "nearby" honest.
SQL is the language of the data plane: declarative, set-oriented, what an interviewer hands you on a whiteboard, and what actually runs in the streaming engines (Flink SQL), the OLAP layer (Pinot, BigQuery), and the lakehouse (Spark SQL, Trino). Use SQL when the question is "what is the answer to this query?". Python is the language of the algorithm: imperative, library-rich, where you actually express the optimization (scipy.optimize), the simulation (elasticities, surge response), the ML residual model (PyTorch / sklearn), and the orchestration (Airflow DAGs). Use Python when the question is "what is the procedure that produces the answer?". The interview test you're preparing for will alternate between them precisely because production systems do.
import h3
import pandas as pd
pings = pd.DataFrame({
"driver_id": [101, 102, 103, 104],
"lat": [37.7749, 37.7790, 37.7831, 37.7600],
"lng": [-122.4194, -122.4150, -122.4080, -122.4400],
"ts": pd.to_datetime(["2026-04-30 09:00:01"]*4),
})
pings["h3_r9"] = pings.apply(lambda r: h3.geo_to_h3(r.lat, r.lng, 9), axis=1)
pings["h3_r7"] = pings.apply(lambda r: h3.geo_to_h3(r.lat, r.lng, 7), axis=1)
print(pings[["driver_id","h3_r9","h3_r7"]])
driver_id h3_r9 h3_r7
0 101 89283082837ffff 87283082bffffff
1 102 89283082877ffff 87283082bffffff
2 103 892830828a3ffff 87283082bffffff
3 104 8928308281bffff 87283082bffffff
Now "find me the 50 nearest drivers" stops being a haversine scan over millions of rows; it becomes "give me everything in this hex and its k=2 ring neighbors" — an O(k²) lookup against an in-memory index. That single design choice is what lets dispatch run in milliseconds instead of seconds.
Resolution 9
City-block scale. The unit of dispatch lookup and per-hex demand counting.
Resolution 7
Neighborhood scale. Used for surge aggregation and supply heatmaps shown to drivers.
Resolution 5
City-quarter scale. Stream partition key for Flink / Kafka — keeps ordering inside a hex.
H3 cell index
Compact, sortable, hierarchical — perfect partition / clustering key in Pinot, Cassandra, BigQuery.
Three reasons. (1) Equidistant neighbors — every adjacent hex centroid is the same distance away, so "nearest" doesn't have a corner-vs-edge bias like squares do. (2) Hierarchical aggregation — a parent hex contains a deterministic set of child hexes, so you can roll up demand from r9 → r7 → r5 with a join instead of a recompute. (3) Stable IDs — H3 indexes are 64-bit ints, perfect as a partition key in Cassandra, BigQuery, or Pinot.
The algorithm doesn't pick the closest driver. It picks the driver whose assignment makes the next five minutes of the city less broken.
— What the dispatcher is actually optimizing2. Dispatch — Choosing Which Driver Gets the Ride
The naïve answer is "the closest driver". The real answer is "the driver whose assignment minimizes total system regret in the next 5 minutes". Uber moved from greedy nearest-driver dispatch to batched matching around 2018: instead of assigning each rider the moment they tap, the system holds requests for a few seconds and solves a small bipartite matching problem.
That cost matrix is the soul of dispatch. Every term is a business decision encoded as a number:
Pickup time
The dominant term. Riders cancel at ~5% per extra minute of ETA over 6 min.
Driver detour
How far the driver deviates from their current trajectory or queued next-trip area.
Idle fairness
Reward drivers who have been waiting longer — utilization smoother, churn lower.
Driver quality
Subtracts a small bonus for high-rated, recently-praised drivers (rare ties only).
Earnings balance
Tilts toward drivers under their target hourly earnings — a soft fairness lever.
Cancellation prob.
An ML score: P(driver accepts and completes). Punishes flaky pairs.
The scoring function — explicit, with weights, with a worked example
This is the part most articles wave at and most candidates can't write down. The cost of assigning rider i to driver j, in seconds:
Now run two candidate drivers through it for the same rider:
Driver A — closer-but-flaky Driver B — slightly further, solid
ETA = 180 s ETA = 220 s
detour = 20 s detour = 60 s
idle_min = 0 idle_min = 12
rating = 4.50 (rating_bonus = 4.5/5 = 0.9) rating = 4.95 (rating_bonus = 4.95/5 = 0.99)
earn_gap_$ = 0 earn_gap_$ = 8
P[cancel] = 0.12 P[cancel] = 0.02
C(i,A) = 180 + 0.3·20 + 0 + (−15)·0.9 + (−10)·0 + 120·0.12
= 180 + 6 + 0 + (−13.5) + 0 + 14.4
= 186.9 s
C(i,B) = 220 + 0.3·60 + (−0.2)·12 + (−15)·0.99 + (−10)·8 + 120·0.02
= 220 + 18 + (−2.4) + (−14.85) + (−80) + 2.4
= 143.15 s
Hungarian picks B — even though A's raw ETA is 40 s lower —
because A's cancel risk and earnings balance dominate the difference.
That's the moment the article stops being "match the closest driver" and starts being an algorithm. The interviewer is listening for whether you know which terms dominate when: ETA dominates in low-density cities; cancel-risk dominates in markets with surge events; earnings balance dominates when driver supply is fragile. A senior candidate names the regime before they name the formula.
Why this matters in dollars: a 1-percentage-point drop in completed-trip rate at 30M trips/day ≈ 300k lost trips/day. At an average $14 fare that's ~$1.5B/year of marketplace GMV. The cancel-risk weight ψ exists because a single percentage-point on cancellation is worth more than every other term in the cost matrix combined.The objective function, named explicitly
The interview-grade phrasing of the dispatch problem is:
Most candidates can describe "match riders to drivers". The senior signal is naming this as a constrained optimization, knowing the constraints are hard (ETA cap) vs soft (penalize-don't-forbid), and being able to explain why the greedy approximation is provably worse on the dual.
What a junior would write
- For each request as it arrives, scan all drivers in city, pick closest by haversine.
- Lock the driver row, mark assigned, return.
- O(M) per request. ~10 ms × 5M drivers = ~5 s on the hot path.
Why it dies: linear scan, write-amplified row locks, and locally-greedy choices that increase global ETA. Average pickup ETA is ~18% worse than batched. Cancellation rate climbs.
What ships at scale
- Hold requests for ~2 s in a per-city batch.
- Each request fetches a ~25-driver shortlist via H3 k-ring lookup (O(1) on the index).
- Build sparse N×M cost matrix; solve with Hungarian / min-cost flow in < 200 ms.
Why it wins: reads are O(1) per request, the global optimum beats the greedy by ~15–20% on average ETA, and the 2 s batch window is below the perceptual cancellation threshold.
The dispatch SQL — what's running on the hot path
This is the candidate-shortlist query. It decides who gets a ride in the next two seconds. Every clause maps to a business stake: the k=2 ring bounds rider wait time; the 20-second ping freshness filter prevents handing a rider to a driver whose phone died; the LIMIT 25 caps the size of the cost matrix so the solver finishes in < 200 ms. It runs against an in-memory state store (Pinot / Aerospike / a custom Go service backed by RocksDB), not Postgres — but the SQL form is what you'll be asked to write at a whiteboard.
-- Inputs: :rider_h3_r9, :rider_lat, :rider_lng, :product = 'UberX'
WITH neighbor_hexes AS (
SELECT h3 FROM h3_kring(:rider_h3_r9, 2) -- 19 hexes around rider
),
nearby_drivers AS (
SELECT d.driver_id,
d.lat, d.lng,
d.last_ping_ts,
d.product_eligible,
d.rating,
d.shift_hours_today,
d.target_earnings_gap_usd,
d.recent_cancel_rate
FROM driver_state d
JOIN neighbor_hexes nh ON d.h3_r9 = nh.h3
WHERE d.status = 'ONLINE_IDLE'
AND :product = ANY(d.product_eligible)
AND d.last_ping_ts > NOW() - INTERVAL '20 seconds'
)
SELECT driver_id,
haversine(lat, lng, :rider_lat, :rider_lng) AS crow_dist_m,
eta_seconds(driver_id, :rider_lat, :rider_lng) AS eta_s,
rating, recent_cancel_rate, target_earnings_gap_usd
FROM nearby_drivers
ORDER BY eta_s ASC
LIMIT 25;
That shortlist of ~25 drivers per rider feeds into the batch optimizer. With ~N riders and ~M drivers in a city in a 2-second window, you build an N×M cost matrix (sparse, because each rider only links to its 25 candidates) and solve min-cost bipartite matching.
driver_stateis partitioned by H3 r6/r7 with a co-located in-memory index. If you naively put it in Postgres on PK = driver_id, this query plans a full scan and dies at ~50k drivers/region.- The k-ring lookup is O(1) — 19 hex IDs joined against a hash index, not a geo predicate. Rewriting this as
ST_DWithinon a PostGIS table runs ~200× slower and serializes on shared GIST locks during high write pressure. LIMIT 25is not cosmetic — it bounds the cost-matrix column count for the Hungarian solver, which is O(N³). At N×M = 500×500 the solver runs in ~80 ms; at 500×5000 it's ~80 s and dispatch misses its 2-second SLA.
The stadium-event meltdown — a story in numbers
Same query, same code. Tuesday at 2 PM it serves p99 in 6 ms. Wednesday at 10 PM, the moment the stadium gates open, the same query on the same hex goes vertical:
Why it breaks: all 12k requests have the same partition key (the stadium hex). The Pinot segment that owns that hex is on one broker; CPU saturates; consumer lag on the upstream Kafka partition climbs to 90 s; surge can't update because Flink's keyed stream for that hex is bottlenecked on the same shard. Skew kills the cluster while every other hex in the city is idle.
Three things that fix it (in order of cost):
- Pre-warm. The schedule is known — the system pre-fetches the hex's driver state into a hot read replica 10 minutes before doors open.
- Salt the partition. Temporarily shard the stadium hex across 8 partitions (
h3 || (rand % 8)), spread the load, re-merge in the matcher. - Geo-fence the dispatch. Switch this hex to a geofenced batch with a 5 s window and a wider k-ring; accept the +3 s perceived delay because the alternative is mass cancellation.
The interview-grade lesson: "the SQL ran fine in dev" is the most dangerous sentence in marketplace engineering. Skew is not a tail risk — it's a daily event. Every dispatch query must be designed assuming one hex will get 1000× the traffic of its neighbors at some unpredictable moment.
Deep-dive — the Hot Partition problem, taught from first principles
Most candidates have heard of skew but can't draw it. This is the topic that separates "I read a blog" from "I've been on call". Worth slowing down for.
What is a hot partition? In a distributed system, data is split across many shards so different servers can share the load. The router sends each event to a shard based on a partition key. A hot partition happens when one key suddenly carries a disproportionate fraction of the traffic — so one server melts while every other server in the cluster is idle.
In Uber's stack the partition key for ride requests is the H3 hex ID. That's a deliberate choice — it means all events for a geographic area land on the same Flink worker, the same Pinot segment, the same dispatch shard, which is what makes neighbor-lookups O(1). The cost of that choice is that 12,000 people in one stadium all hash to the same hex, and therefore the same shard.
-- Looks like a healthy point-lookup. Isn't.
SELECT *
FROM ride_requests
WHERE h3_index = '85283473fffffff' -- the stadium hex
AND request_ts > NOW() - INTERVAL '3 minutes';
If ride_requests is partitioned by h3_index, this query reads from exactly one Pinot segment / Cassandra partition / Kafka partition. When 12,000 rows match, that one shard does 100% of the work. The CPU on the host owning that segment goes to 100%, p99 read latency climbs from 6 ms to 850 ms, and every other shard sits at 4% utilization. The cluster has the capacity to handle the load — it just can't reach it.
import random, pandas as pd
# 3 partitions, each can drain 10,000 events/min before lag accumulates
PARTITIONS = ["A", "B", "C"]
DRAIN_PER_MIN = 10_000
def route_naive(hex_id):
"""Plain hash on the hex — every stadium request goes to one shard."""
return PARTITIONS[hash(hex_id) % len(PARTITIONS)]
def route_salted(hex_id, salt_buckets=8):
"""Append a small random salt — fans the hot hex across many partitions."""
salt = random.randint(0, salt_buckets - 1)
return PARTITIONS[hash(f"{hex_id}#{salt}") % len(PARTITIONS)]
def simulate(events, route_fn):
in_q = {p: 0 for p in PARTITIONS} # incoming this minute
lag = {p: 0 for p in PARTITIONS} # backlog carried over
for ev in events:
in_q[route_fn(ev["h3"])] += 1
for p in PARTITIONS:
delivered = min(in_q[p], DRAIN_PER_MIN)
lag[p] = (in_q[p] - delivered) + lag[p] # whatever the shard couldn't drain
return in_q, lag
# Stadium scenario: 12,000 events from one hex, plus 600 ambient from random hexes
stadium_events = (
[{"h3": "85283473fffffff"} for _ in range(12_000)] +
[{"h3": f"hex_{random.randint(0,99)}"} for _ in range(600)]
)
naive_in, naive_lag = simulate(stadium_events, route_naive)
salted_in, salted_lag = simulate(stadium_events, route_salted)
print("NAIVE routing in:", naive_in, " lag:", naive_lag)
print("SALTED routing in:", salted_in, " lag:", salted_lag)
NAIVE routing in: {'A': 12200, 'B': 211, 'C': 189} lag: {'A': 2200, 'B': 0, 'C': 0}
SALTED routing in: {'A': 4225, 'B': 4180, 'C': 4195} lag: {'A': 0, 'B': 0, 'C': 0}
Inline definitions for the terms in this section
keyBy(h3) guarantees all events for a given hex are processed by the same TaskManager (so windows and aggregates are correct). Cost: that one TaskManager is a single point of saturation.h3 || "#" || (rand % 8). Splits one hot partition across N "shadow" partitions; downstream re-aggregation sums them back together. Costs O(N) more work in the merge step but unlocks all the cluster's idle capacity.- It only works if the downstream operator can re-aggregate the salted shards.
SUM,COUNT,HLL— yes.EXACTLY-ONCE last-write-winson a single state row — no. - It breaks per-key ordering. If your business logic depends on "the latest event for hex H wins", salting destroys that guarantee. Combine with sequence numbers.
- It needs an adaptive controller. Salting all keys all the time wastes capacity; salting on demand requires monitoring per-partition lag and triggering re-shuffle. The full production pattern is elastic salting: turn it on for the hex when its lag exceeds threshold, turn it off when lag clears.
import numpy as np
import pandas as pd
from scipy.optimize import linear_sum_assignment
INF = 1e9
cost = np.array([
[ 180, 240, INF, 300, INF], # rider 0
[ 220, 150, 280, INF, INF], # rider 1
[ INF, 340, 200, 260, 310], # rider 2
[ INF, INF, 290, 220, 170], # rider 3
])
row_ind, col_ind = linear_sum_assignment(cost)
assignment = pd.DataFrame({
"rider": [f"R{i}" for i in row_ind],
"driver": [f"D{j}" for j in col_ind],
"cost_s": cost[row_ind, col_ind].astype(int),
})
print(assignment)
print("Total system ETA cost (s):", int(cost[row_ind, col_ind].sum()))
rider driver cost_s
0 R0 D0 180
1 R1 D1 150
2 R2 D2 200
3 R3 D4 170
Total system ETA cost (s): 700
R3 didn't get D4 because D4 was closest to R3 — D4 was assigned because giving D2 to R2 and D4 to R3 yields a lower total cost than the greedy alternative. This is exactly why two riders going the same way can get different drivers, and why "the closer one didn't come" feels strange but is mathematically correct.
Deep-dive — opening up linear_sum_assignment: the Hungarian algorithm in plain English
Calling scipy.optimize.linear_sum_assignment is fine; not knowing what it does is a senior-level red flag. The Hungarian algorithm is the dispatcher's beating heart. Here's the version I'd give in an interview, in five steps you can draw on a whiteboard.
The problem in one sentence. Given an N×N (or N×M, padded with infinities) cost matrix, pick exactly one cell per row and per column such that the total cost is minimized. Brute-force searches N! permutations — at N=20 that's 2.4×10¹⁸ assignments. Hungarian solves it in O(N³).
Worked on a 3×3:
Original cost matrix (rows = riders, cols = drivers):
D0 D1 D2
R0 [ 250 400 350 ] row-min = 250
R1 [ 400 600 350 ] row-min = 350
R2 [ 200 400 250 ] row-min = 200
Step 1 — row reduction:
R0 [ 0 150 100 ]
R1 [ 50 250 0 ]
R2 [ 0 200 50 ] col-min = (0, 150, 0)
Step 2 — column reduction:
R0 [ 0 0 100 ]
R1 [ 50 100 0 ]
R2 [ 0 50 50 ]
Step 3 — minimum lines covering all zeros: 2 lines suffice
(line through column 0, line through R0). 2 < 3 → not optimal.
Step 4 — Augment. Smallest uncovered value = 50.
Subtract 50 from every uncovered cell, add 50 to doubly-covered.
R0 [ 0 0 150 ]
R1 [ 0 50 0 ]
R2 [ 0 50 50 ] ← wait, we need 3 lines to cover all zeros now: yes (col 0, R0, R1)
still not 3 distinct rows/cols → continue.
... after one more augmentation step the matrix becomes:
R0 [ 0 0 100 ]
R1 [ 0 50 0 ]
R2 [ 0 50 50 ] → optimal at: R0→D1 (0), R1→D2 (0), R2→D0 (0)
Mapping back to original costs: R0→D1 = 400, R1→D2 = 350, R2→D0 = 200.
Total = 950 seconds.
Greedy alternative (always pick the locally cheapest):
R0 picks D0 (250), then R1 must pick D2 (350), then R2 must pick D1 (400).
Total = 1000 seconds. Hungarian beats greedy by 50 s here.
import numpy as np
from scipy.optimize import linear_sum_assignment
cost = np.array([
[250, 400, 350],
[400, 600, 350],
[200, 400, 250],
])
rows, cols = linear_sum_assignment(cost)
print("assignment:", list(zip(rows, cols)))
print("total cost:", int(cost[rows, cols].sum()))
print("greedy :", 250 + 350 + 400)
assignment: [(0, 1), (1, 2), (2, 0)]
total cost: 950
greedy : 1000
What that 50-second difference means at scale: across millions of dispatch decisions per day, the cumulative ETA savings from solving Hungarian instead of greedy is roughly the difference between a 4.2-min average pickup and a 5.0-min average pickup. Every 1-minute of average ETA reduction is worth ~3% in completed-trip rate — which is the entire reason batched matching exists.
Why Hungarian, not Min-Cost Flow, in the Python toy?
For the dispatch problem with one rider matched to one driver, the two are equivalent — assignment is a special case of min-cost flow. Production matchers use min-cost flow because it generalizes when the constraints get richer: a driver can be offered to two riders simultaneously (one accepts, the other re-batches), capacity constraints on hexes, multi-commodity flows for Pool. The Hungarian implementation is just the cleanest pedagogical entry point, which is why scipy.optimize.linear_sum_assignment is what every interview question reaches for.
- It's dense. Production cost matrices are sparse — each rider has ~25 candidate drivers, not all M. Sparse min-cost flow is ~50× faster than dense Hungarian at the sizes that matter (N=500 riders × M=2000 drivers).
- It's balanced (N×N). Reality is unbalanced (N riders, M drivers, N≠M). Pad with infinities and the algorithm still works — but you must remember to filter out the infinity-cost assignments from the output.
- It's static. Reality has new riders and new drivers arriving during the 2 s batch. Production solvers run an incremental variant that warm-starts from the previous batch's assignment.
- It treats infeasibility as "infinity". Better: use a two-phase approach — first satisfy hard constraints (ETA cap, product compatibility), then optimize soft costs only over the feasible subgraph.
Greedy gives each new rider the locally best driver and locks it in. Batched waits 1–2 s, builds an N×M matrix, and solves min-cost flow. Internal Uber papers and ridesharing literature (Özkan & Ward, 2020) show batched dispatch reduces average pickup ETA by ~15–20% and lifts completed-trips-per-driver-hour by ~3–5%. The trade-off is a 1–2 s perceived delay, well below the cancellation threshold.
driver_state actually live? You won't say "Postgres".Driver state is a hot, geo-partitioned, write-heavy keyspace updated by every GPS ping (~4 s cadence × millions of drivers). It lives in a sharded in-memory store — Uber's Ringpop-based service stack, or industry equivalents like Aerospike, ScyllaDB, or Apache Pinot for the read side. Partition key = H3 hex at r6/r7 for shard balance. The SQL form is the logical model exposed to the dispatcher; the physical implementation is a custom Go/Java service with sub-millisecond reads.
Edge cases that bite at scale
Driver "teleports" 800 m
Tunnels, urban canyons, malformed Android pings. Solution: median-of-3 smoothing + a Kalman filter; reject any ping that implies > 250 km/h vs the previous one.
Rider double-taps "Request"
Network retry produces two identical events. Solution: idempotency key = (rider_id, pickup_h3, ts_minute); collapse before the matcher sees them.
Driver accepts then doesn't move
3% of offers stall. Solution: 30 s "intention check" — if the driver hasn't moved toward pickup, reassign and penalize cancel_risk in the cost matrix going forward.
Stadium lets out
One hex spikes from 5 to 60k req/min. Solution: salt the partition key for that hex temporarily (h3 || (random % 8)), spread to 8 consumers, re-merge in the matcher.
3. Surge Pricing — Markets, Not Multipliers
Riders see "1.8x". Internally there is no global multiplier — surge is a per-hex, per-product, per-minute price signal whose only job is to clear the market: bring supply up, dampen demand, until queue depth and ETA hit acceptable thresholds. The math is closer to airline yield management than to "Uber raised prices because it was raining".
The signal is bidirectional: surge raises prices to dampen demand and shows drivers a heatmap to pull supply toward the hex. The closed loop typically settles within 3–5 minutes during normal events.
"Just SUM open requests / online drivers"
- Read from the OLTP trips table on every page load.
- No windowing — counts are unbounded since midnight.
- Multiplier swings wildly minute-to-minute.
Dies at: 50–100k req/s read load, oscillating prices, drivers chasing ghosts.
Pre-aggregated, monotonic, clamped
- Flink keyed stream per H3 r5 maintains 60 s sliding windows.
- Multiplier is monotonic in each input → never decreases on bad signal.
- Clamped to [1.0, 3.0] so rare bugs can't 100×-charge a rider.
Why it wins: O(1) read at the dispatcher, smooth price response, blast-radius bounded by the clamp.
-- streaming view (Flink / Materialize / ksqlDB style)
WITH supply_60s AS (
SELECT h3_r8,
COUNT(DISTINCT driver_id) FILTER (WHERE status = 'ONLINE_IDLE') AS idle_drivers
FROM driver_pings
WHERE ping_ts >= NOW() - INTERVAL '60 seconds'
GROUP BY h3_r8
),
demand_60s AS (
SELECT h3_r8,
COUNT(*) FILTER (WHERE state = 'OPEN') AS open_requests,
AVG(quoted_eta_s) AS avg_eta_s,
AVG(CASE WHEN cancelled THEN 1 ELSE 0 END) AS cancel_rate
FROM ride_requests
WHERE request_ts >= NOW() - INTERVAL '60 seconds'
GROUP BY h3_r8
)
SELECT d.h3_r8,
d.open_requests,
s.idle_drivers,
ROUND(d.open_requests::numeric / NULLIF(s.idle_drivers, 0), 2) AS d_over_s,
d.avg_eta_s,
d.cancel_rate,
LEAST(3.0, GREATEST(1.0,
1.0
+ 0.45 * GREATEST(0, (d.open_requests::numeric / NULLIF(s.idle_drivers, 0)) - 1.0)
+ 0.020 * GREATEST(0, d.avg_eta_s - 360)
+ 0.80 * GREATEST(0, d.cancel_rate - 0.10)
)) AS surge_multiplier
FROM demand_60s d
LEFT JOIN supply_60s s USING (h3_r8);
Three things to notice. The formula is bounded ([1.0, 3.0] in most markets), it is monotonic in each input (so it can never decrease price when ETA worsens), and the coefficients are learned per city from historical elasticity. New York's response is not San Francisco's.
- The
WHERE ping_ts >= NOW() - INTERVAL '60 seconds'looks innocent but on a static table it forces a full sequential scan every minute against ~75M rows. As a Flink keyed stream over a sliding window, it becomes O(events arriving) — only the delta is processed. COUNT(DISTINCT driver_id)is a hash-set per partition. At hot-hex pressure that hash-set blows past the JVM heap on a single TaskManager. Replace with HyperLogLog (APPROX_COUNT_DISTINCT) — accepts ±2% error for O(1) memory.- The
NULLIF(s.idle_drivers, 0)is not just safety — it's the only reason the formula doesn't return NaN for hexes with zero supply. A NaN here propagates: surge becomes "—", the rider app shows nothing, conversion drops by ~7 pp on the affected hexes.
Why Python here, not SQL — the simulator's job
SQL gives you the steady-state answer at any tick. It can't tell you whether the surge controller is stable under a demand spike. For that you need to run the closed loop forward in time: shock the system, watch the price, watch the supply respond, watch whether it converges or oscillates. That's a job for Python — specifically, a tiny discrete-event simulator that mirrors what the production Flink job does but lets you run 1,000 different parameter sets in seconds. This is the actual difference between SQL and Python in the Uber stack: SQL queries the world that exists; Python builds the world that doesn't yet exist so you can see if it would survive.
A demand-spike simulator — the Python job that surge tuning actually uses
import numpy as np, pandas as pd
EPS_RIDER = -0.6 # +1% price → -0.6% demand
EPS_DRIVER = +1.2 # +1% price → +1.2% supply within 5 min
base_demand = 1000
base_supply = 600
mults = np.linspace(1.0, 2.4, 8)
rows = []
for m in mults:
pct = (m - 1.0) * 100
demand = base_demand * (1 + EPS_RIDER * pct/100)
supply = base_supply * (1 + EPS_DRIVER * pct/100)
fulfilled = min(demand, supply)
rows.append({"surge": f"{m:.1f}x", "demand": int(demand), "supply": int(supply),
"fulfilled": int(fulfilled),
"fill_rate": f"{100*fulfilled/demand:0.1f}%"})
print(pd.DataFrame(rows).to_string(index=False))
surge demand supply fulfilled fill_rate
1.0x 1000 600 600 60.0%
1.2x 880 744 744 84.5%
1.4x 760 888 760 100.0%
1.6x 640 1032 640 100.0%
1.8x 520 1176 520 100.0%
2.0x 400 1320 400 100.0%
2.2x 280 1464 280 100.0%
2.4x 160 1608 160 100.0%
The market clears around 1.4x in this toy. Going higher than that doesn't help fulfillment — it transfers wealth from riders to drivers and risks a customer-trust event. Real Uber surge is tuned to find the lowest multiplier that clears the queue, not the highest the market can bear.
The closed loop — does it actually converge under a shock?
Steady-state is comforting. Production isn't steady-state. Here's the simulator I'd actually run before changing a surge coefficient: shock the system with a 5× demand spike at minute 5 and watch whether the controller dampens or oscillates.
import numpy as np
def simulate(spike_factor=5.0, kp=0.45, smooth=0.3, minutes=20, dt_s=10):
"""Discrete-event surge simulator for a single hex.
Returns time-series of (demand, supply, surge, unmet_requests).
kp : proportional gain (≡ the 0.45 coefficient in the SQL)
smooth : EMA smoothing applied to surge to prevent oscillation (0=none, 1=frozen)
"""
steps = int(minutes * 60 / dt_s)
base_d, base_s = 100, 80 # per-minute baseline
d_arr = np.full(steps, base_d, dtype=float)
d_arr[steps//4 : steps//4 + 12] *= spike_factor # 2-min spike at t=5min
surge = np.ones(steps); supply = np.full(steps, base_s, dtype=float); unmet = np.zeros(steps)
EPS_R, EPS_D = -0.6, 1.2 # elasticities
for t in range(1, steps):
d_t = d_arr[t] * (1 + EPS_R * (surge[t-1] - 1)) # rider response (instant)
s_t = supply[t-1] + (base_s * (1 + EPS_D * (surge[t-1]-1)) - supply[t-1]) * 0.20 # supply response (lagged ~5 dt)
ratio = d_t / max(s_t, 1)
raw = 1.0 + kp * max(0, ratio - 1.0)
raw = min(3.0, max(1.0, raw)) # clamp
surge[t] = smooth * surge[t-1] + (1 - smooth) * raw # EMA smoothing
supply[t] = s_t
unmet[t] = max(0, d_t - s_t)
return d_arr, supply, surge, unmet
# Run two configs: aggressive (no smoothing) vs production (smoothed)
agg = simulate(spike_factor=5.0, kp=0.45, smooth=0.0)
prod = simulate(spike_factor=5.0, kp=0.45, smooth=0.3)
import pandas as pd
def summarize(name, run):
d, s, surge, unmet = run
return {"config": name,
"max_surge": round(surge.max(), 2),
"settle_min": round(np.argmax(surge[len(surge)//2:] < surge.max()*0.6) * 10/60, 1),
"unmet_pct": round(100 * unmet.sum() / d.sum(), 1),
"oscillation_amp": round(surge[len(surge)//4:].max() - surge[len(surge)//4:].min(), 2)}
print(pd.DataFrame([summarize("aggressive", agg), summarize("production", prod)]).to_string(index=False))
config max_surge settle_min unmet_pct oscillation_amp
aggressive 2.95 3.5 18.2 1.42
production 2.10 5.0 12.7 0.18
Deep-dive — streaming windows, watermarks & the late-event problem
The surge SQL has a clause: WHERE ping_ts >= NOW() - INTERVAL '60 seconds'. That clause is doing a lot of work. Inside Flink it's not a filter — it's a window, and the window has to decide when to close. Get that wrong and surge either lies or arrives late. This is the streaming concept candidates fumble most often.
What is a window? A way to slice an unbounded stream into bounded pieces you can aggregate over. Three flavors:
Non-overlapping, fixed size
"Every 60 s, emit the count of pings since the previous emission." Each event belongs to exactly one window. Good for billing, simple counts.
Overlapping, fixed size
"Every 10 s, emit the count of pings in the last 60 s." Each event belongs to multiple windows. What surge actually uses — smoother updates.
Activity-based, dynamic
"Group events for a key until there's a 5-min gap." Each window's length is data-driven. Rider-trip sessionization uses this.
What is a watermark? A timestamp that says "events older than this are assumed to have all arrived". The window stays open until the watermark crosses its end boundary, then closes and emits. The trick: watermarks are heuristics, not facts. A driver's phone in a tunnel will dump 90 s of buffered pings the moment it reconnects. Those pings are timestamped 90 s ago — past the watermark — and the window has already emitted.
-- Flink SQL: a sliding 60s window advancing every 10s, watermarked
-- to allow ~3s of normal network jitter, with 120s of lateness for tunnels.
CREATE TABLE driver_pings (
driver_id BIGINT,
h3_r8 BIGINT,
speed_kmh DOUBLE,
ping_ts TIMESTAMP(3),
WATERMARK FOR ping_ts AS ping_ts - INTERVAL '3' SECOND -- jitter buffer
) WITH (...);
CREATE VIEW supply_60s AS
SELECT h3_r8,
window_start, window_end,
COUNT(DISTINCT driver_id) FILTER (WHERE status='ONLINE_IDLE') AS idle_drivers
FROM TABLE(HOP(
TABLE driver_pings,
DESCRIPTOR(ping_ts),
INTERVAL '10' SECONDS, -- slide
INTERVAL '60' SECONDS -- size
))
GROUP BY h3_r8, window_start, window_end;
-- The "late event" knob lives on the sink, not the window:
-- ALLOWED_LATENESS = 120s and the sink must be UPSERT-capable.
SET 'table.exec.emit.late-fire.delay' = '120 s';
import pandas as pd
# Stream of pings: the driver in the tunnel emits 6 pings at t=10s but they
# don't actually arrive until t=95s (after the t=60s window has closed).
events = pd.DataFrame([
# ping_ts (event time) , arrival_ts (processing time)
{"driver":"D1","ping_ts": 5, "arrival_ts": 6},
{"driver":"D2","ping_ts": 12, "arrival_ts": 13},
{"driver":"D3","ping_ts": 25, "arrival_ts": 26},
{"driver":"D4","ping_ts": 40, "arrival_ts": 41},
{"driver":"D5","ping_ts": 55, "arrival_ts": 56},
# tunnel driver — events stamped 10s but arrived at 95s
{"driver":"D9","ping_ts": 10, "arrival_ts": 95},
{"driver":"D9","ping_ts": 20, "arrival_ts": 95},
{"driver":"D9","ping_ts": 30, "arrival_ts": 95},
{"driver":"D9","ping_ts": 40, "arrival_ts": 95},
{"driver":"D9","ping_ts": 50, "arrival_ts": 95},
{"driver":"D9","ping_ts": 58, "arrival_ts": 95},
])
WINDOW_END = 60 # window covers [0, 60)
def strategy_drop(ev):
"""Naive: window emits at t=63 (watermark + 3s jitter); D9 arrives at 95 → dropped."""
in_window = ev[(ev.ping_ts < WINDOW_END) & (ev.arrival_ts <= 63)]
return in_window.driver.nunique(), 0
def strategy_lateness(ev, allowed=120):
"""Allowed-lateness: keep the window OPEN for 120s past close. D9 arrives at 95 → counted."""
in_window = ev[(ev.ping_ts < WINDOW_END) & (ev.arrival_ts <= 63 + allowed)]
return in_window.driver.nunique(), 0
def strategy_retraction(ev):
"""Production: emit at t=63, then RETRACT and re-emit when D9 arrives."""
initial = ev[(ev.ping_ts < WINDOW_END) & (ev.arrival_ts <= 63)].driver.nunique()
final = ev[(ev.ping_ts < WINDOW_END) & (ev.arrival_ts <= 200)].driver.nunique()
return final, (final - initial) # n_correct, n_retracted
print("DROP :", strategy_drop(events)) # (5, 0) — 6 missing
print("LATENESS :", strategy_lateness(events)) # (6, 0) — correct, but rider waited 2 min
print("RETRACTION :", strategy_retraction(events)) # (6, 1) — fast initial emit, corrected later
DROP : (5, 0)
LATENESS : (6, 0)
RETRACTION : (6, 1)
Each strategy fails in a different way:
- Drop — fastest emit, but the late driver is invisible. Surge over-counts D/S, prices spike unnecessarily, drivers chase a phantom shortage.
- Lateness — accurate but the window is blocked for 2 minutes waiting for stragglers. Surge updates every 10 s become surge updates every 130 s. Marketplace lag.
- Retraction — emit fast (the dispatcher gets a number now), then issue a corrected value when the late event lands. The downstream sink (Pinot upsert table) overwrites the old row. This is what production runs.
You can have any two of: fast, complete, simple. Drop = fast + simple, not complete. Lateness = complete + simple, not fast. Retraction = fast + complete, not simple. The interview move is to name the trade-off, then pick retraction, then say what the downstream sink must support to make retraction safe (idempotent UPSERT, primary key includes (h3_r8, window_end), no append-only consumers downstream).
Visualizing a city's surge as a hex heatmap
Same row = north-to-south band of a city; same column = east-to-west band. Each cell is one r7 hex, colored by current multiplier. This is what the driver heatmap renders.
| W3 | W2 | W1 | Center | E1 | E2 | E3 | |
|---|---|---|---|---|---|---|---|
| N3 | 1.0x | 1.0x | 1.2x | 1.5x | 1.2x | 1.0x | 1.0x |
| N2 | 1.0x | 1.2x | 1.5x | 1.9x | 1.6x | 1.1x | 1.0x |
| N1 | 1.1x | 1.4x | 1.9x | 2.4x | 2.0x | 1.5x | 1.2x |
| Ctr | 1.2x | 1.6x | 2.5x | 2.8x | 2.1x | 1.6x | 1.2x |
| S1 | 1.0x | 1.2x | 1.6x | 1.9x | 1.5x | 1.2x | 1.0x |
| S2 | 1.0x | 1.0x | 1.2x | 1.5x | 1.2x | 1.0x | 1.0x |
Surge is computed at H3 r8 granularity (~0.7 km²). "Across the street" can be the boundary between two hexes with different supply/demand microclimates — one hex contains a stadium exit, the other contains a residential block where 6 drivers happen to be idle. Also: surge is per-product (UberX vs Comfort vs XL) and per-rider-cohort (loyalty / promo), so two riders in the same hex can see different prices for legitimate reasons.
Kafka topics for driver_pings and ride_requests, partitioned by H3 r5 (so all events for a parent hex go to the same Flink keyed stream). A Flink job maintains per-hex sliding 60 s windows and emits (h3_r8, surge, ts) tuples to a second Kafka topic. The dispatcher and the rider app consume that topic via a low-latency store (Redis / Pinot). End-to-end p99 under 3 s.
Edge cases that bite surge
Driver pings arrive 90 s late
Window already emitted 1.0x. Solution: allowedLateness=120s on the surge window + retraction stream → downstream Pinot upserts the corrected value. Idempotent sinks only.
Lightning strike → 30s blackout
All drivers go offline simultaneously. Naive formula spikes to 3.0x. Solution: rate-of-change cap on the multiplier (max +0.3x per 60 s) + freeze if >X% of drivers in the city dropped in the same minute (likely network event, not real demand).
Hex with 0 drivers all morning
D/S division by zero. Solution: NULL-safe formula and a floor on supply (assume 1 phantom driver) so prices don't snap to 3.0x in a hex that simply has no traffic at all.
Rider straddles two hexes
One ping says hex A (1.0x), next says hex B (1.8x). Solution: snap rider to the hex they were in for the longest fraction of the last 30 s; show the lower of the two adjacent multipliers as the displayed quote.
"Driver arriving in 4 minutes" is a probability distribution disguised as a number.
— Layer 1 of the ETA stack, before L2 and L3 fix it4. ETA — The Most Lied-About Number on the App
"Driver arriving in 4 minutes" is a probability distribution disguised as a number. Uber's ETA stack has three layers, and each one solves a different failure mode of the previous one.
Map ETA
Pure routing-engine answer: shortest-path / Dijkstra / contraction hierarchies on the road graph, weighted by free-flow speeds.
Traffic-Adjusted
Edge weights replaced by live speed estimates from the last N minutes of GPS pings on the same road segment.
ML Residual
A Gradient-Boosted Tree (or DeepETA-style transformer) predicts the residual: how much L2 will be wrong given hour, weather, event, driver behavior.
L3 is where Uber publicly claims the biggest wins — DeepETA reduced absolute ETA error by ~12% over the prior GBT. The trick is that L1+L2 already get you 80% of the way; L3 just learns the systematic bias.
"Just route on the road graph"
- Static edge weights from speed-limit data.
- No real-time signal, no ML residual.
- Median absolute error ~3–4 minutes in any urban core.
Dies at: rush hour. The system tells riders 4 minutes; the actual ETA is 9. Cancellation rate doubles.
Layered with traffic + ML residual
- L2 swaps free-flow speeds for live segment speeds (3 / 15 / 60 min rolling).
- L3 transformer learns systematic bias by hour, weather, event, driver behavior.
- Median abs error compresses to ~30–45 s.
Why it wins: the rider trusts the number, the dispatcher trusts the cost matrix, and the cancellation curve flattens.
-- Each driver ping snaps to a road segment via map-matching (HMM/Viterbi).
-- This view exposes the rolling speed estimate per segment.
CREATE OR REPLACE VIEW segment_speed_live AS
SELECT
segment_id,
h3_r7, -- spatial partition
AVG(speed_kmh) FILTER (WHERE ping_ts > NOW() - INTERVAL '3 min') AS speed_3m,
AVG(speed_kmh) FILTER (WHERE ping_ts > NOW() - INTERVAL '15 min') AS speed_15m,
AVG(speed_kmh) FILTER (WHERE ping_ts > NOW() - INTERVAL '60 min') AS speed_60m,
COUNT(*) FILTER (WHERE ping_ts > NOW() - INTERVAL '3 min') AS samples_3m,
NOW() AS computed_at
FROM snapped_pings
WHERE ping_ts > NOW() - INTERVAL '60 minutes'
GROUP BY segment_id, h3_r7;
import numpy as np, pandas as pd
# road segments along a candidate route
segs = pd.DataFrame({
"segment_id": [101, 102, 103, 104, 105],
"length_m": [320, 410, 600, 220, 540],
"freeflow_kmh": [ 50, 50, 60, 40, 50],
"live_kmh": [ 28, 22, 35, 15, 30], # current traffic
})
# L1: free-flow ETA
segs["eta_L1_s"] = segs.length_m / (segs.freeflow_kmh * 1000/3600)
# L2: traffic-adjusted ETA
segs["eta_L2_s"] = segs.length_m / (segs.live_kmh * 1000/3600)
# L3: ML residual, e.g. predicted +18% during 6pm Friday rain
RESIDUAL = 0.18
segs["eta_L3_s"] = segs.eta_L2_s * (1 + RESIDUAL)
print(segs.round(1))
print(f"\\nL1 total: {segs.eta_L1_s.sum():.0f}s "
f"L2 total: {segs.eta_L2_s.sum():.0f}s "
f"L3 total: {segs.eta_L3_s.sum():.0f}s")
segment_id length_m freeflow_kmh live_kmh eta_L1_s eta_L2_s eta_L3_s
0 101 320 50 28 23.0 41.1 48.5
1 102 410 50 22 29.5 67.1 79.2
2 103 600 60 35 36.0 61.7 72.8
3 104 220 40 15 19.8 52.8 62.3
4 105 540 50 30 38.9 64.8 76.5
L1 total: 147s L2 total: 288s L3 total: 339s
L1 says 2:27, L2 says 4:48, L3 says 5:39. The honest ETA shown to the rider is L3 — and it'll be ~ ±45 s off, which is the published p50 error band for DeepETA-class models.
Edge cases that bite ETA
GPS sticks the driver to the wrong parallel road
Highway vs frontage road look similar from above. Solution: HMM/Viterbi map-matching with road-class priors and turn-restriction penalties, not nearest-segment snap.
Segment has 0 GPS samples in last 3 min
Don't fall back to free-flow blindly. Solution: cascade to 15-min, then 60-min, then historical-by-hour; mark the ETA confidence as degraded so dispatch can de-prioritize that route.
Bridge closes for an hour
L2 will eventually learn from the speed drop, but slowly. Solution: external incident feed → hard-cut the segment from the routing graph within 60 s; retro-flag any in-flight trips on it.
Concert lets out, traffic explodes everywhere at once
L2 drifts but L3 hasn't seen this exact event before. Solution: ML training set includes event proximity features; otherwise ETA quotes lag reality by 5–10 minutes for the first 20 min of the spike.
5. Routing — Which Roads Should the Driver Take?
Once an ETA exists, routing is the inverse problem: find the path that minimizes that ETA. Doing Dijkstra naïvely over a continent-sized road graph is hopeless; production systems use contraction hierarchies (CH) or customizable contraction hierarchies (CCH) that pre-process the graph into a layered structure where queries are sub-millisecond.
Deep-dive — Contraction Hierarchies, taught from first principles
This is the topic interviewers love because it filters out candidates who only know "Dijkstra". The math is a little intimidating; the intuition is not.
The problem. A continent-scale road graph has ~50 M nodes (intersections) and ~100 M edges (road segments). Dijkstra from any source explores nodes in order of distance — for a cross-city query that's tens of millions of nodes touched. On real hardware: ~600–900 ms per query. Uber needs ~10 ms per query at >100k QPS globally. Dijkstra cannot get there no matter how you tune it.
The idea. Pre-compute the graph into a hierarchy. Most road queries traverse a highway in the middle and only touch local streets at the ends — so collapse the graph by repeatedly removing "unimportant" nodes and adding shortcuts that preserve shortest-path distances. At query time, you only ever search upward in the hierarchy from source and target, then meet in the middle. Number of nodes touched drops from ~10 M to ~hundreds.
import heapq, time
# Toy graph: 8 nodes; "level" = importance rank (higher = kept in hierarchy).
graph = {
"A":[("B",4),("E",2)], "B":[("A",4),("C",3),("F",2)], "C":[("B",3),("D",2),("G",2)],
"D":[("C",2),("H",3)], "E":[("A",2),("F",1)], "F":[("E",1),("B",2),("G",2)],
"G":[("F",2),("C",2),("H",1)], "H":[("G",1),("D",3)],
}
level = {"A":3, "D":3, "B":2, "C":2, "E":1, "F":1, "G":1, "H":1}
def dijkstra_full(src, dst):
"""Touches every reachable node up to dst's distance."""
dist={n: float("inf") for n in graph}; dist[src]=0; pq=[(0,src)]; touched=set()
while pq:
d,u = heapq.heappop(pq); touched.add(u)
if u==dst: return dist[dst], len(touched)
for v,w in graph[u]:
if d+w < dist[v]: dist[v]=d+w; heapq.heappush(pq,(dist[v],v))
return dist[dst], len(touched)
def ch_upward(src, dst):
"""Only relax edges that go to a HIGHER-LEVEL node (the CH query rule)."""
dist={n: float("inf") for n in graph}; dist[src]=0; pq=[(0,src)]; touched=set()
while pq:
d,u = heapq.heappop(pq); touched.add(u)
if u==dst: return dist[dst], len(touched)
for v,w in graph[u]:
if level[v] >= level[u] and d+w < dist[v]:
dist[v]=d+w; heapq.heappush(pq,(dist[v],v))
return dist[dst], len(touched)
print("Dijkstra A→D :", dijkstra_full("A","D"))
print("CH upward A→D:", ch_upward("A","D"))
Dijkstra A→D : (9, 8) ← visited every node
CH upward A→D: (9, 4) ← visited only A→B→C→D path through high-level spine
On a toy graph the win is 2×. On a continent-scale graph the win is ~50,000×, because the search complexity is dominated by the height of the hierarchy (≈ log of nodes) instead of the diameter of the graph. That's why production routing engines hit p99 of 8–15 ms on cross-country queries.
Production reality: why Uber uses Customizable CH (CCH), not plain CH
Plain CH bakes edge weights into the hierarchy at pre-processing time. Update one weight (a road closes, traffic surges) and you re-build the whole hierarchy — hours of compute. CCH separates the hierarchy (graph topology, fixed) from the metric (edge weights, mutable). Pre-process once; re-customize the metric in seconds when traffic shifts. That's the only version that survives a system where every segment's weight changes every 60 seconds.
- You separate the pre-processing step (slow, run once or rarely) from the query step (sub-ms, run constantly). Most candidates conflate them.
- You name bidirectional search — start from source and target, meet in the middle. Halves the touched-nodes set.
- You name CCH (not CH) the moment "live traffic" enters the question. CH is the textbook answer; CCH is the production answer.
- You acknowledge that CH/CCH gives you shortest path, not fastest path under uncertainty. Uber's stack adds a top layer of multi-criteria search (alternative routes, route diversity for load-balancing) on top.
import heapq
graph = {
"A": [("B", 4), ("C", 2)],
"B": [("C", 5), ("D", 10)],
"C": [("D", 3), ("E", 8)],
"D": [("E", 4), ("F", 11)],
"E": [("F", 5)],
"F": [],
}
def dijkstra(g, src, dst):
dist = {n: float("inf") for n in g}
prev = {}
dist[src] = 0
pq = [(0, src)]
while pq:
d, u = heapq.heappop(pq)
if u == dst: break
if d > dist[u]: continue
for v, w in g[u]:
if d + w < dist[v]:
dist[v] = d + w
prev[v] = u
heapq.heappush(pq, (dist[v], v))
path, cur = [], dst
while cur in prev: path.append(cur); cur = prev[cur]
path.append(src)
return dist[dst], path[::-1]
print(dijkstra(graph, "A", "F"))
(14, ['A', 'C', 'D', 'E', 'F'])
In production, the edge weights are not static — they're the L2/L3 traffic-aware ETAs above. Routing recomputes when (a) traffic shifts >15% on the chosen route, (b) the driver deviates >100 m, or (c) a fresh incident report (police, accident, road closure) lands within 1 km of the path.
Two reasons. First, the routing engine intentionally spreads load across near-equivalent paths to avoid creating new congestion (this is why one driver gets 5th Ave and the next gets 6th). Second, the engine personalizes for driver preferences learned over their last ~500 trips — some drivers consistently take known shortcuts, others stick to highways; both yield similar ETAs. The system respects that.
Three signals. (1) Anomaly detection on segment_speed_live — when 90th-percentile speed on a segment drops to 0 for 90 s while neighboring segments are normal, a "probable closure" event is emitted. (2) Driver reports in the partner app. (3) External feeds from city traffic APIs and partners like TomTom / HERE. The three feed an event stream that updates edge weights in the routing graph within ~60 s.
6. Who Arrives First? — Pickup Sequencing in Shared Trips
For solo trips the question is trivial: the driver goes to the rider, period. For UberX Share / Pool the algorithm has to decide an ordering: pick up A first or B first? Drop A first or B first? With four stops (2 pickups + 2 drop-offs) there are 4!/2 = 12 orderings if we keep "pickup before drop-off" as a constraint — and a real Pool car may have 3–4 active legs.
This is a tiny vehicle routing problem (VRP) solved per-vehicle, per-decision. The objective: minimize total in-car time × rider (so a rider already in the car doesn't get punished by a giant detour for a new pickup), bounded by per-rider detour budget (typically ≤ 8 minutes vs the solo route).
from itertools import permutations
# travel time in seconds between any two waypoints
T = {
("car", "Apick"): 180, ("car", "Bpick"): 240,
("Apick","Bpick"): 200, ("Bpick","Apick"): 220,
("Apick","Adrop"): 600, ("Apick","Bdrop"): 720,
("Bpick","Adrop"): 540, ("Bpick","Bdrop"): 480,
("Adrop","Bdrop"): 360, ("Bdrop","Adrop"): 360,
}
stops = ["Apick","Bpick","Adrop","Bdrop"]
solo_A_time = T[("car","Apick")] + T[("Apick","Adrop")] # if alone
solo_B_time = T[("car","Bpick")] + T[("Bpick","Bdrop")]
DETOUR_BUDGET = 8 * 60 # 8 minutes
def feasible(order):
return order.index("Apick") < order.index("Adrop") and \\
order.index("Bpick") < order.index("Bdrop")
best = None
for perm in permutations(stops):
if not feasible(perm): continue
seq = ["car"] + list(perm)
leg_t = [T[(seq[i], seq[i+1])] for i in range(len(seq)-1)]
cum = [sum(leg_t[:i+1]) for i in range(len(leg_t))]
a_in_car = cum[seq.index("Adrop")-1] - cum[seq.index("Apick")-1]
b_in_car = cum[seq.index("Bdrop")-1] - cum[seq.index("Bpick")-1]
a_detour = a_in_car - (T[("Apick","Adrop")])
b_detour = b_in_car - (T[("Bpick","Bdrop")])
if a_detour > DETOUR_BUDGET or b_detour > DETOUR_BUDGET: continue
total_rider_time = a_in_car + b_in_car
if best is None or total_rider_time < best[0]:
best = (total_rider_time, perm, a_detour, b_detour)
print("best order:", best[1])
print(f"total rider in-car time: {best[0]}s")
print(f"A detour: {best[2]}s B detour: {best[3]}s")
best order: ('Apick', 'Bpick', 'Bdrop', 'Adrop')
total rider in-car time: 1620s
A detour: 540s B detour: 0s
The algorithm picked A first, then B, then dropped B first (because B's destination is en route), then A. Note A took a 9-minute detour but it was still under budget — and importantly the algorithm refused to pick orderings that made any single rider wait too long. That's the fairness floor.
Dispatch is a one-shot bipartite matching. Sequencing is a constrained TSP per vehicle, re-solved every time a new rider gets added to the trip plan. The constraints are per-rider detour budget, vehicle capacity, pickup-before-dropoff precedence, and a soft objective on total rider-time. With ≤ 4 stops you brute-force; with more, you use insertion heuristics + 2-opt local search and re-solve in ~10 ms.
7. Which Customer to Pick Based on Mapping — Geo-Aware Demand Routing
This is the question the user posed directly: "which customers to pick based on mapping?" It's actually two questions glued together — (a) of all open requests in the city, which one should the dispatcher offer this driver? and (b) when a driver finishes a trip in zone X, which queued request in adjacent zones should they receive?
The answer in both cases is the inverse of what we did earlier. Instead of building "for this rider, who are the candidate drivers?", we build "for this driver, who are the candidate riders?" — and we let the same min-cost matching arbitrate. But the cost matrix gets two extra terms:
- Destination heat — the demand intensity at the rider's drop-off hex. Sending a driver to a hot drop-off is a freebie: they end up in a high-demand area for the next trip.
- Driver shift state — if the driver is near their target earnings or near end-of-shift in a known direction, the system biases toward requests that head that way (this is the "destination filter" in the partner app).
-- Inputs: :driver_id, :driver_h3_r9, :driver_lat, :driver_lng,
-- :driver_target_dir (NULL or compass quadrant), :shift_min_left
WITH driver AS (
SELECT driver_id, h3_r9, lat, lng,
target_earnings_gap_usd, shift_hours_today
FROM driver_state WHERE driver_id = :driver_id
),
candidate_requests AS (
SELECT r.request_id,
r.h3_r9 AS pickup_h3,
r.dest_h3_r9 AS drop_h3,
r.product,
r.fare_estimate_usd,
r.created_ts,
haversine(d.lat, d.lng, r.pickup_lat, r.pickup_lng) AS pickup_m,
eta_seconds(:driver_id, r.pickup_lat, r.pickup_lng) AS eta_s,
sd.demand_intensity AS dest_demand
FROM ride_requests r
CROSS JOIN driver d
LEFT JOIN surge_by_hex sd ON sd.h3_r8 = r.dest_h3_r8
WHERE r.state = 'OPEN'
AND r.product = ANY((SELECT product_eligible FROM driver_state
WHERE driver_id = :driver_id))
AND ST_Distance(d.h3_r9::geometry, r.h3_r9::geometry) < 4000 -- 4km gate
)
SELECT request_id, pickup_h3, drop_h3, fare_estimate_usd,
eta_s, dest_demand,
-- driver-centric score: lower is better
( eta_s
- 0.7 * dest_demand -- reward dropping in hot zones
- 0.4 * fare_estimate_usd -- reward $$$
+ 30 * EXTRACT(EPOCH FROM NOW() - created_ts) / 60.0 -- age penalty
) AS score
FROM candidate_requests
ORDER BY score ASC
LIMIT 25;
This score is what the batched matcher reads when scoring this driver's edges in the cost matrix. Two drivers in the same hex will see different orderings here because their target_earnings_gap_usd and recent trip pattern differ.
It's bounded. The bonuses (rating, earnings_gap, dest_demand) are small relative to the dominant ETA term — typically < 15% of total score. They act as tiebreakers and slow drifts, not overrides. Periodic fairness audits measure earnings dispersion across drivers in the same shift bucket and re-tune coefficients if dispersion exceeds historical norms. The system optimizes for marketplace health, not for any one driver.
The match the dispatcher is most afraid of isn't the one that took too long. It's the one that gets cancelled 14 seconds in.
— Whycancel_risk is the loudest term in the cost matrix
8. Cancellations — The Other Half of the Marketplace
Every previous section optimized for "completed trip". But ~10–15% of matched trips never complete. A cancellation isn't just a missing row in fact_trip; it's a second-order failure that wastes the matcher cycle, locks supply away from the next request, charges the company for the driver's repositioning, and increases the cancel-risk score of the affected rider/driver pair for weeks. If you don't have a model of who cancels and when, every other section in this article is built on sand.
The cancellation funnel — six stages, three parties
A trip can be cancelled at any of six stages, by one of three parties. The combinations are not symmetric — a driver-cancel after pickup arrival is a different operational event from a rider-cancel before dispatch.
Rider pre-dispatch cancels
The biggest bucket. Most are "I waited too long to be matched". Drops sharply when ETA stays under 5 min.
Driver after-accept cancels
Often due to ETA misquote (driver realises pickup is further than the app showed) or a better-paying request appearing.
Rider after-arrival cancels
The most expensive. Driver already there, fuel spent, supply locked. Most marketplaces charge a fee at this stage.
System-initiated cancels
Safety signals, fraud, region failover. Rare but every one is a Sev-3 ticket.
The SQL — splitting the funnel by party AND stage
The single biggest mistake candidates make on this query is collapsing all cancellations into one number. The senior version tracks both axes:
-- fact_trip extension: cancelled, cancellation_party, cancellation_stage
WITH base AS (
SELECT trip_id,
cancelled,
cancellation_party, -- 'rider' | 'driver' | 'system' | NULL
cancellation_stage, -- 'pre_dispatch' .. 'near_dropoff'
EXTRACT(EPOCH FROM (cancelled_ts - request_ts)) AS time_to_cancel_s
FROM fact_trip
WHERE request_ts >= CURRENT_DATE - INTERVAL '7 days'
)
SELECT
cancellation_party,
cancellation_stage,
COUNT(*) AS n,
ROUND(100.0 * COUNT(*) /
SUM(COUNT(*)) OVER (), 2) AS pct_of_all_cancels,
PERCENTILE_DISC(0.5) WITHIN GROUP
(ORDER BY time_to_cancel_s) AS p50_seconds,
PERCENTILE_DISC(0.95) WITHIN GROUP
(ORDER BY time_to_cancel_s) AS p95_seconds
FROM base
WHERE cancelled = TRUE
GROUP BY cancellation_party, cancellation_stage
ORDER BY cancellation_party, cancellation_stage;
cancellation_party cancellation_stage n pct p50_s p95_s
driver en_route 1,820,331 18.4 42 210
driver arrived 260,118 2.6 95 380
driver dispute 42,540 0.4 180 720
rider pre_dispatch 3,210,440 32.5 38 155
rider en_route 1,640,202 16.6 58 245
rider arrived 580,015 5.9 115 380
rider in_trip 60,118 0.6 780 1820
system pre_dispatch 420,308 4.3 12 45
system en_route 180,114 1.8 25 95
system other 85,440 0.9 40 180
- Driver en-route cancels at p95 = 210 s means 5% of driver cancels happen 3+ minutes after accept — long enough that the rider has already started watching the dot move. This is the worst trust event in the marketplace; every percentage point matters.
- Rider in-trip cancels at p95 = 1820 s = ~30 min. These are the safety / payment-dispute events. Track this as a separate operational metric — it's a regulatory signal, not a marketplace metric.
- System pre-dispatch cancels at p50 = 12 s is the ETA-cap rejection: no driver inside the 8-min radius. If this number climbs, supply is collapsing. Page the marketplace ops team, not the data team.
The Python — a cancel-risk score that feeds back into dispatch
The cancel-risk score (ψ · cancel_risk in the dispatch cost matrix from Section 2) isn't pulled from thin air. It's a logistic regression over rider/driver/context features, retrained nightly. The toy version makes the inputs explicit:
import numpy as np, pandas as pd
# Hypothetical training data: each row is a past offer.
# Features the production model actually uses (reduced):
df = pd.DataFrame({
"eta_quoted_s": [120, 360, 540, 180, 720, 220, 410, 610],
"surge_mult": [1.0, 1.6, 2.4, 1.0, 2.8, 1.2, 1.4, 2.2],
"rider_cancel_30d":[0, 2, 5, 0, 8, 0, 1, 4 ],
"driver_rating": [4.95,4.7, 4.2, 4.95,4.0, 4.9, 4.6, 4.3],
"driver_idle_min":[1, 8, 20, 2, 30, 3, 10, 18 ],
"is_pool": [0, 1, 1, 0, 1, 0, 0, 1 ],
"cancelled": [0, 0, 1, 0, 1, 0, 0, 1 ], # label
})
from sklearn.linear_model import LogisticRegression
X = df.drop(columns=["cancelled"]).values
y = df["cancelled"].values
clf = LogisticRegression().fit(X, y)
# Score a fresh candidate match
new = np.array([[ 480, 1.8, 3, 4.55, 12, 1 ]])
p_cancel = clf.predict_proba(new)[0, 1]
print(f"P[cancel] for this match = {p_cancel:.3f}")
# Plug into the dispatch cost: psi=120 from Section 2's cost matrix
psi = 120
penalty = psi * p_cancel
print(f"Cost-matrix penalty added by cancel risk = {penalty:.1f} s-equivalent")
P[cancel] for this match = 0.412
Cost-matrix penalty added by cancel risk = 49.4 s-equivalent
That 49.4 s is what gets added to the cell C[i,j] in the Hungarian solver. It will routinely flip the assignment away from a closer-but-flaky driver to a slightly further driver with a lower cancel risk — exactly the Driver A vs Driver B comparison you saw back in Section 2. The ψ coefficient and the cancel-risk model together are a single closed-loop system: the model learns from past cancellations, the matcher uses the score to avoid future ones, the next batch of training data is shaped by which matches the matcher allowed in the first place.
- Selection bias. The matcher only ever observes outcomes for matches it made. Drivers with high cancel-risk are starved of trips, so the model has no recent data on whether their behaviour improved. Counter: dedicate ~1% of dispatch volume to exploration — accept slightly higher cancel risk to keep the training set fresh.
- Driver penalty fairness. A single bad week shouldn't blacklist a driver for life. Production models use exponentially-decayed historical features (~14-day half-life on cancel rate) and re-weight by trip volume.
- Surge interaction. Cancel rate goes up during surge events for entirely environmental reasons (rain, traffic). If your model doesn't condition on surge, you'll over-penalize drivers who happened to work during a thunderstorm. Always include
surge_multas an input feature, not an output filter.
Edge cases that bite cancellations
Driver app crashes mid-trip
App reconnects 90 s later. Trip is in an ambiguous state — was the rider in the car? Solution: drivers' phones emit periodic heartbeats; if heartbeats stop > 60 s during an active trip, freeze the state and require manual reconciliation.
Both rider and driver tap "Cancel" within 1 s
Whose cancellation wins? Solution: idempotency key (trip_id) + last-write-wins by server timestamp + an explicit cancellation_party field; reconcile on the server, never on either client.
Driver waited 5 min, rider invisible
System auto-cancels with party = "rider", charges the no-show fee, and re-injects the driver at top of the queue with a small earnings credit. Subtle: this is a system cancel that looks like a rider cancel — log both.
Rider sees price, accepts, then cancels in 8 s
Behavioral signal: the rider accepted reluctantly. Track time_to_cancel_after_accept; bucket cancels under 15 s separately. They predict next-week churn.
9. How 30 Million Trips a Day Actually Move Through the Stack
So far we've stayed at the per-trip layer. Step back and look at the data plane that supports 30M trips/day. This is the diagram the panel really wants you to draw on the whiteboard.
Three things to internalize:
- Two state stores, two purposes. The hot path (dispatch) reads driver state from Aerospike-class stores in < 1 ms. The warm path (analytics, ML training) reads from Hudi/Iceberg tables on object storage, hours-to-days fresh. Trying to serve both from one DB is the most common architecture mistake.
- Kafka is the spine, not a side channel. Every event — ping, request, offer, accept, complete, cancel, fare — lands on Kafka first. Every downstream system (dispatch state, surge, ETA features, fraud, lakehouse) is a consumer. This is what gives Uber the property that you can replay any service against historical events.
- Partitioning by H3 cascades through the stack. Same partition key from edge → Kafka → Flink → state store → lake table. That's why a query for "all trips in San Francisco's Financial District last week" is fast at every layer.
The "what runs where" matrix — the board you draw before any SQL
Every component in an Uber-class system maps to exactly one of three planes. If you can't place a feature on this matrix in < 5 seconds during an interview, you're not at L5 yet. Memorize this; it's the cheat sheet that lets you answer any "design X" question in the first 30 seconds.
| Component | Real-time ms–s |
Streaming s–min |
Batch hour–day |
Tool / store | SQL or Python? |
|---|---|---|---|---|---|
| Dispatch decision | ● | · | · | Go service + Aerospike / Ringpop | Neither — custom matcher in Go |
| Driver state (geo-keyed) | ● | · | · | Aerospike / Scylla, H3-partitioned | Logical SQL; physical KV reads |
| Surge multiplier | · | ● | · | Flink keyed stream → Pinot | Flink SQL on a sliding window |
| ETA features | · | ● | · | Flink → Pinot feature store | SQL for aggs, Python for residual |
| Driver heatmap | · | ● | · | Pinot fed by Flink | SQL (Pinot online OLAP) |
| Routing graph weights | ● | ● | · | In-memory CCH + Flink updates | Python build / C++ serve |
| ETA model training | · | · | ● | Spark + PyTorch / Michelangelo | Python (PyTorch + feature joins) |
| Trip ledger / fares | · | · | ● | Hudi / Iceberg on S3 | SQL (Spark / Trino) |
| Reporting / dashboards | · | · | ● | Presto / Trino + Pinot | SQL |
| Fraud detection | · | ● | ● | Flink (online) + Spark (offline) | Both — features in SQL, models in Python |
| Surge coefficient tuning | · | · | ● | Notebook + offline simulator | Python (the simulator above) |
The schemas — what an interviewer will ask you to draw
-- DIMENSIONS
CREATE TABLE dim_rider (
rider_id BIGINT PRIMARY KEY,
home_h3_r7 BIGINT,
signup_ts TIMESTAMP,
cohort TEXT,
lifetime_trips INT,
rating_avg NUMERIC(3,2)
);
CREATE TABLE dim_driver (
driver_id BIGINT PRIMARY KEY,
primary_city_id INT,
vehicle_class TEXT, -- UberX, Comfort, XL...
onboarded_ts TIMESTAMP,
rating_avg NUMERIC(3,2),
cancel_rate_30d NUMERIC(4,3)
);
CREATE TABLE dim_geo_h3 (
h3_r9 BIGINT PRIMARY KEY,
h3_r7 BIGINT,
h3_r5 BIGINT,
city_id INT,
country_iso2 CHAR(2),
centroid_lat DOUBLE PRECISION,
centroid_lng DOUBLE PRECISION
);
-- FACTS
CREATE TABLE fact_trip (
trip_id UUID PRIMARY KEY,
rider_id BIGINT,
driver_id BIGINT,
request_ts TIMESTAMP,
pickup_ts TIMESTAMP,
dropoff_ts TIMESTAMP,
pickup_h3_r9 BIGINT,
dropoff_h3_r9 BIGINT,
distance_m INT,
duration_s INT,
fare_usd NUMERIC(8,2),
surge_multiplier NUMERIC(3,2),
product TEXT,
is_pool BOOLEAN,
cancelled BOOLEAN,
cancellation_party TEXT -- rider | driver | system | NULL
)
PARTITION BY RANGE (request_ts); -- daily partitions
CREATE TABLE fact_dispatch_offer (
offer_id UUID PRIMARY KEY,
trip_id UUID,
driver_id BIGINT,
rider_id BIGINT,
offer_ts TIMESTAMP,
accepted BOOLEAN,
eta_quoted_s INT,
eta_actual_s INT,
cost_score NUMERIC(8,3)
);
Three SQLs every interviewer will ask
For each of these, I'm showing a small slice of the input table, the SQL, and the actual result you'd get back. Whiteboarding the SQL is half the work; explaining what the rows mean is the other half.
SQL 1 — ETA accuracy by hour-of-day, last 7 days
Why anyone cares: systematic ETA bias erodes rider trust and inflates dispatch cost-matrix errors. A +90 s bias at 6 PM means dispatch is over-promising and the matcher is picking the wrong drivers.
fact_dispatch_offeroffer_id trip_id driver_id offer_ts accepted eta_quoted_s eta_actual_s
o-1001 t-9001 d-501 2026-04-29 06:14:22 TRUE 180 215
o-1002 t-9002 d-502 2026-04-29 08:02:11 TRUE 240 312
o-1003 t-9003 d-503 2026-04-29 08:05:48 TRUE 210 268
o-1004 t-9004 d-504 2026-04-29 12:31:09 TRUE 300 305
o-1005 t-9005 d-505 2026-04-29 18:47:33 TRUE 240 395
o-1006 t-9006 d-506 2026-04-29 18:49:01 TRUE 210 350
o-1007 t-9007 d-507 2026-04-29 22:11:55 TRUE 300 330
... (~9.4M rows over 7 days)
SELECT EXTRACT(HOUR FROM offer_ts) AS hour_of_day,
COUNT(*) AS offers,
AVG(eta_actual_s - eta_quoted_s) AS bias_s,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY ABS(eta_actual_s - eta_quoted_s)) AS p50_abs_err,
PERCENTILE_DISC(0.95) WITHIN GROUP (ORDER BY ABS(eta_actual_s - eta_quoted_s)) AS p95_abs_err
FROM fact_dispatch_offer
WHERE offer_ts >= CURRENT_DATE - INTERVAL '7 days'
AND accepted = TRUE
GROUP BY 1
ORDER BY 1;
hour_of_day offers bias_s p50_abs_err p95_abs_err
0 180,433 +12 22 110
6 312,544 +28 34 135
7 501,210 +51 48 185
8 612,890 +72 61 220
12 401,778 +14 26 115
17 598,450 +66 58 205
18 642,118 +88 71 245
19 571,200 +63 54 198
22 330,455 +18 28 120
hour=18, bias=+88s, p95=245s tell you?At 6 PM the system under-quotes ETA by ~1.5 minutes on average, and 5% of trips are off by >4 minutes. That's a clear signal that the L3 ML residual is undertrained on the evening rush — likely because feature freshness for "active events in the area" lags by ~10 minutes. Action: shorten the event-feature pipeline and retrain DeepETA with hour-bucketed weighting.
PERCENTILE_DISCrequires a sort of the entire group. At 9.4M rows × 24 groups, that's a 9.4M-row shuffle every time. Senior fix: pre-aggregate hourly into afact_eta_accuracy_hourlytable with t-digest sketches; the dashboard query reads 24 rows.- Filtering on
EXTRACT(HOUR FROM offer_ts)instead of a materializedhour_localcolumn means no partition pruning — the planner reads all daily partitions in the 7-day range. Addhour_local TINYINTat write time. - The
accepted = TRUEfilter hides a sampling bias: rejected offers were rejected because their ETA was implausible. Excluding them flatters the bias number. Always report both populations side by side.
SQL 2 — Driver utilization (P3 — paid time / online time) by city
Why anyone cares: P3 is the operating health score of a marketplace. A city with low P3 has too many idle drivers (driver churn risk); a city with very high P3 has too few (rider cancellations climb). Operations teams pull this every morning.
driver_session + fact_trip-- driver_session
driver_id city_id online_start online_end
d-501 NYC 2026-04-29 06:00 2026-04-29 13:30
d-502 NYC 2026-04-29 16:00 2026-04-30 01:00
d-601 SFO 2026-04-29 07:30 2026-04-29 16:00
d-701 LON 2026-04-29 05:00 2026-04-29 14:00
... (~5M sessions / 7 days)
-- fact_trip (filtered)
trip_id driver_id pickup_h3_r9 pickup_ts dropoff_ts cancelled
t-9001 d-501 891fa726a3... 2026-04-29 06:14 2026-04-29 06:33 FALSE
t-9008 d-501 891fa726a3... 2026-04-29 06:48 2026-04-29 07:21 FALSE
t-9015 d-501 891fa726a3... 2026-04-29 07:35 2026-04-29 08:11 FALSE
t-9080 d-502 891fa726b1... 2026-04-29 16:22 2026-04-29 17:05 FALSE
... (~210M trips / 7 days)
WITH online AS (
SELECT driver_id, city_id,
SUM(EXTRACT(EPOCH FROM (online_end - online_start))) AS online_s
FROM driver_session
WHERE online_start >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY 1, 2
),
paid AS (
SELECT driver_id, dim_geo_h3.city_id,
SUM(EXTRACT(EPOCH FROM (dropoff_ts - pickup_ts))) AS paid_s
FROM fact_trip
JOIN dim_geo_h3 ON dim_geo_h3.h3_r9 = fact_trip.pickup_h3_r9
WHERE pickup_ts >= CURRENT_DATE - INTERVAL '7 days'
AND cancelled = FALSE
GROUP BY 1, 2
)
SELECT o.city_id,
SUM(p.paid_s)::numeric / NULLIF(SUM(o.online_s),0) AS p3_utilization
FROM online o
LEFT JOIN paid p USING (driver_id, city_id)
GROUP BY 1
ORDER BY p3_utilization DESC;
city_id online_s paid_s p3_utilization
NYC 45,210,400 28,532,140 0.631
SFO 18,440,210 11,108,005 0.602
LON 22,318,005 13,175,621 0.590
TOR 14,210,330 7,888,222 0.555
BOM 38,220,100 19,540,001 0.511
SAO 21,330,200 10,440,055 0.490
JNB 12,108,544 5,113,309 0.422
42% utilization in Johannesburg means drivers are online but not earning. Either supply is over-matched to demand (slow growth city — pull back driver acquisition spend) or there's a structural issue (driver heatmap is wrong, pickups are clustered in a sub-region with no riders). The senior signal is naming both hypotheses and proposing a check before recommending action.
- The
fact_trip ⨝ dim_geo_h3join looks small but at 210M trips it's a 210M-row probe. Senior fix: denormalizecity_iddirectly ontofact_tripat write time and skip the dim join entirely on hot dashboards. Storage cost is trivial; query time drops 20×. LEFT JOIN paid p USING (driver_id, city_id)— if a driver was online in two cities in the week, this double-counts online_s. Edge case: cross-region drivers (London/Reading) inflate denominators. Use a deterministic primary-city assignment per session.- The
SUMat the outer level is what makes this a city P3, not a driver-weighted P3. They're different numbers. Always specify which weighting before showing this to a panel — they will ask, and "I summed and divided" is the wrong answer at L5.
SQL 3 — Marketplace conversion funnel
Why anyone cares: the funnel is the single best diagnostic in the company. Each step that drops more than ~3 pp from the trailing baseline points at a different system: got_eta→dispatched drops mean dispatch failures (no driver), dispatched→completed drops mean cancellations.
fact_trip ⨝ fact_dispatch_offerrequest_ts trip_id eta_quoted_s accepted cancelled
2026-04-29 08:01:14 t-9100 240 TRUE FALSE
2026-04-29 08:01:55 t-9101 NULL NULL NULL -- never got an ETA quote
2026-04-29 08:02:11 t-9102 180 TRUE FALSE
2026-04-29 08:02:48 t-9103 220 TRUE TRUE -- rider cancelled
2026-04-29 08:03:09 t-9104 260 FALSE NULL -- no driver accepted in window
2026-04-29 08:03:51 t-9105 300 TRUE FALSE
... (~30M / day × 14 days = ~420M rows)
WITH funnel AS (
SELECT
DATE(request_ts) AS d,
COUNT(*) AS sessions_with_request,
COUNT(*) FILTER (WHERE eta_quoted_s IS NOT NULL) AS got_eta,
COUNT(*) FILTER (WHERE accepted) AS dispatched,
COUNT(*) FILTER (WHERE accepted AND NOT cancelled) AS completed
FROM fact_trip t
LEFT JOIN fact_dispatch_offer o USING (trip_id)
WHERE request_ts >= CURRENT_DATE - INTERVAL '14 days'
GROUP BY 1
)
SELECT d,
sessions_with_request,
ROUND(100.0 * got_eta / sessions_with_request, 1) AS pct_got_eta,
ROUND(100.0 * dispatched / sessions_with_request, 1) AS pct_dispatched,
ROUND(100.0 * completed / sessions_with_request, 1) AS pct_completed
FROM funnel
ORDER BY d;
d sessions pct_got_eta pct_dispatched pct_completed
2026-04-16 29,800,212 98.7 92.4 88.1
2026-04-17 30,110,401 98.6 92.1 87.9
2026-04-18 31,440,809 98.4 91.8 87.5
2026-04-19 30,901,544 98.5 92.0 87.6
2026-04-20 29,055,210 98.6 91.9 87.4
2026-04-21 30,500,330 98.5 91.7 87.0
2026-04-22 30,840,001 98.4 88.1 82.7 *
2026-04-23 30,900,440 98.6 90.9 86.5
2026-04-24 31,055,221 98.7 91.5 87.1
2026-04-25 31,440,300 98.7 91.8 87.4
2026-04-26 30,800,108 98.5 91.9 87.5
2026-04-27 29,720,505 98.6 92.0 87.6
2026-04-28 30,560,810 98.6 92.2 87.8
2026-04-29 30,940,012 98.5 92.1 87.6
Dispatched fell ~4 pp and completed fell ~5 pp on the same day — both stages are bleeding. First check: was there a known incident in dispatch (deploy, region failover)? Second: was there an external event (weather, transit strike) that depressed driver supply? The data structure of the funnel makes the diagnosis fast — that's why it's the first SQL anyone runs in an incident.
- Skew: a single megacity (Mexico City, São Paulo) drives ~12% of global trips. Aggregating without a city dimension hides regional disasters — Apr 22 might have been a clean global day with one city in flames. Always GROUP BY day × region for any funnel that informs an action.
- Late events: trips that complete after the day boundary land in the next day's partition. Without an explicit
WHERE event_date = request_dateguard, "today's" funnel is permanently understated by ~2–3 pp until the lake catches up overnight. - Cancellation party: "completed = accepted AND NOT cancelled" buckets rider-cancels and driver-cancels into one number. They have opposite root causes and opposite fixes. Split them:
cancelled_by_ridervscancelled_by_drivervscancelled_by_system. The funnel becomes diagnostic only when split.
Deep-dive — HyperLogLog and the art of being approximately right
One pattern shows up in every senior-level marketplace interview: "give me COUNT(DISTINCT user_id) over a billion rows, in 50 ms". The naïve answer dies. The right answer is HyperLogLog, and most candidates can name it but can't actually use it.
Why exact distinct counts don't scale. To count distinct values exactly you must remember every value you've seen — a hash set. At 30 M trips/day × 50 events each = 1.5 B events, a per-key hash set holding 64-bit IDs is ~12 GB per key. Multiply by thousands of dashboard panels and the memory bill is impossible.
What HyperLogLog does. Replaces the hash set with a tiny array of "leading-zero counters" — typically 4–16 KB regardless of cardinality. The math leans on a beautiful observation: if you hash N items uniformly, the expected length of the longest run of leading zeros is roughly log₂(N). Track that maximum across many sub-streams, harmonic-mean it, and you get cardinality with ±1–2% error and constant memory.
import hashlib, math, sys, random
from datasketch import HyperLogLog # pip install datasketch
# Generate a stream of 1,000,000 unique driver IDs, repeated 5x in random order
unique = [f"driver_{i}" for i in range(1_000_000)]
stream = unique * 5
random.shuffle(stream)
# --- Exact: a Python set ---
exact = set()
for d in stream:
exact.add(d)
exact_card = len(exact)
exact_bytes = sys.getsizeof(exact) + sum(sys.getsizeof(x) for x in exact)
# --- HyperLogLog: a 16 KB sketch ---
hll = HyperLogLog(p=14) # p=14 → 2^14 = 16384 registers ≈ 16 KB
for d in stream:
hll.update(d.encode())
hll_card = int(hll.count())
print(f"Exact : count={exact_card:>9,d} memory={exact_bytes/1e6:>8.1f} MB")
print(f"HLL p=14: count={hll_card :>9,d} memory={(2**14):>8d} bytes")
print(f"HLL relative error: {abs(hll_card-exact_card)/exact_card*100:.2f} %")
Exact : count= 1,000,000 memory= 87.4 MB
HLL p=14: count= 994,128 memory= 16384 bytes
HLL relative error: 0.59 %
-- Pinot: APPROX_COUNT_DISTINCT uses HLL under the hood (p=12 default)
SELECT h3_r8,
APPROX_COUNT_DISTINCT(driver_id) AS unique_drivers_60s
FROM driver_pings
WHERE ping_ts >= NOW() - INTERVAL '60 seconds'
GROUP BY h3_r8;
-- BigQuery: HLL_COUNT.MERGE / HLL_COUNT.INIT — HLL state is mergeable.
-- Pre-aggregate per-minute sketches at write time, merge at query time.
WITH per_min AS (
SELECT minute, h3_r8,
HLL_COUNT.INIT(driver_id) AS sketch
FROM driver_pings
GROUP BY minute, h3_r8
)
SELECT h3_r8,
HLL_COUNT.MERGE(sketch) AS unique_drivers_hour
FROM per_min
WHERE minute >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY h3_r8;
The killer property is in the second query: HLL sketches are mergeable. HLL(A ∪ B) = MERGE(HLL(A), HLL(B)). That means you can compute one sketch per minute, store it, and answer any longer-window distinct-count by merging the relevant minute sketches at query time — without ever touching the raw data again. It's the single trick that makes "unique riders per hex per hour over the last 30 days" answerable in milliseconds.
- HyperLogLog — distinct counts (cardinality). Mergeable. ±1–2% error, ~16 KB.
- Count-Min Sketch — top-K / heavy hitters ("which hex got the most requests last minute?"). Mergeable. One-sided error.
- t-digest / KLL — quantiles (p50, p95, p99). Mergeable. ~5 KB. The right answer when an interviewer says "approximate p99 latency from a stream".
- Bloom filter — set-membership. Not mergeable in general (use Cuckoo or Quotient filter for mergeable). One-sided false positives only.
The interview test is whether you can say "this query needs a sketch, here's which one, and it's mergeable so I can pre-aggregate per minute and merge at query time". That single sentence converts an O(N) full-scan answer into an O(1) lookup.
p)p=14 = 2^14 = 16,384 registers ≈ 16 KB ≈ 0.81% standard error.10. Pre-emptive Interview Bank — Data Architect vs Data Engineer
The two roles overlap but the panels weight things differently. Architect interviews drift toward "why did you pick this technology, and what does it cost?". Engineer interviews drift toward "write the SQL, debug the pipeline, fix the latency". Below are the questions I've actually heard or asked.
Data Architect track
~5M drivers × 1 ping / 4 s × 1 KB ≈ 1.25M msgs/s ≈ 1.25 GB/s sustained. Replication factor 3 → ~3.75 GB/s into disks. With 7-day retention that's ~2.3 PB on the cluster. You'd want 200+ brokers, NVMe, 100 GbE NICs, partitioning by H3 r5 (≈ 50k partitions globally) so consumers can be horizontally scaled per city.
The cardinality is roughly weather (~6) × hour (24) × city (1000) = 144k cells × 5-min refresh = 41M evals/day. On Pinot or BigQuery with the right pre-aggregation it's $X00–$X,000/day. The architect's answer isn't "yes/no", it's "yes, here's the bill, here's a cheaper alternative (hourly refresh, or only top-50 cities), here's the trade-off". Always quote the bill.
Schema-evolve forward-compatible: add intermediate_stops as a nullable, repeated field on the trip event. Bump minor version. Publish a data contract in the schema registry with deprecation calendar. Run dual-write for two weeks, validate in shadow. The architect owns the contract, not the schema.
Edge events still land on Kafka (multi-region MirrorMaker). Driver state in the affected region goes stale; a fail-over replica in the adjacent region takes over with a brief consistency hiccup. Surge feature pipelines see a gap and freeze last-known values rather than emit zeros. Lakehouse jobs detect the gap via a watermark monitor and re-process the gap window from Kafka once recovered. RPO ≈ 0 events, RTO ≈ < 60 s.
Data Engineer track
Two CTEs over fact_dispatch_offer with accepted::int averaged per driver per ISO week, self-join on week = week-1, filter (this.rate - last.rate) / last.rate < -0.20. Watch for low-volume divisors — gate on offers >= 30.
ROW_NUMBER() OVER (PARTITION BY rider_id ORDER BY request_ts) to rank trips, filter rn <= 10, then AVG(fare) OVER (PARTITION BY rider_id ORDER BY rn ROWS BETWEEN 4 PRECEDING AND CURRENT ROW).
Spark UI → Stages tab → tasks with skewed input bytes (one partition holding all the data). Likely cause: a hot H3 r5 hex (e.g., a stadium event). Fix: salt the join key (h3_r5 || (random % 16)) and re-aggregate, or switch to a broadcast join on the dim side. Verify with the Stage timeline that one task no longer dwarfs the others.
Default: dropped if past the watermark. Better: configure allowedLateness on the surge window (e.g., 120 s) so late pings update an already-emitted surge value via a retraction stream. Downstream consumers must handle the retraction (Pinot upsert tables, idempotent writes). The engineer's answer is "I know about watermarks, allowedLateness, and idempotent sinks — and I can tell you when each matters".
Three guardrails. (1) Per-device sanity: ping speed < 250 km/h, lat/lng inside the geofence, timestamp monotonic. (2) Aggregate sanity: 5-min ping volume per city must be within ±3σ of trailing-30-day baseline. (3) Cross-feature: if the surge formula is about to emit >2.5x for a hex that hasn't surged in 6 months, hold and alert. The pipeline fails closed, not open.
11. The Visual Story — Data, in One Glance
If you only remember one diagram from this article, remember this: every Uber trip is a vertical slice through three time scales running concurrently.
An interviewer will ask you to map a feature to one of those columns. "Where does ETA accuracy reporting live?" — batch. "Where does the driver heatmap live?" — near-real-time. "Where does the dispatch decision live?" — real-time. If you can place every product feature into the right column and explain the trade-off, you're already past the bar.
12. What Interviewers Are Actually Looking For — Five Sentences That Win the Round
Most candidates leave their interview wondering whether they said enough. The truth is brutal: panels make their decision on five or six specific signals, and the candidate who hits them in roughly this order walks out with the offer. Memorize the pattern, not the answers.
-
Convert the product question into a constrained optimization.
Don't describe what happens — name the objective, the decision variable, the constraints. The panel is testing whether you think in math or in metaphor.
"This is a min-cost bipartite matching: minimize Σ C[i,j]·x[i,j], subject to ≤1 driver per rider, hard ETA cap, and a soft cancel-risk penalty." -
Reach for the right consistency model — and say "bounded staleness", not "Postgres".
Real-time geo systems are eventually consistent on purpose. The panel wants to hear that you know which consistency is acceptable for which piece of state.
"Driver state is bounded-stale (~4 s), keyed by H3 r6 in an Aerospike-class store. Trip ledger is strongly consistent in Hudi. Surge is monotonically updated via streaming retraction." -
Name the failure modes before the happy path.
Senior engineers earn trust by talking about what breaks. Hot hexes, late events, skewed partitions, retraction handling — bring them up before you're asked.
"The biggest risk here is partition skew on a hot hex. I'd salt the partition during known events and add an admission-control gate for the matcher when ETA p99 climbs past 8 s." -
Tie every query to a business cost.
Every SQL has a stake. ETA accuracy = cancellation = $. Driver utilization = wages = retention. If you can't price the query's failure, you don't yet think like the L5 the panel needs.
"A 1 pp drop in completion at 30M trips/day is ~300k lost rides — at $14 average fare, ~$1.5B/year of GMV. That's why I'd pre-aggregate this hourly even though the raw query 'works'." -
Show the layered architecture in one breath.
Real-time → streaming → batch. SQL where it fits, Python where it fits. Draw the matrix; place the feature on it. This single move converts "candidate" into "candidate I want on my team".
"Dispatch is real-time, in Go on Aerospike. Surge and ETA features are streaming, in Flink to Pinot. Model training and the trip ledger are batch, in Spark on Hudi. Each plane has its own consistency, latency, and cost profile."
And five things that quietly disqualify you
- "I'd put it in Postgres." Whatever it is, Postgres is the wrong answer for the hot path. (It might be right for the trip ledger; never for driver state, dispatch, surge, or pings.)
- Skipping the optimization formulation. Jumping to ML before naming the objective tells the panel you reach for tools before defining problems.
- Forgetting fairness. A cost matrix with only ETA and money optimizes for the wrong system. Driver-side fairness is not a soft skill; it's a marketplace stability term.
- "It's just a SQL query." No SQL at this scale is just a SQL query. Naming join order, skew risk, and partition pruning is what separates senior from staff.
- Treating Python and SQL as interchangeable. They are not. SQL describes the world that exists; Python builds the world that hasn't happened yet. Use them for what they're for.
Closing — What This Whole Stack Is Really Doing
Strip away the H3 hexagons, the Flink jobs, the Hungarian solver, the contraction hierarchies. What Uber is really doing is solving the same question 30 million times a day: "of all the people who want to move and all the people who can move them, who should be paired with whom, by which path, at what price, in what order, to make the next 5 minutes of the city better?". That's the problem. Everything in this article is just the data plumbing that lets the answer come back in under two seconds.
If you can walk an interviewer from the rider's tap to the driver's accept and back, naming the data structure, the SQL, and the failure mode at each layer — you don't need to memorize any one company's stack. You'll be able to design any of them.
Glossary — Every Term & Abbreviation in This Article
- H3
- Uber's open-source hexagonal hierarchical geo-indexing library. Tiles the planet in nested hexagons; every cell has a 64-bit integer ID.
- Resolution (r5–r9)
- The hex size.
r5≈ 250 km² (city quarter),r7≈ 5 km² (neighborhood),r8≈ 0.7 km² (surge unit),r9≈ 0.1 km² (city block, dispatch unit). - k-ring
- The set of all hexes within k steps of a center hex.
k=2= 19 hexes; the standard "nearby drivers" lookup. - Haversine
- Great-circle distance formula between two lat/lng points. Used as a fast crow-flies estimate before the routing engine produces a real ETA.
- Map matching
- Snapping a noisy GPS ping to the road segment the driver is most likely actually on. Production-grade implementations use a Hidden Markov Model with Viterbi decoding.
- HMM / Viterbi
- Hidden Markov Model + the dynamic-programming algorithm that finds the most likely sequence of hidden states (here: road segments) given noisy observations (GPS pings).
- PostGIS / GIST /
ST_DWithin - The "PostgreSQL way" of doing geo: spatial extension, geometric index type, distance-predicate function. Fine for offline analytics, far too slow for hot-path dispatch.
- CCH / Contraction Hierarchies
- A graph pre-processing technique that lets shortest-path queries run in microseconds on continent-scale road graphs. Customizable CCH lets edge weights change without rebuilding the hierarchy.
- Dispatch
- The decision of which driver to offer a given ride request to.
- Greedy matching
- Naive approach: assign each new rider to the locally-closest driver, immediately. Fast, but provably worse on global ETA than batched matching.
- Batched matching
- Hold incoming requests for a short window (~2 s), then solve a single global matching problem over all riders × candidate drivers.
- Bipartite matching
- Graph-theory term for a matching between two disjoint vertex sets — here, riders and drivers.
- Hungarian algorithm
- The classic O(n³) algorithm for solving assignment problems (a kind of bipartite matching). Available in Python as
scipy.optimize.linear_sum_assignment. - Min-cost flow
- Generalization of bipartite matching to richer graphs with capacities; the formulation production matchers actually solve.
- Cost matrix
- The N×M table of "how bad would it be to assign rider i to driver j" used as input to the solver.
- Objective function
- The single number the solver minimizes — here, a weighted sum of ETA, detour, idle penalty, rating bonus, earnings balance, and cancel risk.
- Hard / soft constraint
- Hard = forbidden (e.g., ETA > 8 min ⇒ infeasible). Soft = penalized but allowed (e.g., high cancel risk adds to the cost).
- VRP
- Vehicle Routing Problem — the family of problems that includes pickup-sequencing for shared trips. Constrained TSP variant.
- TSP
- Traveling Salesman Problem — find the shortest tour visiting all stops. The base abstraction for sequencing a multi-stop trip.
- Surge multiplier
- A per-hex, per-product, per-minute price coefficient (1.0× to 3.0×) that clears the local market.
- D/S ratio
- Demand over supply in a hex; the dominant input to the surge formula.
- Elasticity
- Percent change in quantity for a 1% change in price. Riders are demand-elastic (~−0.6); drivers are supply-elastic (~+1.2) on a ~5-minute lag.
- Monotonic
- The surge formula must never decrease the price when an input gets worse (more demand, longer ETA, higher cancel rate). Prevents oscillation.
- Clamp
- Hard upper/lower bound on the multiplier ([1.0, 3.0]). The blast-radius safety rail when an input is anomalous.
- EMA smoothing
- Exponential Moving Average.
surge_t = α·surge_{t-1} + (1−α)·raw. Damps oscillation but adds settling time. - Market clearing
- The lowest multiplier at which fulfilled = min(demand, supply) is maximized. Surge's actual goal.
- P3 utilization
- Paid time / online time — the operating-health metric for a driver marketplace.
- GMV
- Gross Merchandise Value — total money flowing through the marketplace before take rate.
- ETA
- Estimated Time of Arrival. The most lied-about number on the app. Quoted to the rider as L3 below.
- L1 / L2 / L3
- The three layers of Uber's ETA stack. L1 = free-flow Dijkstra. L2 = traffic-adjusted from live segment speeds. L3 = ML residual on top of L2.
- DeepETA
- Uber's transformer-based ETA residual model. Reduced absolute ETA error by ~12% vs the prior gradient-boosted tree.
- Dijkstra
- The textbook shortest-path algorithm. Correct but too slow on continent-scale graphs without pre-processing.
- Routing engine
- The service that, given origin / destination / live edge weights, returns a path. Production = CCH or A*-with-landmarks, not raw Dijkstra.
- Segment / edge
- A piece of road between two intersections in the routing graph.
- Free-flow speed
- The speed limit (or learned uncongested speed) on a segment. The L1 weight.
- Live segment speed
- Rolling 3 / 15 / 60 min average of GPS-derived speeds on the segment. The L2 weight.
- Kafka
- The distributed log used as the spine of the event bus. Every ping, request, offer, accept, complete, fare lands here first.
- Flink
- Distributed stateful stream processor. Runs the surge windows, the live segment-speed aggregates, the ETA features.
- Pinot
- Real-time OLAP store. Sub-second SQL on streaming data — what the dispatcher and rider app actually read.
- Aerospike / ScyllaDB
- Sharded in-memory key-value stores. Where driver_state actually lives at sub-millisecond read latency.
- Ringpop
- Uber-built consistent-hashing membership library used for sharding stateful services like dispatch.
- Sliding / tumbling window
- Tumbling = non-overlapping, fixed-size buckets. Sliding = overlapping windows (e.g., last 60 s, advanced every 10 s). Surge uses sliding.
- Watermark
- A timestamp that says "events older than this are assumed to have all arrived". Tells Flink when to close a window.
- Allowed lateness
- How long after a window closes Flink will still accept and re-emit corrections for late events.
- Retraction stream
- An output mode where downstream consumers receive both old and new values for a key so they can update.
- Idempotent sink
- A downstream system (e.g., Pinot upsert table) where applying the same update twice is safe. Required if you allow lateness.
- Bounded staleness
- A consistency model that promises reads are no more than X seconds out of date. Driver state uses ~4 s bounded staleness.
- Eventual consistency
- Reads will converge to the latest write — eventually. Required to scale geo-distributed state to multi-region.
- Skew / hot partition
- One key (one hex, one driver, one region) carrying disproportionate load. The single biggest cause of dispatch outages.
- Salting
- Appending a random suffix to a hot key (
h3 || (rand % 8)) to spread load across partitions, then re-aggregating. - HyperLogLog (HLL)
- A probabilistic data structure for approximate distinct counts in O(1) memory. Use this when
COUNT(DISTINCT ...)threatens to OOM. Exposed in SQL asAPPROX_COUNT_DISTINCT. - t-digest
- A probabilistic structure for approximate quantiles (p50, p95, p99) over streaming data.
- Broadcast join
- A small dimension table is replicated to every executor; large fact table is not shuffled. Hugely faster than a shuffle join when the small side fits in memory.
- Shuffle
- The expensive operation of repartitioning data across the cluster by a join/group-by key. The first thing you blame when a Spark job is slow.
- Partition pruning
- The query planner's ability to skip reading partitions that can't possibly satisfy the WHERE clause. Killed by functions like
EXTRACT(HOUR FROM ts)applied to the partition column.
- OLTP
- Online Transaction Processing. Postgres/MySQL territory: row-oriented, ACID, optimized for many small writes.
- OLAP
- Online Analytical Processing. Pinot/BigQuery/Trino territory: column-oriented, optimized for large scans and aggregations.
- Lakehouse
- Object-store-native data architecture (S3 + open table format) that supports both batch SQL and ACID upserts. Replaces the lake / warehouse split.
- Hudi / Iceberg / Delta
- Open table formats for the lakehouse. Add ACID, schema evolution, time travel, and upserts on top of S3 / HDFS.
- Compaction
- Periodic merging of small files into fewer, larger ones. Without it, lakehouse queries spend most of their time listing files.
- Schema registry
- Central catalog of event schemas (Avro / Protobuf) with versioning and compatibility rules. Producers and consumers agree on the contract here.
- Data contract
- The published, versioned commitment about a data set's schema, semantics, freshness and ownership. Architect-level discipline.
- SCD (Type 1 / Type 2)
- Slowly Changing Dimensions. Type 1 = overwrite. Type 2 = append a new row with effective-from / -to dates so history is preserved.
- Star schema / fact / dim
- The classic warehouse model: a central fact table (events, measurements) joined to dimension tables (descriptive context).
- GBT / Gradient Boosted Tree
- An ensemble of shallow decision trees trained sequentially on each other's residuals. Strong baseline for tabular ETA / fraud / cancel-risk models.
- Transformer
- The neural-network architecture (attention-based) that powers DeepETA and most modern sequence models.
- Residual model
- An ML model that learns the error of a simpler model, not the target itself. L3 ETA is a residual on top of L2.
- Feature store
- A managed service that serves up-to-date model features at low latency for inference and at training-time consistency for offline jobs.
- Michelangelo
- Uber's internal end-to-end ML platform — feature store, training, deployment, monitoring.
- MAE
- Mean Absolute Error. The standard metric for ETA model quality. Lower is better.
- p50 / p95 / p99
- Latency percentiles: half / 95% / 99% of requests are at or below this time. p99 is what users notice; p50 is what dashboards show.
- SLA / SLO
- Service Level Agreement = external promise. Service Level Objective = internal target you measure against.
- RPO
- Recovery Point Objective. Maximum acceptable data loss in a disaster (in seconds / events).
- RTO
- Recovery Time Objective. Maximum acceptable downtime (in seconds / minutes).
- pp (percentage points)
- Absolute difference between two percentages. "Cancellation up 4 pp" = up by 4 percentage points, not 4%.
- DAU / WAU / MAU
- Daily / Weekly / Monthly Active Users — the standard engagement metrics.
- Cancellation party
- Whether the rider, the driver, or the system cancelled a trip. Mixing them in one number hides the diagnosis.
- Admission control
- Selectively rejecting low-priority work when a system is at capacity, to keep the high-priority work meeting SLA. The "fail closed" pattern.
- Geofence
- A polygon on the map used to apply special rules to requests inside it (e.g., airport queues, stadium events).
- CTE
- Common Table Expression — the
WITH … AS (…)form. Makes complex queries readable but doesn't always improve plans. - Window function
- An aggregate that runs over a window of rows without collapsing them —
SUM() OVER (PARTITION BY … ORDER BY …). PERCENTILE_DISC/PERCENTILE_CONT- SQL functions for exact percentiles. Both require sorting the group; expensive at scale — replace with t-digest in production.
FILTERclause- The
COUNT(*) FILTER (WHERE …)form. Lets you compute multiple conditional aggregates in a single pass. - Recursive CTE
- A
WITH RECURSIVEquery that references itself. The trick for graph traversal, gaps-and-islands, and bills-of-materials. - Hash / sort-merge / nested-loop join
- The three physical join algorithms a planner can pick. Knowing why each one was chosen is a senior signal.
- API
- Application Programming Interface — the contract between two services.
- KV
- Key-Value store. Aerospike, Redis, RocksDB, DynamoDB — the family.
- DAG
- Directed Acyclic Graph — the topology of an Airflow / Spark / Flink job.
- GPS
- Global Positioning System. The ping source. Notoriously noisy in urban canyons and tunnels.
- ML / AI
- Machine Learning / Artificial Intelligence. In this article ML = the residual layer of ETA.
- HDFS / S3
- Distributed file systems for the lakehouse. HDFS = on-prem Hadoop; S3 = AWS object store. Iceberg / Hudi sit on either.
- NYC / SFO / LON / BOM / SAO / JNB
- City IATA codes used in the example outputs: New York, San Francisco, London, Mumbai, São Paulo, Johannesburg.
Disclaimer
This article is for educational and interview-preparation purposes. It is not affiliated with Uber Technologies, Inc., and does not reproduce any proprietary code, schema, or system. The numbers, formulas, and architectures shown are illustrative composites drawn from publicly available papers, conference talks (e.g., Uber's KDD / VLDB / Strata papers on H3, DeepETA, Michelangelo, Pinot), and the broader ridesharing literature. Real production systems are far more nuanced — and far more interesting.