▸ DESIGN · Hot Shards & Data Skew · surviving whales in production

Hot shards & data skew — every layer breaks differently.

The vocabulary page tells you what whales, dolphins, and minnows ARE. This page tells you what they do to your infrastructure at every layer of the stack — Kafka partitions, Flink keyed state, Spark shuffles, hot DB rows, hot Cassandra/DynamoDB partitions, BI dashboards that visually lie because one whale dominates the axis. Plus the named mitigation per layer the senior-IC interview expects.

The shape of the problem — one whale, every layer hurts
Streaming · Kafka partition skew & Flink keyed-state hotspots
Distributed compute · Spark salting, broadcast joins, AQE
Databases · hot rows, hot partitions, index hotspots
Reports & dashboards · query performance + visual lies
The whale / dolphin / minnow playbook
Detection — how you SEE skew before it pages you
The 90-second articulation script

§ 01 — The shape of the problem

One whale, every layer hurts — differently.

The Pareto curve isn't just a metrics-reporting nuisance. Every layer that partitions by user_id (and that's almost every layer) ends up with one partition that takes 100× the traffic of the others. The shape of the problem is the same everywhere — one shard is hot — but the symptoms and the mitigations are layer-specific.

Five layers, one root cause, five different mitigations. Naming the layer-specific mitigation in an interview is the senior-IC signal — generic "use a hash key" answers don't survive contact with a load test.

· · ·

§ 02 — Streaming

Streaming — Kafka partition skew & Flink keyed-state hotspots.

Symptom — one Kafka partition takes 100× the writes

Kafka partition skew

Topic partitioned by user_id. A whale generates 10K events/sec while the median user generates 100. The whale's hash lands on partition 7. Partition 7's consumer lags 30 minutes behind while the other partitions are real-time. The downstream consumer is now reading mostly-stale data — and your "real-time" feature is anything but.

Mitigations (named, in order of operational cost):

Composite key — partition by (user_id, hour_of_day) or (user_id, MOD(event_seq, 16)). Spreads one whale's stream across multiple partitions. Cost: consumers can no longer assume "all events for user X are on one partition" — ordering guarantees drop.
Sticky-aware producer with whale escape hatch — most events use default hashing; producers detect the top-K hot keys and reroute to dedicated whale partitions. Cost: extra producer-side state, periodic recalibration of the hot-key list.
Separate whale topic — top 1% of users get a dedicated topic with higher partition count + dedicated consumer group. Cost: two topics to operate, dual-path consumer logic.

The trap: "just add more partitions" doesn't fix it. Kafka hashes by hash(key) % N — adding partitions reshuffles SOME keys but the whale's hash bucket stays whale-heavy. You need to change the key, not the partition count.

Symptom — one Flink subtask's RocksDB explodes

Flink keyed-state hotspot

Flink's keyBy(user_id) sends all events for one user to one operator subtask. The whale's subtask accumulates 100× the state — its RocksDB instance grows to 50 GB while others are at 500 MB. Checkpoints take 10× longer. Eventually checkpoint storm: incremental checkpoints fall behind, full checkpoints exceed the timeout, the job goes into a restore loop.

Mitigations:

Pre-aggregation — before keyBy(user_id), do a keyBy(user_id, source_partition) with a session-window aggregation. Reduces 10K events/sec from the whale into one aggregate per source partition; the second keyBy now sees only N aggregates per whale instead of N events. Same shape as Spark's combiner pattern.
Two-phase reduce — explicit two-stage Flink: first stage aggregates by (user_id, salt), second stage aggregates by user_id with N inputs per whale. Same as Spark salting (next section) but applied at stream time.
Async I/O for sink lookups — if the whale's hotspot is on the SINK side (joining to a dim table), use Flink's async I/O to overlap the lookups; the whale operator stops being CPU-bound on lookups.
Custom state backend tuning — bump RocksDB block cache, enable bloom filters, use FsStateBackend for tiny operators and RocksDBStateBackend for the whale-heavy ones (per-operator config).

The interview moment. If asked "how do you handle a hot Kafka partition", the L4 answer is "more partitions". The L5 answer is "either change the partition key (composite or salt), or split the hot tenant into a dedicated topic — adding partitions doesn't help because the hash still concentrates."

· · ·

§ 03 — Distributed compute

Spark — salting, broadcast joins, AQE.

Symptom — one task runs for 6 hours while the others finish in 10 seconds

Spark skewed join / aggregation

You join a fact table (orders) to a dim table (customers), partitioned by customer_id. The whale customer has 30M orders, the median customer has 30. The reducer task for the whale customer takes 6 hours; everything else finishes in 10 seconds. Stage shows 199/200 tasks complete, 1 task stuck. The whole job is bottlenecked on the whale.

Mitigations, in increasing order of code change:

Mitigation 1 — Spark 3 AQE skew join (free)

Adaptive Query Execution (AQE) detects skewed partitions at shuffle time and automatically splits them into sub-partitions. No code change — just turn it on:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")

AQE looks at the shuffle stats; any partition more than 5× the median AND larger than 256 MB gets split. For the build side it duplicates the matching dim rows. This handles the 80% case for free. Reach for it FIRST in any interview answer.

Mitigation 2 — Broadcast join (when the dim is small)

If the dim table fits in memory (typically < 100 MB), broadcast it to every executor. No shuffle at all; the whale's fact rows stay where they are and look up the dim locally:

SELECT /*+ BROADCAST(customers) */ ...
-- Or Spark conf:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100MB")

Eliminates the join-side skew entirely. Doesn't help if both sides are large.

Mitigation 3 — Salting (the manual classic)

When AQE can't and broadcast can't, you salt: explode the hot key into N sub-keys on both sides of the join, then aggregate. This is what customer_growth-0027 walks through in SQL form. The Spark pattern:

# 1. Salt the fact side — add a uniform random salt 0..15 to every row
val SALT_BUCKETS = 16
val factSalted = fact.withColumn("salt", floor(rand() * SALT_BUCKETS).cast("int"))

# 2. Explode the dim side — one copy of each dim row per salt bucket
val dimExploded = dim
  .withColumn("salt", explode(array((0 until SALT_BUCKETS).map(lit): _*)))

# 3. Join on the composite key. The whale's 30M rows now go to 16 reducers
#    instead of 1. Each reducer handles ~2M rows.
val joined = factSalted.join(dimExploded,
  Seq("customer_id", "salt"))

The tradeoff: dim side gets N× duplication (16× memory if N=16). Pick N based on (a) how skewed the hot key is, (b) how much memory you can spare on the dim broadcast.

Mitigation 4 — Isolate the whale (the heavy-hammer)

Compute the whale separately. Filter the fact table into whale_orders (just user 42) and rest_orders (everyone else). Run different join strategies on each — broadcast for the whale, hash-join for the rest. UNION at the end:

val (whales, rest) = facts.partition($"customer_id".isin(hotKeyList: _*))
val whaleResult = whales.join(broadcast(dim), "customer_id")
val restResult  = rest.join(dim, "customer_id")  // hash join, fine
val all = whaleResult.unionByName(restResult)

Operationally heavy — you maintain a hot-key list, refresh it nightly. Used at FAANG scale where AQE + salting still can't handle the largest accounts.

Strategy	Code change	When to pick	Cost
AQE skew join	None — config only	Always try first; Spark 3.0+	Free
Broadcast	Add hint or set threshold	Dim table < 100 MB	Memory on every executor
Salting	Explicit two-phase join	Both sides large + skewed	N× duplication of dim side
Whale isolation	Filter + UNION	Top-K keys known and stable	Hot-key list maintenance

· · ·

§ 04 — Databases

Databases — hot rows, hot partitions, index hotspots.

Symptom — one row gets all the writes; everyone else waits

Postgres / MySQL hot-row contention

An account_balance table where the whale account gets 1000 UPDATEs/sec while others get 10/min. Every UPDATE takes a row lock. The whale's row is locked 99% of the time. Read queries on that row block. p99 latency for whale-touching queries goes to seconds.

Mitigations:

Counter sharding — store the balance as N rows (account_balance_shard(account_id, shard_id, delta)); writes pick a random shard, reads SUM across all shards. Eliminates contention but reads cost N× more.
Write batching — buffer N seconds of deltas in a queue, flush as one UPDATE. Trades latency-of-write for throughput-of-write.
Optimistic concurrency + retry — instead of SELECT ... FOR UPDATE, use compare-and-swap with a version column. Whales retry more, but no other transaction blocks waiting.
Materialised aggregate — don't update the balance row directly. Append account_ledger rows; a periodic job rolls up. The hot path becomes append-only (no contention).

Symptom — one partition throttled to 1000 RU/s; everyone else sees 0

DynamoDB / Cassandra hot partition

DynamoDB partitions by hash(partition_key). The whale's partition handles 50K requests/sec — exceeds the per-partition limit (~3K RU/s) and gets throttled. Adaptive capacity kicks in eventually but the first 30 seconds of every spike see error rates climb. Cassandra has the same problem: a wide row for the whale grows unboundedly and breaks reads on that key.

Mitigations:

Composite partition key — change PK = user_id to PK = (user_id, date) or (user_id, salt_bucket). Spreads the whale across multiple physical partitions. Cost: queries that want "all data for user X" now need to scatter across N partitions.
Cassandra: bucket the wide row — partition key includes a time bucket like (user_id, day_of_year). Bounds the row size so reads stay fast.
DynamoDB write sharding via random suffix — append _0 through _15 to the partition key on write; on read, query all 16 suffixes in parallel.
Read replicas / DAX cache — for read-heavy whales, the cache hit rate is near 100% (they re-read the same data constantly).

Symptom — sequential index slows EVERY query, even non-whale ones

B-tree right-edge contention

If the partition key is monotonically increasing (timestamp, auto-incrementing ID), every insert lands on the right edge of the B-tree. Every write contends on the same leaf page lock. This isn't whale-specific — it's about insert distribution, not user distribution. But it's the same shape of problem.

Mitigations:

UUID v7 / KSUID as PK — time-sortable but with random suffix, so inserts hit many leaf pages.
Partition by a salt + time — Postgres / MySQL: PARTITION BY HASH(salt) on top of a time-ordered partition.
For Cassandra/Scylla: never use a monotonic clustering key alone — always lead the partition key with a hash.

· · ·

§ 05 — Reports & dashboards

Dashboards — query performance AND visual lies.

Skew breaks dashboards twice: once on the query (aggregations over a skewed table take 100× longer than the predicted runtime), and once on the visual (one whale's bar dominates the chart, the average tile shows a number nobody actually experiences). The L5 interviewer wants you to name both.

A. Query performance — when the BI tile times out

A "top-N customers by revenue this month" tile runs GROUP BY customer_id over 100M order rows. With Pareto distribution, the whale's customer_id has 5M rows; the per-customer reducer for that ID takes 30 seconds while others finish in 100 ms. The dashboard tile times out at 20 seconds. Same skew, BI surface.

Anti-pattern	Fix
BI tile recomputes the GROUP BY on every page-load	Materialise as `fct_customer_daily_revenue` nightly; BI reads the rollup, not the raw fact
Tile asks for COUNT(DISTINCT user_id) over 90 days	Use HLL sketches (`APPROX_COUNT_DISTINCT`); see customer_growth-0028
Top-1000 leaderboard sorts the entire fact table	Approximate top-K via Count-Min Sketch + heavy-hitter tracking
Filter-then-aggregate where filter is non-selective on the hot key	Pre-bucket and aggregate at ETL time; BI reads pre-computed buckets

B. Visual lies — when the chart misleads

Even when the query is fast, the picture can be wrong. Skewed data breaks dashboards visually in five specific ways:

Trap 1 — The axis is dominated by one bar. Bar chart of "revenue per customer". The whale's bar is $5M; the next customer is $80K. Every other bar is a 1-pixel sliver. The chart visually says "one customer is the entire business" — even though decile 2-10 collectively are 70% of revenue.
Fix: log-scale Y-axis, or top-N + an "others (aggregated)" bar with the long-tail collapsed into one. Or use the lift-chart shape — cumulative percent on the Y-axis, percentile rank on the X-axis. The whale's contribution shows up as a steep early curve, not a single dominating bar.

Trap 2 — The average tile reports a value nobody experiences. "Average revenue per customer = $202" on the executive dashboard. The median is $30. Nobody actually pays $202 — that number exists because the whale exists. Reporting just the mean is a senior-IC mistake.
Fix: always show mean AND median side by side. Add the skew_ratio (mean / median) as a small subscript. When skew_ratio > 2 the dashboard is communicating two different stories at once.

Trap 3 — A pie chart that should never have been a pie chart. "Revenue share by customer." If one customer is 60%, the pie is unreadable for everyone else (3°, 2°, 1° slices crammed together). Pie charts assume roughly uniform shares — they fail on long-tail data.
Fix: use a ranked bar chart (cumulative percent) or a treemap, both of which are designed for non-uniform distributions.

Trap 4 — Time-series with single-day spikes hides everything else. Daily revenue chart; one day had a whale's $20M renewal. The Y-axis now goes 0 to $25M. Every other day is a 0.5-pixel line near the bottom. The whole month looks dead except for one day.
Fix: compute a robust Y-axis (clip at P99), or display the spike as an annotation marker on a normally-scaled axis. Some BI tools call this "outlier suppression".

Trap 5 — Cohort retention charts where one cohort dominates the sample. "Retention by signup month." If one month had a viral launch (5× the usual signups), that cohort's retention curve dominates the visual. The other months' curves overlap into a blur. Conclusions drawn from this chart are mostly about the viral cohort, not the platform.
Fix: show retention as percentage (not absolute), and stack the curves so each can be inspected. Or aggregate non-viral months together to give the visual a baseline.

C. The L5-friendly tile pattern

For any tile that aggregates over potentially-skewed data, show three numbers, not one:

┌─────────────────────────────────────────┐
│  REVENUE PER CUSTOMER (this quarter)    │
│                                         │
│    Mean      Median     Skew ratio      │
│    $202      $30        6.7×            │
│                                         │
│    P99 customer: $5,000,000             │
└─────────────────────────────────────────┘

The mean alone is the slide for the press release. The full triplet is the slide for the operating review. Senior data folks recognise the difference and build the dashboard for both audiences.

· · ·

§ 06 — Cohort playbook

The whale / dolphin / minnow playbook — different cohorts, different handling.

Once you've identified cohorts (see the vocabulary page), the operational moves diverge by cohort. This is the L5 framing: skew handling isn't one mitigation — it's three different operational regimes running in parallel.

🐋 Whale · top 1-5%

Handle as VIP infrastructure tenants

Streaming: dedicated topic / partition group; opt-out of default partitioning.
Compute: whale-isolation join pattern (filter + UNION). Pre-compute their aggregates nightly so dashboards don't recompute them at query time.
Database: counter sharding, write batching, optimistic concurrency. Read-replica routing if they're read-heavy.
Cache: aggressive pre-warming. Their working set fits — and is worth — keeping hot.
Dashboard: always reported separately ("top-5 customers" with names; or "non-whale aggregate" if anonymised).
SLA: tighter — they generate the most support volume, can't afford to have a bad experience.

🐬 Dolphin · middle 65-75%

Default-path tenants — most of the work goes here

Streaming: default hash partitioning is fine.
Compute: hash joins, no special handling. AQE skew join will catch the occasional borderline-dolphin that creeps into the whale band.
Database: standard sharding; no per-user tuning needed.
Cache: LRU; their working set is too large to pin, but cache hit rates are still good because they re-read recent content.
Dashboard: the median + decile rollups represent them. Optimise the dashboard for THIS cohort — they're the platform.

🐟 Minnow · bottom 20-30%

Optimise for cold-path efficiency, not freshness

Streaming: default. But they generate so few events that you can consider not partitioning their events at all — pool minnow events into a low-priority topic and process in bigger batches.
Compute: can run on a cheaper schedule (hourly instead of real-time). They're not waiting on the data anyway.
Database: cold-tier storage; archive after N days of inactivity. Reduce the working set so dolphins fit in hot memory.
Cache: don't pre-warm; cache cost outweighs the value (they don't re-read enough).
Dashboard: aggregated as a single "low-engagement" cohort. Per-minnow detail is noise.
SLA: relaxed — a 10-second page load for a user who hasn't logged in in 30 days is acceptable; for a whale it isn't.

· · ·

§ 07 — Detection

Detection — how you SEE skew before it pages you.

The strongest signal in the interview isn't naming a mitigation — it's naming the metric you'd watch to detect skew before it breaks production. For each layer:

Layer	Detection metric	Threshold that pages on-call
Kafka	Per-partition consumer lag (ms or messages)	P99 partition lag > 5× P50 lag
Kafka	Bytes-in per partition (Prometheus / KMM)	Max partition > 4× mean partition
Flink	Subtask processing time + state size	One subtask > 3× median operator
Flink	Checkpoint duration trend	Drifting up week-over-week > 20%
Spark	Task duration P99 / P50 per stage	P99 > 10× P50 — Spark UI flags this
Spark	Shuffle bytes per task	Max task shuffle bytes > 5× median
Postgres	Lock wait time per table (`pg_locks`)	Per-row lock wait > 100ms P99
DynamoDB	Per-partition throttled requests (CloudWatch)	Any 5xx throttle
Cassandra	Wide-row size (`nodetool tablestats`)	Max partition size > 100MB
BI dashboard	Tile execution time + skew_ratio per query	Skew ratio > 3, mean / median > 4

The senior detection move. Don't wait for the throttle / timeout / lock error. Build a skew dashboard for every layer — one row per metric, green/yellow/red. When yellow turns red, the on-call investigates BEFORE the user complains. The skew metric is the leading indicator; the SLA breach is the lagging one.

· · ·

§ 08 — The interview script

The 90-second articulation script.

When asked "how do you handle hot shards" or "what happens when a whale joins your platform":

"Skew is the same root cause everywhere — one shard takes 100× the traffic — but the symptom and mitigation are layer-specific.

At the streaming layer, one Kafka partition lags. I change the partition key — composite (user_id + hour) or pre-aggregate before the keyBy. Adding partitions doesn't help because the hash still concentrates.

At the compute layer, one Spark task takes 6 hours. AQE skew join is the free first move; broadcast if the dim is small; salting (two-phase aggregation with a uniform random salt 0..15) if both sides are large; whale-isolation (filter + UNION) at FAANG scale.

At the database layer, one row gets all the writes. Counter sharding, write batching, or convert to an append-only ledger with periodic rollup. For Dynamo / Cassandra, change the partition key to include a time bucket or a salt.

At the BI layer, the dashboard misleads twice: queries time out on the whale's GROUP BY, AND the chart visually lies because one bar dominates the axis. Pre-aggregate at ETL time; report mean AND median AND skew_ratio side-by-side; never use a pie chart on Pareto data.

And the cohort-aware framing: whales get VIP-tier isolation, dolphins are the default path most of the work optimises for, minnows are cold-tier with relaxed SLAs and pooled processing. The detection metric is layer-specific too — Kafka partition lag, Spark task P99/P50, lock wait time, dashboard skew ratio. Catch it before it pages you."

Practice questions — paired SQL + Python by topic

Each topic in this design page maps to a runnable practice question (often two — one in SQL, one in Python). The "asked at" column lists the L5+ loops where this topic shows up most often, based on the patterns in the practice bank.

Topic in this page	SQL question	Python question	Asked at (L5+ loops)
Salting / two-phase aggregation (§ 03)	customer_growth-0027 — Salting a Skewed Aggregation (canonical SQL form of the Spark salt pattern)	python_175-0002 — Skew-Aware Key Partitioner (Okta-tagged); also mixed_167-0166	Snowflake · Databricks · Spark-heavy shops · Okta · any FAANG warehouse role
PySpark partition strategy (§ 03)	—	mixed_167-0050 — Repartition · mixed_167-0051 — Broadcast Join · mixed_167-0049 — CSV and Partitions · mixed_167-0054 — Cache and Performance	Databricks · Netflix · Meta · Uber · Airbnb · all Spark-on-cluster shops
Window functions over partitioned data (§ 02, § 03)	customer_growth-0026 — NTILE Decile Velocity Cohorts · customer_growth-0031 — Power-User Skew (PERCENTILE_CONT + LAD)	mixed_167-0059 — Window Functions without Partitions · mixed_167-0061 — Mountain Climber Logs · mixed_167-0063 — Tracking Customer Purchase History	All FAANG · Stripe · Snowflake · Databricks
Sketches / approximate distincts (§ 04, § 05)	customer_growth-0028 — DAU/MAU with HyperLogLog (the non-additivity + sketch-merge probe set)	python_175-0008 — Log Aggregator (Microsoft) · mixed_167-0164 — Log Aggregator	Netflix · Meta · Cloudflare · Datadog · Splunk · any observability/SaaS analytics
Streaming + fraud detection (§ 02 + skew-aware tumbling/HOP)	—	customer_growth-0010 — Velocity-Rule Fraud (Robinhood) · customer_growth-0012 — PyFlink Tumbling Window (Stripe) · customer_growth-0013 — PyFlink Address-Change (Coinbase) · customer_growth-0018 — PyFlink HOP (Stripe)	Stripe · Coinbase · Robinhood · PayPal · Visa · Mastercard · all fintech
Pareto / 80-20 cumulative (§ 03 in skew-and-distributions; cited in § 05 here)	community_solved-0082 — 80/20 Pareto cumulative · customer_growth-0033 — Dynamic Pivot (JSON aggregation pattern that survives whale-dimension explosion)	customer_growth-0019 — Passive Customers Classifier (Pinterest)	Netflix · Meta · Pinterest · YouTube · TikTok · all consumer engagement shops
Cohort / archetype classification (§ 06 playbook)	customer_growth-0032 — Whale / Dolphin / Minnow with skewness signals + LAD-tolerant view	customer_growth-0008 — CLV pandas (Netflix) · customer_growth-0009 — Unsteady Passive Customers (Amazon) · customer_growth-0011 — RFM tier transitions	Netflix · Amazon · Pinterest · Spotify · all consumer with paid tiers
Sliding-window analytics (§ 02 — DAU/WAU rolling)	customer_growth-0028 — 7-day rolling unique with HLL	python_175-0017 — Sliding Window Median (Google) · mixed_167-0158 — Sliding Window Median · python_175-0104 — Time-Series Fixed Windows (Tesla)	Google · Tesla · Datadog · Cloudflare · Bloomberg
Aggregation rollups (group-by, pivot) (§ 05 BI tile pattern)	co_sql_305-0192 — Salary ranking with edge cases · customer_growth-0033 — Dynamic pivot via JSON aggregation	mixed_167-0053 — Daily Category Sales · mixed_167-0136 — Top Products by Revenue · python_175-0099 — Aggregate & Export (Meta)	Meta · Amazon · Walmart · Target · all retail/e-commerce analytics

By company — what the L5+ loop tends to ask

If you're prepping for a specific company, this is the curated subset of the bank that exercises the patterns this design page covers. The same shape comes up in different domain dressing — fraud at Stripe, content at Netflix, dispatch at DoorDash — but the underlying skew/sharding pattern is identical.

Company	SQL questions	Python questions	Why these in particular
Netflix	customer_growth-0030 · 0031 · 0032 · 0033	customer_growth-0008 — CLV pandas	Streaming-first platform; whales / minnows / LAD vocabulary is core; Kafka partition skew is daily reality. The whole skew page was written with Netflix in mind.
Stripe	co_sql_305-0218 · mixed_167-0011	customer_growth-0012 — PyFlink Tumbling · customer_growth-0018 — PyFlink HOP	Payments at scale; merchant-tenant skew (one merchant = whale-tier traffic); fraud-detection windows over hot keys. PyFlink is the production pattern here.
Meta	customer_growth-0026 · 0028 · community_solved-0082	co_sql_305-0213 · python_175-0099	Feed ranking + ad attribution; HLL for DAU; Pareto-shaped engagement; dashboard query optimisation under hot-key shuffle.
Amazon	customer_growth-0032	customer_growth-0009 — Unsteady Passive · 0011	DynamoDB hot-partition territory (§ 04); seller-tier skew on FBA; cohort transitions in retail.
Google	customer_growth-0028	python_175-0017 — Sliding Window Median · 0056 — Descriptive Stats Pandas	Sliding-window medians are a Google-loop staple; BigQuery's APPROX_COUNT_DISTINCT is the production HLL.
Snowflake / Databricks	customer_growth-0027 — Salting · 0033 — Dynamic pivot	mixed_167-0050 — Repartition · 0051 — Broadcast Join · 0054 — Cache and Performance	Both vendors test AQE awareness, salt patterns, broadcast/sort-merge tradeoffs. Often a live coding round in PySpark.
Uber / DoorDash	—	python_175-0109 — Partition CSV → monthly Parquet (DoorDash) · 0170 — Partition Labels	Geographic hot-spotting (one city = whale partition); time-window aggregation under per-city skew. DoorDash's question on partitioning by month is the production rebalancing pattern.
Robinhood / Coinbase / fintech	—	customer_growth-0010 — Velocity-Rule Fraud · 0013 — Address-Change PyFlink	Streaming fraud detection where one merchant or user account can dominate the event stream; sliding-window pandas + PyFlink Table API.
Cloudflare / Datadog / Splunk	customer_growth-0028	python_175-0008 — Log Aggregator (Microsoft, same pattern)	Observability platforms LIVE on hot-key/hot-customer skew; sketch-based aggregation is mandatory; tenant-isolation required.
Pinterest / TikTok / consumer	community_solved-0082	customer_growth-0019 — Passive Customers Classifier (Pinterest)	Long-tail engagement is the business model; whale/dolphin/minnow is daily ops vocab.
Tesla / IoT / time-series	—	python_175-0104 — Time-Series Fixed Windows (Tesla)	Sensor-stream aggregation under per-vehicle / per-region skew; tumbling-window patterns.
Okta / identity	—	python_175-0002 — Skew-Aware Key Partitioner	One enterprise tenant = whale; identity-graph sharding under per-org concentration. Direct match for § 04 hot-partition material.

Related design pages

Skew & Distributions — the senior DE vocabulary — whales / dolphins / minnows / Pareto / long-tail / normal / skewed / mean vs median / LAD gap
Streaming Architecture — Flink + Kafka deep dive on event-time, watermarks, exactly-once, backpressure, checkpoint storm
Data Modeling — 18 industry scenarios — includes Scenario 17 Netflix LAD which intersects with skew detection
ML Engineering Interview Prep — feature stores, training/serving skew (different "skew" — same vocabulary trap)

· · ·

Hot shards & data skew — every layer breaks differently.

Contents

One whale, every layer hurts — differently.

Streaming — Kafka partition skew & Flink keyed-state hotspots.

Symptom — one Kafka partition takes 100× the writes

Kafka partition skew

Symptom — one Flink subtask's RocksDB explodes

Flink keyed-state hotspot

Spark — salting, broadcast joins, AQE.

Symptom — one task runs for 6 hours while the others finish in 10 seconds

Spark skewed join / aggregation

Mitigation 1 — Spark 3 AQE skew join (free)

Mitigation 2 — Broadcast join (when the dim is small)

Mitigation 3 — Salting (the manual classic)

Mitigation 4 — Isolate the whale (the heavy-hammer)

Databases — hot rows, hot partitions, index hotspots.

Symptom — one row gets all the writes; everyone else waits

Postgres / MySQL hot-row contention

Symptom — one partition throttled to 1000 RU/s; everyone else sees 0

DynamoDB / Cassandra hot partition

Symptom — sequential index slows EVERY query, even non-whale ones

B-tree right-edge contention

Dashboards — query performance AND visual lies.

A. Query performance — when the BI tile times out

B. Visual lies — when the chart misleads

C. The L5-friendly tile pattern

The whale / dolphin / minnow playbook — different cohorts, different handling.

🐋 Whale · top 1-5%

Handle as VIP infrastructure tenants

🐬 Dolphin · middle 65-75%

Default-path tenants — most of the work goes here

🐟 Minnow · bottom 20-30%

Optimise for cold-path efficiency, not freshness

Detection — how you SEE skew before it pages you.

The 90-second articulation script.

Practice questions — paired SQL + Python by topic

By company — what the L5+ loop tends to ask

Related design pages