▸ PERFORMANCE · petabyte-scale SQL · the optimizations FAANG actually uses

Performance — 28 SQL optimizations for the petabyte table.

Not LeetCode. These are the rewrites a senior data engineer reaches for when the table is 5 PB, the cluster bill is real, and "it returns the right answer" is the easy part. Each card is an anti-pattern you'll see in a PR, the optimized rewrite, and the byte-level reason it wins. Dialect notes for Spark SQL, Snowflake, BigQuery and Trino throughout.

The mental model — at petabyte scale you pay three bills. Every optimization on this page reduces at least one of them. If you can name which bill a rewrite cuts, you're already answering the interview question.
Bill №1 · I/O

Bytes scanned

What you read off storage. Cut it with partition pruning, column projection, data skipping and clustering. The cheapest byte is the one you never read.

SCAN: skip before you read 8 files on disk · only 2 read pruning + min/max skipping
Bill №2 · Network

Bytes shuffled

What crosses the wire between stages — joins, GROUP BY, DISTINCT, global sorts. Cut it with broadcast joins, pre-aggregation, bucketing and approximate sketches.

SHUFFLE: rows cross the wire map 1 map 2 map 3 red 1 red 2 red 3 broadcast / pre-agg deletes wires
Bill №3 · Write

Bytes written

What you re-materialize every run. Cut it by computing incrementally — touch only the new partition, MERGE the delta, and compact instead of rewriting.

WRITE: touch only the delta ds-4 ds-3 ds-2 ds-1 today rewrite 1 partition, not 1,825
ONE QUERY · THREE BILLS — WHERE THE MONEY ACTUALLY MOVES Object storage PB on disk ① BYTES SCANNED $ prune · project · skip Stage 1 scan + filter ② BYTES SHUFFLED $$ broadcast · pre-agg · bucket Stage 2 join + aggregate ③ BYTES WRITTEN $ incremental · MERGE Table result Every optimization on this page turns down one of these three meters — name the meter and you've named the win.
The three meters, animated — dashes flow where bytes (and dollars) move; every family below turns one of them down

A useful rule of thumb when you read each rewrite: scan less, shuffle less, write once. The skew cards are the exception — there the total work is fine, it's the distribution of work that kills you, and one straggler holds the whole stage hostage.

Dialect cheatsheet — the same idea, four engines

OptimizationSpark SQLSnowflakeBigQueryTrino / Presto
Force broadcast/*+ BROADCAST(t) */automatic (stats)automatic/*+ BROADCAST */ / join_distribution_type
Top-N per group filtersubquery + WHERE rn<=kQUALIFYQUALIFYQUALIFY (363+)
Approx distinctapprox_count_distinctAPPROX_COUNT_DISTINCTAPPROX_COUNT_DISTINCT / HLLapprox_distinct
Approx percentileapprox_percentileAPPROX_PERCENTILEAPPROX_QUANTILESapprox_percentile
Skew joinAQE skew join / saltautomatic + saltautomaticsalt / partitioning
Data skippingZ-ORDER (Delta)CLUSTER BYclusteringsorted files / Iceberg sort
UpsertMERGE INTO (Delta/Iceberg)MERGEMERGEMERGE (Iceberg/Delta)
§ Family 1 — scan less · the cheapest byte is the one you never read

Pruning, projection & pushdown.

At petabyte scale, storage I/O is usually the dominant line on the bill. These five cards are all variations on one idea: convince the planner to skip files before it opens them. Get this family right and the rest of the query is operating on a thousandth of the data.

№ 01

Never wrap the partition column in a function

partition pruningHive · Spark · Trino

Meta — pull one day from an events table partitioned by ds that lands ~100 TB/day. The query "works" but scans the whole table.

✗ Anti-pattern — full scan
SELECT user_id, event_type
FROM   events
WHERE  CAST(event_ts AS DATE) = DATE '2024-01-01';
-- ds is the partition key, but the planner
-- can't see through CAST(event_ts) → it reads
-- every partition, then filters row by row.
✓ Optimized — pruned to one partition
SELECT user_id, event_type
FROM   events
WHERE  ds = '2024-01-01'              -- partition key, bare
  AND  event_ts >= TIMESTAMP '2024-01-01 00:00:00'
  AND  event_ts <  TIMESTAMP '2024-01-02 00:00:00';

Why it wins. Partition pruning happens in the metastore, against partition values — before a single file is opened. The moment you wrap ds (or a column functionally tied to it) in CAST, DATE(), SUBSTR or string concatenation, the optimizer can no longer map the expression back to partition directories and falls back to a full scan. Keep the partition column bare on one side of the predicate.

Impact: ~100 TB scanned → ~100 GB. Same result, 1000× less I/O, and the job stops being a noisy-neighbour on the cluster.
№ 02

Contiguous range beats a derived bucket or a 365-value IN-list

partition pruningall engines

Amazon — "last 7 days of orders." The partition key is a yyyy-MM-dd string.

✗ Anti-pattern
-- derives the month from the key → can't prune
WHERE SUBSTR(ds,1,7) = '2024-01'
-- or a hand-built list the planner treats opaquely
WHERE ds IN ('2024-01-01', ... 365 values ...)
✓ Optimized
WHERE ds BETWEEN '2024-01-01' AND '2024-01-07'
-- lexical order == date order for yyyy-MM-dd,
-- so the metastore returns exactly 7 directories.

Why it wins. A BETWEEN on a sortable partition key lets the catalog do a range scan over partition metadata and hand back the precise file list. SUBSTR derives a value the planner can't invert; a giant IN-list is evaluated per partition and bloats the query plan. The yyyy-MM-dd convention exists precisely so lexical order equals chronological order.

Impact: the difference between touching 7 partitions and listing/opening thousands.
№ 03

Project columns — SELECT * is a tax on columnar storage

column pruningParquet · ORC

Netflix — a 400-column wide playback-events table in Parquet; the report needs three columns.

✗ Anti-pattern
SELECT *
FROM   playback_events
WHERE  ds = '2024-01-01';
-- reads all 400 column chunks off disk,
-- decompresses them, ships them — to use 3.
✓ Optimized
SELECT user_id, title_id, watch_seconds
FROM   playback_events
WHERE  ds = '2024-01-01';
-- columnar readers fetch only the 3 column
-- chunks; the other 397 are never touched.

Why it wins. Parquet/ORC store data column-by-column, so projection is a genuine I/O reduction, not just a narrower result. SELECT * also defeats column-level predicate pushdown and forces the engine to carry wide rows through every downstream shuffle. Name your columns — always.

Impact: 3 of 400 columns ≈ a 100× cut in bytes scanned, plus lighter shuffles downstream.
№ 04

Filter before the join, not in the outer WHERE

predicate pushdownall engines

Uber — join the trips fact (PB) to a small dim_city, but you only care about one city and one week.

✗ Anti-pattern — join then filter
SELECT t.*, c.city_name
FROM   trips t
JOIN   dim_city c ON c.city_id = t.city_id
WHERE  c.city_name = 'Delhi'
  AND  t.ds BETWEEN '2024-01-01' AND '2024-01-07';
-- with a weak optimizer the whole trips table
-- is shuffled into the join first.
✓ Optimized — shrink each side first
SELECT t.*, c.city_name
FROM  (SELECT * FROM trips
       WHERE ds BETWEEN '2024-01-01' AND '2024-01-07'
         AND city_id = 9) t          -- resolve the id once
JOIN  (SELECT city_id, city_name FROM dim_city
       WHERE city_name = 'Delhi') c
  ON  c.city_id = t.city_id;

Why it wins. The join input is the thing you shuffle, and shuffle volume is what you pay. Modern cost-based optimizers push selective predicates down automatically, but the moment a UDF, an OR across both tables, or a late-resolved value blocks them, you eat a full-table shuffle. Filtering the big side inside the scan guarantees the optimizer can't get it wrong.

Impact: the join sees rows × (date-selectivity × city-selectivity) instead of the whole fact — often a 1000×+ smaller shuffle.
№ 05

Dynamic partition pruning — let a tiny dim prune the giant fact

DPPSpark · BigQuery

Sparkfact_sales partitioned by ds, joined to dim_date to pull only holidays. The filter is on the dim, but the fact still full-scans.

✗ Anti-pattern
SELECT s.*
FROM   fact_sales s
JOIN   dim_date d ON d.ds = s.ds
WHERE  d.is_holiday = true;
-- the holiday filter lives on dim_date; without
-- DPP, Spark scans every ds partition of the fact.
✓ Optimized — push the key set
SET spark.sql.optimizer.dynamicPartitionPruning.enabled=true;
-- ...or make it explicit so any engine prunes:
SELECT s.*
FROM   fact_sales s
WHERE  s.ds IN (SELECT ds FROM dim_date WHERE is_holiday);

Why it wins. DPP runs the small dimension first, collects the surviving partition keys, and injects them as a runtime filter on the fact's partition column — so the fact scan is pruned to just the holiday partitions. When you can't rely on the optimizer, a semi-join (IN (SELECT …) / EXISTS) on the partition key reproduces it deterministically.

Impact: fact scan drops from 365 partitions to ~11 — the join key is the partition key, so pruning is exact.
✦ ✦ ✦
§ Family 2 — shuffle less · the network is the wall

Join strategy.

Once you've stopped scanning needless bytes, the next bill is the shuffle — the all-to-all exchange a sort-merge join, GROUP BY or DISTINCT triggers. A 5 PB shuffle can dominate wall-clock and cost. These five cards keep the big table off the wire.

№ 06

Broadcast the small side of a big-vs-small join

broadcast / map-side joinSpark · Trino

Google — a 5 PB fact_events joined to a 40 MB dim_country. The default plan shuffles the 5 PB side.

✗ Anti-pattern — sort-merge, both sides shuffled
SELECT e.*, c.country_name
FROM   fact_events e
JOIN   dim_country c ON c.country_id = e.country_id;
-- hash-partitions and sorts BOTH inputs by
-- country_id → 5 PB crosses the network.
✓ Optimized — broadcast hash join
SELECT /*+ BROADCAST(c) */ e.*, c.country_name
FROM   fact_events e
JOIN   dim_country c ON c.country_id = e.country_id;
-- 40 MB dim is copied to every executor; the
-- 5 PB fact is probed in place and never moves.

Why it wins. If one side fits in memory, ship it to where the big side already lives. A broadcast hash join turns an O(N log N) shuffle-and-sort into a streaming hash probe with zero exchange on the large input. Snowflake and BigQuery choose this automatically from table statistics; in Spark/Trino you nudge it with a hint (and watch spark.sql.autoBroadcastJoinThreshold).

Impact: eliminates a 5 PB network exchange — frequently the single biggest win in a join-heavy pipeline.
№ 07

Pre-aggregate before joining the big dimension

agg-before-joinall engines

Amazon — revenue per category from 50 B order_items joined to dim_product.

✗ Anti-pattern — join 50B rows, then group
SELECT p.category, SUM(oi.revenue) AS rev
FROM   order_items oi
JOIN   dim_product p ON p.product_id = oi.product_id
GROUP BY p.category;
-- shuffles all 50B item rows through the join.
✓ Optimized — collapse first
WITH per_product AS (
  SELECT product_id, SUM(revenue) AS rev
  FROM   order_items
  GROUP BY product_id)            -- 50B → ~few M rows
SELECT p.category, SUM(pp.rev) AS rev
FROM   per_product pp
JOIN   dim_product p ON p.product_id = pp.product_id
GROUP BY p.category;

Why it wins. SUM is associative, so you can collapse order_items to one row per product before touching the dimension. The join then runs on millions of rows instead of tens of billions, and (bonus) the per-product aggregate is broadcast-eligible. Push aggregation as deep as the algebra allows.

Impact: join input 50 B → ~few M rows; the expensive shuffle moves to the cheap pre-agg stage.
№ 08

Anti-join, not NOT IN — correctness and speed

anti-joinall engines

Stripe — charges that never got refunded. NOT IN over a big subquery is both slow and quietly wrong.

✗ Anti-pattern — NULL-unsafe + un-optimizable
SELECT *
FROM   charges
WHERE  charge_id NOT IN (
         SELECT charge_id FROM refunds);
-- if ANY refunds.charge_id is NULL, the whole
-- predicate is UNKNOWN → ZERO rows returned.
✓ Optimized — left anti-join
SELECT c.*
FROM   charges c
LEFT JOIN refunds r ON r.charge_id = c.charge_id
WHERE  r.charge_id IS NULL;
-- or: WHERE NOT EXISTS (SELECT 1 FROM refunds r
--                       WHERE r.charge_id = c.charge_id)

Why it wins. Two reasons. Correctness: NOT IN with a single NULL in the list returns nothing — a classic production bug. Speed: NOT EXISTS / LEFT JOIN … IS NULL compile to a hash anti-join (one shuffle, streamed), while NOT IN often degrades to a correlated or nested-loop check. NOT EXISTS is the safe default.

Impact: a NULL-safe single-shuffle anti-join — and it doesn't silently drop every row when a NULL sneaks in.
№ 09

Bucket / cluster co-located tables to delete the shuffle

bucketing · SMB joinSpark · Hive · Iceberg

LinkedIn — two large facts joined on member_id in a daily pipeline, so the same shuffle runs every single night.

✗ Anti-pattern — re-shuffle every run
SELECT a.*, b.*
FROM   member_events a
JOIN   member_profile b ON b.member_id = a.member_id;
-- both sides hash-partitioned by member_id from
-- scratch on every execution.
✓ Optimized — bucket once, join free forever
-- one-time DDL: same key, same bucket count
CREATE TABLE member_events (...)
  CLUSTERED BY (member_id) INTO 1024 BUCKETS;
CREATE TABLE member_profile (...)
  CLUSTERED BY (member_id) INTO 1024 BUCKETS;
-- now the join is sort-merge-bucket: no shuffle.

Why it wins. Bucketing physically lays each table's rows into N hash buckets on the join key. When both tables share the key and bucket count, matching buckets are already co-located, so the engine does a local sort-merge with no exchange. You pay the partitioning cost once at write time and bank it on every subsequent join.

Impact: a recurring full shuffle becomes shuffle-free — compounding savings on anything that runs on a schedule.
№ 10

Replace a self-join with a window aggregate

window vs self-joinall engines

Robinhood — each trade's running total of volume per account. The textbook self-join is quadratic.

✗ Anti-pattern — O(n²) self-join
SELECT t.trade_id, SUM(p.qty) AS running_qty
FROM   trades t
JOIN   trades p
  ON  p.account_id = t.account_id
 AND  p.ts <= t.ts
GROUP BY t.trade_id;
-- every trade re-joins all prior trades; a hot
-- account with 10M trades explodes the join.
✓ Optimized — one ordered pass
SELECT trade_id,
       SUM(qty) OVER (PARTITION BY account_id
                      ORDER BY ts
                      ROWS UNBOUNDED PRECEDING) AS running_qty
FROM   trades;

Why it wins. An inequality self-join (p.ts <= t.ts) materializes a triangular cross-product — O(n²) per group. A window function sorts each partition once and sweeps it in a single linear pass. Any "running / cumulative / compare-to-neighbour" requirement is a window function, never a self-join.

Impact: O(n²) → O(n log n); a 10 M-trade account goes from "never finishes" to seconds.
✦ ✦ ✦
§ Family 3 — beat skew · the whale problem

When one key owns half the rows.

Skew is the optimization juniors miss because the query is "correct" and the plan looks fine — yet one task runs for three hours while the other 999 finished in thirty seconds. The total work is reasonable; its distribution is not. (Deeper operational treatment on the Hot Shards & Data Skew page.)

№ 11

Salt the hot join key

saltingSpark · Trino

Meta — join engagements to users where 0.1% of users (celebrities) own 50% of rows. The reducer that owns a celebrity key is the straggler that holds the stage.

✗ Anti-pattern — one reducer melts
SELECT u.name, COUNT(*) AS engagements
FROM   engagements e
JOIN   users u ON u.user_id = e.user_id
GROUP BY u.name;
-- all of @celebrity's rows hash to ONE partition.
✓ Optimized — spread the hot key N ways
WITH e AS (        -- salt the skewed (fact) side
  SELECT *, CAST(FLOOR(RAND()*16) AS INT) AS salt
  FROM engagements),
u AS (             -- fan the small side out to match
  SELECT u.*, s.salt
  FROM users u
  CROSS JOIN UNNEST(SEQUENCE(0,15)) AS s(salt))
SELECT u.name, COUNT(*) AS engagements
FROM   e JOIN u
  ON  u.user_id = e.user_id AND u.salt = e.salt
GROUP BY u.name;

Why it wins. Appending a random salt to the hot key splits its rows across N partitions instead of one; replicating the small side N times keeps the join correct. Sixteen evenly-loaded tasks beat one task doing 50% of the work. You only need to salt the keys that are actually hot — salt everything and you just inflate the small side.

Impact: the 3-hour straggler becomes sixteen ~12-minute tasks; the stage finishes when the median task does, not the max.
№ 12

Let AQE rebalance skew automatically

adaptive query executionSpark 3+

Airbnb — same skewed join, but you'd rather not hand-salt every pipeline. Let the engine split hot partitions at runtime.

✗ Anti-pattern — static plan, blind to skew
-- AQE off: the plan is fixed at compile time;
-- a partition that turns out 200× the median
-- size still runs as a single task.
SET spark.sql.adaptive.enabled=false;
✓ Optimized — runtime skew split
SET spark.sql.adaptive.enabled=true;
SET spark.sql.adaptive.skewJoin.enabled=true;
-- Spark measures actual partition sizes mid-query
-- and splits any partition > skewedPartitionFactor ×
-- median into sub-partitions, each its own task.

Why it wins. AQE re-plans using real shuffle statistics instead of compile-time estimates: it splits oversized partitions, coalesces tiny ones, and can flip a sort-merge to a broadcast once it sees the true sizes. It's the zero-code first move for skew — reach for manual salting (№11) only when AQE isn't available or isn't enough.

Impact: stragglers handled by the engine, no query rewrite — turn it on by default on Spark 3+.
№ 13

Isolate the hot keys and union them back

skew isolationall engines

TikTok — a handful of mega-viral creators dominate. You know exactly who they are, so handle them on a separate path.

✗ Anti-pattern — one path for all keys
SELECT creator_id, SUM(views)
FROM   video_events
GROUP BY creator_id;
-- the top 20 creators each get a giant group
-- while millions of long-tail creators are tiny.
✓ Optimized — split hot vs cold
-- cold tail: normal aggregation
SELECT creator_id, SUM(views) AS views
FROM   video_events
WHERE  creator_id NOT IN (SELECT creator_id FROM hot_creators)
GROUP BY creator_id
UNION ALL
-- hot keys: pre-split with a salt, then re-sum
SELECT creator_id, SUM(views) AS views FROM (
  SELECT creator_id,
         SUM(views) AS views
  FROM   video_events
  WHERE  creator_id IN (SELECT creator_id FROM hot_creators)
  GROUP BY creator_id, CAST(FLOOR(RAND()*64) AS INT)
) t GROUP BY creator_id;

Why it wins. When the hot set is small and known, you don't have to penalise the whole query. The long tail aggregates normally; the few whales get a salted two-phase aggregation. Same idea as salting, but surgically applied so you don't replicate the dimension for keys that were never hot.

Impact: balanced stages with no broad replication cost; the 20 whales stop dictating the runtime of millions of minnows.
✦ ✦ ✦
§ Family 4 — aggregate smarter · approximate & multi-grain

Don't count the hard way.

The most expensive aggregates are the exact-distinct and exact-percentile ones, because they need to see — and shuffle — every value. At petabyte scale, "within 2%" is almost always good enough for a dashboard, and one pass can answer many grains at once.

№ 14

APPROX_COUNT_DISTINCT for DAU/MAU — HyperLogLog

HyperLogLogSpark · Snowflake · BQ · Trino

Meta — distinct daily active users over a 100 B-row event table. Exact COUNT(DISTINCT) must shuffle every user_id to dedupe it.

✗ Anti-pattern — exact distinct, huge shuffle
SELECT ds, COUNT(DISTINCT user_id) AS dau
FROM   events
GROUP BY ds;
-- every user_id is shuffled and held in memory
-- to be deduplicated, per day.
✓ Optimized — HLL sketch, ~2% error
SELECT ds, APPROX_COUNT_DISTINCT(user_id) AS dau
FROM   events
GROUP BY ds;
-- even better, persist a mergeable sketch per day:
-- HLL_SKETCH_AGG → MAU = merge of 30 daily sketches,
-- no re-scan of the raw events.

Why it wins. HyperLogLog estimates cardinality from a fixed-size sketch (KBs) instead of materialising every distinct value, so the shuffle and memory footprint are tiny and bounded. The killer feature is that sketches are mergeable: store one per day and you can compute any week/month/quarter by unioning sketches — never rescanning raw events. Standard error ~1–2%.

Impact: orders of magnitude cheaper, and rolling windows (MAU, L28) become a sketch merge instead of a re-aggregation.
№ 15

APPROX_PERCENTILE for p95 latency — t-digest

t-digest quantilesSpark · Snowflake · BQ · Trino

Datadog — p50/p95/p99 request latency over trillions of spans for an SLO dashboard.

✗ Anti-pattern — exact percentile = global sort
SELECT service,
       PERCENTILE_CONT(0.95)
         WITHIN GROUP (ORDER BY latency_ms) AS p95
FROM   spans
GROUP BY service;
-- exact quantile must sort the entire group.
✓ Optimized — bounded-memory sketch
SELECT service,
       APPROX_PERCENTILE(latency_ms, 0.95) AS p95
FROM   spans
GROUP BY service;
-- BigQuery: APPROX_QUANTILES(latency_ms, 100)[OFFSET(95)]

Why it wins. Exact percentiles need the whole group ordered; a t-digest/Q-digest sketch summarises the distribution in bounded memory with high accuracy in the tails (exactly where p95/p99 live). It's the default for any latency or SLO panel — nobody sorts a trillion spans to draw a line chart.

Impact: no global sort, constant memory per group; the standard pattern behind every latency dashboard.
№ 16

GROUPING SETS — every grain in one scan

grouping sets · rollupall engines

Amazon — revenue by country, by country×category, and the grand total. Three reports, ideally one scan.

✗ Anti-pattern — three scans, UNION-ed
SELECT country, NULL cat, SUM(rev) FROM f GROUP BY country
UNION ALL
SELECT country, category, SUM(rev) FROM f GROUP BY country,category
UNION ALL
SELECT NULL, NULL, SUM(rev) FROM f;
-- scans the PB fact three times.
✓ Optimized — one pass, all grains
SELECT country, category, SUM(rev) AS rev
FROM   f
GROUP BY GROUPING SETS (
  (country, category),
  (country),
  ()                      -- grand total
);

Why it wins. GROUPING SETS (and its shorthands ROLLUP/CUBE) compute multiple aggregation grains from a single read of the fact, sharing the scan and much of the shuffle. Each extra grain you'd have UNION-ed is another full PB scan you just deleted.

Impact: 3 PB-scans → 1. Use GROUPING() to tell the subtotal rows apart.
№ 17

Map-side pre-aggregation cuts the shuffle (the combiner)

partial aggregationSpark · Trino

Cloudflare — request counts grouped by a high-cardinality key over a firehose. Every raw row crossing the wire is wasted.

✗ Anti-pattern — distinct-then-count, all rows shuffled
SELECT zone_id, COUNT(DISTINCT ip) AS uniques
FROM   requests
GROUP BY zone_id;
-- (zone_id, ip) pairs all shuffle so DISTINCT
-- can dedupe globally.
✓ Optimized — pre-collapse, then count
SELECT zone_id, COUNT(*) AS uniques
FROM ( SELECT zone_id, ip
       FROM   requests
       GROUP BY zone_id, ip )   -- map-side combine
GROUP BY zone_id;
-- or just approx_count_distinct(ip) (see №14)

Why it wins. Algebraic aggregates (SUM, COUNT, MIN/MAX) get a free map-side combine — each task pre-aggregates locally and ships only partial results, so shuffle volume drops by the group-collapse factor. COUNT(DISTINCT) doesn't combine for free; rewriting it as group-then-count restores the partial-aggregation, and an HLL sketch removes the shuffle entirely.

Impact: shuffle shrinks from "every row" to "every distinct group key" — often 100×+.
№ 18

Materialize daily rollups — dashboards read the rollup, not raw events

pre-aggregationall engines

Every FAANG — a 90-day trend tile that re-scans the raw event firehose on every single dashboard load.

✗ Anti-pattern — dashboard scans raw events
-- runs on every page load, ×100s of viewers
SELECT ds, COUNT(*) AS events, SUM(rev) AS rev
FROM   raw_events
WHERE  ds >= DATE_SUB(CURRENT_DATE, 90)
GROUP BY ds;
✓ Optimized — build once, read forever
-- scheduled job, once a day:
INSERT OVERWRITE TABLE agg_daily PARTITION (ds)
SELECT ds, COUNT(*) events, SUM(rev) rev, ...
FROM   raw_events WHERE ds = CURRENT_DATE GROUP BY ds;
-- dashboard query → scans one tiny table:
SELECT ds, events, rev FROM agg_daily
WHERE ds >= DATE_SUB(CURRENT_DATE, 90);

Why it wins. The single highest-leverage move in analytics: pay the scan once in a scheduled job, then serve every dashboard from a table that's one row per day×dimension. A 90-day tile reads kilobytes. (This is the bridge into the Analytics, Dashboard & Optimization pillar, where the whole serving layer is built on this idea.)

Impact: PB scanned per load → KB. With 200 viewers a day, that's the difference between a five-figure monthly bill and a rounding error.
✦ ✦ ✦
§ Family 5 — windows beat self-joins · dedup & top-N

One ordered pass where juniors write a join.

Three of the most common interview prompts — top-N per group, latest-record-per-key, and sessionization — have a tempting self-join answer that's quadratic, and a window-function answer that's linear. Knowing the window form cold is a senior signal.

№ 19

QUALIFY ROW_NUMBER() for top-N per group

top-N per groupSnowflake · BQ · Trino

Netflix — the top-3 titles per user by watch time.

✗ Anti-pattern — correlated count subquery
SELECT *
FROM   watch w
WHERE (SELECT COUNT(*) FROM watch w2
       WHERE w2.user_id = w.user_id
         AND w2.secs > w.secs) < 3;
-- correlated subquery re-scans per row.
✓ Optimized — window + QUALIFY
SELECT *
FROM   watch
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id
                           ORDER BY secs DESC) <= 3;
-- Spark/PG: wrap it and filter:
-- SELECT * FROM (… ROW_NUMBER() AS rn …) WHERE rn <= 3

Why it wins. QUALIFY filters on a window result without a subquery; one partition-and-sort answers the whole top-N. The correlated-count form re-scans the group for every row — quadratic. Use RANK() if you want ties to share a place, ROW_NUMBER() for a hard cut.

Impact: O(n²) correlated scans → a single windowed pass.
№ 20

Latest-record-per-key dedup (CDC) with a tiebreaker

dedup · CDCall engines

Snowflake — collapse an append-only change log to the current row per primary key.

✗ Anti-pattern — correlated MAX, non-deterministic
SELECT *
FROM   changes c
WHERE  c.updated_at = (
  SELECT MAX(updated_at) FROM changes c2
  WHERE c2.pk = c.pk);
-- re-scans per pk, AND returns 2 rows when two
-- updates share the same updated_at.
✓ Optimized — deterministic window
SELECT *
FROM   changes
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY pk
  ORDER BY updated_at DESC, op_seq DESC) = 1;
-- op_seq breaks ties → exactly one row per pk.

Why it wins. One window pass replaces a correlated subquery, and the secondary sort key (op_seq, an offset, a monotonic id) guarantees exactly one winner per key even when timestamps collide — a real bug in high-throughput CDC. This is the canonical "current state from a change stream" pattern.

Impact: single scan + window, and deterministic — no duplicate keys leaking downstream.
№ 21

Sessionize with LAG + running sum, not a self-join

gaps & islandsall engines

Spotify — split each user's event stream into sessions with a 30-minute inactivity gap.

✗ Anti-pattern — self-join to find gaps
-- join each event to its predecessor by a
-- correlated subquery / inequality self-join,
-- then diff timestamps → O(n²) per user.
✓ Optimized — two linear window passes
SELECT *,
  SUM(is_new) OVER (PARTITION BY user_id
                    ORDER BY ts) AS session_id
FROM (
  SELECT *,
    CASE WHEN ts - LAG(ts) OVER (PARTITION BY user_id
                                 ORDER BY ts)
              > INTERVAL '30' MINUTE
         THEN 1 ELSE 0 END AS is_new
  FROM events
) s;

Why it wins. LAG gives each event its predecessor's timestamp in one pass; a gap over the threshold flags a session boundary; a running SUM of those flags numbers the sessions. Two ordered sweeps replace a quadratic self-join — the textbook gaps-and-islands solution.

Impact: linear sessionization that scales to a user with millions of events.
№ 22

Push the LIMIT — don't globally sort a PB to take 100 rows

top-N / limit pushdownall engines

Pinterest — "top 100 pins by saves this week" off a huge fact.

✗ Anti-pattern — sort the world, then cut
SELECT pin_id, saves
FROM   wide_join_of_everything   -- PB of rows
ORDER BY saves DESC
LIMIT 100;
-- a global sort over a PB to keep 100 rows.
✓ Optimized — aggregate, then top-N heap
SELECT pin_id, SUM(saves) AS saves
FROM   save_events
WHERE  ds BETWEEN '2024-01-01' AND '2024-01-07'
GROUP BY pin_id
ORDER BY saves DESC
LIMIT 100;          -- engine keeps a 100-entry heap
                    -- per partition, then merges.

Why it wins. ORDER BY … LIMIT k over an aggregated input lets the engine maintain a bounded top-k heap per partition and merge them — no full global sort. Sorting a raw PB-wide join to discard all but 100 rows is pure waste; collapse to the grain first, filter early, and only then take the top-N.

Impact: a bounded-memory partial top-N instead of a full distributed sort of petabytes.
✦ ✦ ✦
§ Family 6 — compute once · incremental, MERGE, compaction

Stop recomputing five years every night.

The third bill is bytes written. The most common waste in a scheduled pipeline is recomputing the entire history when only today changed. These cards make write cost scale with the delta, not the table.

№ 23

Incremental partition overwrite, not full recompute

incrementalall engines

Airbnb — a daily aggregate over a 5-year fact, rebuilt from scratch every night.

✗ Anti-pattern — recompute all of history
INSERT OVERWRITE TABLE agg
SELECT ds, dim, SUM(x)
FROM   fact            -- all 1,825 days, nightly
GROUP BY ds, dim;
✓ Optimized — only the new partition
INSERT OVERWRITE TABLE agg PARTITION (ds = '2024-01-08')
SELECT '2024-01-08' AS ds, dim, SUM(x)
FROM   fact
WHERE  ds = '2024-01-08'      -- read & write one day
GROUP BY dim;
-- dynamic-partition variant handles late data:
-- WHERE ds >= CURRENT_DATE - INTERVAL '3' DAY

Why it wins. Yesterday's partitions didn't change, so reading and rewriting them is pure waste. Overwriting only the affected partition makes both read and write O(one day) instead of O(history). For late-arriving data, reprocess a small trailing window (e.g., last 3 days) rather than everything.

Impact: a nightly job that grew linearly with history becomes flat — runtime and cost stop creeping up as the table ages.
№ 24

MERGE the delta for upserts / SCD, not a full rebuild

MERGE · upsertSnowflake · Delta · Iceberg · BQ

Stripe — apply a batch of CDC changes to a dimension on a lakehouse table.

✗ Anti-pattern — rebuild the whole table
CREATE OR REPLACE TABLE dim_customer AS
SELECT ... FROM dim_customer
  -- hand-merge old + new by UNION/JOIN, rewrite
  -- every row even though 0.1% changed.
✓ Optimized — touch only changed files
MERGE INTO dim_customer d
USING staged_changes s ON d.customer_id = s.customer_id
WHEN MATCHED AND s.op = 'U'
  THEN UPDATE SET d.email = s.email, d.updated_at = s.ts
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN NOT MATCHED
  THEN INSERT (customer_id, email, updated_at)
       VALUES (s.customer_id, s.email, s.ts);

Why it wins. On Delta/Iceberg/Snowflake, MERGE rewrites only the data files that contain matched keys, leaving the rest untouched. Write amplification scales with the size of the change set, not the table — and it's one atomic statement instead of a fragile rebuild-and-swap.

Impact: a 0.1% daily change set rewrites ~0.1% of the files, not 100%.
№ 25

Compact small files — stop dying by a thousand opens

small-files problemDelta · Iceberg · Hive

DoorDash — a streaming ingest writes millions of 1 MB files a day; downstream batch reads spend all their time listing and opening files.

✗ Anti-pattern — 1M × 1MB files
-- micro-batch every few seconds, no compaction:
-- read planning lists millions of files, opens
-- each one, reads a tiny footer — overhead >>
-- actual data read.
✓ Optimized — compact to ~128–512 MB
-- Delta
OPTIMIZE events WHERE ds = '2024-01-08';
-- Iceberg
CALL system.rewrite_data_files('events');
-- plain Spark/Hive
INSERT OVERWRITE TABLE events PARTITION (ds='2024-01-08')
SELECT /*+ COALESCE(64) */ * FROM events WHERE ds='2024-01-08';

Why it wins. Every file carries fixed per-file overhead — a list entry, an open, a footer read, a task. A million tiny files means the engine spends its life on bookkeeping instead of data. Compacting into a few hundred right-sized files (≈128–512 MB) collapses planning and read time and restores healthy parallelism.

Impact: file count ↓ ~1000×; query planning alone often drops from minutes to seconds.
✦ ✦ ✦
§ Family 7 — storage & cost traps · the silent full scans

Clustering, JSON & the cast that kills pruning.

The last family is the set of quiet mistakes that disable everything above — a layout that prevents skipping, a JSON blob re-parsed a million times, and the single most common pruning-killer in real code: an implicit type cast on a key.

№ 26

Cluster / Z-order by your filter columns for data skipping

clustering · data skippingSnowflake · Delta · BQ

Snowflake — a huge table partitioned by date but almost always filtered by (event_date, country). Within a day, countries are scattered across every micro-partition.

✗ Anti-pattern — country scattered, no skipping
SELECT ...
FROM   events
WHERE  ds = '2024-01-08' AND country = 'IN';
-- country='IN' rows live in EVERY micro-partition
-- of that day → min/max can't skip any of them.
✓ Optimized — co-locate by the filter key
ALTER TABLE events CLUSTER BY (ds, country);
-- Delta:  OPTIMIZE events ZORDER BY (country);
-- BigQuery: CLUSTER BY country at table creation.
-- now each file holds a narrow country range →
-- min/max stats skip the non-'IN' files.

Why it wins. Engines keep per-file/micro-partition min/max stats and skip any file whose range can't match the predicate. That only helps if matching rows are physically clustered; clustering or Z-ordering by your common filter columns narrows each file's range so data skipping actually fires. Partitioning gives coarse pruning; clustering gives fine-grained skipping inside the partition.

Impact: a within-partition scan of everything becomes a read of the handful of files that hold country='IN'.
№ 27

Flatten JSON once into typed columns — don't re-parse per query

semi-structured ELTSnowflake · Spark · BQ

Shopify — a raw payload JSON string that ten different dashboards each parse on the fly over raw events.

✗ Anti-pattern — parse JSON in every query
SELECT
  GET_PATH(PARSE_JSON(payload), 'checkout.total') AS total,
  GET_PATH(PARSE_JSON(payload), 'checkout.currency') AS ccy
FROM   raw_events
WHERE  ds = '2024-01-08';
-- parse + path-extract on every row, every query,
-- and the blob defeats column pruning/skipping.
✓ Optimized — extract once in ELT
-- one curated table, built once per partition:
CREATE TABLE checkout_facts AS
SELECT order_id, ds,
       payload:checkout.total::NUMBER   AS total,
       payload:checkout.currency::STRING AS currency
FROM   raw_events;
-- dashboards now read typed columns:
SELECT total, currency FROM checkout_facts WHERE ds='2024-01-08';

Why it wins. JSON parsing is CPU-heavy and a string blob can't be column-pruned or min/max-skipped. Doing the extraction once in an ELT step turns hot paths into typed, columnar fields — parse cost is paid a single time, and every downstream query gets projection and data skipping back.

Impact: N dashboards × per-row parse → one parse at write time; reads become cheap columnar scans.
№ 28

Don't cast the join / partition key — implicit casts kill pruning

type alignmentall engines

Visa — joining a STRING user_id to a BIGINT user_id, and filtering a string partition key with an integer literal. Both quietly disable optimizations.

✗ Anti-pattern — implicit cast on the key
-- partition key ds is STRING; literal is INT →
-- engine casts the COLUMN, pruning is disabled:
WHERE ds = 20240108

-- join keys differ in type → re-hash both sides,
-- bucketing/co-location no longer applies:
ON CAST(a.user_id AS STRING) = b.user_id
✓ Optimized — compare like with like
-- compare the partition column to its OWN type:
WHERE ds = '20240108'

-- align key types at modeling time, then join bare:
ON a.user_id = b.user_id        -- both BIGINT

Why it wins. When you compare a column to a value of a different type, the engine usually casts the column — and a cast wrapped around a partition key disables partition pruning (same trap as №1), while a cast on a join key forces a re-hash that defeats bucketing and broadcast co-location. Fix the types in the model so every key comparison is bare. It's the highest "silent tax" per character of SQL.

Impact: restores partition pruning and shuffle-free joins that an invisible one-character type mismatch had switched off.
✦ ✦ ✦
§ The 60-second articulation

How to say it in the interview.

When an interviewer drops a slow query in front of you, narrate the diagnosis in the order the bytes flow — it signals you think in physical plans, not just SQL syntax:

"First I'd cut what we scan — is the partition key bare, are we projecting only needed columns, can a dim prune the fact? Then what we shuffle — broadcast the small side, pre-aggregate before the join, and check for a skewed key that needs salting or AQE. Then how often we write it — can this be incremental or a MERGE instead of a full recompute? And for the metrics themselves, an approximate distinct or percentile is almost always good enough at this scale. I'd confirm each hypothesis against the query plan and the bytes-scanned number before and after."

That paragraph maps one-to-one onto the seven families above. The named techniques — partition pruning, broadcast join, salting, HyperLogLog, QUALIFY, incremental MERGE, clustering — are the vocabulary that reads as senior. The discipline of always asking "which of the three bills does this cut?" is what reads as staff.

Keep going → Performance is the compute layer — making one petabyte query cheap. The next pillar, Analytics, Dashboard & Optimization, is the serving layer: turning that warehouse into sub-second dashboards with rollups, materialized views, a semantic layer, and OLAP engines. And Design covers the schemas these queries run on.

← Back to Practice · Q&A  ·  ↑ Top