Performance — 28 SQL optimizations for the petabyte table.
Not LeetCode. These are the rewrites a senior data engineer reaches for when the table is 5 PB, the cluster bill is real, and "it returns the right answer" is the easy part. Each card is an anti-pattern you'll see in a PR, the optimized rewrite, and the byte-level reason it wins. Dialect notes for Spark SQL, Snowflake, BigQuery and Trino throughout.
Bytes scanned
What you read off storage. Cut it with partition pruning, column projection, data skipping and clustering. The cheapest byte is the one you never read.
Bytes shuffled
What crosses the wire between stages — joins, GROUP BY, DISTINCT, global sorts. Cut it with broadcast joins, pre-aggregation, bucketing and approximate sketches.
Bytes written
What you re-materialize every run. Cut it by computing incrementally — touch only the new partition, MERGE the delta, and compact instead of rewriting.
A useful rule of thumb when you read each rewrite: scan less, shuffle less, write once. The skew cards are the exception — there the total work is fine, it's the distribution of work that kills you, and one straggler holds the whole stage hostage.
Dialect cheatsheet — the same idea, four engines
| Optimization | Spark SQL | Snowflake | BigQuery | Trino / Presto |
|---|---|---|---|---|
| Force broadcast | /*+ BROADCAST(t) */ | automatic (stats) | automatic | /*+ BROADCAST */ / join_distribution_type |
| Top-N per group filter | subquery + WHERE rn<=k | QUALIFY | QUALIFY | QUALIFY (363+) |
| Approx distinct | approx_count_distinct | APPROX_COUNT_DISTINCT | APPROX_COUNT_DISTINCT / HLL | approx_distinct |
| Approx percentile | approx_percentile | APPROX_PERCENTILE | APPROX_QUANTILES | approx_percentile |
| Skew join | AQE skew join / salt | automatic + salt | automatic | salt / partitioning |
| Data skipping | Z-ORDER (Delta) | CLUSTER BY | clustering | sorted files / Iceberg sort |
| Upsert | MERGE INTO (Delta/Iceberg) | MERGE | MERGE | MERGE (Iceberg/Delta) |
The seven families
- Scan less — pruning, projection, pushdown (№1–5)
- Shuffle less — join strategy (№6–10)
- Beat skew — the whale problem (№11–13)
- Aggregate smarter — approximate & multi-grain (№14–18)
- Windows beat self-joins — dedup & top-N (№19–22)
- Compute once — incremental, MERGE, compaction (№23–25)
- Storage & cost traps — clustering, JSON, casts (№26–28)
Read fewer bytes
Stop the full scan before it starts: prune partitions, project columns, push predicates, let a dim prune the fact.
Keep data off the wire
Broadcast the small side, pre-aggregate before the join, co-bucket, and anti-join instead of NOT IN.
Kill the straggler
One celebrity key owns half the rows. Salt it, split it, or let AQE rebalance — so no single task runs for hours.
Don't count the hard way
HyperLogLog distinct, t-digest percentiles, grouping sets in one pass, and rollups that turn PB into KB.
One pass, not O(n²)
QUALIFY top-N, latest-per-key dedup, and gaps-and-islands sessionization — windows where juniors reach for self-joins.
Touch only the delta
Incremental partitions, MERGE upserts, and compaction — stop recomputing five years of history every night.
The silent full scans
Clustering for data skipping, flatten JSON once, and the implicit cast that quietly disables every pruning optimization.
Pruning, projection & pushdown.
At petabyte scale, storage I/O is usually the dominant line on the bill. These five cards are all variations on one idea: convince the planner to skip files before it opens them. Get this family right and the rest of the query is operating on a thousandth of the data.
Never wrap the partition column in a function
Meta — pull one day from an events table partitioned by ds that lands ~100 TB/day. The query "works" but scans the whole table.
SELECT user_id, event_type
FROM events
WHERE CAST(event_ts AS DATE) = DATE '2024-01-01';
-- ds is the partition key, but the planner
-- can't see through CAST(event_ts) → it reads
-- every partition, then filters row by row.SELECT user_id, event_type
FROM events
WHERE ds = '2024-01-01' -- partition key, bare
AND event_ts >= TIMESTAMP '2024-01-01 00:00:00'
AND event_ts < TIMESTAMP '2024-01-02 00:00:00';Why it wins. Partition pruning happens in the metastore, against partition values — before a single file is opened. The moment you wrap ds (or a column functionally tied to it) in CAST, DATE(), SUBSTR or string concatenation, the optimizer can no longer map the expression back to partition directories and falls back to a full scan. Keep the partition column bare on one side of the predicate.
Contiguous range beats a derived bucket or a 365-value IN-list
Amazon — "last 7 days of orders." The partition key is a yyyy-MM-dd string.
-- derives the month from the key → can't prune
WHERE SUBSTR(ds,1,7) = '2024-01'
-- or a hand-built list the planner treats opaquely
WHERE ds IN ('2024-01-01', ... 365 values ...)WHERE ds BETWEEN '2024-01-01' AND '2024-01-07'
-- lexical order == date order for yyyy-MM-dd,
-- so the metastore returns exactly 7 directories.Why it wins. A BETWEEN on a sortable partition key lets the catalog do a range scan over partition metadata and hand back the precise file list. SUBSTR derives a value the planner can't invert; a giant IN-list is evaluated per partition and bloats the query plan. The yyyy-MM-dd convention exists precisely so lexical order equals chronological order.
Project columns — SELECT * is a tax on columnar storage
Netflix — a 400-column wide playback-events table in Parquet; the report needs three columns.
SELECT *
FROM playback_events
WHERE ds = '2024-01-01';
-- reads all 400 column chunks off disk,
-- decompresses them, ships them — to use 3.SELECT user_id, title_id, watch_seconds
FROM playback_events
WHERE ds = '2024-01-01';
-- columnar readers fetch only the 3 column
-- chunks; the other 397 are never touched.Why it wins. Parquet/ORC store data column-by-column, so projection is a genuine I/O reduction, not just a narrower result. SELECT * also defeats column-level predicate pushdown and forces the engine to carry wide rows through every downstream shuffle. Name your columns — always.
Filter before the join, not in the outer WHERE
Uber — join the trips fact (PB) to a small dim_city, but you only care about one city and one week.
SELECT t.*, c.city_name
FROM trips t
JOIN dim_city c ON c.city_id = t.city_id
WHERE c.city_name = 'Delhi'
AND t.ds BETWEEN '2024-01-01' AND '2024-01-07';
-- with a weak optimizer the whole trips table
-- is shuffled into the join first.SELECT t.*, c.city_name
FROM (SELECT * FROM trips
WHERE ds BETWEEN '2024-01-01' AND '2024-01-07'
AND city_id = 9) t -- resolve the id once
JOIN (SELECT city_id, city_name FROM dim_city
WHERE city_name = 'Delhi') c
ON c.city_id = t.city_id;Why it wins. The join input is the thing you shuffle, and shuffle volume is what you pay. Modern cost-based optimizers push selective predicates down automatically, but the moment a UDF, an OR across both tables, or a late-resolved value blocks them, you eat a full-table shuffle. Filtering the big side inside the scan guarantees the optimizer can't get it wrong.
Dynamic partition pruning — let a tiny dim prune the giant fact
Spark — fact_sales partitioned by ds, joined to dim_date to pull only holidays. The filter is on the dim, but the fact still full-scans.
SELECT s.*
FROM fact_sales s
JOIN dim_date d ON d.ds = s.ds
WHERE d.is_holiday = true;
-- the holiday filter lives on dim_date; without
-- DPP, Spark scans every ds partition of the fact.SET spark.sql.optimizer.dynamicPartitionPruning.enabled=true;
-- ...or make it explicit so any engine prunes:
SELECT s.*
FROM fact_sales s
WHERE s.ds IN (SELECT ds FROM dim_date WHERE is_holiday);Why it wins. DPP runs the small dimension first, collects the surviving partition keys, and injects them as a runtime filter on the fact's partition column — so the fact scan is pruned to just the holiday partitions. When you can't rely on the optimizer, a semi-join (IN (SELECT …) / EXISTS) on the partition key reproduces it deterministically.
Join strategy.
Once you've stopped scanning needless bytes, the next bill is the shuffle — the all-to-all exchange a sort-merge join, GROUP BY or DISTINCT triggers. A 5 PB shuffle can dominate wall-clock and cost. These five cards keep the big table off the wire.
Broadcast the small side of a big-vs-small join
Google — a 5 PB fact_events joined to a 40 MB dim_country. The default plan shuffles the 5 PB side.
SELECT e.*, c.country_name
FROM fact_events e
JOIN dim_country c ON c.country_id = e.country_id;
-- hash-partitions and sorts BOTH inputs by
-- country_id → 5 PB crosses the network.SELECT /*+ BROADCAST(c) */ e.*, c.country_name
FROM fact_events e
JOIN dim_country c ON c.country_id = e.country_id;
-- 40 MB dim is copied to every executor; the
-- 5 PB fact is probed in place and never moves.Why it wins. If one side fits in memory, ship it to where the big side already lives. A broadcast hash join turns an O(N log N) shuffle-and-sort into a streaming hash probe with zero exchange on the large input. Snowflake and BigQuery choose this automatically from table statistics; in Spark/Trino you nudge it with a hint (and watch spark.sql.autoBroadcastJoinThreshold).
Pre-aggregate before joining the big dimension
Amazon — revenue per category from 50 B order_items joined to dim_product.
SELECT p.category, SUM(oi.revenue) AS rev
FROM order_items oi
JOIN dim_product p ON p.product_id = oi.product_id
GROUP BY p.category;
-- shuffles all 50B item rows through the join.WITH per_product AS (
SELECT product_id, SUM(revenue) AS rev
FROM order_items
GROUP BY product_id) -- 50B → ~few M rows
SELECT p.category, SUM(pp.rev) AS rev
FROM per_product pp
JOIN dim_product p ON p.product_id = pp.product_id
GROUP BY p.category;Why it wins. SUM is associative, so you can collapse order_items to one row per product before touching the dimension. The join then runs on millions of rows instead of tens of billions, and (bonus) the per-product aggregate is broadcast-eligible. Push aggregation as deep as the algebra allows.
Anti-join, not NOT IN — correctness and speed
Stripe — charges that never got refunded. NOT IN over a big subquery is both slow and quietly wrong.
SELECT *
FROM charges
WHERE charge_id NOT IN (
SELECT charge_id FROM refunds);
-- if ANY refunds.charge_id is NULL, the whole
-- predicate is UNKNOWN → ZERO rows returned.SELECT c.*
FROM charges c
LEFT JOIN refunds r ON r.charge_id = c.charge_id
WHERE r.charge_id IS NULL;
-- or: WHERE NOT EXISTS (SELECT 1 FROM refunds r
-- WHERE r.charge_id = c.charge_id)Why it wins. Two reasons. Correctness: NOT IN with a single NULL in the list returns nothing — a classic production bug. Speed: NOT EXISTS / LEFT JOIN … IS NULL compile to a hash anti-join (one shuffle, streamed), while NOT IN often degrades to a correlated or nested-loop check. NOT EXISTS is the safe default.
Bucket / cluster co-located tables to delete the shuffle
LinkedIn — two large facts joined on member_id in a daily pipeline, so the same shuffle runs every single night.
SELECT a.*, b.*
FROM member_events a
JOIN member_profile b ON b.member_id = a.member_id;
-- both sides hash-partitioned by member_id from
-- scratch on every execution.-- one-time DDL: same key, same bucket count
CREATE TABLE member_events (...)
CLUSTERED BY (member_id) INTO 1024 BUCKETS;
CREATE TABLE member_profile (...)
CLUSTERED BY (member_id) INTO 1024 BUCKETS;
-- now the join is sort-merge-bucket: no shuffle.Why it wins. Bucketing physically lays each table's rows into N hash buckets on the join key. When both tables share the key and bucket count, matching buckets are already co-located, so the engine does a local sort-merge with no exchange. You pay the partitioning cost once at write time and bank it on every subsequent join.
Replace a self-join with a window aggregate
Robinhood — each trade's running total of volume per account. The textbook self-join is quadratic.
SELECT t.trade_id, SUM(p.qty) AS running_qty
FROM trades t
JOIN trades p
ON p.account_id = t.account_id
AND p.ts <= t.ts
GROUP BY t.trade_id;
-- every trade re-joins all prior trades; a hot
-- account with 10M trades explodes the join.SELECT trade_id,
SUM(qty) OVER (PARTITION BY account_id
ORDER BY ts
ROWS UNBOUNDED PRECEDING) AS running_qty
FROM trades;Why it wins. An inequality self-join (p.ts <= t.ts) materializes a triangular cross-product — O(n²) per group. A window function sorts each partition once and sweeps it in a single linear pass. Any "running / cumulative / compare-to-neighbour" requirement is a window function, never a self-join.
When one key owns half the rows.
Skew is the optimization juniors miss because the query is "correct" and the plan looks fine — yet one task runs for three hours while the other 999 finished in thirty seconds. The total work is reasonable; its distribution is not. (Deeper operational treatment on the Hot Shards & Data Skew page.)
Salt the hot join key
Meta — join engagements to users where 0.1% of users (celebrities) own 50% of rows. The reducer that owns a celebrity key is the straggler that holds the stage.
SELECT u.name, COUNT(*) AS engagements
FROM engagements e
JOIN users u ON u.user_id = e.user_id
GROUP BY u.name;
-- all of @celebrity's rows hash to ONE partition.WITH e AS ( -- salt the skewed (fact) side
SELECT *, CAST(FLOOR(RAND()*16) AS INT) AS salt
FROM engagements),
u AS ( -- fan the small side out to match
SELECT u.*, s.salt
FROM users u
CROSS JOIN UNNEST(SEQUENCE(0,15)) AS s(salt))
SELECT u.name, COUNT(*) AS engagements
FROM e JOIN u
ON u.user_id = e.user_id AND u.salt = e.salt
GROUP BY u.name;Why it wins. Appending a random salt to the hot key splits its rows across N partitions instead of one; replicating the small side N times keeps the join correct. Sixteen evenly-loaded tasks beat one task doing 50% of the work. You only need to salt the keys that are actually hot — salt everything and you just inflate the small side.
Let AQE rebalance skew automatically
Airbnb — same skewed join, but you'd rather not hand-salt every pipeline. Let the engine split hot partitions at runtime.
-- AQE off: the plan is fixed at compile time;
-- a partition that turns out 200× the median
-- size still runs as a single task.
SET spark.sql.adaptive.enabled=false;SET spark.sql.adaptive.enabled=true;
SET spark.sql.adaptive.skewJoin.enabled=true;
-- Spark measures actual partition sizes mid-query
-- and splits any partition > skewedPartitionFactor ×
-- median into sub-partitions, each its own task.Why it wins. AQE re-plans using real shuffle statistics instead of compile-time estimates: it splits oversized partitions, coalesces tiny ones, and can flip a sort-merge to a broadcast once it sees the true sizes. It's the zero-code first move for skew — reach for manual salting (№11) only when AQE isn't available or isn't enough.
Isolate the hot keys and union them back
TikTok — a handful of mega-viral creators dominate. You know exactly who they are, so handle them on a separate path.
SELECT creator_id, SUM(views)
FROM video_events
GROUP BY creator_id;
-- the top 20 creators each get a giant group
-- while millions of long-tail creators are tiny.-- cold tail: normal aggregation
SELECT creator_id, SUM(views) AS views
FROM video_events
WHERE creator_id NOT IN (SELECT creator_id FROM hot_creators)
GROUP BY creator_id
UNION ALL
-- hot keys: pre-split with a salt, then re-sum
SELECT creator_id, SUM(views) AS views FROM (
SELECT creator_id,
SUM(views) AS views
FROM video_events
WHERE creator_id IN (SELECT creator_id FROM hot_creators)
GROUP BY creator_id, CAST(FLOOR(RAND()*64) AS INT)
) t GROUP BY creator_id;Why it wins. When the hot set is small and known, you don't have to penalise the whole query. The long tail aggregates normally; the few whales get a salted two-phase aggregation. Same idea as salting, but surgically applied so you don't replicate the dimension for keys that were never hot.
Don't count the hard way.
The most expensive aggregates are the exact-distinct and exact-percentile ones, because they need to see — and shuffle — every value. At petabyte scale, "within 2%" is almost always good enough for a dashboard, and one pass can answer many grains at once.
APPROX_COUNT_DISTINCT for DAU/MAU — HyperLogLog
Meta — distinct daily active users over a 100 B-row event table. Exact COUNT(DISTINCT) must shuffle every user_id to dedupe it.
SELECT ds, COUNT(DISTINCT user_id) AS dau
FROM events
GROUP BY ds;
-- every user_id is shuffled and held in memory
-- to be deduplicated, per day.SELECT ds, APPROX_COUNT_DISTINCT(user_id) AS dau
FROM events
GROUP BY ds;
-- even better, persist a mergeable sketch per day:
-- HLL_SKETCH_AGG → MAU = merge of 30 daily sketches,
-- no re-scan of the raw events.Why it wins. HyperLogLog estimates cardinality from a fixed-size sketch (KBs) instead of materialising every distinct value, so the shuffle and memory footprint are tiny and bounded. The killer feature is that sketches are mergeable: store one per day and you can compute any week/month/quarter by unioning sketches — never rescanning raw events. Standard error ~1–2%.
APPROX_PERCENTILE for p95 latency — t-digest
Datadog — p50/p95/p99 request latency over trillions of spans for an SLO dashboard.
SELECT service,
PERCENTILE_CONT(0.95)
WITHIN GROUP (ORDER BY latency_ms) AS p95
FROM spans
GROUP BY service;
-- exact quantile must sort the entire group.SELECT service,
APPROX_PERCENTILE(latency_ms, 0.95) AS p95
FROM spans
GROUP BY service;
-- BigQuery: APPROX_QUANTILES(latency_ms, 100)[OFFSET(95)]Why it wins. Exact percentiles need the whole group ordered; a t-digest/Q-digest sketch summarises the distribution in bounded memory with high accuracy in the tails (exactly where p95/p99 live). It's the default for any latency or SLO panel — nobody sorts a trillion spans to draw a line chart.
GROUPING SETS — every grain in one scan
Amazon — revenue by country, by country×category, and the grand total. Three reports, ideally one scan.
SELECT country, NULL cat, SUM(rev) FROM f GROUP BY country
UNION ALL
SELECT country, category, SUM(rev) FROM f GROUP BY country,category
UNION ALL
SELECT NULL, NULL, SUM(rev) FROM f;
-- scans the PB fact three times.SELECT country, category, SUM(rev) AS rev
FROM f
GROUP BY GROUPING SETS (
(country, category),
(country),
() -- grand total
);Why it wins. GROUPING SETS (and its shorthands ROLLUP/CUBE) compute multiple aggregation grains from a single read of the fact, sharing the scan and much of the shuffle. Each extra grain you'd have UNION-ed is another full PB scan you just deleted.
GROUPING() to tell the subtotal rows apart.Map-side pre-aggregation cuts the shuffle (the combiner)
Cloudflare — request counts grouped by a high-cardinality key over a firehose. Every raw row crossing the wire is wasted.
SELECT zone_id, COUNT(DISTINCT ip) AS uniques
FROM requests
GROUP BY zone_id;
-- (zone_id, ip) pairs all shuffle so DISTINCT
-- can dedupe globally.SELECT zone_id, COUNT(*) AS uniques
FROM ( SELECT zone_id, ip
FROM requests
GROUP BY zone_id, ip ) -- map-side combine
GROUP BY zone_id;
-- or just approx_count_distinct(ip) (see №14)Why it wins. Algebraic aggregates (SUM, COUNT, MIN/MAX) get a free map-side combine — each task pre-aggregates locally and ships only partial results, so shuffle volume drops by the group-collapse factor. COUNT(DISTINCT) doesn't combine for free; rewriting it as group-then-count restores the partial-aggregation, and an HLL sketch removes the shuffle entirely.
Materialize daily rollups — dashboards read the rollup, not raw events
Every FAANG — a 90-day trend tile that re-scans the raw event firehose on every single dashboard load.
-- runs on every page load, ×100s of viewers
SELECT ds, COUNT(*) AS events, SUM(rev) AS rev
FROM raw_events
WHERE ds >= DATE_SUB(CURRENT_DATE, 90)
GROUP BY ds;-- scheduled job, once a day:
INSERT OVERWRITE TABLE agg_daily PARTITION (ds)
SELECT ds, COUNT(*) events, SUM(rev) rev, ...
FROM raw_events WHERE ds = CURRENT_DATE GROUP BY ds;
-- dashboard query → scans one tiny table:
SELECT ds, events, rev FROM agg_daily
WHERE ds >= DATE_SUB(CURRENT_DATE, 90);Why it wins. The single highest-leverage move in analytics: pay the scan once in a scheduled job, then serve every dashboard from a table that's one row per day×dimension. A 90-day tile reads kilobytes. (This is the bridge into the Analytics, Dashboard & Optimization pillar, where the whole serving layer is built on this idea.)
One ordered pass where juniors write a join.
Three of the most common interview prompts — top-N per group, latest-record-per-key, and sessionization — have a tempting self-join answer that's quadratic, and a window-function answer that's linear. Knowing the window form cold is a senior signal.
QUALIFY ROW_NUMBER() for top-N per group
Netflix — the top-3 titles per user by watch time.
SELECT *
FROM watch w
WHERE (SELECT COUNT(*) FROM watch w2
WHERE w2.user_id = w.user_id
AND w2.secs > w.secs) < 3;
-- correlated subquery re-scans per row.SELECT *
FROM watch
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id
ORDER BY secs DESC) <= 3;
-- Spark/PG: wrap it and filter:
-- SELECT * FROM (… ROW_NUMBER() AS rn …) WHERE rn <= 3
Why it wins. QUALIFY filters on a window result without a subquery; one partition-and-sort answers the whole top-N. The correlated-count form re-scans the group for every row — quadratic. Use RANK() if you want ties to share a place, ROW_NUMBER() for a hard cut.
Latest-record-per-key dedup (CDC) with a tiebreaker
Snowflake — collapse an append-only change log to the current row per primary key.
SELECT *
FROM changes c
WHERE c.updated_at = (
SELECT MAX(updated_at) FROM changes c2
WHERE c2.pk = c.pk);
-- re-scans per pk, AND returns 2 rows when two
-- updates share the same updated_at.SELECT *
FROM changes
QUALIFY ROW_NUMBER() OVER (
PARTITION BY pk
ORDER BY updated_at DESC, op_seq DESC) = 1;
-- op_seq breaks ties → exactly one row per pk.Why it wins. One window pass replaces a correlated subquery, and the secondary sort key (op_seq, an offset, a monotonic id) guarantees exactly one winner per key even when timestamps collide — a real bug in high-throughput CDC. This is the canonical "current state from a change stream" pattern.
Sessionize with LAG + running sum, not a self-join
Spotify — split each user's event stream into sessions with a 30-minute inactivity gap.
-- join each event to its predecessor by a
-- correlated subquery / inequality self-join,
-- then diff timestamps → O(n²) per user.SELECT *,
SUM(is_new) OVER (PARTITION BY user_id
ORDER BY ts) AS session_id
FROM (
SELECT *,
CASE WHEN ts - LAG(ts) OVER (PARTITION BY user_id
ORDER BY ts)
> INTERVAL '30' MINUTE
THEN 1 ELSE 0 END AS is_new
FROM events
) s;Why it wins. LAG gives each event its predecessor's timestamp in one pass; a gap over the threshold flags a session boundary; a running SUM of those flags numbers the sessions. Two ordered sweeps replace a quadratic self-join — the textbook gaps-and-islands solution.
Push the LIMIT — don't globally sort a PB to take 100 rows
Pinterest — "top 100 pins by saves this week" off a huge fact.
SELECT pin_id, saves
FROM wide_join_of_everything -- PB of rows
ORDER BY saves DESC
LIMIT 100;
-- a global sort over a PB to keep 100 rows.SELECT pin_id, SUM(saves) AS saves
FROM save_events
WHERE ds BETWEEN '2024-01-01' AND '2024-01-07'
GROUP BY pin_id
ORDER BY saves DESC
LIMIT 100; -- engine keeps a 100-entry heap
-- per partition, then merges.Why it wins. ORDER BY … LIMIT k over an aggregated input lets the engine maintain a bounded top-k heap per partition and merge them — no full global sort. Sorting a raw PB-wide join to discard all but 100 rows is pure waste; collapse to the grain first, filter early, and only then take the top-N.
Stop recomputing five years every night.
The third bill is bytes written. The most common waste in a scheduled pipeline is recomputing the entire history when only today changed. These cards make write cost scale with the delta, not the table.
Incremental partition overwrite, not full recompute
Airbnb — a daily aggregate over a 5-year fact, rebuilt from scratch every night.
INSERT OVERWRITE TABLE agg
SELECT ds, dim, SUM(x)
FROM fact -- all 1,825 days, nightly
GROUP BY ds, dim;INSERT OVERWRITE TABLE agg PARTITION (ds = '2024-01-08')
SELECT '2024-01-08' AS ds, dim, SUM(x)
FROM fact
WHERE ds = '2024-01-08' -- read & write one day
GROUP BY dim;
-- dynamic-partition variant handles late data:
-- WHERE ds >= CURRENT_DATE - INTERVAL '3' DAYWhy it wins. Yesterday's partitions didn't change, so reading and rewriting them is pure waste. Overwriting only the affected partition makes both read and write O(one day) instead of O(history). For late-arriving data, reprocess a small trailing window (e.g., last 3 days) rather than everything.
MERGE the delta for upserts / SCD, not a full rebuild
Stripe — apply a batch of CDC changes to a dimension on a lakehouse table.
CREATE OR REPLACE TABLE dim_customer AS
SELECT ... FROM dim_customer
-- hand-merge old + new by UNION/JOIN, rewrite
-- every row even though 0.1% changed.MERGE INTO dim_customer d
USING staged_changes s ON d.customer_id = s.customer_id
WHEN MATCHED AND s.op = 'U'
THEN UPDATE SET d.email = s.email, d.updated_at = s.ts
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN NOT MATCHED
THEN INSERT (customer_id, email, updated_at)
VALUES (s.customer_id, s.email, s.ts);Why it wins. On Delta/Iceberg/Snowflake, MERGE rewrites only the data files that contain matched keys, leaving the rest untouched. Write amplification scales with the size of the change set, not the table — and it's one atomic statement instead of a fragile rebuild-and-swap.
Compact small files — stop dying by a thousand opens
DoorDash — a streaming ingest writes millions of 1 MB files a day; downstream batch reads spend all their time listing and opening files.
-- micro-batch every few seconds, no compaction:
-- read planning lists millions of files, opens
-- each one, reads a tiny footer — overhead >>
-- actual data read.-- Delta
OPTIMIZE events WHERE ds = '2024-01-08';
-- Iceberg
CALL system.rewrite_data_files('events');
-- plain Spark/Hive
INSERT OVERWRITE TABLE events PARTITION (ds='2024-01-08')
SELECT /*+ COALESCE(64) */ * FROM events WHERE ds='2024-01-08';Why it wins. Every file carries fixed per-file overhead — a list entry, an open, a footer read, a task. A million tiny files means the engine spends its life on bookkeeping instead of data. Compacting into a few hundred right-sized files (≈128–512 MB) collapses planning and read time and restores healthy parallelism.
Clustering, JSON & the cast that kills pruning.
The last family is the set of quiet mistakes that disable everything above — a layout that prevents skipping, a JSON blob re-parsed a million times, and the single most common pruning-killer in real code: an implicit type cast on a key.
Cluster / Z-order by your filter columns for data skipping
Snowflake — a huge table partitioned by date but almost always filtered by (event_date, country). Within a day, countries are scattered across every micro-partition.
SELECT ...
FROM events
WHERE ds = '2024-01-08' AND country = 'IN';
-- country='IN' rows live in EVERY micro-partition
-- of that day → min/max can't skip any of them.ALTER TABLE events CLUSTER BY (ds, country);
-- Delta: OPTIMIZE events ZORDER BY (country);
-- BigQuery: CLUSTER BY country at table creation.
-- now each file holds a narrow country range →
-- min/max stats skip the non-'IN' files.Why it wins. Engines keep per-file/micro-partition min/max stats and skip any file whose range can't match the predicate. That only helps if matching rows are physically clustered; clustering or Z-ordering by your common filter columns narrows each file's range so data skipping actually fires. Partitioning gives coarse pruning; clustering gives fine-grained skipping inside the partition.
country='IN'.Flatten JSON once into typed columns — don't re-parse per query
Shopify — a raw payload JSON string that ten different dashboards each parse on the fly over raw events.
SELECT
GET_PATH(PARSE_JSON(payload), 'checkout.total') AS total,
GET_PATH(PARSE_JSON(payload), 'checkout.currency') AS ccy
FROM raw_events
WHERE ds = '2024-01-08';
-- parse + path-extract on every row, every query,
-- and the blob defeats column pruning/skipping.-- one curated table, built once per partition:
CREATE TABLE checkout_facts AS
SELECT order_id, ds,
payload:checkout.total::NUMBER AS total,
payload:checkout.currency::STRING AS currency
FROM raw_events;
-- dashboards now read typed columns:
SELECT total, currency FROM checkout_facts WHERE ds='2024-01-08';Why it wins. JSON parsing is CPU-heavy and a string blob can't be column-pruned or min/max-skipped. Doing the extraction once in an ELT step turns hot paths into typed, columnar fields — parse cost is paid a single time, and every downstream query gets projection and data skipping back.
Don't cast the join / partition key — implicit casts kill pruning
Visa — joining a STRING user_id to a BIGINT user_id, and filtering a string partition key with an integer literal. Both quietly disable optimizations.
-- partition key ds is STRING; literal is INT →
-- engine casts the COLUMN, pruning is disabled:
WHERE ds = 20240108
-- join keys differ in type → re-hash both sides,
-- bucketing/co-location no longer applies:
ON CAST(a.user_id AS STRING) = b.user_id-- compare the partition column to its OWN type:
WHERE ds = '20240108'
-- align key types at modeling time, then join bare:
ON a.user_id = b.user_id -- both BIGINTWhy it wins. When you compare a column to a value of a different type, the engine usually casts the column — and a cast wrapped around a partition key disables partition pruning (same trap as №1), while a cast on a join key forces a re-hash that defeats bucketing and broadcast co-location. Fix the types in the model so every key comparison is bare. It's the highest "silent tax" per character of SQL.
How to say it in the interview.
When an interviewer drops a slow query in front of you, narrate the diagnosis in the order the bytes flow — it signals you think in physical plans, not just SQL syntax:
That paragraph maps one-to-one onto the seven families above. The named techniques — partition pruning, broadcast join, salting, HyperLogLog, QUALIFY, incremental MERGE, clustering — are the vocabulary that reads as senior. The discipline of always asking "which of the three bills does this cut?" is what reads as staff.