Skew & distributions — speak whale, decile, Pareto, P99.
L5+ data-engineering interviews test whether you can name the shape of user behaviour, not just query it. This page codifies the vocabulary: user archetypes (whale / dolphin / minnow), quantile terminology (decile / quartile / percentile), distribution shapes (Pareto / long-tail / normal / skewed), and the one-sentence answers interviewers listen for when they probe with "why not just use AVG and standard deviation?".
Contents
User archetypes — whales, dolphins, minnows.
Any consumer platform with a long-tail engagement distribution has three named cohorts. The names come from gaming/casino analytics in the 2000s; they're now standard at Netflix, Spotify, YouTube, Meta, Stripe, and every B2C SaaS. The interview signal isn't knowing the words — it's knowing what to do when you've classified a user as each.
Top 1-5% · The Power Users
🐋 Whale
Definition. The top 1-5% of users by consumption — watch time, GMV, API calls, played minutes. They drive a disproportionate share of total platform usage. On Netflix, a whale watches 60+ hours / month while the typical user watches 5-10.
How you find them in SQL.
PERCENT_RANK() OVER (ORDER BY total_hours DESC) <= 0.05
-- or: NTILE(20) ... WHERE bucket = 1 (top 5%)
-- or: NTILE(100) ... WHERE bucket = 1 (top 1%)
Business context. Whales drive retention metrics — they're the most likely to renew, refer friends, and tolerate price increases. They also stress-test infrastructure: if a whale has a bad streaming experience, it's a P0 because (a) they're paying customers, (b) they generate the most support volume, (c) their bandwidth pattern is what your edge cache must serve. Senior framing: "A whale's churn isn't just one cancelled subscription — it's a leading indicator of platform health degradation."
Middle 65-75% · The Backbone
🐬 Dolphin (or Mid-User)
Definition. The average user — watches a few hours per week, logs in regularly, uses the core feature set but not the long-tail features. Roughly P25 to P85 on the consumption distribution.
How you find them.
PERCENT_RANK() OVER (ORDER BY total_hours DESC) BETWEEN 0.15 AND 0.75
-- or: NTILE(10) ... WHERE bucket BETWEEN 3 AND 8 (deciles 3-8)
Business context. Dolphins are the platform's stability metric. If dolphin DAU/MAU is steady, the platform is healthy. If dolphins shift downward (toward minnow territory), you're in trouble months before the financial metrics show it. Senior framing: "We don't optimise for the whales — they're already maxed out. We don't optimise for the minnows — they're already gone. We optimise for the dolphin's next session."
Bottom 20-30% · The Churn Risk
🐟 Minnow
Definition. Low-engagement users. One session a week or less, often just trial users, gift-subscription holders, or partner-bundle accounts. Bottom decile or two.
How you find them.
PERCENT_RANK() OVER (ORDER BY total_hours DESC) >= 0.80
-- or: NTILE(10) ... WHERE bucket IN (9, 10) (bottom 20%)
Business context. Minnows are the growth opportunity, not deadweight. They're paying for the subscription but not getting value — moving them from minnow → dolphin is how the platform expands without acquiring new users. Senior framing: "Minnows aren't a deletion target. They're the move-them-up target. Marketing's job is decile 10 → 8 → 5, not churn-and-replace."
Quantiles — decile, quartile, quintile, percentile.
"Dividing the population into equal-sized groups based on a metric" is a single concept with five names depending on how many buckets you want. Interviewers expect you to use the right one without thinking.
| Term | Buckets | Each bucket size | SQL | Typical use |
|---|---|---|---|---|
| Quartile | 4 | 25% | NTILE(4) | Coarse cohorting; salary band reports |
| Quintile | 5 | 20% | NTILE(5) | Marketing segmentation (top quintile = "premium") |
| Decile | 10 | 10% | NTILE(10) | Power-user analysis; lift charts in propensity modelling |
| Percentile | 100 | 1% | NTILE(100) or PERCENT_RANK | Latency SLAs, top 1% whale identification |
| P-value notation | — | — | PERCENTILE_CONT(0.99) | "P99 latency" — the value at the 99th percentile |
Top decile vs top 1st decile — the ambiguity that trips juniors
Top decile = decile 1 in DESC order = top 10% of users. But some teams number deciles ascending (decile 1 = bottom 10%, decile 10 = top 10%). Always confirm the convention before writing SQL — and prefer naming the threshold ("top 10%") over the bucket number ("decile 1") when the audience is non-technical.
-- Convention A: DESC — decile 1 = whales
SELECT user_id, NTILE(10) OVER (ORDER BY hours DESC) AS decile FROM user_metrics;
-- decile 1 = top 10% (whales), decile 10 = bottom 10% (minnows)
-- Convention B: ASC — decile 10 = whales
SELECT user_id, NTILE(10) OVER (ORDER BY hours ASC) AS decile FROM user_metrics;
-- decile 1 = bottom 10% (minnows), decile 10 = top 10% (whales)
PERCENTILE_CONT vs PERCENTILE_DISC
| Function | What it returns | Pick when |
|---|---|---|
PERCENTILE_CONT(0.5) | Interpolated value (linear between two adjacent values) | Continuous metrics — revenue, hours, latency |
PERCENTILE_DISC(0.5) | Actual value from the data — the row at or just past the rank | Discrete metrics, or when the result must be a real observed value |
NTILE(N) | Bucket assignment 1..N (roughly equal-sized buckets) | Cohort assignment, lift charts, decile rollups |
PERCENT_RANK() | Continuous rank in [0, 1] | "Where does this user fall on the curve?"; growth-resilient filters like <= 0.05 |
CUME_DIST() | Cumulative distribution in (0, 1] | "What fraction of users watch ≤ this many hours?" |
Distribution shapes — name the curve, not just the number.
Pareto distribution — the 80/20 rule
Vilfredo Pareto, 1896, observing that 80% of Italian land was owned by 20% of the population. The same shape shows up in every consumer platform: 80% of watch hours come from 20% of users; 80% of revenue from 20% of customers; 80% of GitHub commits from 20% of contributors.
Netflix context: 20% of your users (the whales) likely account for 80% of total watch hours. SQL strategy: use cumulative sums (running totals) to show where the "bulge" is — order users DESC by hours, run the cumulative sum, the threshold where it crosses 80% of the global total is your "vital few" cutoff. See the 80/20 Pareto question for the full pattern.
Long tail (power law) — Pareto's stretched cousin
A distribution where a high frequency of "minnows" trails off into a very long, thin tail of "whales". Visually: a huge spike on the left (lots of people watching 0-1 hours) and a line that stretches far to the right (one viewer watching 400 hours). The "tail" can stretch to extremes — one user at 400 hours/month while the bulk of users sit at 0-2 hours.
Two-word interview test: if the interviewer says "long tail", you should immediately reach for NTILE(10) + decile rollups rather than AVG. The average is meaningless on a long tail because the extremes pull it sideways.
Normal distribution (bell curve) — the one tech data rarely has
This is what most people assume data looks like, but in tech it's rarely true for engagement. Definition: most users are in the middle, with very few at the extreme high or low ends. Symmetric around the mean, ~68% within 1σ, ~95% within 2σ, ~99.7% within 3σ. Useful for measurement error, manufacturing tolerances, and IQ scores. Almost never useful for user engagement.
Interview tip — the standard-deviation trap. If an interviewer asks why you aren't using simple STDDEV for outlier detection, the answer is: "because engagement data is skewed, and standard deviation assumes a normal distribution." The whales pull the mean way above the median, σ is dominated by outliers, and STDDEV-based outlier detection silently flags half your dolphins as anomalies while missing the actual whales.
Skewness — the asymmetry that breaks naive analytics
Positive (right) skew means the "tail" is on the right side. This is the Netflix example — most users are on the low end, but the whales pull the average to the right. Interview insight: when data is positively skewed, the mean (average) will be much higher than the median. Always mention the median to show you aren't being fooled by the whales.
| Type | Tail on | Mean vs median | Where you see it |
|---|---|---|---|
| Positive (right) skew | Right side (high values) | Mean > Median | Watch time, revenue, salary, page-view counts — the default for engagement |
| Negative (left) skew | Left side (low values) | Mean < Median | Test scores in an easy test, age at retirement, time-to-bug-fix on critical issues |
| Zero skew (symmetric) | Both sides equal | Mean ≈ Median | Pure measurement noise, true random walks |
Mean vs median — the L5 tell.
The single most-tested signal in senior-IC data interviews: which one do you use, and why?
| Distribution | AVG(hours) | MEDIAN(hours) | What it means |
|---|---|---|---|
| Uniform (1, 2, 3, 4, 5) | 3.0 | 3.0 | They agree — neither misleads |
| Skewed (1, 2, 3, 4, 1000) | 202.0 | 3.0 | One whale pulls the mean up 67×; median ignores |
| Netflix-shape (typical: 5h, whales: 60h+) | ~12h | ~6h | Mean reports a user who doesn't exist |
The L5 sentence. When asked "what's the average watch time per user", the senior answer is: "Average watch time per user is X hours. The median is Y hours. The gap tells you the platform is heavily skewed toward whales — for any operational decision, use the median; for capacity planning, look at P99."
Skew ratio — quantifying how skewed
A simple, defensible metric for "how skewed":
SELECT
AVG(hours) AS mean_hours,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY hours) AS median_hours,
AVG(hours) / NULLIF(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY hours), 0)
AS skew_ratio_mean_over_median,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY hours) /
NULLIF(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY hours), 0)
AS p99_over_p50
FROM user_metrics;
Healthy uniform: skew_ratio ≈ 1.0. Healthy long-tail (Pareto): skew_ratio between 2 and 4. Extreme power-law (one viral product, gaming whales): skew_ratio > 10. The p99_over_p50 ratio is the more interviewer-friendly variant — "how many times more does the top 1% consume than the median?"
LAD gap — when whales get miscategorised as minnows.
Late-Arriving Dimensions (LAD): the user just watched a brand-new show, but the content metadata hasn't propagated to the global catalog yet. Their event row is correct, but joins to dim_content return NULL, and the user's hours look artificially low for a brief window. If your decile-classification job runs during that window, a whale gets stamped as a minnow. Worse: they appear in churn-risk campaigns and you spam your best customers with retention offers.
The defensive pattern — Lag-Tolerant View
Emit an is_metadata_complete flag on every row. Power-user / VIP / classification reports filter to = TRUE; ops and SRE reports run unfiltered (they want to see the LAD-pending rows).
-- The LAD-tolerant decile classification
WITH user_metrics AS (
SELECT user_id, SUM(watch_seconds) / 3600.0 AS hours
FROM viewing_activity
WHERE is_metadata_complete = TRUE -- ← the LAD guard
GROUP BY user_id
),
ranked AS (
SELECT user_id, hours,
NTILE(10) OVER (ORDER BY hours DESC) AS decile,
CASE
WHEN NTILE(20) OVER (ORDER BY hours DESC) = 1 THEN 'whale'
WHEN NTILE(10) OVER (ORDER BY hours DESC) <= 2 THEN 'heavy_dolphin'
WHEN NTILE(10) OVER (ORDER BY hours DESC) <= 8 THEN 'dolphin'
ELSE 'minnow'
END AS archetype
FROM user_metrics
)
SELECT * FROM ranked ORDER BY hours DESC;
When NOT to use the LAD guard
- Real-time SLA dashboards — you want to see incomplete rows so you can spot the propagation backlog.
- SRE incident drills — the LAD-pending count is itself the metric.
- Financial close / regulatory reporting — never; everything must be complete by close.
The rule of thumb: any report that classifies a user's behaviour (decile, archetype, churn risk, cohort) needs the LAD guard. Any report that observes the system itself wants no guard.
Interview summary table + the one-sentence script.
| Term | What it measures | Why use it at Netflix? |
|---|---|---|
| Median (P50) | The "typical" user | Ignores whales; shows what a normal person actually watches |
| P99 | Extreme behaviour | Sizes the edge cache for the biggest whales; latency SLA target |
| Decile | Group performance | Compares whales vs minnows side-by-side; lift-chart input |
| Skew ratio (mean / median) | How asymmetric the curve is | One number on the board deck that quantifies "how Pareto-shaped is the user base?" |
| Whale / Dolphin / Minnow | Named cohort | Compressed vocabulary that maps directly to growth / retention / churn strategies |
| LAD gap | Metadata lag | Ensures whales aren't miscategorised as minnows because data is late |
The Final L5 Tip — the one sentence that proves senior-level thinking
"Because this distribution is heavily skewed toward whales, I'll use NTILEs to ensure our metrics represent the entire user base, not just the power users."
When you see a skewed distribution in an interview, don't jump straight into the SQL. Say that sentence first. It proves you understand the failure mode of naive aggregations — and that you've shipped consumer-product analytics in production. One sentence; immediate L5 signal.
The 90-second expanded form
That one paragraph signals: (1) you know the distribution is skewed, (2) you know which functions are appropriate, (3) you know the standard-deviation trap, (4) you know the LAD trap, (5) you've shipped this in production. Senior-IC bar cleared in 90 seconds.
Run the SQL — practice questions
- customer_growth-0031 — Power-User Skew Analysis · NTILE decile rollup + PERCENTILE_CONT P50/P99 + LAD-tolerant view
- community_solved-0082 — 80/20 Pareto cumulative contribution (Netflix-flavoured probes)
- customer_growth-0032 — Whale / Dolphin / Minnow archetype classification with skewness signals
- customer_growth-0026 — NTILE decile velocity cohorts
- customer_growth-0028 — DAU/MAU with HyperLogLog (non-additivity + HLL probes)
- customer_growth-0033 — Dynamic Pivot (the trap that has no clean answer) · JSON aggregation + three-approach probe