▸ DESIGN · Skew & Distributions · the senior DE vocabulary

Skew & distributions — speak whale, decile, Pareto, P99.

L5+ data-engineering interviews test whether you can name the shape of user behaviour, not just query it. This page codifies the vocabulary: user archetypes (whale / dolphin / minnow), quantile terminology (decile / quartile / percentile), distribution shapes (Pareto / long-tail / normal / skewed), and the one-sentence answers interviewers listen for when they probe with "why not just use AVG and standard deviation?".

User archetypes — whales, dolphins, minnows
Quantile vocabulary — decile, quartile, quintile, percentile
Distribution shapes — Pareto, long-tail, normal, skewed
Mean vs median — the L5 tell
LAD gap — when whales get miscategorised as minnows
Interview summary table + the one-sentence script

§ 01 — User archetypes

User archetypes — whales, dolphins, minnows.

Any consumer platform with a long-tail engagement distribution has three named cohorts. The names come from gaming/casino analytics in the 2000s; they're now standard at Netflix, Spotify, YouTube, Meta, Stripe, and every B2C SaaS. The interview signal isn't knowing the words — it's knowing what to do when you've classified a user as each.

Top 1-5% · The Power Users

🐋 Whale

Definition. The top 1-5% of users by consumption — watch time, GMV, API calls, played minutes. They drive a disproportionate share of total platform usage. On Netflix, a whale watches 60+ hours / month while the typical user watches 5-10.

How you find them in SQL.

PERCENT_RANK() OVER (ORDER BY total_hours DESC) <= 0.05
-- or: NTILE(20) ... WHERE bucket = 1   (top 5%)
-- or: NTILE(100) ... WHERE bucket = 1  (top 1%)

Business context. Whales drive retention metrics — they're the most likely to renew, refer friends, and tolerate price increases. They also stress-test infrastructure: if a whale has a bad streaming experience, it's a P0 because (a) they're paying customers, (b) they generate the most support volume, (c) their bandwidth pattern is what your edge cache must serve. Senior framing: "A whale's churn isn't just one cancelled subscription — it's a leading indicator of platform health degradation."

Middle 65-75% · The Backbone

🐬 Dolphin (or Mid-User)

Definition. The average user — watches a few hours per week, logs in regularly, uses the core feature set but not the long-tail features. Roughly P25 to P85 on the consumption distribution.

How you find them.

PERCENT_RANK() OVER (ORDER BY total_hours DESC) BETWEEN 0.15 AND 0.75
-- or: NTILE(10) ... WHERE bucket BETWEEN 3 AND 8  (deciles 3-8)

Business context. Dolphins are the platform's stability metric. If dolphin DAU/MAU is steady, the platform is healthy. If dolphins shift downward (toward minnow territory), you're in trouble months before the financial metrics show it. Senior framing: "We don't optimise for the whales — they're already maxed out. We don't optimise for the minnows — they're already gone. We optimise for the dolphin's next session."

Bottom 20-30% · The Churn Risk

🐟 Minnow

Definition. Low-engagement users. One session a week or less, often just trial users, gift-subscription holders, or partner-bundle accounts. Bottom decile or two.

How you find them.

PERCENT_RANK() OVER (ORDER BY total_hours DESC) >= 0.80
-- or: NTILE(10) ... WHERE bucket IN (9, 10)  (bottom 20%)

Business context. Minnows are the growth opportunity, not deadweight. They're paying for the subscription but not getting value — moving them from minnow → dolphin is how the platform expands without acquiring new users. Senior framing: "Minnows aren't a deletion target. They're the move-them-up target. Marketing's job is decile 10 → 8 → 5, not churn-and-replace."

Why these three names matter. In interviews, saying "the top 5% of users" is correct but anonymous — it could be any platform. Saying "the whale cohort" instantly signals you've shipped consumer-product analytics. Same for "minnow" and "long-tail". The vocabulary is the compressed shorthand that lets you and the interviewer move past basic definitions and into the interesting question: what do you do with each cohort?

· · ·

§ 02 — Quantile vocabulary

Quantiles — decile, quartile, quintile, percentile.

"Dividing the population into equal-sized groups based on a metric" is a single concept with five names depending on how many buckets you want. Interviewers expect you to use the right one without thinking.

Term	Buckets	Each bucket size	SQL	Typical use
Quartile	4	25%	`NTILE(4)`	Coarse cohorting; salary band reports
Quintile	5	20%	`NTILE(5)`	Marketing segmentation (top quintile = "premium")
Decile	10	10%	`NTILE(10)`	Power-user analysis; lift charts in propensity modelling
Percentile	100	1%	`NTILE(100)` or `PERCENT_RANK`	Latency SLAs, top 1% whale identification
P-value notation	—	—	`PERCENTILE_CONT(0.99)`	"P99 latency" — the value at the 99th percentile

Top decile vs top 1st decile — the ambiguity that trips juniors

Top decile = decile 1 in DESC order = top 10% of users. But some teams number deciles ascending (decile 1 = bottom 10%, decile 10 = top 10%). Always confirm the convention before writing SQL — and prefer naming the threshold ("top 10%") over the bucket number ("decile 1") when the audience is non-technical.

-- Convention A: DESC — decile 1 = whales
SELECT user_id, NTILE(10) OVER (ORDER BY hours DESC) AS decile FROM user_metrics;
-- decile 1 = top 10% (whales), decile 10 = bottom 10% (minnows)

-- Convention B: ASC — decile 10 = whales
SELECT user_id, NTILE(10) OVER (ORDER BY hours ASC) AS decile FROM user_metrics;
-- decile 1 = bottom 10% (minnows), decile 10 = top 10% (whales)

PERCENTILE_CONT vs PERCENTILE_DISC

Function	What it returns	Pick when
`PERCENTILE_CONT(0.5)`	Interpolated value (linear between two adjacent values)	Continuous metrics — revenue, hours, latency
`PERCENTILE_DISC(0.5)`	Actual value from the data — the row at or just past the rank	Discrete metrics, or when the result must be a real observed value
`NTILE(N)`	Bucket assignment 1..N (roughly equal-sized buckets)	Cohort assignment, lift charts, decile rollups
`PERCENT_RANK()`	Continuous rank in [0, 1]	"Where does this user fall on the curve?"; growth-resilient filters like `<= 0.05`
`CUME_DIST()`	Cumulative distribution in (0, 1]	"What fraction of users watch ≤ this many hours?"

· · ·

§ 03 — Distribution shapes

Distribution shapes — name the curve, not just the number.

Pareto distribution — the 80/20 rule

Vilfredo Pareto, 1896, observing that 80% of Italian land was owned by 20% of the population. The same shape shows up in every consumer platform: 80% of watch hours come from 20% of users; 80% of revenue from 20% of customers; 80% of GitHub commits from 20% of contributors.

Netflix context: 20% of your users (the whales) likely account for 80% of total watch hours. SQL strategy: use cumulative sums (running totals) to show where the "bulge" is — order users DESC by hours, run the cumulative sum, the threshold where it crosses 80% of the global total is your "vital few" cutoff. See the 80/20 Pareto question for the full pattern.

Long tail (power law) — Pareto's stretched cousin

A distribution where a high frequency of "minnows" trails off into a very long, thin tail of "whales". Visually: a huge spike on the left (lots of people watching 0-1 hours) and a line that stretches far to the right (one viewer watching 400 hours). The "tail" can stretch to extremes — one user at 400 hours/month while the bulk of users sit at 0-2 hours.

Two-word interview test: if the interviewer says "long tail", you should immediately reach for NTILE(10) + decile rollups rather than AVG. The average is meaningless on a long tail because the extremes pull it sideways.

Normal distribution (bell curve) — the one tech data rarely has

This is what most people assume data looks like, but in tech it's rarely true for engagement. Definition: most users are in the middle, with very few at the extreme high or low ends. Symmetric around the mean, ~68% within 1σ, ~95% within 2σ, ~99.7% within 3σ. Useful for measurement error, manufacturing tolerances, and IQ scores. Almost never useful for user engagement.

Interview tip — the standard-deviation trap. If an interviewer asks why you aren't using simple STDDEV for outlier detection, the answer is: "because engagement data is skewed, and standard deviation assumes a normal distribution." The whales pull the mean way above the median, σ is dominated by outliers, and STDDEV-based outlier detection silently flags half your dolphins as anomalies while missing the actual whales.

Skewness — the asymmetry that breaks naive analytics

Positive (right) skew means the "tail" is on the right side. This is the Netflix example — most users are on the low end, but the whales pull the average to the right. Interview insight: when data is positively skewed, the mean (average) will be much higher than the median. Always mention the median to show you aren't being fooled by the whales.

Type	Tail on	Mean vs median	Where you see it
Positive (right) skew	Right side (high values)	Mean > Median	Watch time, revenue, salary, page-view counts — the default for engagement
Negative (left) skew	Left side (low values)	Mean < Median	Test scores in an easy test, age at retirement, time-to-bug-fix on critical issues
Zero skew (symmetric)	Both sides equal	Mean ≈ Median	Pure measurement noise, true random walks

· · ·

§ 04 — Mean vs median

Mean vs median — the L5 tell.

The single most-tested signal in senior-IC data interviews: which one do you use, and why?

Distribution	AVG(hours)	MEDIAN(hours)	What it means
Uniform (1, 2, 3, 4, 5)	3.0	3.0	They agree — neither misleads
Skewed (1, 2, 3, 4, 1000)	202.0	3.0	One whale pulls the mean up 67×; median ignores
Netflix-shape (typical: 5h, whales: 60h+)	~12h	~6h	Mean reports a user who doesn't exist

The L5 sentence. When asked "what's the average watch time per user", the senior answer is: "Average watch time per user is X hours. The median is Y hours. The gap tells you the platform is heavily skewed toward whales — for any operational decision, use the median; for capacity planning, look at P99."

Always-show-both rule. Never report a single measure of central tendency for engagement data. Show mean + median side-by-side. The interviewer's follow-up question will be "why are they different?" — and that's the question you want, because the answer is "because the distribution is skewed, here's how I'd handle it".

Skew ratio — quantifying how skewed

A simple, defensible metric for "how skewed":

SELECT
  AVG(hours)                                           AS mean_hours,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY hours)   AS median_hours,
  AVG(hours) / NULLIF(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY hours), 0)
                                                       AS skew_ratio_mean_over_median,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY hours) /
    NULLIF(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY hours), 0)
                                                       AS p99_over_p50
FROM user_metrics;

Healthy uniform: skew_ratio ≈ 1.0. Healthy long-tail (Pareto): skew_ratio between 2 and 4. Extreme power-law (one viral product, gaming whales): skew_ratio > 10. The p99_over_p50 ratio is the more interviewer-friendly variant — "how many times more does the top 1% consume than the median?"

· · ·

§ 05 — LAD gap

LAD gap — when whales get miscategorised as minnows.

Late-Arriving Dimensions (LAD): the user just watched a brand-new show, but the content metadata hasn't propagated to the global catalog yet. Their event row is correct, but joins to dim_content return NULL, and the user's hours look artificially low for a brief window. If your decile-classification job runs during that window, a whale gets stamped as a minnow. Worse: they appear in churn-risk campaigns and you spam your best customers with retention offers.

The defensive pattern — Lag-Tolerant View

Emit an is_metadata_complete flag on every row. Power-user / VIP / classification reports filter to = TRUE; ops and SRE reports run unfiltered (they want to see the LAD-pending rows).

-- The LAD-tolerant decile classification
WITH user_metrics AS (
    SELECT user_id, SUM(watch_seconds) / 3600.0 AS hours
    FROM viewing_activity
    WHERE is_metadata_complete = TRUE   -- ← the LAD guard
    GROUP BY user_id
),
ranked AS (
    SELECT user_id, hours,
        NTILE(10) OVER (ORDER BY hours DESC) AS decile,
        CASE
            WHEN NTILE(20) OVER (ORDER BY hours DESC) = 1  THEN 'whale'
            WHEN NTILE(10) OVER (ORDER BY hours DESC) <= 2 THEN 'heavy_dolphin'
            WHEN NTILE(10) OVER (ORDER BY hours DESC) <= 8 THEN 'dolphin'
            ELSE                                                'minnow'
        END AS archetype
    FROM user_metrics
)
SELECT * FROM ranked ORDER BY hours DESC;

When NOT to use the LAD guard

Real-time SLA dashboards — you want to see incomplete rows so you can spot the propagation backlog.
SRE incident drills — the LAD-pending count is itself the metric.
Financial close / regulatory reporting — never; everything must be complete by close.

The rule of thumb: any report that classifies a user's behaviour (decile, archetype, churn risk, cohort) needs the LAD guard. Any report that observes the system itself wants no guard.

· · ·

§ 06 — The interview script

Interview summary table + the one-sentence script.

Term	What it measures	Why use it at Netflix?
Median (P50)	The "typical" user	Ignores whales; shows what a normal person actually watches
P99	Extreme behaviour	Sizes the edge cache for the biggest whales; latency SLA target
Decile	Group performance	Compares whales vs minnows side-by-side; lift-chart input
Skew ratio (mean / median)	How asymmetric the curve is	One number on the board deck that quantifies "how Pareto-shaped is the user base?"
Whale / Dolphin / Minnow	Named cohort	Compressed vocabulary that maps directly to growth / retention / churn strategies
LAD gap	Metadata lag	Ensures whales aren't miscategorised as minnows because data is late

The Final L5 Tip — the one sentence that proves senior-level thinking

▸ The one sentence to memorise

"Because this distribution is heavily skewed toward whales, I'll use NTILEs to ensure our metrics represent the entire user base, not just the power users."

When you see a skewed distribution in an interview, don't jump straight into the SQL. Say that sentence first. It proves you understand the failure mode of naive aggregations — and that you've shipped consumer-product analytics in production. One sentence; immediate L5 signal.

The 90-second expanded form

"Because user-engagement data is positively skewed toward whales, I'll classify users with NTILE(10) rather than thresholding on AVG ± σ — the standard-deviation approach assumes a normal distribution, which engagement data never is. I'll report both mean and median on the dashboard with the skew ratio (mean / median) as a single tell. And the classification job will filter to is_metadata_complete = TRUE so a whale watching a freshly-released show doesn't get stamped as a minnow by a LAD-pending join."

That one paragraph signals: (1) you know the distribution is skewed, (2) you know which functions are appropriate, (3) you know the standard-deviation trap, (4) you know the LAD trap, (5) you've shipped this in production. Senior-IC bar cleared in 90 seconds.

Run the SQL — practice questions

customer_growth-0031 — Power-User Skew Analysis · NTILE decile rollup + PERCENTILE_CONT P50/P99 + LAD-tolerant view
community_solved-0082 — 80/20 Pareto cumulative contribution (Netflix-flavoured probes)
customer_growth-0032 — Whale / Dolphin / Minnow archetype classification with skewness signals
customer_growth-0026 — NTILE decile velocity cohorts
customer_growth-0028 — DAU/MAU with HyperLogLog (non-additivity + HLL probes)
customer_growth-0033 — Dynamic Pivot (the trap that has no clean answer) · JSON aggregation + three-approach probe

· · ·

Skew & distributions — speak whale, decile, Pareto, P99.

Contents

User archetypes — whales, dolphins, minnows.

Top 1-5% · The Power Users

🐋 Whale

Middle 65-75% · The Backbone

🐬 Dolphin (or Mid-User)

Bottom 20-30% · The Churn Risk

🐟 Minnow

Quantiles — decile, quartile, quintile, percentile.

Top decile vs top 1st decile — the ambiguity that trips juniors

PERCENTILE_CONT vs PERCENTILE_DISC

Distribution shapes — name the curve, not just the number.

Pareto distribution — the 80/20 rule

Long tail (power law) — Pareto's stretched cousin

Normal distribution (bell curve) — the one tech data rarely has

Skewness — the asymmetry that breaks naive analytics

Mean vs median — the L5 tell.

Skew ratio — quantifying how skewed

LAD gap — when whales get miscategorised as minnows.

The defensive pattern — Lag-Tolerant View

When NOT to use the LAD guard

Interview summary table + the one-sentence script.

The Final L5 Tip — the one sentence that proves senior-level thinking

The 90-second expanded form

Run the SQL — practice questions