← PaddySpeaks
Interview Studio · Practice · Q&A Design
▸ DESIGN · Skew & Distributions · the senior DE vocabulary

Skew & distributions — speak whale, decile, Pareto, P99.

L5+ data-engineering interviews test whether you can name the shape of user behaviour, not just query it. This page codifies the vocabulary: user archetypes (whale / dolphin / minnow), quantile terminology (decile / quartile / percentile), distribution shapes (Pareto / long-tail / normal / skewed), and the one-sentence answers interviewers listen for when they probe with "why not just use AVG and standard deviation?".

§ 01 — User archetypes

User archetypes — whales, dolphins, minnows.

Any consumer platform with a long-tail engagement distribution has three named cohorts. The names come from gaming/casino analytics in the 2000s; they're now standard at Netflix, Spotify, YouTube, Meta, Stripe, and every B2C SaaS. The interview signal isn't knowing the words — it's knowing what to do when you've classified a user as each.

Top 1-5% · The Power Users

🐋 Whale

Definition. The top 1-5% of users by consumption — watch time, GMV, API calls, played minutes. They drive a disproportionate share of total platform usage. On Netflix, a whale watches 60+ hours / month while the typical user watches 5-10.

How you find them in SQL.

PERCENT_RANK() OVER (ORDER BY total_hours DESC) <= 0.05
-- or: NTILE(20) ... WHERE bucket = 1   (top 5%)
-- or: NTILE(100) ... WHERE bucket = 1  (top 1%)

Business context. Whales drive retention metrics — they're the most likely to renew, refer friends, and tolerate price increases. They also stress-test infrastructure: if a whale has a bad streaming experience, it's a P0 because (a) they're paying customers, (b) they generate the most support volume, (c) their bandwidth pattern is what your edge cache must serve. Senior framing: "A whale's churn isn't just one cancelled subscription — it's a leading indicator of platform health degradation."

Middle 65-75% · The Backbone

🐬 Dolphin (or Mid-User)

Definition. The average user — watches a few hours per week, logs in regularly, uses the core feature set but not the long-tail features. Roughly P25 to P85 on the consumption distribution.

How you find them.

PERCENT_RANK() OVER (ORDER BY total_hours DESC) BETWEEN 0.15 AND 0.75
-- or: NTILE(10) ... WHERE bucket BETWEEN 3 AND 8  (deciles 3-8)

Business context. Dolphins are the platform's stability metric. If dolphin DAU/MAU is steady, the platform is healthy. If dolphins shift downward (toward minnow territory), you're in trouble months before the financial metrics show it. Senior framing: "We don't optimise for the whales — they're already maxed out. We don't optimise for the minnows — they're already gone. We optimise for the dolphin's next session."

Bottom 20-30% · The Churn Risk

🐟 Minnow

Definition. Low-engagement users. One session a week or less, often just trial users, gift-subscription holders, or partner-bundle accounts. Bottom decile or two.

How you find them.

PERCENT_RANK() OVER (ORDER BY total_hours DESC) >= 0.80
-- or: NTILE(10) ... WHERE bucket IN (9, 10)  (bottom 20%)

Business context. Minnows are the growth opportunity, not deadweight. They're paying for the subscription but not getting value — moving them from minnow → dolphin is how the platform expands without acquiring new users. Senior framing: "Minnows aren't a deletion target. They're the move-them-up target. Marketing's job is decile 10 → 8 → 5, not churn-and-replace."

Why these three names matter. In interviews, saying "the top 5% of users" is correct but anonymous — it could be any platform. Saying "the whale cohort" instantly signals you've shipped consumer-product analytics. Same for "minnow" and "long-tail". The vocabulary is the compressed shorthand that lets you and the interviewer move past basic definitions and into the interesting question: what do you do with each cohort?
· · ·
§ 02 — Quantile vocabulary

Quantiles — decile, quartile, quintile, percentile.

"Dividing the population into equal-sized groups based on a metric" is a single concept with five names depending on how many buckets you want. Interviewers expect you to use the right one without thinking.

TermBucketsEach bucket sizeSQLTypical use
Quartile425%NTILE(4)Coarse cohorting; salary band reports
Quintile520%NTILE(5)Marketing segmentation (top quintile = "premium")
Decile1010%NTILE(10)Power-user analysis; lift charts in propensity modelling
Percentile1001%NTILE(100) or PERCENT_RANKLatency SLAs, top 1% whale identification
P-value notationPERCENTILE_CONT(0.99)"P99 latency" — the value at the 99th percentile

Top decile vs top 1st decile — the ambiguity that trips juniors

Top decile = decile 1 in DESC order = top 10% of users. But some teams number deciles ascending (decile 1 = bottom 10%, decile 10 = top 10%). Always confirm the convention before writing SQL — and prefer naming the threshold ("top 10%") over the bucket number ("decile 1") when the audience is non-technical.

-- Convention A: DESC — decile 1 = whales
SELECT user_id, NTILE(10) OVER (ORDER BY hours DESC) AS decile FROM user_metrics;
-- decile 1 = top 10% (whales), decile 10 = bottom 10% (minnows)

-- Convention B: ASC — decile 10 = whales
SELECT user_id, NTILE(10) OVER (ORDER BY hours ASC) AS decile FROM user_metrics;
-- decile 1 = bottom 10% (minnows), decile 10 = top 10% (whales)

PERCENTILE_CONT vs PERCENTILE_DISC

FunctionWhat it returnsPick when
PERCENTILE_CONT(0.5)Interpolated value (linear between two adjacent values)Continuous metrics — revenue, hours, latency
PERCENTILE_DISC(0.5)Actual value from the data — the row at or just past the rankDiscrete metrics, or when the result must be a real observed value
NTILE(N)Bucket assignment 1..N (roughly equal-sized buckets)Cohort assignment, lift charts, decile rollups
PERCENT_RANK()Continuous rank in [0, 1]"Where does this user fall on the curve?"; growth-resilient filters like <= 0.05
CUME_DIST()Cumulative distribution in (0, 1]"What fraction of users watch ≤ this many hours?"
· · ·
§ 03 — Distribution shapes

Distribution shapes — name the curve, not just the number.

Pareto distribution — the 80/20 rule

Vilfredo Pareto, 1896, observing that 80% of Italian land was owned by 20% of the population. The same shape shows up in every consumer platform: 80% of watch hours come from 20% of users; 80% of revenue from 20% of customers; 80% of GitHub commits from 20% of contributors.

PARETO · 80% of effect from 20% of cause users (low → high) hours/user top 20% drive 80% of total long tail remaining 80% of users

Netflix context: 20% of your users (the whales) likely account for 80% of total watch hours. SQL strategy: use cumulative sums (running totals) to show where the "bulge" is — order users DESC by hours, run the cumulative sum, the threshold where it crosses 80% of the global total is your "vital few" cutoff. See the 80/20 Pareto question for the full pattern.

Long tail (power law) — Pareto's stretched cousin

A distribution where a high frequency of "minnows" trails off into a very long, thin tail of "whales". Visually: a huge spike on the left (lots of people watching 0-1 hours) and a line that stretches far to the right (one viewer watching 400 hours). The "tail" can stretch to extremes — one user at 400 hours/month while the bulk of users sit at 0-2 hours.

Two-word interview test: if the interviewer says "long tail", you should immediately reach for NTILE(10) + decile rollups rather than AVG. The average is meaningless on a long tail because the extremes pull it sideways.

Normal distribution (bell curve) — the one tech data rarely has

This is what most people assume data looks like, but in tech it's rarely true for engagement. Definition: most users are in the middle, with very few at the extreme high or low ends. Symmetric around the mean, ~68% within 1σ, ~95% within 2σ, ~99.7% within 3σ. Useful for measurement error, manufacturing tolerances, and IQ scores. Almost never useful for user engagement.

Interview tip — the standard-deviation trap. If an interviewer asks why you aren't using simple STDDEV for outlier detection, the answer is: "because engagement data is skewed, and standard deviation assumes a normal distribution." The whales pull the mean way above the median, σ is dominated by outliers, and STDDEV-based outlier detection silently flags half your dolphins as anomalies while missing the actual whales.

Skewness — the asymmetry that breaks naive analytics

Positive (right) skew means the "tail" is on the right side. This is the Netflix example — most users are on the low end, but the whales pull the average to the right. Interview insight: when data is positively skewed, the mean (average) will be much higher than the median. Always mention the median to show you aren't being fooled by the whales.

TypeTail onMean vs medianWhere you see it
Positive (right) skewRight side (high values)Mean > MedianWatch time, revenue, salary, page-view counts — the default for engagement
Negative (left) skewLeft side (low values)Mean < MedianTest scores in an easy test, age at retirement, time-to-bug-fix on critical issues
Zero skew (symmetric)Both sides equalMean ≈ MedianPure measurement noise, true random walks
SKEWNESS — three shapes, three SQL strategies POSITIVE SKEW tail right · Mean > Median Netflix watch hours NORMAL symmetric · Mean ≈ Median measurement error NEGATIVE SKEW tail left · Mean < Median easy-test scores
· · ·
§ 04 — Mean vs median

Mean vs median — the L5 tell.

The single most-tested signal in senior-IC data interviews: which one do you use, and why?

DistributionAVG(hours)MEDIAN(hours)What it means
Uniform (1, 2, 3, 4, 5)3.03.0They agree — neither misleads
Skewed (1, 2, 3, 4, 1000)202.03.0One whale pulls the mean up 67×; median ignores
Netflix-shape (typical: 5h, whales: 60h+)~12h~6hMean reports a user who doesn't exist

The L5 sentence. When asked "what's the average watch time per user", the senior answer is: "Average watch time per user is X hours. The median is Y hours. The gap tells you the platform is heavily skewed toward whales — for any operational decision, use the median; for capacity planning, look at P99."

Always-show-both rule. Never report a single measure of central tendency for engagement data. Show mean + median side-by-side. The interviewer's follow-up question will be "why are they different?" — and that's the question you want, because the answer is "because the distribution is skewed, here's how I'd handle it".

Skew ratio — quantifying how skewed

A simple, defensible metric for "how skewed":

SELECT
  AVG(hours)                                           AS mean_hours,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY hours)   AS median_hours,
  AVG(hours) / NULLIF(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY hours), 0)
                                                       AS skew_ratio_mean_over_median,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY hours) /
    NULLIF(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY hours), 0)
                                                       AS p99_over_p50
FROM user_metrics;

Healthy uniform: skew_ratio ≈ 1.0. Healthy long-tail (Pareto): skew_ratio between 2 and 4. Extreme power-law (one viral product, gaming whales): skew_ratio > 10. The p99_over_p50 ratio is the more interviewer-friendly variant — "how many times more does the top 1% consume than the median?"

· · ·
§ 05 — LAD gap

LAD gap — when whales get miscategorised as minnows.

Late-Arriving Dimensions (LAD): the user just watched a brand-new show, but the content metadata hasn't propagated to the global catalog yet. Their event row is correct, but joins to dim_content return NULL, and the user's hours look artificially low for a brief window. If your decile-classification job runs during that window, a whale gets stamped as a minnow. Worse: they appear in churn-risk campaigns and you spam your best customers with retention offers.

The defensive pattern — Lag-Tolerant View

Emit an is_metadata_complete flag on every row. Power-user / VIP / classification reports filter to = TRUE; ops and SRE reports run unfiltered (they want to see the LAD-pending rows).

-- The LAD-tolerant decile classification
WITH user_metrics AS (
    SELECT user_id, SUM(watch_seconds) / 3600.0 AS hours
    FROM viewing_activity
    WHERE is_metadata_complete = TRUE   -- ← the LAD guard
    GROUP BY user_id
),
ranked AS (
    SELECT user_id, hours,
        NTILE(10) OVER (ORDER BY hours DESC) AS decile,
        CASE
            WHEN NTILE(20) OVER (ORDER BY hours DESC) = 1  THEN 'whale'
            WHEN NTILE(10) OVER (ORDER BY hours DESC) <= 2 THEN 'heavy_dolphin'
            WHEN NTILE(10) OVER (ORDER BY hours DESC) <= 8 THEN 'dolphin'
            ELSE                                                'minnow'
        END AS archetype
    FROM user_metrics
)
SELECT * FROM ranked ORDER BY hours DESC;

When NOT to use the LAD guard

  • Real-time SLA dashboards — you want to see incomplete rows so you can spot the propagation backlog.
  • SRE incident drills — the LAD-pending count is itself the metric.
  • Financial close / regulatory reporting — never; everything must be complete by close.

The rule of thumb: any report that classifies a user's behaviour (decile, archetype, churn risk, cohort) needs the LAD guard. Any report that observes the system itself wants no guard.

· · ·
§ 06 — The interview script

Interview summary table + the one-sentence script.

TermWhat it measuresWhy use it at Netflix?
Median (P50)The "typical" userIgnores whales; shows what a normal person actually watches
P99Extreme behaviourSizes the edge cache for the biggest whales; latency SLA target
DecileGroup performanceCompares whales vs minnows side-by-side; lift-chart input
Skew ratio (mean / median)How asymmetric the curve isOne number on the board deck that quantifies "how Pareto-shaped is the user base?"
Whale / Dolphin / MinnowNamed cohortCompressed vocabulary that maps directly to growth / retention / churn strategies
LAD gapMetadata lagEnsures whales aren't miscategorised as minnows because data is late

The Final L5 Tip — the one sentence that proves senior-level thinking

▸ The one sentence to memorise

"Because this distribution is heavily skewed toward whales, I'll use NTILEs to ensure our metrics represent the entire user base, not just the power users."

When you see a skewed distribution in an interview, don't jump straight into the SQL. Say that sentence first. It proves you understand the failure mode of naive aggregations — and that you've shipped consumer-product analytics in production. One sentence; immediate L5 signal.

The 90-second expanded form

"Because user-engagement data is positively skewed toward whales, I'll classify users with NTILE(10) rather than thresholding on AVG ± σ — the standard-deviation approach assumes a normal distribution, which engagement data never is. I'll report both mean and median on the dashboard with the skew ratio (mean / median) as a single tell. And the classification job will filter to is_metadata_complete = TRUE so a whale watching a freshly-released show doesn't get stamped as a minnow by a LAD-pending join."

That one paragraph signals: (1) you know the distribution is skewed, (2) you know which functions are appropriate, (3) you know the standard-deviation trap, (4) you know the LAD trap, (5) you've shipped this in production. Senior-IC bar cleared in 90 seconds.

Run the SQL — practice questions

· · ·