▸ ANALYTICS · the serving layer · petabyte warehouse → sub-second dashboard

Analytics, Dashboard & Optimization.

The Performance pillar makes one petabyte query cheap. This pillar is the other half of the job: turning that warehouse into a sub-second, hundred-viewer dashboard that doesn't reprice itself every refresh. Nineteen patterns across modelling for the dashboard, accelerating the serve, restraining the dashboard itself, and the analytics-correctness traps that make a chart quietly lie.

The mental model — a dashboard is a latency budget you spend three ways. Every panel that loads in 300 ms instead of 30 s did one of three things: it pre-computed the answer before you asked, it served the answer from something faster than a warehouse scan, or it simply asked for less. The fourth concern is correctness: a fast dashboard that lies is worse than a slow one.
Lever №1 · Model

Pre-compute it

Rollups, one-big-table marts, aggregate awareness and a semantic layer — so the dashboard reads a small table instead of scanning raw events at view time.

MODEL: build once, read forever raw · PB rollup job 1× per day · $ once agg · KB dashboards read the rollup, not raw
Lever №2 · Serve

Accelerate it

Materialized views, result cache, extracts and purpose-built OLAP engines (Druid / Pinot / ClickHouse) for interactive latency at high concurrency.

SERVE: absorb the herd viewer viewer ×100 MV + cache hits ≈ free warehouse 1st viewer only identical queries never recompute
Lever №3 · Restrain

Ask for less

Bounded default windows, filter-aligned partitioning, lazy tiles and caching — the dashboard shouldn't fire thirty "all-time" queries on first paint.

RESTRAIN: ask for less fire lazy lazy 28-day default · lazy tiles · cache TTL
THE SERVING PATH — PAY ONCE UPSTREAM, SERVE ALMOST FREE DOWNSTREAM Raw events PB · firehose rollup · 1× per load $ once Agg / OBT mart 1 row / day × dim refresh · incremental MV · cache · OLAP hits ≈ free ≈ free Dashboard ×100 viewers the anti-pattern: every viewer scans raw events at view time — PB × 100 loads (Lever A, №01) Every lever on this page moves work left of the dashboard — pre-compute, then serve from the fastest tier.
The serving path, animated — the top route pays once per load; the red route pays per viewer, forever

…plus a fourth, non-negotiable concern — don't let the chart lie. Averages over skewed data, non-additive metrics rolled up wrong, and a half-loaded latest partition are the ways a fast dashboard ships a confident, wrong number.

Serving-layer cheatsheet — same need, different stacks

NeedSnowflakeBigQueryDatabricks / SparkReal-time OLAP
Pre-aggregatematerialized view / tablematerialized viewgold table / DLTDruid / Pinot rollup
Instant repeat queryresult cache (24h)cached resultsDelta / disk cachebroker cache
Sub-second @ concurrencymulti-cluster warehouseBI EngineSQL warehouse + PhotonDruid / Pinot / ClickHouse
BI extractBI EngineTableau Hyper · Power BI import
Approx distinctAPPROX_COUNT_DISTINCTAPPROX_COUNT_DISTINCTapprox_count_distinctTheta / HLL sketch
Semantic layerUnity metric viewsdbt / Looker / Cube
§ Lever A — model for the dashboard · pre-compute everything

The answer should exist before the question.

The single biggest determinant of dashboard speed isn't the BI tool or the warehouse size — it's whether the panel reads a pre-aggregated table or scans raw events. Every pattern here moves work out of view-time and into a scheduled job that runs once.

№ 01

Dashboards read rollup tables, never raw events

rollup / aggregate tableall warehouses

Spotify — a "streams per day, last 90 days" tile wired straight to the raw play-events firehose, re-scanned on every load by every viewer.

✗ Anti-pattern — view-time scan of raw
-- the panel's query, run on every page load:
SELECT ds, COUNT(*) AS streams
FROM   play_events            -- billions of rows/day
WHERE  ds >= DATEADD(day,-90,CURRENT_DATE)
GROUP BY ds;
✓ Optimized — read the daily rollup
-- scheduled once/day: agg_streams_daily (1 row/day/dim)
INSERT INTO agg_streams_daily
SELECT ds, country, COUNT(*) streams, ...
FROM play_events WHERE ds = CURRENT_DATE GROUP BY ds, country;
-- the panel now scans ~90 rows:
SELECT ds, SUM(streams) FROM agg_streams_daily
WHERE ds >= DATEADD(day,-90,CURRENT_DATE) GROUP BY ds;

Why it wins. A dashboard is read hundreds of times between data refreshes, so paying the scan once in a scheduled job and serving everyone from a one-row-per-day×dimension table is the highest-leverage move in all of analytics. Keep a small ladder of grains (hourly → daily → monthly) and point each panel at the coarsest one that answers it.

Impact: billions of rows scanned per load → ~90 rows; warehouse spend on that tile drops by 3–4 orders of magnitude.
№ 02

Pre-join a one-big-table mart so the dashboard never joins

OBT · denormalizationall warehouses

Airbnb — a bookings dashboard whose every filter triggers a 6-table star-schema join at query time, multiplied across panels.

✗ Anti-pattern — join the star at view time
SELECT d.market, l.room_type, SUM(f.gbv)
FROM   fact_bookings f
JOIN dim_listing  l ON l.listing_id = f.listing_id
JOIN dim_market   d ON d.market_id  = l.market_id
JOIN dim_date     dt ON dt.ds = f.ds
JOIN dim_guest    g ON g.guest_id = f.guest_id
... GROUP BY 1,2;   -- every panel re-runs the joins
✓ Optimized — read one wide mart
-- built once in ELT: bookings_obt has the dims'
-- attributes denormalized onto each fact row.
SELECT market, room_type, SUM(gbv)
FROM   bookings_obt
WHERE  ds BETWEEN :start AND :end
GROUP BY market, room_type;

Why it wins. Star schemas are the right storage model, but joins are the most expensive thing a dashboard does repeatedly. Flattening the hot dimensions onto the fact in an ELT step (a "one big table") trades a little storage and refresh cost for join-free reads — and columnar compression makes the duplicated dimension values nearly free. The semantic layer (№4) can still present it as a clean star.

Impact: N panels × a 6-way join per refresh → a single wide scan; the most common cause of a "slow dashboard, idle warehouse… then a spike."
№ 03

Aggregate awareness — route each query to the smallest table that answers it

aggregate navigationLooker · Cube · dbt

Uber — the same metric is asked at city-month, country-week and global-day granularity; one table can't be optimal for all three.

✗ Anti-pattern — one grain for everything
-- every query hits the finest table (trip-level)
-- even when it only needs country-month totals,
-- OR analysts hard-code which rollup to use and
-- the wiring rots as rollups change.
✓ Optimized — let the layer pick the rollup
# Cube / Looker style: declare rollups; the
# query planner rewrites to the coarsest match.
pre_aggregations:
  by_country_month: {measures: [gbv], dimensions: [country],
                     granularity: month}
  by_city_day:      {measures: [gbv], dimensions: [city],
                     granularity: day}
# a country-month question auto-routes to by_country_month

Why it wins. Aggregate awareness (Looker's aggregate_awareness, Cube pre-aggregations, Mondrian agg tables) lets you define a hierarchy of rollups and have the BI layer automatically rewrite each query to the smallest pre-aggregation that can answer it — transparently falling back to raw for unusual cuts. Analysts write one logical query; the engine routes it.

Impact: coarse questions hit kilobyte tables, fine questions still work — without analysts hand-picking the source.
№ 04

A semantic layer — define each metric once

metrics / semantic layerdbt · LookML · Cube

Meta — five teams each hand-write "active user," with subtly different filters, and three dashboards disagree by 4% in the same all-hands.

✗ Anti-pattern — metric re-defined per dashboard
-- dashboard A
COUNT(DISTINCT CASE WHEN events > 0 THEN user_id END)
-- dashboard B (forgot the bot filter)
COUNT(DISTINCT user_id)
-- dashboard C (different session window)
COUNT(DISTINCT CASE WHEN session_min >= 1 THEN user_id END)
✓ Optimized — one governed definition
# semantic_model.yml — defined once, reused everywhere
metrics:
  - name: weekly_active_users
    label: WAU
    calculation: count_distinct(user_id)
    filter: "is_bot = false AND events > 0"
# every dashboard references metric('weekly_active_users')

Why it wins. A semantic/metrics layer makes the metric definition a single governed object that every tool consumes, so the numbers reconcile by construction. It also centralises the join paths and rollup routing — which means the performance wins of №1–3 are applied once and inherited by every dashboard instead of re-litigated in each.

Impact: dashboards stop disagreeing; one change to the definition propagates everywhere; "whose number is right?" stops being a meeting.
№ 05

Store sketches, not counts — so distinct metrics stay roll-up-able

additivity · HLL sketchSnowflake · BQ · Druid

Reddit — a rollup stores daily_distinct_users as an integer, then a PM asks for the monthly distinct and the dashboard "helpfully" sums 30 days of it.

✗ Anti-pattern — sum of daily distincts (wrong)
-- agg_daily.dau is an INT count of distinct users
SELECT SUM(dau) AS "MAU"      -- ❌ double-counts
FROM   agg_daily              -- anyone active on
WHERE  month = '2024-01';     -- 5 days counts 5×
✓ Optimized — store a mergeable HLL sketch
-- agg_daily.dau_hll holds a HyperLogLog sketch
SELECT HLL_ESTIMATE(HLL_COMBINE(dau_hll)) AS mau
FROM   agg_daily
WHERE  month = '2024-01';
-- sketches MERGE across any window → correct MAU,
-- L7, L28, quarter — all without rescanning raw.

Why it wins. Sums, counts and min/max are additive — you can roll them up freely. Distinct counts are not, and pre-aggregating them as integers bakes in a double-counting bug. Storing a HyperLogLog sketch instead keeps the metric mergeable: any time window is a union of sketches, so the rollup stays both correct and cheap. (Same engine as Performance №14, used here to keep the serving layer honest.)

Impact: distinct-user metrics become rollup-safe at every grain — no raw rescans, no silent over-count.
✦ ✦ ✦
§ Lever B — accelerate the serve · faster than a warehouse scan

When pre-aggregation isn't enough, change the engine.

Some dashboards are interactive (slice, drill, filter) or high-concurrency (an exec floor at 9am Monday) and even a rollup scan on a warehouse is too slow. These patterns put a faster serving tier in front of the data.

№ 06

Materialized views + result cache for repeated queries

materialized view · cacheSnowflake · BQ · Redshift

Salesforce — a heavy aggregate behind a popular tile, recomputed from scratch for every viewer although the underlying data changes hourly.

✗ Anti-pattern — recompute per view
-- complex GROUP BY over a large base table,
-- executed fresh on every dashboard open even
-- though inputs only change once an hour.
✓ Optimized — materialize + let cache serve
CREATE MATERIALIZED VIEW mv_kpi AS
SELECT region, ds, SUM(amount) amt, COUNT(*) n
FROM base GROUP BY region, ds;
-- MV auto-maintains incrementally; identical
-- repeat queries also return from the result
-- cache instantly until the data changes.

Why it wins. A materialized view persists the aggregate and (on Snowflake/BigQuery/Redshift) maintains it incrementally as the base changes, so viewers read a small, current result. The warehouse result cache stacks on top: byte-identical repeat queries return with zero compute until the inputs change. Together they absorb the "everyone opens the same dashboard" load.

Impact: the first viewer pays; the next hundred are served from MV + cache for ~free until the next data load.
№ 07

Put a real-time OLAP engine in front for sub-second @ high concurrency

Druid · Pinot · ClickHouseserving store

LinkedIn / Uber — a member-facing "who viewed your profile / trips this week" analytics surface: thousands of concurrent users, <200 ms expected, fresh to the minute. A batch warehouse can't do this.

✗ Anti-pattern — interactive app on a batch warehouse
-- every user request fires a warehouse query;
-- queue depth explodes at concurrency, p99 is
-- seconds, and freshness lags the batch job.
✓ Optimized — pre-aggregated columnar OLAP tier
-- ingest the stream into Druid/Pinot/ClickHouse:
--  • rollup at ingestion (pre-aggregated segments)
--  • columnar + inverted/bitmap indexes
--  • scatter-gather across data nodes
-- the app queries the OLAP store, not the warehouse:
SELECT dim, SUM(metric) FROM events_realtime
WHERE ts > now() - INTERVAL '7' DAY GROUP BY dim;

Why it wins. Druid, Pinot and ClickHouse are built for exactly this: roll-up at ingestion, columnar segments with bitmap/inverted indexes, and scatter-gather execution tuned for many small concurrent aggregations with sub-second p99. They ingest from Kafka for minute-fresh data. It's the standard pattern when a dashboard is really a product surface, not an internal report.

Impact: seconds-and-queueing → sub-200 ms at thousands of QPS, minute-fresh — latency a warehouse fundamentally can't hit at that concurrency.
№ 08

Extract vs live connection — cache the data next to the BI tool

extract · import modeTableau · Power BI

Walmart — a Tableau workbook on a live warehouse connection where every filter click round-trips a fresh query, and 300 analysts do it all day.

✗ Anti-pattern — live query on every interaction
-- LIVE connection: each filter/drill = a new
-- warehouse query. Interactive latency is at the
-- mercy of warehouse load; cost scales with clicks.
✓ Optimized — columnar extract as a cache
-- Tableau Hyper extract / Power BI Import:
--  • a compressed columnar snapshot lives with the
--    BI engine; interactions hit RAM, not the WH
--  • scheduled refresh keeps it current
--  • filter to the needed rows/cols at extract time
-- use LIVE/DirectQuery only when true real-time
-- freshness is the requirement.

Why it wins. A Tableau Hyper extract or Power BI import is a purpose-built columnar cache sitting next to the BI engine — interactions resolve in memory instead of round-tripping to the warehouse, which is both faster and dramatically cheaper. Reserve live/DirectQuery for genuinely real-time needs, and even then back it with aggregate awareness. Extract only the fields and grain the workbook uses (never SELECT *).

Impact: interaction latency decouples from warehouse load; the per-click warehouse bill goes to roughly zero.
№ 09

Refresh extracts and MVs incrementally, not full-rebuild

incremental refreshPower BI · dbt · MVs

Datadog — a 2-year extract behind a usage dashboard, fully rebuilt every hour because "refresh" was left on the default.

✗ Anti-pattern — full refresh of all history
-- hourly job re-reads and re-loads 730 days of
-- data to pick up the last hour of changes.
✓ Optimized — only the changed window
-- Power BI incremental refresh policy:
--   archive > 2 years, refresh last 3 days
-- dbt incremental model:
{{ config(materialized='incremental') }}
SELECT ... FROM events
{% if is_incremental() %}
  WHERE ds > (SELECT MAX(ds) FROM {{ this }})
{% endif %}

Why it wins. The same "compute once" discipline from the Performance pillar (№23–24) applied to the serving layer: partition the extract/MV by date and refresh only the recent, mutable window — archiving the stable history. Refresh cost scales with new data, not total history, so the hourly job stays flat as the dataset grows.

Impact: an hourly refresh that read 730 days now reads ~3; refresh time and cost stop growing with the table.
№ 10

Approximate counters for the headline tiles

approximate analyticsall warehouses

TikTok — a real-time "unique viewers" big-number tile recomputing an exact COUNT(DISTINCT) over the live firehose every few seconds.

✗ Anti-pattern — exact distinct for a glanceable KPI
SELECT COUNT(DISTINCT viewer_id) AS unique_viewers
FROM   live_views;        -- exact, expensive, and
                          -- nobody reads the last 3 digits
                          -- of a 14,237,1•• counter.
✓ Optimized — approximate for the headline
SELECT APPROX_COUNT_DISTINCT(viewer_id) AS unique_viewers
FROM   live_views;        -- ~1–2% error, a fraction
                          -- of the cost; exact reserved
                          -- for billing/export drill-downs.

Why it wins. A headline counter is read at a glance — 14.2M vs 14,237,104 changes no decision, so paying for exactness is pure waste. Approximate distinct/percentile (HLL, t-digest) give the number in a fraction of the cost and memory. Keep exact computation for the places that legally require it — billing, finance, compliance exports.

Impact: the most-viewed tiles get cheap and fast; exactness is spent only where it actually matters.
✦ ✦ ✦
§ Lever C — restrain the dashboard · ask for less

The cheapest query is the one the dashboard never fires.

A lot of dashboard cost is self-inflicted by the dashboard's own defaults — unbounded windows, filters that don't prune, and a wall of panels that all fire on first paint. These patterns are about the dashboard asking the warehouse for less in the first place.

№ 11

Bounded default window + lazy-loaded tiles

default window · lazy loadBI layer

GitHub — a 30-panel dashboard that defaults to "all time" and fires all 30 queries the instant it opens, including tiles below the fold nobody scrolls to.

✗ Anti-pattern — all-time, all-at-once
-- default range: since the beginning of time
WHERE ds >= '2015-01-01'
-- and 30 panels issue their queries on page load,
-- 25 of them never scrolled into view.
✓ Optimized — bounded default + on-demand
-- default to the window people actually look at:
WHERE ds >= DATEADD(day,-28,CURRENT_DATE)
-- render above-the-fold tiles first; defer the
-- rest until scrolled/expanded; let users opt in
-- to longer ranges explicitly.

Why it wins. Most dashboard views only need a recent window, and most panels are never looked at in a given session. A sensible bounded default plus lazy tile loading turns a 30-query thundering herd on every open into a handful of small, recent-window queries — which also prune partitions cleanly. Longer ranges become an explicit, infrequent choice.

Impact: queries per open drop from 30 to ~5, each scanning a recent slice instead of all history.
№ 12

Align partition & cluster keys to the dashboard's filters

filter-aligned layoutall warehouses

Stripe — every dashboard filters by merchant_id and date, but the serving table is partitioned only by date, so the merchant filter scans every file in the range.

✗ Anti-pattern — layout ignores the filters
-- table partitioned by ds only; merchant scattered.
SELECT ... FROM payments_mart
WHERE ds >= :start AND merchant_id = :m;
-- merchant_id = :m touches every file in the date
-- range — no skipping on the most-used filter.
✓ Optimized — cluster on the hot filter
-- Snowflake: CLUSTER BY (ds, merchant_id)
-- BigQuery:  PARTITION BY ds CLUSTER BY merchant_id
-- Delta:     OPTIMIZE ... ZORDER BY (merchant_id)
-- now merchant_id prunes via min/max stats and the
-- dashboard's most common filter skips most files.

Why it wins. The serving table should be physically organised around how the dashboard actually filters. Partition by the coarse time dimension, cluster/Z-order by the high-selectivity filter columns (merchant, account, country), and every interactive filter prunes instead of scans. This is Performance №26 applied with the dashboard's WHERE clause as the design input.

Impact: the most-used filter goes from "scan the whole date range" to "skip to the matching files."
№ 13

Survive concurrency with caching + elastic warehouses

concurrency · autoscaleSnowflake · BQ

Atlassian — Monday 9am, 500 people open the same exec dashboard in five minutes; the warehouse queues and everyone watches spinners.

✗ Anti-pattern — 500 identical live queries
-- single-cluster warehouse; 500 concurrent runs of
-- the same uncached aggregate queue behind each
-- other → p99 measured in minutes.
✓ Optimized — serve from cache, scale out the rest
-- 1) MV + result cache so identical queries don't
--    recompute (see №6) — most of the 500 are cache hits
-- 2) multi-cluster / autoscaling warehouse for the
--    concurrent misses:
ALTER WAREHOUSE bi SET MIN_CLUSTER_COUNT=1
  MAX_CLUSTER_COUNT=10 SCALING_POLICY='STANDARD';
-- 3) BigQuery: a BI Engine reservation for the dash.

Why it wins. High-concurrency spikes are a different problem from slow queries — the fix is to (a) make most requests cache hits so they never touch compute, and (b) let the remainder fan out across auto-added clusters instead of queueing. The combination handles the Monday-morning herd without permanently over-provisioning a giant warehouse that sits idle the rest of the week.

Impact: the herd is mostly cache hits; the rest scale out and back in — interactive latency holds, idle cost doesn't.
№ 14

Don't render a 50,000-row table widget

viz cardinalityBI layer

Shopify — a "detail" tab is a raw table widget returning every transaction, so the warehouse ships 50k rows and the browser chokes rendering them.

✗ Anti-pattern — dump everything into a grid
SELECT * FROM transactions
WHERE ds >= :start;     -- 50k+ rows to a table viz
-- huge result transfer, slow render, and no human
-- reads a 50,000-row on-screen table anyway.
✓ Optimized — aggregate, top-N, paginate, export
-- show the aggregate the chart actually needs:
SELECT category, SUM(amount) FROM transactions
WHERE ds >= :start GROUP BY category ORDER BY 2 DESC LIMIT 50;
-- row-level detail → a paginated drill-down or a
-- "download CSV" that runs an async export job.

Why it wins. A visualization should return what a human can perceive — a few dozen bars, a ranked top-N, a trend line. Massive table widgets pay twice: a large result transfer from the warehouse and an expensive client-side render. Aggregate or top-N for the on-screen view, and route true row-level needs to pagination or an async export.

Impact: result size drops from tens of thousands of rows to dozens; the panel renders instantly.
№ 15

Cache with a TTL tied to the data-freshness SLA

caching · warm-upBI layer

Pinterest — a dashboard fed by an hourly pipeline, but with caching off, so it recomputes continuously to show numbers that only change once an hour.

✗ Anti-pattern — recompute faster than data changes
-- cache disabled / 0s TTL: the dashboard recomputes
-- on every interaction even though the source only
-- lands new data once per hour.
✓ Optimized — TTL = freshness, scheduled warm-up
-- cache TTL set to the pipeline cadence (e.g. 1h);
-- a scheduled "warm-up" runs the heavy queries right
-- AFTER each load so the first human always hits a
-- warm cache. Invalidate on load completion, not by
-- a guessed timer.

Why it wins. There's no value in recomputing a number more often than its inputs change. Setting the cache TTL to the data-freshness SLA, and warming the cache immediately after each pipeline load, means viewers almost always hit a fresh cache and the warehouse runs the heavy query once per load instead of once per click. Best of all is event-driven invalidation keyed to load completion.

Impact: the expensive query runs ~once per data load, not once per interaction; every viewer still sees current numbers.
✦ ✦ ✦
§ Lever D — don't let the chart lie · analytics correctness

A fast dashboard that's wrong is the worst kind.

Optimization isn't only about speed — a pre-aggregated, cached, sub-second panel that shows a confidently wrong number does real damage. These four patterns are the correctness traps that the serving layer specifically introduces.

№ 16

Mean lies on skewed data — show median / percentiles

distributionanalytics

DoorDash — an "average order value" KPI that a handful of catering whales drag 30% above what any typical customer ever spends.

✗ Anti-pattern — a single mean on a long tail
SELECT AVG(order_value) AS "Typical order"
FROM   orders;        -- one $9,000 catering order
                      -- per 1,000 lifts the "average"
                      -- away from reality.
✓ Optimized — median + percentile spread
SELECT APPROX_PERCENTILE(order_value,0.50) AS p50,
       APPROX_PERCENTILE(order_value,0.90) AS p90,
       AVG(order_value)                    AS mean
FROM   orders;        -- show p50 as "typical", and
                      -- the p50-vs-mean gap reveals skew.

Why it wins. On the heavy-tailed distributions that dominate real business data, the mean is pulled toward the whales and misrepresents the typical case. Lead with the median, show a percentile spread, and consider a log scale for whale-heavy charts. The gap between mean and median is itself the skew signal. (Vocabulary on the Skew & Distributions page.)

Impact: the KPI reflects the typical customer instead of being silently governed by the top 0.1%.
№ 17

Time zones & late data — define "today" and don't plot a half-loaded partition

as-of correctnessanalytics

Netflix — a global daily-active chart whose latest bar craters every morning, sparking a false-alarm Slack thread, because today's partition is only partly loaded and "day" is in UTC for a US-centric audience.

✗ Anti-pattern — naive latest day, ambiguous TZ
SELECT ds, COUNT(*) FROM events GROUP BY ds;
-- today's bar is partial (data still arriving) and
-- "ds" is UTC, so the curve dips every morning and
-- is shifted vs the users' local day.
✓ Optimized — complete days, explicit TZ
SELECT DATE(ts AT TIME ZONE 'America/Los_Angeles') AS day,
       COUNT(*)
FROM   events
WHERE  ts < DATE_TRUNC('day', CURRENT_TIMESTAMP)  -- exclude
GROUP BY 1;                                       -- partial today
-- or mark the in-progress day as provisional in the viz.

Why it wins. Two classic dashboard lies: plotting an incomplete current partition as if it were a finished day, and aggregating by an implicit UTC "day" that doesn't match how the business thinks about time. Excluding (or visibly flagging) the in-progress day kills the daily false-drop, and converting to a declared business time zone makes day-over-day comparisons honest.

Impact: no more morning false-alarms; day boundaries match the business, and trends are comparable.
№ 18

Never pre-aggregate a ratio — store numerator and denominator

non-additive metricsanalytics

Robinhood — a rollup stores conversion_rate per day, and a weekly tile averages the seven daily rates, producing a number that's mathematically wrong.

✗ Anti-pattern — average of daily rates
-- agg_daily.conv_rate = conversions/visits per day
SELECT AVG(conv_rate) AS weekly_rate   -- ❌ a day with
FROM   agg_daily                       -- 2 visits counts
WHERE  week = :w;                      -- as much as one
                                       -- with 2,000,000.
✓ Optimized — keep the additive parts
-- store conversions and visits (both additive):
SELECT SUM(conversions) * 1.0 / SUM(visits) AS weekly_rate
FROM   agg_daily
WHERE  week = :w;       -- ratio computed at read time
                        -- from rolled-up components.

Why it wins. Ratios, rates and averages are not additive — you can't sum or average them across grains without weighting. The rule is to pre-aggregate only additive components (numerator and denominator, sums and counts) and compute the ratio at read time. Same family as storing sketches instead of distinct counts (№5): keep the building blocks, derive the metric.

Impact: rates and averages roll up correctly to any grain; the weekly number stops being a meaningless average-of-averages.
№ 19

Label the approximate & sampled panels

disclosure · trustanalytics

Coinbase — finance pulls a "total settled volume" number off a dashboard tile that's quietly powered by APPROX_COUNT_DISTINCT and a 1% sample, and reconciliation later disagrees by 1.5%.

✗ Anti-pattern — approximate dressed as exact
-- tile shows "Settled volume: 14,237,104" but it's
-- really a sampled/approximate estimate. Someone
-- treats it as the book of record.
✓ Optimized — disclose & route exactness
-- label it: "≈ 14.2M (approx, ±2%)"
-- approximate/sampled → exploration & glance KPIs
-- exact, un-sampled query → the path used for
--   billing, finance, compliance and any export.
-- one click from the tile to the exact drill-down.

Why it wins. Approximation (№10) and sampling are the right call for speed — but only if consumers know which numbers are estimates. Visibly labelling approximate/sampled panels and routing finance-grade questions to an exact, un-sampled path preserves both speed and trust. The failure mode isn't the approximation; it's an estimate masquerading as the source of truth.

Impact: fast approximate tiles for everyday use, an audited exact path where it counts, and nobody reconciles against the wrong one.
✦ ✦ ✦
§ The 60-second articulation

How to say it in the interview.

When you're asked to make a slow, expensive dashboard fast, walk the four levers in order — it shows you treat the dashboard as a system, not a pile of SQL:

"First I'd move work out of view-time — does the panel scan raw events, or a rollup? Pre-aggregate, build a one-big-table mart, and put a semantic layer over it so metrics are defined once and aggregate-aware. Then I'd accelerate the serve — materialized views and result cache for the repeated queries, and if it's an interactive product surface, a Druid/Pinot/ClickHouse tier rather than a batch warehouse. Then I'd restrain the dashboard — a bounded default window, partitioning aligned to the filters, lazy tiles, and caching keyed to the freshness SLA. And throughout I'd protect correctness — median not mean on skew, additive components not pre-baked ratios, sketches not distinct counts, and a clear line between approximate and exact."

That's the whole pillar in one breath. Notice how it dovetails with Performance: the rollups and sketches you build here are made cheap by the scan/shuffle/skew techniques there, and both rest on the schemas from Design. Pre-compute, accelerate, restrain, and don't lie — in that order.

The three engineering pillars together → Design is the schema the data lives in. Performance is making one petabyte query over it cheap. Analytics is turning the whole thing into a fast, honest dashboard. Practice the query mechanics in Practice · Q&A and check yourself in Skill Check.

← Performance pillar  ·  Practice · Q&A  ·  ↑ Top