Analytics, Dashboard & Optimization.
The Performance pillar makes one petabyte query cheap. This pillar is the other half of the job: turning that warehouse into a sub-second, hundred-viewer dashboard that doesn't reprice itself every refresh. Nineteen patterns across modelling for the dashboard, accelerating the serve, restraining the dashboard itself, and the analytics-correctness traps that make a chart quietly lie.
Pre-compute it
Rollups, one-big-table marts, aggregate awareness and a semantic layer — so the dashboard reads a small table instead of scanning raw events at view time.
Accelerate it
Materialized views, result cache, extracts and purpose-built OLAP engines (Druid / Pinot / ClickHouse) for interactive latency at high concurrency.
Ask for less
Bounded default windows, filter-aligned partitioning, lazy tiles and caching — the dashboard shouldn't fire thirty "all-time" queries on first paint.
…plus a fourth, non-negotiable concern — don't let the chart lie. Averages over skewed data, non-additive metrics rolled up wrong, and a half-loaded latest partition are the ways a fast dashboard ships a confident, wrong number.
Serving-layer cheatsheet — same need, different stacks
| Need | Snowflake | BigQuery | Databricks / Spark | Real-time OLAP |
|---|---|---|---|---|
| Pre-aggregate | materialized view / table | materialized view | gold table / DLT | Druid / Pinot rollup |
| Instant repeat query | result cache (24h) | cached results | Delta / disk cache | broker cache |
| Sub-second @ concurrency | multi-cluster warehouse | BI Engine | SQL warehouse + Photon | Druid / Pinot / ClickHouse |
| BI extract | — | BI Engine | — | Tableau Hyper · Power BI import |
| Approx distinct | APPROX_COUNT_DISTINCT | APPROX_COUNT_DISTINCT | approx_count_distinct | Theta / HLL sketch |
| Semantic layer | — | — | Unity metric views | dbt / Looker / Cube |
The four levers
Pre-compute everything
Rollup tables, one-big-table marts, aggregate awareness, a semantic layer, and additive metrics that roll up correctly.
Accelerate the serve
Materialized views + result cache, real-time OLAP engines, BI extracts, incremental refresh, and approximate counters.
Ask for less
Bounded windows, filter-aligned partitioning, concurrency caching, lazy tiles, and killing the 50k-row table widget.
Don't lie
Mean vs median on skew, time zones & late data, non-additive ratios, and labelling the approximate panels.
The answer should exist before the question.
The single biggest determinant of dashboard speed isn't the BI tool or the warehouse size — it's whether the panel reads a pre-aggregated table or scans raw events. Every pattern here moves work out of view-time and into a scheduled job that runs once.
Dashboards read rollup tables, never raw events
Spotify — a "streams per day, last 90 days" tile wired straight to the raw play-events firehose, re-scanned on every load by every viewer.
-- the panel's query, run on every page load:
SELECT ds, COUNT(*) AS streams
FROM play_events -- billions of rows/day
WHERE ds >= DATEADD(day,-90,CURRENT_DATE)
GROUP BY ds;-- scheduled once/day: agg_streams_daily (1 row/day/dim)
INSERT INTO agg_streams_daily
SELECT ds, country, COUNT(*) streams, ...
FROM play_events WHERE ds = CURRENT_DATE GROUP BY ds, country;
-- the panel now scans ~90 rows:
SELECT ds, SUM(streams) FROM agg_streams_daily
WHERE ds >= DATEADD(day,-90,CURRENT_DATE) GROUP BY ds;Why it wins. A dashboard is read hundreds of times between data refreshes, so paying the scan once in a scheduled job and serving everyone from a one-row-per-day×dimension table is the highest-leverage move in all of analytics. Keep a small ladder of grains (hourly → daily → monthly) and point each panel at the coarsest one that answers it.
Pre-join a one-big-table mart so the dashboard never joins
Airbnb — a bookings dashboard whose every filter triggers a 6-table star-schema join at query time, multiplied across panels.
SELECT d.market, l.room_type, SUM(f.gbv)
FROM fact_bookings f
JOIN dim_listing l ON l.listing_id = f.listing_id
JOIN dim_market d ON d.market_id = l.market_id
JOIN dim_date dt ON dt.ds = f.ds
JOIN dim_guest g ON g.guest_id = f.guest_id
... GROUP BY 1,2; -- every panel re-runs the joins-- built once in ELT: bookings_obt has the dims'
-- attributes denormalized onto each fact row.
SELECT market, room_type, SUM(gbv)
FROM bookings_obt
WHERE ds BETWEEN :start AND :end
GROUP BY market, room_type;Why it wins. Star schemas are the right storage model, but joins are the most expensive thing a dashboard does repeatedly. Flattening the hot dimensions onto the fact in an ELT step (a "one big table") trades a little storage and refresh cost for join-free reads — and columnar compression makes the duplicated dimension values nearly free. The semantic layer (№4) can still present it as a clean star.
Aggregate awareness — route each query to the smallest table that answers it
Uber — the same metric is asked at city-month, country-week and global-day granularity; one table can't be optimal for all three.
-- every query hits the finest table (trip-level)
-- even when it only needs country-month totals,
-- OR analysts hard-code which rollup to use and
-- the wiring rots as rollups change.# Cube / Looker style: declare rollups; the
# query planner rewrites to the coarsest match.
pre_aggregations:
by_country_month: {measures: [gbv], dimensions: [country],
granularity: month}
by_city_day: {measures: [gbv], dimensions: [city],
granularity: day}
# a country-month question auto-routes to by_country_monthWhy it wins. Aggregate awareness (Looker's aggregate_awareness, Cube pre-aggregations, Mondrian agg tables) lets you define a hierarchy of rollups and have the BI layer automatically rewrite each query to the smallest pre-aggregation that can answer it — transparently falling back to raw for unusual cuts. Analysts write one logical query; the engine routes it.
A semantic layer — define each metric once
Meta — five teams each hand-write "active user," with subtly different filters, and three dashboards disagree by 4% in the same all-hands.
-- dashboard A
COUNT(DISTINCT CASE WHEN events > 0 THEN user_id END)
-- dashboard B (forgot the bot filter)
COUNT(DISTINCT user_id)
-- dashboard C (different session window)
COUNT(DISTINCT CASE WHEN session_min >= 1 THEN user_id END)# semantic_model.yml — defined once, reused everywhere
metrics:
- name: weekly_active_users
label: WAU
calculation: count_distinct(user_id)
filter: "is_bot = false AND events > 0"
# every dashboard references metric('weekly_active_users')Why it wins. A semantic/metrics layer makes the metric definition a single governed object that every tool consumes, so the numbers reconcile by construction. It also centralises the join paths and rollup routing — which means the performance wins of №1–3 are applied once and inherited by every dashboard instead of re-litigated in each.
Store sketches, not counts — so distinct metrics stay roll-up-able
Reddit — a rollup stores daily_distinct_users as an integer, then a PM asks for the monthly distinct and the dashboard "helpfully" sums 30 days of it.
-- agg_daily.dau is an INT count of distinct users
SELECT SUM(dau) AS "MAU" -- ❌ double-counts
FROM agg_daily -- anyone active on
WHERE month = '2024-01'; -- 5 days counts 5×-- agg_daily.dau_hll holds a HyperLogLog sketch
SELECT HLL_ESTIMATE(HLL_COMBINE(dau_hll)) AS mau
FROM agg_daily
WHERE month = '2024-01';
-- sketches MERGE across any window → correct MAU,
-- L7, L28, quarter — all without rescanning raw.Why it wins. Sums, counts and min/max are additive — you can roll them up freely. Distinct counts are not, and pre-aggregating them as integers bakes in a double-counting bug. Storing a HyperLogLog sketch instead keeps the metric mergeable: any time window is a union of sketches, so the rollup stays both correct and cheap. (Same engine as Performance №14, used here to keep the serving layer honest.)
When pre-aggregation isn't enough, change the engine.
Some dashboards are interactive (slice, drill, filter) or high-concurrency (an exec floor at 9am Monday) and even a rollup scan on a warehouse is too slow. These patterns put a faster serving tier in front of the data.
Materialized views + result cache for repeated queries
Salesforce — a heavy aggregate behind a popular tile, recomputed from scratch for every viewer although the underlying data changes hourly.
-- complex GROUP BY over a large base table,
-- executed fresh on every dashboard open even
-- though inputs only change once an hour.CREATE MATERIALIZED VIEW mv_kpi AS
SELECT region, ds, SUM(amount) amt, COUNT(*) n
FROM base GROUP BY region, ds;
-- MV auto-maintains incrementally; identical
-- repeat queries also return from the result
-- cache instantly until the data changes.Why it wins. A materialized view persists the aggregate and (on Snowflake/BigQuery/Redshift) maintains it incrementally as the base changes, so viewers read a small, current result. The warehouse result cache stacks on top: byte-identical repeat queries return with zero compute until the inputs change. Together they absorb the "everyone opens the same dashboard" load.
Put a real-time OLAP engine in front for sub-second @ high concurrency
LinkedIn / Uber — a member-facing "who viewed your profile / trips this week" analytics surface: thousands of concurrent users, <200 ms expected, fresh to the minute. A batch warehouse can't do this.
-- every user request fires a warehouse query;
-- queue depth explodes at concurrency, p99 is
-- seconds, and freshness lags the batch job.-- ingest the stream into Druid/Pinot/ClickHouse:
-- • rollup at ingestion (pre-aggregated segments)
-- • columnar + inverted/bitmap indexes
-- • scatter-gather across data nodes
-- the app queries the OLAP store, not the warehouse:
SELECT dim, SUM(metric) FROM events_realtime
WHERE ts > now() - INTERVAL '7' DAY GROUP BY dim;Why it wins. Druid, Pinot and ClickHouse are built for exactly this: roll-up at ingestion, columnar segments with bitmap/inverted indexes, and scatter-gather execution tuned for many small concurrent aggregations with sub-second p99. They ingest from Kafka for minute-fresh data. It's the standard pattern when a dashboard is really a product surface, not an internal report.
Extract vs live connection — cache the data next to the BI tool
Walmart — a Tableau workbook on a live warehouse connection where every filter click round-trips a fresh query, and 300 analysts do it all day.
-- LIVE connection: each filter/drill = a new
-- warehouse query. Interactive latency is at the
-- mercy of warehouse load; cost scales with clicks.-- Tableau Hyper extract / Power BI Import:
-- • a compressed columnar snapshot lives with the
-- BI engine; interactions hit RAM, not the WH
-- • scheduled refresh keeps it current
-- • filter to the needed rows/cols at extract time
-- use LIVE/DirectQuery only when true real-time
-- freshness is the requirement.Why it wins. A Tableau Hyper extract or Power BI import is a purpose-built columnar cache sitting next to the BI engine — interactions resolve in memory instead of round-tripping to the warehouse, which is both faster and dramatically cheaper. Reserve live/DirectQuery for genuinely real-time needs, and even then back it with aggregate awareness. Extract only the fields and grain the workbook uses (never SELECT *).
Refresh extracts and MVs incrementally, not full-rebuild
Datadog — a 2-year extract behind a usage dashboard, fully rebuilt every hour because "refresh" was left on the default.
-- hourly job re-reads and re-loads 730 days of
-- data to pick up the last hour of changes.-- Power BI incremental refresh policy:
-- archive > 2 years, refresh last 3 days
-- dbt incremental model:
{{ config(materialized='incremental') }}
SELECT ... FROM events
{% if is_incremental() %}
WHERE ds > (SELECT MAX(ds) FROM {{ this }})
{% endif %}Why it wins. The same "compute once" discipline from the Performance pillar (№23–24) applied to the serving layer: partition the extract/MV by date and refresh only the recent, mutable window — archiving the stable history. Refresh cost scales with new data, not total history, so the hourly job stays flat as the dataset grows.
Approximate counters for the headline tiles
TikTok — a real-time "unique viewers" big-number tile recomputing an exact COUNT(DISTINCT) over the live firehose every few seconds.
SELECT COUNT(DISTINCT viewer_id) AS unique_viewers
FROM live_views; -- exact, expensive, and
-- nobody reads the last 3 digits
-- of a 14,237,1•• counter.SELECT APPROX_COUNT_DISTINCT(viewer_id) AS unique_viewers
FROM live_views; -- ~1–2% error, a fraction
-- of the cost; exact reserved
-- for billing/export drill-downs.Why it wins. A headline counter is read at a glance — 14.2M vs 14,237,104 changes no decision, so paying for exactness is pure waste. Approximate distinct/percentile (HLL, t-digest) give the number in a fraction of the cost and memory. Keep exact computation for the places that legally require it — billing, finance, compliance exports.
The cheapest query is the one the dashboard never fires.
A lot of dashboard cost is self-inflicted by the dashboard's own defaults — unbounded windows, filters that don't prune, and a wall of panels that all fire on first paint. These patterns are about the dashboard asking the warehouse for less in the first place.
Bounded default window + lazy-loaded tiles
GitHub — a 30-panel dashboard that defaults to "all time" and fires all 30 queries the instant it opens, including tiles below the fold nobody scrolls to.
-- default range: since the beginning of time
WHERE ds >= '2015-01-01'
-- and 30 panels issue their queries on page load,
-- 25 of them never scrolled into view.-- default to the window people actually look at:
WHERE ds >= DATEADD(day,-28,CURRENT_DATE)
-- render above-the-fold tiles first; defer the
-- rest until scrolled/expanded; let users opt in
-- to longer ranges explicitly.Why it wins. Most dashboard views only need a recent window, and most panels are never looked at in a given session. A sensible bounded default plus lazy tile loading turns a 30-query thundering herd on every open into a handful of small, recent-window queries — which also prune partitions cleanly. Longer ranges become an explicit, infrequent choice.
Align partition & cluster keys to the dashboard's filters
Stripe — every dashboard filters by merchant_id and date, but the serving table is partitioned only by date, so the merchant filter scans every file in the range.
-- table partitioned by ds only; merchant scattered.
SELECT ... FROM payments_mart
WHERE ds >= :start AND merchant_id = :m;
-- merchant_id = :m touches every file in the date
-- range — no skipping on the most-used filter.-- Snowflake: CLUSTER BY (ds, merchant_id)
-- BigQuery: PARTITION BY ds CLUSTER BY merchant_id
-- Delta: OPTIMIZE ... ZORDER BY (merchant_id)
-- now merchant_id prunes via min/max stats and the
-- dashboard's most common filter skips most files.Why it wins. The serving table should be physically organised around how the dashboard actually filters. Partition by the coarse time dimension, cluster/Z-order by the high-selectivity filter columns (merchant, account, country), and every interactive filter prunes instead of scans. This is Performance №26 applied with the dashboard's WHERE clause as the design input.
Survive concurrency with caching + elastic warehouses
Atlassian — Monday 9am, 500 people open the same exec dashboard in five minutes; the warehouse queues and everyone watches spinners.
-- single-cluster warehouse; 500 concurrent runs of
-- the same uncached aggregate queue behind each
-- other → p99 measured in minutes.-- 1) MV + result cache so identical queries don't
-- recompute (see №6) — most of the 500 are cache hits
-- 2) multi-cluster / autoscaling warehouse for the
-- concurrent misses:
ALTER WAREHOUSE bi SET MIN_CLUSTER_COUNT=1
MAX_CLUSTER_COUNT=10 SCALING_POLICY='STANDARD';
-- 3) BigQuery: a BI Engine reservation for the dash.Why it wins. High-concurrency spikes are a different problem from slow queries — the fix is to (a) make most requests cache hits so they never touch compute, and (b) let the remainder fan out across auto-added clusters instead of queueing. The combination handles the Monday-morning herd without permanently over-provisioning a giant warehouse that sits idle the rest of the week.
Don't render a 50,000-row table widget
Shopify — a "detail" tab is a raw table widget returning every transaction, so the warehouse ships 50k rows and the browser chokes rendering them.
SELECT * FROM transactions
WHERE ds >= :start; -- 50k+ rows to a table viz
-- huge result transfer, slow render, and no human
-- reads a 50,000-row on-screen table anyway.-- show the aggregate the chart actually needs:
SELECT category, SUM(amount) FROM transactions
WHERE ds >= :start GROUP BY category ORDER BY 2 DESC LIMIT 50;
-- row-level detail → a paginated drill-down or a
-- "download CSV" that runs an async export job.Why it wins. A visualization should return what a human can perceive — a few dozen bars, a ranked top-N, a trend line. Massive table widgets pay twice: a large result transfer from the warehouse and an expensive client-side render. Aggregate or top-N for the on-screen view, and route true row-level needs to pagination or an async export.
Cache with a TTL tied to the data-freshness SLA
Pinterest — a dashboard fed by an hourly pipeline, but with caching off, so it recomputes continuously to show numbers that only change once an hour.
-- cache disabled / 0s TTL: the dashboard recomputes
-- on every interaction even though the source only
-- lands new data once per hour.-- cache TTL set to the pipeline cadence (e.g. 1h);
-- a scheduled "warm-up" runs the heavy queries right
-- AFTER each load so the first human always hits a
-- warm cache. Invalidate on load completion, not by
-- a guessed timer.Why it wins. There's no value in recomputing a number more often than its inputs change. Setting the cache TTL to the data-freshness SLA, and warming the cache immediately after each pipeline load, means viewers almost always hit a fresh cache and the warehouse runs the heavy query once per load instead of once per click. Best of all is event-driven invalidation keyed to load completion.
A fast dashboard that's wrong is the worst kind.
Optimization isn't only about speed — a pre-aggregated, cached, sub-second panel that shows a confidently wrong number does real damage. These four patterns are the correctness traps that the serving layer specifically introduces.
Mean lies on skewed data — show median / percentiles
DoorDash — an "average order value" KPI that a handful of catering whales drag 30% above what any typical customer ever spends.
SELECT AVG(order_value) AS "Typical order"
FROM orders; -- one $9,000 catering order
-- per 1,000 lifts the "average"
-- away from reality.SELECT APPROX_PERCENTILE(order_value,0.50) AS p50,
APPROX_PERCENTILE(order_value,0.90) AS p90,
AVG(order_value) AS mean
FROM orders; -- show p50 as "typical", and
-- the p50-vs-mean gap reveals skew.Why it wins. On the heavy-tailed distributions that dominate real business data, the mean is pulled toward the whales and misrepresents the typical case. Lead with the median, show a percentile spread, and consider a log scale for whale-heavy charts. The gap between mean and median is itself the skew signal. (Vocabulary on the Skew & Distributions page.)
Time zones & late data — define "today" and don't plot a half-loaded partition
Netflix — a global daily-active chart whose latest bar craters every morning, sparking a false-alarm Slack thread, because today's partition is only partly loaded and "day" is in UTC for a US-centric audience.
SELECT ds, COUNT(*) FROM events GROUP BY ds;
-- today's bar is partial (data still arriving) and
-- "ds" is UTC, so the curve dips every morning and
-- is shifted vs the users' local day.SELECT DATE(ts AT TIME ZONE 'America/Los_Angeles') AS day,
COUNT(*)
FROM events
WHERE ts < DATE_TRUNC('day', CURRENT_TIMESTAMP) -- exclude
GROUP BY 1; -- partial today
-- or mark the in-progress day as provisional in the viz.Why it wins. Two classic dashboard lies: plotting an incomplete current partition as if it were a finished day, and aggregating by an implicit UTC "day" that doesn't match how the business thinks about time. Excluding (or visibly flagging) the in-progress day kills the daily false-drop, and converting to a declared business time zone makes day-over-day comparisons honest.
Never pre-aggregate a ratio — store numerator and denominator
Robinhood — a rollup stores conversion_rate per day, and a weekly tile averages the seven daily rates, producing a number that's mathematically wrong.
-- agg_daily.conv_rate = conversions/visits per day
SELECT AVG(conv_rate) AS weekly_rate -- ❌ a day with
FROM agg_daily -- 2 visits counts
WHERE week = :w; -- as much as one
-- with 2,000,000.-- store conversions and visits (both additive):
SELECT SUM(conversions) * 1.0 / SUM(visits) AS weekly_rate
FROM agg_daily
WHERE week = :w; -- ratio computed at read time
-- from rolled-up components.Why it wins. Ratios, rates and averages are not additive — you can't sum or average them across grains without weighting. The rule is to pre-aggregate only additive components (numerator and denominator, sums and counts) and compute the ratio at read time. Same family as storing sketches instead of distinct counts (№5): keep the building blocks, derive the metric.
Label the approximate & sampled panels
Coinbase — finance pulls a "total settled volume" number off a dashboard tile that's quietly powered by APPROX_COUNT_DISTINCT and a 1% sample, and reconciliation later disagrees by 1.5%.
-- tile shows "Settled volume: 14,237,104" but it's
-- really a sampled/approximate estimate. Someone
-- treats it as the book of record.-- label it: "≈ 14.2M (approx, ±2%)"
-- approximate/sampled → exploration & glance KPIs
-- exact, un-sampled query → the path used for
-- billing, finance, compliance and any export.
-- one click from the tile to the exact drill-down.Why it wins. Approximation (№10) and sampling are the right call for speed — but only if consumers know which numbers are estimates. Visibly labelling approximate/sampled panels and routing finance-grade questions to an exact, un-sampled path preserves both speed and trust. The failure mode isn't the approximation; it's an estimate masquerading as the source of truth.
How to say it in the interview.
When you're asked to make a slow, expensive dashboard fast, walk the four levers in order — it shows you treat the dashboard as a system, not a pile of SQL:
That's the whole pillar in one breath. Notice how it dovetails with Performance: the rollups and sketches you build here are made cheap by the scan/shuffle/skew techniques there, and both rest on the schemas from Design. Pre-compute, accelerate, restrain, and don't lie — in that order.