Est. 2026Philosophy · Technology · WisdomLinkedIn ↗

PaddySpeaks

Where ancient wisdom meets the architecture of tomorrow

← All Articles
technology · data engineering

Data Modeling for the Modern Warehouse

A practitioner's field guide — slowly changing dimensions, the three fact-table grains, factless facts, One Big Table, Data Vault, change data capture, the medallion architecture, and what housekeeping each cloud warehouse actually needs.

Every data platform is, underneath the dashboards and the dbt graphs, a set of modeling decisions. How do you store an attribute that changes? What is the grain of this table? Should this be one wide table or a star? When the warehouse slows down, what do you actually run to fix it? These questions are old — Ralph Kimball was answering most of them in the 1990s — but the cloud warehouse reshuffled the trade-offs, and a lot of working knowledge is now a confusing mix of timeless principle and obsolete advice.

This is a field guide to the parts that still matter. It is organized in four parts: the grammar of dimensional modeling, how to choose a modeling paradigm, how to load the warehouse, and how to keep it fast. It is a companion to the interview question bank, which has a runnable Data Modeling set drilling every concept below.

The one idea to keep: a data model is a contract about grain (what one row means), history (what happens when a value changes), and cost (what each query and each load has to scan). Almost every term below is just a named answer to one of those three questions.

✦ ✦ ✦
Part I

The Grammar of Dimensional Modeling

Facts, dimensions, grain, and the patterns that have a name because you will need them.

1. Facts, Dimensions, and the Star

Dimensional modeling splits the world into two kinds of table. Fact tables hold the measurements of a business process — an order line, a payment, a sensor reading — and are mostly foreign keys plus numbers. Dimension tables hold the descriptive context you filter and group by — the customer, the product, the date, the store. A fact surrounded by its dimensions is a star schema, and it is still the default shape for analytics because it matches how people ask questions: a measure, sliced by some descriptors.

A snowflake schema normalizes those dimensions into sub-dimensions — product points to category points to department, each its own table. It removes redundancy and keeps a shared hierarchy in one place, at the cost of more joins per query. On a columnar warehouse the redundancy of a flat dimension is cheap and the join savings are real, so default to the star; reach for snowflaking only when a dimension is enormous or a sub-hierarchy is genuinely shared across many dimensions.

When several fact tables share the same dimensions — sales and returns both joined to the same date, product and store — you have a fact constellation (or galaxy schema). The rule there: never join two fact tables directly. Their grains differ and the join fans out. Instead drill across — aggregate each fact to the common grain separately, then full-outer-join the results on the shared keys.

2. The Three Fact-Table Grains

The first decision for any fact table is its grain: what exactly does one row represent? Get it wrong and every query downstream is fighting the table. There are three classic grains.

GrainOne row perAdditivityUse case
Transactionevent / transactionFully additiveOrder lines, payments, clicks — the atomic source of truth
Periodic snapshotentity, per periodSemi-additiveDaily account balance, inventory on hand, subscriber count
Accumulating snapshotprocess instanceMixed; row is updatedAn order moving through ordered → shipped → delivered

The transaction grain is the most flexible and the most atomic — one row each time something happens, every measure fully additive. The periodic snapshot captures state rather than events: it has a row for every entity every period, even when nothing happened, because a balance still exists on a quiet day. Its measures are semi-additive — a balance sums across customers but not across time. The accumulating snapshot has one row per process instance that is updated in place as the instance hits milestones; it is built for pipeline and cycle-time analysis, where a single row read answers "how long did this order take to ship?"

A mature warehouse uses all three. The transaction fact is the source of truth; snapshots are derived from it for the trend and lag questions that are expensive to recompute from raw events every time.

3. Factless Fact Tables

A factless fact table has no numeric measures at all — it is pure foreign keys. The existence of the row is the fact, and COUNT(*) is the measure. They come in two flavors, and the second is the clever one.

The first flavor is event tracking: a row each time a tracked thing happens. On Netflix, a row in fact_title_impression every time a title is shown in a profile's UI, and a row in fact_view_start every time a playback begins. Neither table has a "measure" — but divide one count by the other and you have click-through rate. The model is nothing but keys, and that is the point.

The second flavor is coverage, and it exists to answer the negative question — what didn't happen. Consider Uber: a rider opens the app and requests a ride, and sometimes no driver is near. An event-factless fact_ride_request records the demand. A coverage-factless fact_driver_availability records which drivers were online in which geo and time bucket — what supply could have served a request. Anti-join demand against coverage and you have measured unmet demand and supply gaps, without a single additive column anywhere in the model.

If you ever find yourself inventing a fake count_flag = 1 column so a table "has a measure," stop — you have a factless fact table, and COUNT(*) is the honest measure.

4. Slowly Changing Dimensions (Type 0 through 6)

A dimension attribute can change in the source — a customer moves, a product is recategorized, an account is upgraded. Slowly Changing Dimension (SCD) types are the named strategies for that change. The single most common mistake is treating SCD as a property of a table. It is not — it is a decision made per attribute, and one physical dimension routinely mixes several types.

TypeStrategyUse when
0Retain the original value; never change itThe first-seen value is itself a fact — date of birth, account open date, original acquisition channel
1Overwrite in place; keep no historyOnly the current value matters — a corrected spelling, a current phone number
2Add a new versioned row; expire the old onePast facts must be explained by the value as it was then
3Add a column for the prior valueYou need exactly one before/after — a single territory realignment
4Current dimension + history table / mini-dimensionAn attribute changes too fast to version inline
6Hybrid 1 + 2 + 3 — versioned rows plus a current-value columnReports need both the historical and the current view at once

Type 2 is the workhorse and deserves a closer look, because it is where most real loads live. Each dimension row gets a surrogate key, a natural key, an effective-from / effective-to window and an is_current flag. The load hashes the tracked attributes into a row_hash; for each incoming record it compares that hash to the current version. A new natural key inserts a first version; a changed hash expires the old row and inserts a new one; an unchanged hash is a no-op. Always expire before you insert, and always load from a deduplicated staging table.

The reason a surrogate key is non-negotiable for Type 2 is structural: one natural key now maps to many rows — one per version — so only a meaningless generated key can uniquely identify a row. A fact joined to that surrogate is pinned to a specific version of the customer, which is the whole point of keeping history.

So when an interviewer asks "what SCD type is this dimension?" the strong answer reframes the question. A streaming service's customer dimension might hold date of birth (Type 0 — immutable), email (Type 1 — only the current address matters, and old ones are PII you must purge), subscription tier (Type 2 — revenue must be attributed to the tier in effect at the time), and a nightly-recomputed taste segment (Type 4 mini-dimension — versioning it inline every night would explode the table).

5. Dimension Patterns Worth Naming

A handful of dimension patterns recur often enough to have names. Knowing the names is half of using them.

Conformed dimension

One dimension defined once and shared identically across many fact tables. It is the only way to compare measures across business processes — and the backbone of an enterprise model. The bus matrix plans them.

Junk dimension

Several low-cardinality flags (is_gift, ship_priority, channel) collapsed into one small dimension of valid combinations, so the fact table carries one key instead of a dozen loose columns.

Degenerate dimension

A dimension key that lives in the fact with no dimension table behind it — an order number, an invoice number. You group by it; there is nothing to look up.

Role-playing dimension

One physical dimension referenced several times by one fact in different roles — a single date dimension used as order date, ship date and delivery date, exposed through per-role views.

Bridge table

Resolves a many-to-many — an account with several owners, a visit with several diagnoses. Carry an allocation weight on the bridge so measures don't double-count when summed across it.

Mini-dimension

Fast-changing attributes pulled out of a large dimension into their own small one, so the main dimension stops churning. The Type 4 answer to volatile attributes.

6. Surrogate Keys vs Natural Keys

A natural (business) key is the identifier the source system uses — a customer ID, a SKU. A surrogate key is a meaningless integer the warehouse mints itself. Warehouses join on surrogates for several reasons at once: they make Type 2 history possible (one natural key, many versioned rows, each needing its own key); they decouple the warehouse from sources that renumber, merge or recycle IDs; they unify dimensions assembled from multiple sources; they keep joins to narrow integers; and they let you reserve special members like a -1 "Unknown" so facts never carry a NULL foreign key. Keep the natural key as an attribute on the dimension — you still need it for lookups, lineage, and the Type 2 change-detection join.

A modern variant is the hash key — a deterministic hash of the business key used as the key itself. It can be computed in parallel on any node with no central sequence, which is why it is the standard key in Data Vault. Its sibling, the hash diff — a hash of all descriptive attributes — is the cheap change-detection engine behind both SCD Type 2 and Data Vault satellites: if the hash diff moved, the row changed.

✦ ✦ ✦
Part II

Choosing a Modeling Paradigm

Kimball, Inmon, Data Vault, One Big Table — four answers, and when each is right.

7. Kimball vs Inmon

The two classic warehouse philosophies differ in where integration happens. Inmon is top-down: build a normalized, third-normal-form enterprise data warehouse first as the single integrated source of truth, then derive dimensional data marts from it. Integration is strong and up front; time-to-first-value is slow. Kimball is bottom-up: build dimensional marts per business process first, and integrate them through conformed dimensions and the bus matrix. Value arrives fast; integration depends on the discipline of conforming dimensions.

The modern lakehouse quietly blends both, and the medallion architecture (Part III) is the blend made concrete: a cleaned, integrated, lightly-normalized core — the Inmon idea — feeding dimensional stars and wide tables in a serving layer — the Kimball idea. The argument is mostly over.

8. Data Vault

Data Vault is a modeling style for the raw integration layer, built for warehouses that ingest many volatile sources and need full auditability. It splits every model into three rigid block types:

  • Hubs — the list of unique business keys for a core concept (customer, product). Just keys, no context.
  • Links — the relationships and transactions connecting hubs, always modeled many-to-many.
  • Satellites — the descriptive, time-variant attributes hanging off a hub or link, with full history by hash diff.

The payoff is that business keys are stable while relationships and attributes are not — so isolating each means a new source or a new attribute is an additive change (a new satellite or link), never a restructure of existing tables. Loads are insert-only and fully parallel. The cost is verbosity: querying a Vault directly means joining through hubs, links and satellites, so you still build Kimball stars on top for consumption. Choose Data Vault as a raw/integration layer when you are merging many changing sources under audit requirements; a small, stable, single-source warehouse should skip it and go straight to a star.

9. One Big Table — and When Not to Use It

One Big Table (OBT) is the flat alternative to the star: a single wide, fully denormalized table — the fact plus every dimension attribute flattened in, one row per event, often hundreds of columns. It became popular for concrete reasons. Columnar storage means the warehouse scans only the columns a query touches, so three hundred unused columns cost nothing. Joins are the expensive operation on a distributed engine, and OBT removes them from the read path entirely. Analysts and BI tools query one table with no join keys to learn. And dbt makes assembling the wide table cheap.

But OBT is a serving-layer optimization, not a modeling foundation, and the trade-offs are real.

One Big Table — strengths

  • Zero joins on read — fast, and simple for analysts
  • Self-contained; trivial to share or export
  • Cheap to scan on columnar storage
  • An obvious fit for the gold / serving layer

One Big Table — what it gives up

  • Massive redundancy — every dimension attribute repeated per row
  • A dimension change means rewriting the whole table
  • No clean SCD2 — history would multiply every fact row
  • No conformance — each OBT re-derives its own dimensions and they drift

So keep a dimensional model — and serve from OBTs built on top of it — whenever dimensions have history, when many facts share dimensions and conformance matters, when the same dimension is needed at different grains, or when duplicating wide dimension text across billions of rows is real money. Model in a star; serve with OBTs where read speed and analyst simplicity justify the duplication.

10. Normalization, and When to Denormalize

Normalization removes update, insert and delete anomalies by making every non-key column depend on "the key, the whole key, and nothing but the key." 1NF demands atomic values — no repeating groups. 2NF removes partial dependencies on a composite key. 3NF removes transitive dependencies, where a non-key column depends on another non-key column. BCNF tightens 3NF so every determinant is a candidate key.

If normalization is "correct," why do warehouses denormalize on purpose? Because normalization and denormalization optimize for opposite workloads. A normalized (3NF/BCNF) schema is right for OLTP — write-heavy transactional systems where each fact is stored once, so updates are cheap and structurally cannot create anomalies. A denormalized schema — star dimensions, OBTs — is right for OLAP, where reads dominate and writes are controlled batch loads. The update anomalies normalization protects against simply cannot occur when the load pipeline is the single writer. Denormalization is not a sin; it is the correct answer to a read-optimized, batch-written problem.

✦ ✦ ✦
Part III

Loading the Warehouse

Change capture, late data, staging, layers, and pre-computed results.

11. Change Data Capture and Incremental Loading

Keeping a warehouse in sync with a source means capturing what changed. There are four approaches, in rough order of fidelity: log-based CDC reads the database transaction log (WAL, binlog, redo) and catches every insert, update and delete in order, near real time, with near-zero load on the source — the gold standard. Trigger-based CDC writes changes to a shadow table via triggers; it catches deletes but taxes every source write. Timestamp/query-based CDC polls WHERE updated_at > last_run — simple, but it misses deletes and any update that fails to bump the timestamp. Snapshot diff extracts the whole table and diffs it — trivially correct, but only viable for small tables.

Whatever the capture method, the delta is applied with an idempotent MERGE keyed on the business key — matched-and-deleted, matched-and-changed, not-matched-and-inserted. Idempotency matters because pipelines re-run constantly: retries, catch-up, replays. A MERGE converges to the same state no matter how many times the same delta is applied; a blind INSERT duplicates on every retry. The choice between a full refresh and an incremental load is really a choice of failure mode: full refresh is simple and self-healing but fails loud when it runs long; incremental scales with change but fails quiet, drifting silently from the source. Make incremental loads safe with an overlapping watermark window and periodic full-refresh reconciliation.

12. Late-Arriving Data and Idempotency

Real pipelines do not receive data in order. A fact can arrive referencing a dimension member the dimension load has not seen — a late-arriving dimension. Do not drop the fact and do not give it a NULL key; create an inferred member: insert a placeholder dimension row now, with a minted surrogate key, the natural key, all attributes "Unknown" and an is_inferred flag. The fact loads cleanly, and when the real record arrives the placeholder is updated in place. An inferred member beats a generic "Unknown" because it preserves the distinct identity, so the fact reconnects to the real member later.

The harder case is a late-arriving fact against a Type 2 dimension: you must attach the dimension version that was in effect on the event date, not today's version — resolve the surrogate key by the event timestamp against the effective-from / effective-to window. And any aggregate already published for that date is now stale and must be re-derived, which is why fact tables should be partitioned by event date, not load date.

All of this rests on idempotency: running a job, or one partition of it, any number of times must produce the same result as running it once. Achieve it with partition overwrite instead of append, MERGE instead of blind insert, and deterministic transforms — no now(), no random(), the logical date passed in as a parameter. A backfill is then just "re-run these partitions," and it is safe to run alongside the daily job.

13. Temp Tables and the Staging Pattern

Inside a loading job you constantly need intermediate results, and there is a small decision tree. A CTE or subquery is logical only — nothing is stored, and in many engines a CTE referenced several times is evaluated several times; it is for the readability of a single statement. A temporary table is physically materialized and session-scoped — computed once, reusable across many statements, and indexable, so downstream joins plan well. A materialized view is persisted across sessions for results reused by many users over time. The heuristic: reused once, one statement — CTE; reused many times within one job — temp table; reused across jobs and worth maintaining — materialized view.

Temp and staging tables are also why robust pipelines do not transform straight into the target. The pattern is land → stage → transform → merge: land the raw extract into staging, validate and deduplicate there, transform into the target shape, and only then merge into the target as the last fast step. Staging gives you atomicity (the target stays consistent until the final merge), idempotency (truncate and re-run freely), and a debuggable audit trail of the batch.

14. The Medallion Architecture

The medallion architecture organizes a lakehouse into three layers, each a durable, queryable checkpoint. Bronze is raw ingestion — source data landed as-is, append-only, plus ingestion metadata; its purpose is to be an immutable, replayable record of what arrived. Silver is cleansed and conformed — deduplicated, typed, validated, business keys resolved, integrated across sources; the single source of truth. Gold is consumption-ready — dimensional stars, OBTs, aggregates and feature tables shaped for specific use cases.

The layers are quality contracts, not just folders, and the common mistakes all come from blurring them: business logic in bronze (filtering on ingest, which destroys replayability), reporting straight off bronze (every dashboard re-inventing cleansing), or cleansing logic copied into many gold models where it drifts. The rule is simple — bronze keeps everything and changes nothing, silver does all the cleansing exactly once, gold only reshapes for consumption.

15. Materialized Views

A plain view is a stored query — zero storage, always fresh, full cost paid on every read. A materialized view is a stored result, refreshed on a schedule or incrementally — fast reads, at the cost of storage, refresh compute and some staleness. A hand-built summary table trades that automation for total control over load logic and late-data handling.

The detail that decides whether a materialized view scales is incremental refresh — updating only the rows affected by base-table changes, versus recomputing the whole result. It varies sharply by engine: Snowflake refreshes materialized views automatically and incrementally but restricts them to a single table (use Dynamic Tables for joins and aggregates); BigQuery refreshes incrementally and will auto-rewrite base queries onto the view; Redshift supports incremental refresh for many query shapes; PostgreSQL has no incremental refresh at all — REFRESH MATERIALIZED VIEW recomputes everything. Always check whether your engine, and your query shape, qualify for incremental refresh — if not, a hand-built summary table with explicit delta logic will scale better.

✦ ✦ ✦
Part IV

Keeping It Fast: Storage Internals & Maintenance

"System cleansing" — what each warehouse needs, and why three of the five need no vacuum at all.

Every warehouse degrades in some way under deletes, unsorted inserts and small files. What "maintenance" means depends entirely on the storage engine — and a striking fact of the modern stack is that three of the five systems below have no VACUUM at all, because their storage is immutable and the housekeeping is automatic.

SystemStorage modelWhat you actually maintain
RedshiftMutable columnar, MPP slicesVACUUM + ANALYZE; sort & distribution keys
SnowflakeImmutable micro-partitionsNothing routine — optional clustering keys; retention
BigQueryManaged columnar, auto-optimizedPartition & cluster keys (design, not commands)
Azure SynapseMutable columnstore, MPPStatistics; columnstore index rebuilds; distribution
Delta LakeFiles + transaction logOPTIMIZE, Z-order, VACUUM

Redshift — VACUUM and ANALYZE

Redshift is the system the word "vacuuming" comes from. It is a columnar MPP database with mutable storage, so deleted rows are only marked deleted and new inserts land out of sort order. VACUUM reclaims the space from deleted rows and re-sorts rows back into sort-key order; ANALYZE refreshes the table statistics the query planner depends on. Modern Redshift auto-vacuums in the background, but a large backfill or bulk delete still warrants a manual VACUUM. The two design knobs that decide whether the table is fast in the first place are the sort key (the on-disk order, which lets the engine skip blocks for range-filtered columns) and the distribution style (KEY to co-locate rows that join, ALL to replicate small dimensions) — a wrong distribution key is the classic cause of skew and slow joins.

Snowflake — Micro-partitions, No VACUUM

Snowflake has no VACUUM because its storage is immutable. Tables are stored as micro-partitions — small, columnar, compressed files that are never updated in place; every write creates new ones. There is nothing to reclaim and nothing to re-sort. The engine prunes queries using per-partition min/max metadata, so good natural clustering means good pruning. For very large tables whose order has decayed, you can define a clustering key and Snowflake re-clusters in the background — a billed service most tables never need. The flip side of immutable storage is Time Travel (query or restore data as of up to 90 days ago) and Fail-safe (a further recovery window) — both retain old micro-partitions, so the real lever is the data-retention setting, not a cleanup command.

BigQuery — Partition, Cluster, No VACUUM

BigQuery is fully managed and continuously optimizes its storage files in the background — again, no VACUUM, no compaction command. Because you pay per byte scanned, performance and cost are governed by table design. Partitioning by a date or integer column lets a filter prune whole partitions; clustering sorts data within each partition by up to four columns, and BigQuery re-clusters automatically and for free. Round it out with partition expiration to auto-delete old data and require_partition_filter to block accidental full-table scans. The maintenance here is entirely up-front design.

Azure Synapse — Statistics and Columnstore

A Synapse dedicated SQL pool is an MPP warehouse spreading data across 60 distributions, and it needs deliberate maintenance. The single biggest plan-quality lever is statistics — Synapse does not always keep them current, so CREATE / UPDATE STATISTICS after large loads is essential. Clustered columnstore tables accumulate small "delta store" rowgroups on trickle inserts, so index rebuilds are needed to consolidate them to full size. And like Redshift, distribution choice — HASH, ROUND_ROBIN, REPLICATE — decides whether large joins shuffle data across the network.

Delta Lake — OPTIMIZE, Z-order, VACUUM

A lakehouse table — Delta Lake, Iceberg, Hudi — is many data files on object storage plus a transaction log, and its characteristic ailment is the small-file problem: streaming and frequent micro-batch writes produce thousands of tiny files, and file-open and planning overhead come to dominate scans. The fix is OPTIMIZE — compaction that rewrites many small files into fewer large ones — optionally with Z-ordering to co-locate related data so filters skip more files. Delta Lake also has its own VACUUM, which deletes data files no longer referenced by the log and older than a retention window; note that running it limits how far back Time Travel can reach. A lakehouse table, in other words, is not fire-and-forget — compaction and vacuum are its version of the housekeeping Redshift made famous.

✦ ✦ ✦

The Short Version

Data modeling looks like a vocabulary test — SCD2, factless fact, accumulating snapshot, micro-partition — but every term is an answer to one of three questions. What does one row mean? is grain: the three fact grains, the factless fact, the bridge table. What happens when a value changes? is history: the SCD types, CDC, late-arriving data, idempotency. What does each query and load have to scan? is cost: star versus snowflake versus OBT, normalization, partitioning, and the maintenance each engine needs.

Get those three right and the rest is detail. Get them wrong and no amount of compute will save the platform. To drill any of this hands-on, the interview question bank has a runnable Data Modeling set — every concept above, as an interview question with a worked answer.

Share