Design the data model behind an enterprise CRM — nested account trees a hundred subsidiaries deep, revenue that must roll up to the parent without double-counting, and every tenant's data sharing the same physical tables. One prompt, two hard problems braided together, and a single decision that keeps a runaway query from reading another company's pipeline. A complete working-through of data flow, schema, the isolation stack, streaming Python, recursive-rollup SQL, and the dashboard that proves the walls hold.
Every data engineer who builds B2B software eventually meets this question. It sounds like a schema for accounts and deals. It is two problems most consumer platforms never face — a recursive hierarchy and absolute tenant isolation — sharing one set of tables.
"Design the data model for a multi-tenant CRM with account hierarchies. JP Morgan has a hundred subsidiaries, each running its own pipeline, but the renewal happens at the parent. Revenue must roll up the tree correctly, and no tenant can ever see another's data. How do you model it?"
The two halves of this question pull in opposite directions, and a senior answer names that tension first. The hierarchy wants flexibility: enterprises re-org constantly, a subsidiary gets sold, a division is folded in, and the tree must bend without a migration. A revenue rollup that does not aggregate up the tree is simply wrong — JP Morgan's number is the sum of its hundred children — but one that aggregates carelessly double-counts a subsidiary that appears under two parents. The multi-tenancy wants rigidity: every Salesforce customer's rows live in the same physical tables, separated only by an org_id, and one query that forgets the filter reads someone else's book of business. That is not a bug you log and fix next sprint — it is an existential, contract-ending breach. Flexibility and rigidity, in the same schema, is the whole assignment.
A weak answer stores current_stage as a column on the opportunity and trusts the application to add the tenant filter. A strong answer notices that current state is a lossy projection of an event stream, and that tenant isolation cannot live in application code where a single forgotten WHERE clause defeats it. So before any boxes and arrows, the working frame for the whole session:
Scope is the first scored dimension, and most candidates skip it. State what you build, what you ignore, and the numbers that shape every later choice. Out of scope here, said explicitly: the CRM application layer and UI, the email/activity sync engine, territory assignment rules beyond the dimension, and the forecasting model itself (treated as a consumer of these facts). In scope: how the account tree is stored so re-orgs are cheap and rollups are fast, how tenant isolation is enforced so it cannot be bypassed, how pipeline state is modeled so velocity is derivable, and how record-sharing cascades up the role hierarchy.
Then the envelope math, volunteered rather than extracted. Salesforce-shaped numbers — a multi-tenant CRM serving every customer from shared infrastructure:
| Quantity | Estimate | Consequence |
|---|---|---|
| Tenants (orgs) | 150,000 | Every table is partitioned by org_id, period |
| Accounts (all tenants) | ~2 B | The tree dimension; recursive, SCD2 on re-org |
| Max hierarchy depth | ~10–15 | Recursive CTE needs a cycle circuit-breaker |
| Cross-tenant reads allowed | 0 | The invariant that shapes the whole architecture |
| Stage changes/day | ~500 M | Append-only event fact; the pipeline spine |
| ARR snapshot rows/day | ~2 B accounts × 1 | Daily, partition-pruned by org; 7-yr retention |
| Largest single hierarchy | ~100 subsidiaries | Rollup must be correct and not double-count |
Notice the row that is a number and not a quantity: zero cross-tenant reads, ever. That single requirement is more architecturally decisive than any volume figure, because it dictates that org_id is not just a column — it is the partition key, the leading index column, and the subject of a security policy the database enforces on every statement. Everything else in this design is dimensional modeling; that row is the reason the modeling has a security stack bolted underneath it. The rest of this article follows the wall and the tree.
One platform, every tenant. The spine of the design is an append-only stage-change stream, partition-keyed by org_id at the source, feeding the pipeline analytics on one side and the hierarchical ARR rollup on the other — with a row-level security policy standing guard on every read.
Three properties of this picture do most of the interview's work. First, there is no current_stage box anywhere — current state is a view over the append-only stream, the latest event per opportunity, so the same table that powers "where is this deal" also powers "how long did it sit in negotiation." Second, the tree is split across two representations on purpose: writes hit a cheap parent_account_id on a single row, while reads hit a precomputed materialized_path, and a nightly job reconciles the second from the first — flexibility on write, speed on read. Third, every edge in the diagram is implicitly filtered by org_id, and the app tier binds the tenant to the session so the row-level security policy can enforce it; the wall is not drawn as a box because it is everywhere.
A stale rollup is recoverable; a leaked row is not. If the nightly path rebuild lags, a subtree query falls back to the recursive CTE over parent_account_id — slower, but correct, and self-healing at the next rebuild. But there is no fallback for a cross-tenant read: it is prevented, never detected-and-fixed, because by the time you detect it the data has already left the building. So the hierarchy degrades gracefully toward correctness, while isolation is made structurally impossible to violate — three independent layers, any one of which alone would stop the breach.
The schema falls out of the two tensions. The account dimension is recursive and hybrid — parent_id plus path. Every table carries a leading org_id. And pipeline truth is an append-only fact of stage changes, never a mutable column.
There are three honest ways to model a parent-child tree, and a senior answer names all three before choosing. Pure parent_id with a recursive CTE: cheap writes, clean schema, but every read recurses and it crawls past depth ten. Materialized path ('/JPM/Chase/Investment/'): a LIKE 'path%' makes subtree queries instant, but a re-org rewrites every descendant's path. Closure table (a row per ancestor-descendant pair): O(1) lookup in either direction, but a re-org cascades a heavy rewrite. Salesforce takes the hybrid: parent_account_id for clean single-row writes plus a materialized_path for fast reads, with a nightly job rebuilding paths from the parent links.
Tenant isolation is not one mechanism; it is three independent ones, because any single layer is one mistake from a breach. Layer one is the partition key: org_id leads every primary key, so a tenant's rows are physically grouped and a per-tenant query cap can throttle a runaway scan without touching the cluster. Layer two is the index: every composite index begins with org_id, so the planner cannot accidentally choose a cross-tenant scan. Layer three is the row-level security policy — the database itself appends the tenant filter to every query, so an engineer with raw SQL access still cannot read another org without going through the API that binds the session variable.
The cardinal modeling sin here is a current_stage column on the opportunity. It is lossy: the moment a deal moves from Negotiation to Closed-Won, the fact that it spent nineteen days in Negotiation is gone, and with it every velocity and conversion metric. Instead, every stage transition is an immutable row. Current stage is the latest event; time-in-stage is the gap to the next event; conversion is a count by stage. One append-only table is the single source of truth for everything pipeline.
And the bridge — record sharing, the many-to-many that powers "share this deal with my manager." Opportunity Y is owned by rep A but visible to manager B and VP C; that visibility is modeled as bridge_account_user_access(org_id, account_id, user_id, permission_level, granted_via), and the rows are generated by a daily job that cascades up the role hierarchy. The granted_via column records whether access came from ownership, a role rule, or a sharing rule — so a permission can be revoked precisely when its source disappears.
The whole correctness of this system lives in one rule about access: no statement reaches a row without proving its tenant. Not by convention, not by a code-review checklist — by a database policy that rewrites every query to carry the filter, so the only way to read a row is to already be inside the right org.
Why this, and not the rollup, is the invariant. A wrong rollup is embarrassing and fixable; a cross-tenant read is the end of the contract and possibly the company. So isolation cannot live where mistakes live — in application code, where a single hand-written report query that forgets WHERE org_id = ? exposes every customer's pipeline. It must live in the engine. The application's only job is to bind the tenant to the session at the start of each request; from there the row-level security policy does the enforcing, on reads and writes alike, for every table in the schema.
Read the chain left to right. The application never writes the tenant filter into its analytics queries — it sets a session variable once, and the policy silently appends AND org_id = current_setting('app.current_org') to every statement the session issues. A developer who writes SELECT sum(arr_usd) FROM fct_account_arr_snapshot with no filter gets only their own tenant's rows, because the database refused to run the query the way it was written. FORCE ROW LEVEL SECURITY means even the table owner is bound; there is no privileged path around the wall short of disabling the policy, which is an auditable, alarmed event.
The defense-in-depth matters precisely because each layer fails differently. RLS could be misconfigured on a new table; the partition key still groups the data so a per-tenant cap contains a runaway scan; the leading index column still steers the planner away from a cross-tenant seq-scan. Three independent mechanisms, and a breach requires all three to fail at once — which is the only acceptable posture for a bug you cannot take back.
Three programs carry the data. The session binder that makes isolation automatic, the nightly path rebuilder that keeps the hybrid tree honest, and the sharing-cascade job that turns the role hierarchy into bridge rows. Each is small; the judgment is in what they refuse to do.
Every database connection a request touches must be bound to exactly one tenant before it issues a single query, and unbound before it returns to the pool. This is the one place tenant safety is the application's responsibility, so it is made impossible to forget: a context manager binds on entry and clears on exit, and any query that runs outside it hits an unbound session and fails closed rather than open.
One carve-out, always stated: cross-tenant platform jobs run under a separate, audited service role, never by loosening the policy on the shared path. The nightly billing aggregator that must read all orgs uses a role whose policy is "all orgs" by design, with its access logged — so the privileged path exists, but it is a deliberate, observable door, not a hole anyone can stumble through.
The hierarchical rollup is where the tree pays off. Two ways to compute it — the recursive CTE for correctness, the materialized path for speed — and the discipline that keeps a subsidiary appearing under two parents from being counted twice.
The recursive CTE is the always-correct path, and the one to show first. Start from the roots, walk down the parent_account_id links accumulating each node's root, then join to the ARR snapshot and sum per root. The circuit-breaker on depth is not optional: a malformed tree with a cycle — a re-org that accidentally made a grandchild its own ancestor — would recurse forever without it, so the depth cap turns a data-quality bug into a bounded, loud failure instead of a hung query.
The materialized-path version is the fast path, and the reason the hybrid model exists. Once the nightly job has stamped '/JPM/Chase/Investment/' onto every node, the entire JP Morgan subtree is a single range scan: WHERE materialized_path LIKE '/JPM/%', no recursion at all. This is what makes a dashboard that rolls up a hundred-subsidiary hierarchy feel instant instead of grinding through ten levels of CTE on every page load. The two paths agree by construction — the path is built from the same parent links the CTE walks — so the fast path is a cache, never a second truth.
The stage-change fact is where the business explains itself. Three queries an interviewer loves, because each one carries a classic pattern on its back — current-state via window, time-in-stage via LEAD, and forecast accuracy as a cohort comparison.
The query that justifies refusing a current_stage column. Current state is simply the most recent stage-change row per opportunity — a ROW_NUMBER partitioned by opp, ordered by time descending, filtered to the first. The same table that stores history answers "where is every deal right now," so there is exactly one source of truth and no risk of a denormalized column drifting from the events.
The metric the whole event model was built to serve. The duration a deal spent in each stage is the gap between its entry and the next stage's entry — a textbook LEAD over the ordered transitions. From this one derivation come average time-in-stage, conversion rates between stages, and the stale-opportunity early warning that flags deals sitting too long.
The query that keeps the forecast honest. Join each closed opportunity's actual outcome to the forecast that was on it, per rep per quarter, and compute bias (predicted minus actual) and MAPE. A rep whose forecast consistently lands above actual is sandbagging; one consistently below is happy-ears. Both are visible the moment you compare the frozen forecast on the stage-change events to the realized close amount.
A senior design ends with observability, because both halves of this problem fail silently. The CRM dashboard watches three things: revenue is rolling up correctly, the pipeline is moving, and — above all — the tenant walls are intact.
Read the tiles together and the dashboard narrates the two-front nature of the problem. On the pipeline side, amber is creeping — time-in-stage and stale opps rising, forecast running 9% hot — the ordinary drift of a sales org that needs a nudge. On the isolation side, the only numbers that matter are at zero or bounded: cycle trips zero, unbound refusals zero, RLS denials present but contained, each one a query the database refused to leak. That asymmetry is the whole design philosophy in one screen — the pipeline is allowed to wobble and self-correct; the wall is not allowed to fail at all, and the telemetry proves it did not.
Strip the CRM details away and the question was testing five judgments, each of which generalizes far beyond Salesforce: