Your company just bought a twenty-billion-dollar SaaS. Two CRMs, two warehouses, two ID systems, two fiscal calendars — and an executive expecting unified KPIs on day ninety. The one decision that separates a senior answer: every merge must be reversible. A complete working-through — data flow, the golden-record schema, the matching playbook, the strangler dual-write, the reconciliation SQL, and the cutover dashboard.
Every acquisition press release says "stronger together." Every acquirer's data team hears "reconcile two of everything before the next board meeting." The senior signal is refusing to ship a single integration plan, and naming the four orthogonal problems first.
"Your company just acquired a $20B SaaS business. Walk me through the data integration — how do you get to unified, trustworthy numbers without freezing either company's product?"
The mistake that fails the question is treating this as one migration. It is four, and they have different owners, different timelines, and different failure modes. Entity resolution asks whether their Acme Corp is our Acme Corp — a graph problem dressed as a join. Schema unification has to happen while both applications keep shipping, because you cannot freeze a database that is still taking customer writes. Warehouse consolidation is a vendor decision — Snowflake versus Databricks versus BigQuery — that can be deferred behind a bridge for six months. And multi-org metrics is the quiet killer: their DAU is not your DAU, their fiscal calendar is not your fiscal calendar, and the board deck on day ninety will be subtly, defensibly wrong unless the definitions are versioned. A junior answer reaches for a big-bang migration and a single "combined" view. A senior answer sequences the four, makes every merge reversible, and gates the cutover on a number.
So before any boxes and arrows, the working frame for the session — the four problems, in the order they bite:
Scope is the first scored dimension, and most candidates skip it. State the sequencing and the deferrals explicitly. In scope: master-data entity resolution, the strangler dual-write and its reconciliation gate, and the versioned metric layer — the three things that block the day-ninety board deck. Deferred on purpose: the warehouse vendor decision (bridged, not migrated, for six months), org-chart and HR-system merges, and the inevitable rationalization of overlapping product SKUs. The thread that ties scope together is reversibility — every structural decision in the first ninety days has to be undoable, because the integration will be re-litigated by Finance, Legal, and the regulators before it is finished.
Then the envelope math, offered rather than extracted. Deal-shaped numbers:
| Quantity | Estimate | Consequence |
|---|---|---|
| Deal size | $20 B | Board-level scrutiny on every reported KPI |
| Source systems to unify | 2 CRMs + 2 warehouses | Two of everything: IDs, calendars, currencies |
| Customer-base overlap | ~40% (pro tier) | 1+1≠2 — ARR double-counts without dedup |
| Reconciliation drift gate | < 0.1% | The number that gates cutover |
| Cutover green window | 72 consecutive hours | One amber hour resets the clock |
| Convergence horizon | 6–18 months | Both apps ship the whole time |
| Day-90 reporting deadline | fixed | Share-and-bridge buys the warehouse decision time |
Notice which number gates everything. The drift threshold — source row counts versus unified row counts, held under one-tenth of one percent for seventy-two straight hours — is the single quantity that decides whether an entity is allowed to cut over. It dictates the reconciliation cadence, the dual-write design, and the on-call posture for the whole convergence. The rest of this article follows the crosswalk.
One core, two living sources. Both CRMs keep writing; CDC streams every change through a translation layer into a unified read model; a daily reconciliation gates the cutover. The crosswalk sits in the middle, mapping every source ID to a golden ID — reversibly.
Three properties of this picture do most of the interview's work. First, the sources are never frozen — both CRMs take writes for the entire convergence, and the unified model is fed by CDC, not by a one-time dump that goes stale the moment it lands. Second, the crosswalk is the hinge of the whole design and it is append-only: a merge writes a row, an un-merge writes another row that supersedes it, and nothing is ever destroyed — which is the only way an analyst's "those are actually parent and subsidiary" becomes a correction instead of a catastrophe. Third, the dashed line from reconciliation is a gate, not a pipe: drift over one-tenth of a percent does not corrupt data, it withholds permission to cut over, and the clock resets.
Every merge is a hypothesis, and every hypothesis must be retractable. Strict rules auto-merge because they are nearly certain; fuzzy rules stage candidates for a human because they are guesses. But even a strict merge can be wrong, so no merge ever rewrites or deletes — it appends a crosswalk row pointing source IDs at a golden ID, and the un-merge appends a superseding row pointing them somewhere new, emits a CDC event, and lets the facts re-attribute. The failure mode is a reversible correction, never an irreversible loss.
The schema falls out of reversibility and auditability. The golden record is SCD2 and lineage-bearing. The crosswalk is append-only with a supersession pointer. The metric layer is versioned so a number can be defended to the board.
The master entity is a UUID-keyed golden_customer that no source owns. Its canonical fields are SCD2 — a customer's industry classification or DUNS number can be corrected over time — and it carries a sources[] array, the lineage that makes overlap analysis a column lookup rather than a join. The presence of both source systems in that array is exactly the "customer in both products" signal Sales and Finance fight over on day ninety.
This is the most important table in the system. It maps every (source_system, source_customer_id) to a golden_customer_id, records which rule fired and at what confidence, and — the part everyone forgets — carries a superseded_by pointer. The crosswalk is never updated in place and never deleted from; a re-match appends a new row and retires the old one by pointing at it. That single column is the difference between an un-merge being a SQL statement and being a two-week incident.
The strangler's destination is unified.customers, keyed by account_id ≡ golden_customer_id, carrying source lineage (source_system, source_row_id) so any unified row can be traced back to the record that produced it. Reads go through the crosswalk; writes arrive by dual-write from CDC. Watching over all of it is recon.daily_drift — the reconciliation fact that compares source row counts to unified row counts per system and per day, and pages the on-call the moment drift crosses one-tenth of a percent.
One more table completes the governance layer, and it is covered in §06: dim_metric_definition, the versioned home for "what does combined DAU actually mean." It is the reason a board number can be audited rather than argued.
Everything in this design protects one sentence: any merge can be undone without losing referential integrity in the pipelines that already consumed it. The invariant lives in the append-only crosswalk and its supersession pointer.
The lifecycle of a golden record is not a straight line; it is a graph that can be walked backward. A source row is ingested, matched, and either auto-merged into an existing golden record or made into a distinct one. Months later an analyst discovers a mistake — two companies merged that should not have been — and the record must un-merge. The invariant says this is a forward operation: retire, re-point, log, emit. Nothing is destroyed; the history of who was merged into whom, by which rule, at what confidence, survives the correction intact.
The un-merge is the atomic heart of the system, and it is the move that separates a designer who has shipped one of these from one who has only read about it. It runs as a single transaction: open a new golden record, append crosswalk rows pointing the affected source IDs at it, retire the old golden record by closing its SCD2 window, and stamp the supersession pointer so the audit trail is intact. The CDC event it emits lets downstream fact tables re-attribute on their own schedule. Never a DELETE.
Three programs carry the convergence: the translation layer that remaps source schemas without freezing them, the matching playbook that runs strict-before-fuzzy, and the dual-writer that lands records idempotently. Each is small; the judgment is in what they refuse to do.
CDC gives you the raw source change; the translation layer turns it into the unified contract. It renames fields, rewrites source IDs into the unified namespace, and stamps lineage. Its one discipline is that a record violating the contract — an unexpected field, a missing key — goes to a dead-letter queue, not into the unified model and not silently dropped. A contract violation is a signal that a source app shipped a schema change; swallowing it is how drift creeps in unseen.
One distinction, always stated: the translation layer never matches. It only normalizes shape. Deciding whether this normalized record is the same customer as another is the playbook's job, kept separate so that a schema change and a matching change can be reasoned about — and rolled back — independently.
You never decide two rows are the same customer with one rule. You run an ordered list from strict to permissive, and the first rule that fires wins — the same logic as airport security: passport first, then bag scan, then pat-down. Strict rules auto-merge into a golden record; fuzzy rules stage a candidate for a human via match_candidates. The ordering is the whole point: a confident identifier match must never be overruled by a fuzzy name similarity.
Two slow layers do the governance work. Reconciliation turns "are we drifting?" into a number that gates the cutover. The versioned metric definition turns "what is combined DAU?" from an argument into an auditable row. Both exist so a human can defend a decision.
Reconciliation runs hourly during cutover and compares source row counts to what the unified model actually ingested, per source system. The output is drift_pct and lag_seconds; both have hard pager thresholds. Drift over one-tenth of a percent or lag over five minutes pages the on-call, and the cutover gate is the simple, unforgiving rule on top: seventy-two consecutive green hours before an entity is allowed to flip, and one amber hour resets the clock to zero.
The second governance layer is the versioned metric. The board deck has "combined DAU" on slide three, and it is wrong — because the acquirer counts "any session in 24h" while the target counts "any session in 7 days, deduped to one row per user-day." Rather than bury that disagreement in SQL, bake the definition into a versioned dimension. Every dashboard tile cites a (metric_id, definition_version); combined DAU has three competing versions; Finance picks the one that matches the board narrative; the data layer auto-generates the SQL from the row.
The golden core and its crosswalk are where the deal explains itself. Three queries an interviewer loves, because each carries a classic pattern: overlap as a set intersection, deduped roll-up through a crosswalk, and an as-of resolution over the append-only mappings.
"How many customers are in both products?" is the first question Sales asks (for the cross-sell list) and the first Finance asks (to deduct overlapping ARR before announcing combined revenue). Because lineage lives in golden_customer.sources[], it is a containment test, not a join: the golden records whose sources array holds both systems are precisely the overlap.
Summing both companies' revenue naively gives 1 + 1 = 2 for a customer that exists in both — overstating combined ARR by the overlap. The fix is to route every order through the current crosswalk row to its golden id, then sum by golden id, so a shared customer contributes once. The gap between the gross sum and this deduped figure is exactly the number Finance must disclose.
After an un-merge, "what was this source customer's golden id on a given date?" has two answers — before and after the correction. The append-only crosswalk makes the historical answer recoverable: walk the supersession chain and pick the row that was current as of the date in question. This is the audit query that defends a restated number to Finance.
Big-bang migration (copy data, rewrite dbt models) fits a strategic incompatibility caught early. Federated query (Snowflake External Tables, Unity Catalog, Trino) fits stacks that complement and a vendor decision you want to defer. Share-and-bridge (cross-account Snowflake Share, Delta Sharing, Authorized Views) is what every real deal uses for the first six months — lowest engineering cost, unblocks day-90 reporting. Tag every query with cost_owner ∈ {pre-merger-target, pre-merger-acquirer, post-merger-combined} so the CFO's "how much of last month's bill came from the acquired team" has an answer.
A senior design ends with observability, because the whole convergence is a bet that you can watch. The dashboard answers one operational question per loop: is the drift under the gate, is the analyst queue draining, and does every board number cite a version?
Read the amber tiles together and the dashboard narrates the discipline from §06: the target CRM's drift crept over the gate during a historical backfill, the on-call was paged, and the seventy-two-hour clock reset to nineteen — no data was lost, the cutover was simply not yet permitted. The DLQ at zero says both source schemas are still honoring the contract; the auto-merge rate at 82% says the strict rules are carrying the load while the analyst queue handles the fuzzy tail; every board tile cites a metric version. That is what a designed, reversible, gated integration looks like from the operator's chair on day sixty-one.
Strip the deal away and the question was testing five judgments, each of which generalizes far beyond M&A: