M&A integration — four data problems pretending to be one deal.
Every acquisition press release says "stronger together." Every acquirer's data team knows that means two incompatible CRMs, two warehouses, two ID systems, two fiscal calendars, and an executive expecting unified KPIs on day 90. This is the round where the interviewer asks "your company just bought a $20B SaaS — how do you merge the customer table?". Four named patterns, four real deals as case studies.
Contents
- The four data problems hidden in every deal
- Entity resolution — golden records across two CRMs
- Schema unification — the strangler pattern, dual-write, cutover
- Warehouse consolidation — Snowflake / Databricks / BigQuery merger
- Multi-org metrics — when their DAU isn't your DAU
- Case studies — Adobe-Figma, Microsoft-Activision, Cisco-Splunk, Salesforce-Slack
- The 90-second articulation script
| Term | What it actually means |
|---|---|
| Golden record | The one canonical row for an entity (one Acme Corp), built by combining the rows that came from both companies' systems. |
| Crosswalk / xref | A lookup table that says "this customer_id in their CRM and this customer_id in our CRM are the same golden record." Every downstream join goes through it. |
| Matching playbook | The ordered list of rules for deciding two rows refer to the same entity. Strict rules run first (exact domain + DUNS match → auto-merge). Fuzzy rules run later (similar names + same country → human reviews). Same idea as airport security: passport check first, then bag scan, then pat-down. |
| Strangler pattern | Both apps keep running. You build the new unified system alongside the old ones, slowly route traffic to it, and only sunset the old systems once nothing reads from them. Named after the strangler fig vine that grows around a tree before replacing it. |
| Dual-write | For a transition period, every change writes to BOTH the old database AND the new unified one. A reconciliation job compares them daily. |
| CDC (Change Data Capture) | A pipe that emits one event per row INSERT / UPDATE / DELETE on the source database. Debezium and Fivetran are the common implementations. |
| Un-merge | Reversing a merge that turned out to be wrong (e.g. parent and subsidiary were collapsed when they shouldn't have been). The schema must support this from day one. |
| Share-and-bridge | Instead of consolidating warehouses, you give each side read-only access to selected datasets in the other (Snowflake Share, Delta Sharing, BigQuery Authorized Views). The day-90 reporting unblock. |
The four data problems hidden in every deal.
"Integrate the data" is a 3-year program disguised as a one-line slide. The interview-grade decomposition: every M&A has four orthogonal data problems, each with its own pattern, its own team, and its own failure mode. Conflate them and you ship a broken migration; separate them and the work parallelizes.
The four buckets
| Problem | Question | Owns | Failure mode |
|---|---|---|---|
| Entity resolution | Is this their Acme Corp the same as our Acme Corp? | MDM / Data team | Double-counting customers, wrong account merge, ghost duplicates |
| Schema unification | How do we keep both apps writing while we converge models? | Backend + Data Eng | Six-month dual-write debt, drift between sources, no clean cutover |
| Warehouse consolidation | One warehouse, two warehouses, or federated? | Data Platform | $2M/yr in duplicated compute, broken lineage, lost dashboards |
| Multi-org metrics | How do we report combined DAU/MRR/ARR when definitions differ? | Analytics / Finance | Board deck has two competing "combined revenue" numbers |
Why "just migrate everything to one system" doesn't work
The naive 3-month plan: pick the acquirer's stack, ETL the target's data over, sunset the target's systems. It always slips. Three reasons:
- The acquired product keeps shipping. You can't freeze their database for 90 days while you migrate; their customers expect the product to keep working. Schema needs to support both apps during the convergence window.
- Customer IDs aren't the same shape. Their B2B account model has a hierarchy; yours is flat. Their "user" is your "contact". Merging tables ≠ merging meaning.
- Compliance is sticky. Acquired company's customers signed THEIR privacy notice. Moving their data into your systems may require re-consent in EU/CA. Engineering can't paper over this.
Entity resolution — golden records across two CRMs.
The day-one question after any acquisition: "how many overlapping customers do we have?". Marketing thinks 30%. Sales thinks 60%. Nobody can prove either number until you build an identity graph that resolves their Acme Corp and your Acme Corp into one entity — across two CRMs with different field names, formats, and hierarchies.
The matching playbook — strict rules first, fuzzy rules second
You never decide "these two rows are the same customer" using one rule. You run an ordered list of rules from strict to permissive. The first rule that fires wins. Same idea as airport security: passport check first (definite ID), then bag scan, then pat-down. The strict rules auto-merge with high confidence; the fuzzy rules send candidates to a human analyst.
| Rule # | Type | What it checks | Confidence | What happens |
|---|---|---|---|---|
| 1 | Strict | Same email domain (acme.com) AND same DUNS number | 0.99+ | Auto-merge into one golden record |
| 2 | Strict | Same domain AND normalized company name | 0.95+ | Auto-merge, sample for human QA |
| 3 | Fuzzy | Similar name (typo-tolerant) AND same country AND same industry | 0.80+ | Stage in match_candidates; human reviews |
| 4 | Fuzzy | 3+ employee emails shared on both sides (e.g. @acme.com) | 0.70+ | Stage in match_candidates; escalate |
| 5 | No match | Nothing above fires | — | Treat as a distinct golden record |
Some industry references call this the "match-key cascade" or "match-tier cascade" — same thing, just academic-sounding names for an ordered rule list.
The golden record schema
The golden_customer table is the unified entity. Source systems keep their own customer tables; the golden record references both via a crosswalk:
-- The golden record (unified entity, mastered)
CREATE TABLE golden_customer (
golden_customer_id UUID PRIMARY KEY,
canonical_name TEXT NOT NULL,
canonical_domain TEXT,
duns_number TEXT,
industry TEXT,
country TEXT,
first_seen_at TIMESTAMP NOT NULL,
-- Lineage: which source(s) contributed to this golden record?
sources TEXT[] -- e.g. ['acquirer_crm', 'target_crm']
);
-- The crosswalk: source-system IDs ↔ golden ID
-- This is what every downstream join uses to "see one customer" across the data.
CREATE TABLE customer_xref (
source_system TEXT NOT NULL, -- 'acquirer_crm' | 'target_crm' | …
source_customer_id TEXT NOT NULL,
golden_customer_id UUID NOT NULL REFERENCES golden_customer,
match_rule INT NOT NULL, -- 1-5, which playbook rule fired
match_confidence NUMERIC NOT NULL, -- 0.0 - 1.0
matched_at TIMESTAMP NOT NULL,
matched_by TEXT NOT NULL, -- 'auto-rule-tier-1' | 'mdm-analyst-alice'
superseded_by UUID, -- for un-merge / re-merge audit
PRIMARY KEY (source_system, source_customer_id)
);
The un-merge — the part everyone forgets
Two months after launch, an MDM analyst realizes "Acme Corp Holding" and "Acme Corporation" were merged but they're actually parent/subsidiary, not duplicates. You need to un-merge without losing referential integrity for the downstream pipelines that already consumed the wrong golden ID.
The superseded_by column + append-only crosswalk + golden record SCD2 is the pattern. Never DELETE a golden record; retire it, point the affected source IDs at a new golden ID, log who/when, and emit a CDC event downstream so fact tables can re-attribute.
Identity for B2B vs B2C
B2B (Salesforce-Slack, Cisco-Splunk): match on account (the company), not user. Acquired company's "Splunk Cloud users at Acme" should resolve to acquirer's "Cisco Meraki account: Acme Inc.". User-level identity is downstream of account identity.
B2C (Microsoft-Activision): match on user (often hashed email + console / platform ID). The merge target is the gamer, not their household. Identity graph here is much messier — same player on Xbox Live, Battle.net, Steam, and PlayStation Network is one person with four federated IDs.
Schema unification — the strangler pattern, dual-write, cutover.
Both apps keep shipping. You can't freeze either database. The merger schema needs to converge over months without a flag day. The pattern is the strangler fig (Martin Fowler, 2004), adapted for data:
- Stand up the unified schema alongside the two source schemas. Don't touch source apps yet.
- Dual-write every change: source-app writes go to source DB and to a CDC stream that lands in the unified schema with a translation layer.
- Backfill: historical rows replayed into the unified schema. Run reconciliation queries comparing source and unified.
- Cutover, one entity at a time: switch the source app to read from the unified schema (still dual-write). Re-validate.
- Stop dual-write: source app reads and writes only to the unified schema. Sunset the old table.
The dual-write architecture
The reconciliation query — drift detection
Dual-write only earns trust if you can prove convergence. Run this every hour during the cutover window:
WITH source_counts AS (
SELECT 'acquirer_crm' AS src, COUNT(*) AS n, MAX(updated_at) AS last_change
FROM acquirer_crm.customers
UNION ALL
SELECT 'target_crm', COUNT(*), MAX(updated_at)
FROM target_crm.tenants
),
unified_counts AS (
SELECT source_system AS src, COUNT(*) AS n,
MAX(ingested_at) AS last_ingest,
COUNT(*) FILTER (WHERE ingest_status = 'rejected') AS rejected
FROM unified.customer_ingest_log
GROUP BY source_system
)
SELECT s.src,
s.n AS source_rows, u.n AS unified_rows,
s.n - u.n AS drift,
100.0 * ABS(s.n - u.n) / NULLIF(s.n, 0) AS drift_pct,
u.rejected,
EXTRACT(EPOCH FROM (s.last_change - u.last_ingest)) AS lag_seconds
FROM source_counts s LEFT JOIN unified_counts u USING (src)
ORDER BY src;
Drift > 0.1% pages the on-call. Lag > 5 minutes pages the on-call. The team that owns this query owns the cutover; nobody flips the read switch until both have been green for 72 consecutive hours.
Warehouse consolidation — Snowflake / Databricks / BigQuery merger.
The CFO sees a $2.4M/yr Snowflake bill plus the acquired team's $1.1M/yr Databricks bill. "Pick one." This is the easiest question to ask and the hardest to execute. Three paths, none of them free:
The three strategies
| Strategy | How | Cost | When to pick |
|---|---|---|---|
| Big-bang migration | Copy all data + rewrite all SQL/dbt models from B → A | 6-18 mo of platform team; $0.5-2M consultant fees | Strategic stacks materially different (Snowflake vs DBX) and < 2-yr post-merger |
| Federated query | Keep both warehouses. Query across via Snowflake External Tables / DBX Unity Catalog federation / Trino | Lower upfront; ongoing query latency tax; 2× ops | Stacks complement (lake + warehouse), or buyer hasn't decided yet |
| Share-and-bridge | Cross-account Snowflake Share / DBX Delta Sharing / BigQuery Authorized Views. Each side reads selected datasets from the other. | Lowest engineering cost; read-only by design; no governance unification | Day-1 reporting unblock while you decide on the bigger migration |
The "Day 90 reporting" reality
The board wants a combined revenue dashboard 90 days in. Big-bang isn't done in 90 days. So every real M&A uses share-and-bridge for the first 6 months, then commits to one of the other two strategies once governance + lineage + access have been sorted.
-- Snowflake cross-account share (acquirer → target reads acquirer data)
CREATE SHARE acq_to_target;
GRANT USAGE ON DATABASE prod_dw TO SHARE acq_to_target;
GRANT USAGE ON SCHEMA prod_dw.combined_metrics TO SHARE acq_to_target;
GRANT SELECT ON ALL TABLES IN SCHEMA prod_dw.combined_metrics TO SHARE acq_to_target;
ALTER SHARE acq_to_target ADD ACCOUNTS = ('TARGET_ACCOUNT_LOCATOR');
-- On the target side, mount as read-only DB
CREATE DATABASE from_acquirer FROM SHARE acquirer.acq_to_target;
SELECT * FROM from_acquirer.combined_metrics.fct_revenue_daily WHERE biz_date >= '2025-01-01';
Lineage rebuild is the silent cost
Both companies had their own column-level lineage (Atlan, Monte Carlo, Datafold, dbt docs). The merger means none of it is correct any more. The unified warehouse needs cross-source lineage from day one:
- Source-system → unified-schema table-level lineage (which raw tables produced which unified table)
- Translation layer → column-level lineage with transform documented (
target.tenant_uuid→unified.account_idviacustomer_xref) - Downstream dbt model lineage (which BI dashboard reads from which unified table)
Without this, the first ad-hoc CFO question — "show me ARR but exclude double-counted overlapping customers" — takes 2 weeks because nobody knows which models touch which tables.
cost_owner = 'pre-merger-target' vs 'pre-merger-acquirer' vs 'post-merger-combined'. Otherwise the executive question "how much of last month's Snowflake bill came from the acquired team" has no answer.
Multi-org metrics — when their DAU isn't your DAU.
The board deck has a "combined DAU" number on slide 3. It's wrong, but nobody will notice for two quarters. Reasons it's wrong:
| Definition mismatch | Example | Resolution pattern |
|---|---|---|
| Different active-user definition | Acquirer counts "any session in last 24h"; target counts "any session in last 7d, deduped to one row per user-day" | Materialize fct_user_activity with both windows; let analytics pick at query time |
| Different fiscal calendar | Acquirer is 4-4-5 retail calendar; target is calendar-month | SCD-2 dim_fiscal_calendar with both axes; never hard-code one in the fact table |
| Different currency / FX policy | Acquirer locks FX at txn time; target reconverts at month-end | Store amount in both source-currency and lockec-USD; flag the FX policy per row |
| Different consolidation rules | Acquirer eliminates intercompany revenue at parent level; target eliminates at subsidiary level | Tag each row with elimination_scope; SUM at the right grain in the rollup |
| Overlapping customer ARR | Acme has $200K with acquirer and $80K with target; combined ARR isn't $280K (would double-count if customer already churned from acquirer post-deal) | Use golden_customer_id for ARR sum; flag overlap explicitly so finance can compare gross vs net |
The dim_metric_definition table
Don't bake definitions into SQL. Bake them into a versioned dim that the BI layer references:
CREATE TABLE dim_metric_definition (
metric_id TEXT NOT NULL, -- 'dau'
definition_version INT NOT NULL, -- 1, 2, 3
org_scope TEXT NOT NULL, -- 'acquirer' | 'target' | 'combined'
window_unit TEXT NOT NULL, -- 'hour' | 'day' | 'week'
window_size INT NOT NULL, -- 24, 7, etc.
dedup_grain TEXT NOT NULL, -- 'user' | 'user-day' | 'session'
fiscal_calendar TEXT NOT NULL, -- 'calendar' | '445'
currency_policy TEXT NOT NULL, -- 'lock-at-txn' | 'lock-at-month-end'
effective_from DATE NOT NULL,
effective_to DATE, -- NULL = current
PRIMARY KEY (metric_id, definition_version)
);
Every dashboard tile cites a (metric_id, definition_version) tuple in its config. The "combined DAU on the board deck" can be definitively traced to definition_version=3 with org_scope='combined', window_size=1, dedup_grain='user-day', and the SQL is auto-generated from that row.
Four deals — the data side of the press release.
$20B · ANNOUNCED 2022 · ABANDONED 2023
Adobe → Figma
Blocked by UK CMA + EU Commission on anti-competitive grounds. But the data-platform plan is the textbook M&A case study: two companies that both built world-class design tools with overlapping user bases.
The four problems that would have hit:
- Entity resolution: Adobe's Creative Cloud account ↔ Figma's email-based account. ~40% of Figma's pro users had Creative Cloud subscriptions on the same email. Golden record would have collapsed two SKUs to one customer, immediately changing Net Dollar Retention math on both sides.
- Schema unification: Adobe's enterprise model is hierarchical (org → team → seat). Figma's is flat (team → invite). Strangler pattern would have run for 18-24 months.
- Warehouse consolidation: Adobe on Snowflake. Figma on BigQuery. Federated query via dbt source freshness checks crossing accounts — the "share-and-bridge" play.
- Multi-org metrics: Adobe's "active subscriber" includes anyone with a billing relationship in the last 12 months. Figma's "active user" is 30-day. Combined ARR forecasts moved by ~$300M depending on which definition won.
The lesson: the deal didn't close, but the antitrust filings published the data architecture problem in public — it's now the best-documented "what the integration would have looked like" case in the industry.
$69B · CLOSED OCT 2023
Microsoft → Activision Blizzard
The largest gaming acquisition in history. Microsoft's Xbox + Game Pass platform absorbed Activision's franchises (Call of Duty, World of Warcraft, Candy Crush) along with their player-identity systems on Battle.net.
The four problems:
- Entity resolution (B2C, federated): One player can be identified on Xbox Live, Battle.net, PlayStation Network, and Steam. The identity graph already existed inside Activision; the merge work was teaching Microsoft's Xbox-centric graph that Battle.net account X = Xbox Live account Y when they share a verified email + phone + linked-payment-method.
- Schema unification (sensitive): Game-telemetry events fire 100K/sec per popular title. Activision's Kafka schemas baked in "session_id = mac-of-device + start-time"; Xbox uses GUID. Translation layer was non-trivial because joining gameplay-session telemetry across both required schema bridging that produced new derived facts (cross-platform retention).
- Warehouse consolidation: Activision was Databricks-first. Microsoft owns Azure Synapse + Fabric. The "let's standardize on Fabric" move was politically obvious and technically painful; Databricks-on-Azure became the bridge.
- Multi-org metrics (regulatory): Combined MAU number had to satisfy regulators on competitive-position. Different jurisdictions (UK CMA, EU, FTC) wanted different cuts of the data with different consolidation rules. The team built
dim_metric_definitionwith 8 versions — one per regulator's preferred definition — and audit-trailed which version backed which submission.
$28B · CLOSED MAR 2024
Cisco → Splunk
The observability-stack consolidation. Cisco wanted Splunk's SIEM + log-analytics layer to bolt onto AppDynamics + ThousandEyes. The interesting wrinkle: both companies sell data platforms, so the M&A data work was happening on systems that customers also use for the same purpose.
The four problems:
- Entity resolution (enterprise B2B): Cisco's customer base is heavy network-infra. Splunk's is heavy SecOps + IT. Overlap was estimated at 65% of Fortune 500 accounts. Golden record matched on DUNS + domain; the harder problem was rolling up the licensing entitlements across both systems so a customer who already had Splunk Enterprise didn't get re-sold a Cisco-bundled Splunk SKU.
- Schema unification: Splunk's data is unstructured event logs (no schema; "schema on read"). Cisco's product telemetry is schematized Protobuf. The unification target wasn't "one schema" — it was a federated query layer (Splunk Indexers + Cisco's data lake) with consistent tags.
- Warehouse consolidation: Splunk has its own indexer architecture (Splunk Indexes + SmartStore on S3). Cisco runs analytics on BigQuery. Neither was migrating. The play was Iceberg as the lingua franca — both ends export to Iceberg tables in a shared S3 bucket; analysts query whichever engine fits.
- Multi-org metrics: Combined "platform telemetry ingested per day" became Cisco's hero number for the deal narrative. Definition fight: does Cisco's Meraki API-poll telemetry count as "ingested data" the same way Splunk's syslog ingestion does? They picked a unified billing-equivalent unit ("ingested GB normalized for retention class") and published the conversion table.
$27.7B · CLOSED JUL 2021
Salesforce → Slack
The CRM giant buys the workplace-collab leader. By 2024 Slack is being deeply rebranded as "Slack, a Salesforce company" with Slack workspaces tied to Salesforce orgs.
The four problems:
- Entity resolution (B2B hierarchy): Salesforce's org model: Account → Contact → User Profile. Slack's: Workspace → Channel → User. The mapping isn't 1:1 — a Salesforce Account often spans multiple Slack Workspaces (different BUs run their own). The golden record bridges via
company.domain+ email-tail matching, with aworkspace_to_account_xrefstaging table. - Schema unification: Slack ships dozens of new product features per quarter. Salesforce wasn't going to slow that down. The data side ran the strangler pattern only for shared entities (users, accounts, billing); product-feature schemas stay native in Slack's stack.
- Warehouse consolidation: Salesforce uses Snowflake + their own Tableau CRM (CRM Analytics). Slack uses BigQuery. The strategic answer was Snowflake; Slack's BigQuery → Snowflake migration took ~18 months and was led with the dual-warehouse share-and-bridge interim.
- Multi-org metrics: "DAU" for Slack ≠ "Active Sales Rep" for Salesforce. The combined story needed a Slack-MAU restated alongside Salesforce-MAU on a shared
dim_metric_definitiongrid. The interesting innovation: introducing across_product_activeflag — users active in BOTH Slack and Salesforce in the last 30 days, the most valuable cross-sell segment.
The 90-second articulation script.
When the interviewer says "your company just acquired a $20B SaaS — walk me through the data integration", this is the structure that wins:
- Name the four problems first (15 sec). "Every M&A has four orthogonal data problems: entity resolution, schema unification, warehouse consolidation, multi-org metrics. They have different owners and different failure modes. Let me ask which one matters most for day 90 before I pick where to start."
- Pick entity resolution as the foundation (15 sec). "If I had to start somewhere, it's entity resolution — because every other layer reads from the golden record. Strict rules first (same domain + DUNS → auto-merge), fuzzy rules second (similar name + same country → human reviews), unmatched rows stay distinct. Append-only crosswalk with
superseded_byfor un-merge." - Address schema unification (20 sec). "Strangler pattern: both apps keep writing to their source DBs, CDC streams to a unified schema via a translation layer, daily drift reconciliation between source counts and unified counts. Cutover one entity at a time. Plan for 6-18 months."
- Address warehouse consolidation (20 sec). "Three options: big-bang migration if stacks are strategically incompatible, federated query if they complement, share-and-bridge as the day-90 unblocker before you commit. Tag-based cost attribution so the CFO can see the bill split."
- Address multi-org metrics (15 sec). "Don't bake definitions into SQL. Materialize
dim_metric_definitionwith org_scope, window, dedup_grain, fiscal calendar, currency policy as versioned rows. Every dashboard cites a (metric_id, definition_version)." - Close on the trap (5 sec). "The mistake teams make is solving these in parallel before naming them — that creates cross-team blocking and shipped-but-wrong dashboards. Naming the four lets you sequence."