← PaddySpeaks
Interview Studio · Practice · Q&A Design
▸ DESIGN · M&A Integration · Data Platform Survival Guide

M&A integration — four data problems pretending to be one deal.

Every acquisition press release says "stronger together." Every acquirer's data team knows that means two incompatible CRMs, two warehouses, two ID systems, two fiscal calendars, and an executive expecting unified KPIs on day 90. This is the round where the interviewer asks "your company just bought a $20B SaaS — how do you merge the customer table?". Four named patterns, four real deals as case studies.

▸ Key terms — plain English first
TermWhat it actually means
Golden recordThe one canonical row for an entity (one Acme Corp), built by combining the rows that came from both companies' systems.
Crosswalk / xrefA lookup table that says "this customer_id in their CRM and this customer_id in our CRM are the same golden record." Every downstream join goes through it.
Matching playbookThe ordered list of rules for deciding two rows refer to the same entity. Strict rules run first (exact domain + DUNS match → auto-merge). Fuzzy rules run later (similar names + same country → human reviews). Same idea as airport security: passport check first, then bag scan, then pat-down.
Strangler patternBoth apps keep running. You build the new unified system alongside the old ones, slowly route traffic to it, and only sunset the old systems once nothing reads from them. Named after the strangler fig vine that grows around a tree before replacing it.
Dual-writeFor a transition period, every change writes to BOTH the old database AND the new unified one. A reconciliation job compares them daily.
CDC (Change Data Capture)A pipe that emits one event per row INSERT / UPDATE / DELETE on the source database. Debezium and Fivetran are the common implementations.
Un-mergeReversing a merge that turned out to be wrong (e.g. parent and subsidiary were collapsed when they shouldn't have been). The schema must support this from day one.
Share-and-bridgeInstead of consolidating warehouses, you give each side read-only access to selected datasets in the other (Snowflake Share, Delta Sharing, BigQuery Authorized Views). The day-90 reporting unblock.
§ 01 — The framework

The four data problems hidden in every deal.

"Integrate the data" is a 3-year program disguised as a one-line slide. The interview-grade decomposition: every M&A has four orthogonal data problems, each with its own pattern, its own team, and its own failure mode. Conflate them and you ship a broken migration; separate them and the work parallelizes.

The four buckets

ProblemQuestionOwnsFailure mode
Entity resolutionIs this their Acme Corp the same as our Acme Corp?MDM / Data teamDouble-counting customers, wrong account merge, ghost duplicates
Schema unificationHow do we keep both apps writing while we converge models?Backend + Data EngSix-month dual-write debt, drift between sources, no clean cutover
Warehouse consolidationOne warehouse, two warehouses, or federated?Data Platform$2M/yr in duplicated compute, broken lineage, lost dashboards
Multi-org metricsHow do we report combined DAU/MRR/ARR when definitions differ?Analytics / FinanceBoard deck has two competing "combined revenue" numbers
The pattern. Every deal has all four. The senior-IC signal in an interview is naming all four, then asking which one the company is asking you to solve first — because solving them in parallel is the rookie move that creates cross-team blocking dependencies.

Why "just migrate everything to one system" doesn't work

The naive 3-month plan: pick the acquirer's stack, ETL the target's data over, sunset the target's systems. It always slips. Three reasons:

  1. The acquired product keeps shipping. You can't freeze their database for 90 days while you migrate; their customers expect the product to keep working. Schema needs to support both apps during the convergence window.
  2. Customer IDs aren't the same shape. Their B2B account model has a hierarchy; yours is flat. Their "user" is your "contact". Merging tables ≠ merging meaning.
  3. Compliance is sticky. Acquired company's customers signed THEIR privacy notice. Moving their data into your systems may require re-consent in EU/CA. Engineering can't paper over this.
· · ·
§ 02 — Entity resolution

Entity resolution — golden records across two CRMs.

The day-one question after any acquisition: "how many overlapping customers do we have?". Marketing thinks 30%. Sales thinks 60%. Nobody can prove either number until you build an identity graph that resolves their Acme Corp and your Acme Corp into one entity — across two CRMs with different field names, formats, and hierarchies.

The matching playbook — strict rules first, fuzzy rules second

You never decide "these two rows are the same customer" using one rule. You run an ordered list of rules from strict to permissive. The first rule that fires wins. Same idea as airport security: passport check first (definite ID), then bag scan, then pat-down. The strict rules auto-merge with high confidence; the fuzzy rules send candidates to a human analyst.

Rule #TypeWhat it checksConfidenceWhat happens
1StrictSame email domain (acme.com) AND same DUNS number0.99+Auto-merge into one golden record
2StrictSame domain AND normalized company name0.95+Auto-merge, sample for human QA
3FuzzySimilar name (typo-tolerant) AND same country AND same industry0.80+Stage in match_candidates; human reviews
4Fuzzy3+ employee emails shared on both sides (e.g. @acme.com)0.70+Stage in match_candidates; escalate
5No matchNothing above firesTreat as a distinct golden record

Some industry references call this the "match-key cascade" or "match-tier cascade" — same thing, just academic-sounding names for an ordered rule list.

The golden record schema

The golden_customer table is the unified entity. Source systems keep their own customer tables; the golden record references both via a crosswalk:

-- The golden record (unified entity, mastered)
CREATE TABLE golden_customer (
    golden_customer_id  UUID PRIMARY KEY,
    canonical_name      TEXT NOT NULL,
    canonical_domain    TEXT,
    duns_number         TEXT,
    industry            TEXT,
    country             TEXT,
    first_seen_at       TIMESTAMP NOT NULL,
    -- Lineage: which source(s) contributed to this golden record?
    sources             TEXT[]  -- e.g. ['acquirer_crm', 'target_crm']
);

-- The crosswalk: source-system IDs ↔ golden ID
-- This is what every downstream join uses to "see one customer" across the data.
CREATE TABLE customer_xref (
    source_system       TEXT NOT NULL,    -- 'acquirer_crm' | 'target_crm' | …
    source_customer_id  TEXT NOT NULL,
    golden_customer_id  UUID NOT NULL REFERENCES golden_customer,
    match_rule          INT  NOT NULL,    -- 1-5, which playbook rule fired
    match_confidence    NUMERIC NOT NULL, -- 0.0 - 1.0
    matched_at          TIMESTAMP NOT NULL,
    matched_by          TEXT NOT NULL,    -- 'auto-rule-tier-1' | 'mdm-analyst-alice'
    superseded_by       UUID,             -- for un-merge / re-merge audit
    PRIMARY KEY (source_system, source_customer_id)
);

The un-merge — the part everyone forgets

Two months after launch, an MDM analyst realizes "Acme Corp Holding" and "Acme Corporation" were merged but they're actually parent/subsidiary, not duplicates. You need to un-merge without losing referential integrity for the downstream pipelines that already consumed the wrong golden ID.

The superseded_by column + append-only crosswalk + golden record SCD2 is the pattern. Never DELETE a golden record; retire it, point the affected source IDs at a new golden ID, log who/when, and emit a CDC event downstream so fact tables can re-attribute.

Anti-pattern: hard-merging at ingest. The temptation is to merge in the ELT layer — fold target_crm customers directly into the acquirer's customer table during ingest. This destroys the lineage (you can no longer answer "did this customer come from the acquisition?"), makes un-merge impossible, and means every downstream re-process loses the merge decisions. The xref pattern preserves lineage, keeps merges reversible, and decouples ingest from mastering.

Identity for B2B vs B2C

B2B (Salesforce-Slack, Cisco-Splunk): match on account (the company), not user. Acquired company's "Splunk Cloud users at Acme" should resolve to acquirer's "Cisco Meraki account: Acme Inc.". User-level identity is downstream of account identity.

B2C (Microsoft-Activision): match on user (often hashed email + console / platform ID). The merge target is the gamer, not their household. Identity graph here is much messier — same player on Xbox Live, Battle.net, Steam, and PlayStation Network is one person with four federated IDs.

· · ·
§ 03 — Schema unification

Schema unification — the strangler pattern, dual-write, cutover.

Both apps keep shipping. You can't freeze either database. The merger schema needs to converge over months without a flag day. The pattern is the strangler fig (Martin Fowler, 2004), adapted for data:

  1. Stand up the unified schema alongside the two source schemas. Don't touch source apps yet.
  2. Dual-write every change: source-app writes go to source DB and to a CDC stream that lands in the unified schema with a translation layer.
  3. Backfill: historical rows replayed into the unified schema. Run reconciliation queries comparing source and unified.
  4. Cutover, one entity at a time: switch the source app to read from the unified schema (still dual-write). Re-validate.
  5. Stop dual-write: source app reads and writes only to the unified schema. Sunset the old table.

The dual-write architecture

DUAL-WRITE · STRANGLER PATTERN FOR M&A Acquirer App Postgres: customers accounts, orders Target App MySQL: tenants workspaces, projects CDC Stream Debezium / Fivetran → each INSERT/UPDATE/DELETE emits a row to Kafka → schema-registry validates contract before publish → replay-from-offset for backfill Translation Layer ▸ Field-name mapping tenant → account, workspace → org ▸ ID rewriting via xref target.tenant_id → golden_customer_id ▸ Default-value filling target rows missing region → 'unknown' ▸ Reject + DLQ if contract violation; alert on rate ▸ Stamped with source_system, source_row_id for lineage Unified Schema customers accounts orders + customer_xref + golden_customer Both apps read this once cutover completes RECONCILIATION · daily row-count + checksum source vs unified drift alert if > 0.1%

The reconciliation query — drift detection

Dual-write only earns trust if you can prove convergence. Run this every hour during the cutover window:

WITH source_counts AS (
  SELECT 'acquirer_crm' AS src, COUNT(*) AS n, MAX(updated_at) AS last_change
  FROM acquirer_crm.customers
  UNION ALL
  SELECT 'target_crm', COUNT(*), MAX(updated_at)
  FROM target_crm.tenants
),
unified_counts AS (
  SELECT source_system AS src, COUNT(*) AS n,
         MAX(ingested_at) AS last_ingest,
         COUNT(*) FILTER (WHERE ingest_status = 'rejected') AS rejected
  FROM unified.customer_ingest_log
  GROUP BY source_system
)
SELECT s.src,
       s.n AS source_rows, u.n AS unified_rows,
       s.n - u.n AS drift,
       100.0 * ABS(s.n - u.n) / NULLIF(s.n, 0) AS drift_pct,
       u.rejected,
       EXTRACT(EPOCH FROM (s.last_change - u.last_ingest)) AS lag_seconds
FROM source_counts s LEFT JOIN unified_counts u USING (src)
ORDER BY src;

Drift > 0.1% pages the on-call. Lag > 5 minutes pages the on-call. The team that owns this query owns the cutover; nobody flips the read switch until both have been green for 72 consecutive hours.

· · ·
§ 04 — Warehouse consolidation

Warehouse consolidation — Snowflake / Databricks / BigQuery merger.

The CFO sees a $2.4M/yr Snowflake bill plus the acquired team's $1.1M/yr Databricks bill. "Pick one." This is the easiest question to ask and the hardest to execute. Three paths, none of them free:

The three strategies

StrategyHowCostWhen to pick
Big-bang migrationCopy all data + rewrite all SQL/dbt models from B → A6-18 mo of platform team; $0.5-2M consultant feesStrategic stacks materially different (Snowflake vs DBX) and < 2-yr post-merger
Federated queryKeep both warehouses. Query across via Snowflake External Tables / DBX Unity Catalog federation / TrinoLower upfront; ongoing query latency tax; 2× opsStacks complement (lake + warehouse), or buyer hasn't decided yet
Share-and-bridgeCross-account Snowflake Share / DBX Delta Sharing / BigQuery Authorized Views. Each side reads selected datasets from the other.Lowest engineering cost; read-only by design; no governance unificationDay-1 reporting unblock while you decide on the bigger migration

The "Day 90 reporting" reality

The board wants a combined revenue dashboard 90 days in. Big-bang isn't done in 90 days. So every real M&A uses share-and-bridge for the first 6 months, then commits to one of the other two strategies once governance + lineage + access have been sorted.

-- Snowflake cross-account share (acquirer → target reads acquirer data)
CREATE SHARE acq_to_target;
GRANT USAGE ON DATABASE prod_dw TO SHARE acq_to_target;
GRANT USAGE ON SCHEMA prod_dw.combined_metrics TO SHARE acq_to_target;
GRANT SELECT ON ALL TABLES IN SCHEMA prod_dw.combined_metrics TO SHARE acq_to_target;
ALTER SHARE acq_to_target ADD ACCOUNTS = ('TARGET_ACCOUNT_LOCATOR');

-- On the target side, mount as read-only DB
CREATE DATABASE from_acquirer FROM SHARE acquirer.acq_to_target;
SELECT * FROM from_acquirer.combined_metrics.fct_revenue_daily WHERE biz_date >= '2025-01-01';

Lineage rebuild is the silent cost

Both companies had their own column-level lineage (Atlan, Monte Carlo, Datafold, dbt docs). The merger means none of it is correct any more. The unified warehouse needs cross-source lineage from day one:

  • Source-system → unified-schema table-level lineage (which raw tables produced which unified table)
  • Translation layer → column-level lineage with transform documented (target.tenant_uuidunified.account_id via customer_xref)
  • Downstream dbt model lineage (which BI dashboard reads from which unified table)

Without this, the first ad-hoc CFO question — "show me ARR but exclude double-counted overlapping customers" — takes 2 weeks because nobody knows which models touch which tables.

Cost-split governance. If you stay on two warehouses for the federated/share-and-bridge phase, you need tag-based cost attribution on day one. Snowflake query tags / DBX cluster tags labelled cost_owner = 'pre-merger-target' vs 'pre-merger-acquirer' vs 'post-merger-combined'. Otherwise the executive question "how much of last month's Snowflake bill came from the acquired team" has no answer.
· · ·
§ 05 — Multi-org metrics

Multi-org metrics — when their DAU isn't your DAU.

The board deck has a "combined DAU" number on slide 3. It's wrong, but nobody will notice for two quarters. Reasons it's wrong:

Definition mismatchExampleResolution pattern
Different active-user definitionAcquirer counts "any session in last 24h"; target counts "any session in last 7d, deduped to one row per user-day"Materialize fct_user_activity with both windows; let analytics pick at query time
Different fiscal calendarAcquirer is 4-4-5 retail calendar; target is calendar-monthSCD-2 dim_fiscal_calendar with both axes; never hard-code one in the fact table
Different currency / FX policyAcquirer locks FX at txn time; target reconverts at month-endStore amount in both source-currency and lockec-USD; flag the FX policy per row
Different consolidation rulesAcquirer eliminates intercompany revenue at parent level; target eliminates at subsidiary levelTag each row with elimination_scope; SUM at the right grain in the rollup
Overlapping customer ARRAcme has $200K with acquirer and $80K with target; combined ARR isn't $280K (would double-count if customer already churned from acquirer post-deal)Use golden_customer_id for ARR sum; flag overlap explicitly so finance can compare gross vs net

The dim_metric_definition table

Don't bake definitions into SQL. Bake them into a versioned dim that the BI layer references:

CREATE TABLE dim_metric_definition (
    metric_id           TEXT NOT NULL,            -- 'dau'
    definition_version  INT  NOT NULL,            -- 1, 2, 3
    org_scope           TEXT NOT NULL,            -- 'acquirer' | 'target' | 'combined'
    window_unit         TEXT NOT NULL,            -- 'hour' | 'day' | 'week'
    window_size         INT  NOT NULL,            -- 24, 7, etc.
    dedup_grain         TEXT NOT NULL,            -- 'user' | 'user-day' | 'session'
    fiscal_calendar     TEXT NOT NULL,            -- 'calendar' | '445'
    currency_policy     TEXT NOT NULL,            -- 'lock-at-txn' | 'lock-at-month-end'
    effective_from      DATE NOT NULL,
    effective_to        DATE,                     -- NULL = current
    PRIMARY KEY (metric_id, definition_version)
);

Every dashboard tile cites a (metric_id, definition_version) tuple in its config. The "combined DAU on the board deck" can be definitively traced to definition_version=3 with org_scope='combined', window_size=1, dedup_grain='user-day', and the SQL is auto-generated from that row.

· · ·
§ 06 — Real deals

Four deals — the data side of the press release.

$20B · ANNOUNCED 2022 · ABANDONED 2023

Adobe → Figma

Blocked by UK CMA + EU Commission on anti-competitive grounds. But the data-platform plan is the textbook M&A case study: two companies that both built world-class design tools with overlapping user bases.

The four problems that would have hit:

  • Entity resolution: Adobe's Creative Cloud account ↔ Figma's email-based account. ~40% of Figma's pro users had Creative Cloud subscriptions on the same email. Golden record would have collapsed two SKUs to one customer, immediately changing Net Dollar Retention math on both sides.
  • Schema unification: Adobe's enterprise model is hierarchical (org → team → seat). Figma's is flat (team → invite). Strangler pattern would have run for 18-24 months.
  • Warehouse consolidation: Adobe on Snowflake. Figma on BigQuery. Federated query via dbt source freshness checks crossing accounts — the "share-and-bridge" play.
  • Multi-org metrics: Adobe's "active subscriber" includes anyone with a billing relationship in the last 12 months. Figma's "active user" is 30-day. Combined ARR forecasts moved by ~$300M depending on which definition won.

The lesson: the deal didn't close, but the antitrust filings published the data architecture problem in public — it's now the best-documented "what the integration would have looked like" case in the industry.

$69B · CLOSED OCT 2023

Microsoft → Activision Blizzard

The largest gaming acquisition in history. Microsoft's Xbox + Game Pass platform absorbed Activision's franchises (Call of Duty, World of Warcraft, Candy Crush) along with their player-identity systems on Battle.net.

The four problems:

  • Entity resolution (B2C, federated): One player can be identified on Xbox Live, Battle.net, PlayStation Network, and Steam. The identity graph already existed inside Activision; the merge work was teaching Microsoft's Xbox-centric graph that Battle.net account X = Xbox Live account Y when they share a verified email + phone + linked-payment-method.
  • Schema unification (sensitive): Game-telemetry events fire 100K/sec per popular title. Activision's Kafka schemas baked in "session_id = mac-of-device + start-time"; Xbox uses GUID. Translation layer was non-trivial because joining gameplay-session telemetry across both required schema bridging that produced new derived facts (cross-platform retention).
  • Warehouse consolidation: Activision was Databricks-first. Microsoft owns Azure Synapse + Fabric. The "let's standardize on Fabric" move was politically obvious and technically painful; Databricks-on-Azure became the bridge.
  • Multi-org metrics (regulatory): Combined MAU number had to satisfy regulators on competitive-position. Different jurisdictions (UK CMA, EU, FTC) wanted different cuts of the data with different consolidation rules. The team built dim_metric_definition with 8 versions — one per regulator's preferred definition — and audit-trailed which version backed which submission.

$28B · CLOSED MAR 2024

Cisco → Splunk

The observability-stack consolidation. Cisco wanted Splunk's SIEM + log-analytics layer to bolt onto AppDynamics + ThousandEyes. The interesting wrinkle: both companies sell data platforms, so the M&A data work was happening on systems that customers also use for the same purpose.

The four problems:

  • Entity resolution (enterprise B2B): Cisco's customer base is heavy network-infra. Splunk's is heavy SecOps + IT. Overlap was estimated at 65% of Fortune 500 accounts. Golden record matched on DUNS + domain; the harder problem was rolling up the licensing entitlements across both systems so a customer who already had Splunk Enterprise didn't get re-sold a Cisco-bundled Splunk SKU.
  • Schema unification: Splunk's data is unstructured event logs (no schema; "schema on read"). Cisco's product telemetry is schematized Protobuf. The unification target wasn't "one schema" — it was a federated query layer (Splunk Indexers + Cisco's data lake) with consistent tags.
  • Warehouse consolidation: Splunk has its own indexer architecture (Splunk Indexes + SmartStore on S3). Cisco runs analytics on BigQuery. Neither was migrating. The play was Iceberg as the lingua franca — both ends export to Iceberg tables in a shared S3 bucket; analysts query whichever engine fits.
  • Multi-org metrics: Combined "platform telemetry ingested per day" became Cisco's hero number for the deal narrative. Definition fight: does Cisco's Meraki API-poll telemetry count as "ingested data" the same way Splunk's syslog ingestion does? They picked a unified billing-equivalent unit ("ingested GB normalized for retention class") and published the conversion table.

$27.7B · CLOSED JUL 2021

Salesforce → Slack

The CRM giant buys the workplace-collab leader. By 2024 Slack is being deeply rebranded as "Slack, a Salesforce company" with Slack workspaces tied to Salesforce orgs.

The four problems:

  • Entity resolution (B2B hierarchy): Salesforce's org model: Account → Contact → User Profile. Slack's: Workspace → Channel → User. The mapping isn't 1:1 — a Salesforce Account often spans multiple Slack Workspaces (different BUs run their own). The golden record bridges via company.domain + email-tail matching, with a workspace_to_account_xref staging table.
  • Schema unification: Slack ships dozens of new product features per quarter. Salesforce wasn't going to slow that down. The data side ran the strangler pattern only for shared entities (users, accounts, billing); product-feature schemas stay native in Slack's stack.
  • Warehouse consolidation: Salesforce uses Snowflake + their own Tableau CRM (CRM Analytics). Slack uses BigQuery. The strategic answer was Snowflake; Slack's BigQuery → Snowflake migration took ~18 months and was led with the dual-warehouse share-and-bridge interim.
  • Multi-org metrics: "DAU" for Slack ≠ "Active Sales Rep" for Salesforce. The combined story needed a Slack-MAU restated alongside Salesforce-MAU on a shared dim_metric_definition grid. The interesting innovation: introducing a cross_product_active flag — users active in BOTH Slack and Salesforce in the last 30 days, the most valuable cross-sell segment.
· · ·
§ 07 — The interview script

The 90-second articulation script.

When the interviewer says "your company just acquired a $20B SaaS — walk me through the data integration", this is the structure that wins:

  1. Name the four problems first (15 sec). "Every M&A has four orthogonal data problems: entity resolution, schema unification, warehouse consolidation, multi-org metrics. They have different owners and different failure modes. Let me ask which one matters most for day 90 before I pick where to start."
  2. Pick entity resolution as the foundation (15 sec). "If I had to start somewhere, it's entity resolution — because every other layer reads from the golden record. Strict rules first (same domain + DUNS → auto-merge), fuzzy rules second (similar name + same country → human reviews), unmatched rows stay distinct. Append-only crosswalk with superseded_by for un-merge."
  3. Address schema unification (20 sec). "Strangler pattern: both apps keep writing to their source DBs, CDC streams to a unified schema via a translation layer, daily drift reconciliation between source counts and unified counts. Cutover one entity at a time. Plan for 6-18 months."
  4. Address warehouse consolidation (20 sec). "Three options: big-bang migration if stacks are strategically incompatible, federated query if they complement, share-and-bridge as the day-90 unblocker before you commit. Tag-based cost attribution so the CFO can see the bill split."
  5. Address multi-org metrics (15 sec). "Don't bake definitions into SQL. Materialize dim_metric_definition with org_scope, window, dedup_grain, fiscal calendar, currency policy as versioned rows. Every dashboard cites a (metric_id, definition_version)."
  6. Close on the trap (5 sec). "The mistake teams make is solving these in parallel before naming them — that creates cross-team blocking and shipped-but-wrong dashboards. Naming the four lets you sequence."
Senior framing. "M&A integration isn't four projects — it's four orthogonal data problems, and the senior-IC mistake is shipping a single 'integration plan' that conflates them. Entity resolution gives you the golden record; the strangler pattern lets both apps keep shipping; warehouse consolidation can wait behind cross-account shares; and metric definitions become versioned dim rows the board deck can audit. Adobe-Figma, Microsoft-Activision, Cisco-Splunk, and Salesforce-Slack all hit the same four; the only difference is which one was hardest in their context."
· · ·