A Data Catalogue That Builds Itself

No Collibra. No Alation. No 6-month implementation. Just architecture.

⚠️ Disclaimer: Horizon Bank Holdings is a fictional company created for this proof of concept. No real financial institution's data was used. All data is synthetically generated. This demonstrates what's architecturally possible.

The Governance Gap Nobody Talks About

Last week I shared the Financial Lakehouse example — 103K+ records, 15 tables, 10 executive dashboards, all built by AI agents. The response was incredible.

But several people asked the same question:

"How do you govern this? How does a new analyst know what risk_tier means? How does compliance know which columns contain PII?"

Fair question. A data platform without governance is a liability. So I built the governance layer.

What I Built

A self-building data catalogue that auto-profiles every table in the MDM Lakehouse. Point it at the data. It does the rest.

The engine reads all 15 tables and generates:

225 column profiles — data type, null rate, cardinality, min/max, standard deviation, distribution patterns, and sample values
Automatic PII classification — every column tagged as PII, SPII, Confidential, or Public based on column names and content patterns
Data lineage — full dependency graph from source systems through Bronze, MDM, and Gold layers
17 business glossary terms — with definitions, domain owners, data stewards, and cross-references to every table where each term appears
Quality scores — 98.6% average across 6 dimensions, scored per column and per table

All output as structured JSON metadata that powers a 5-tab React catalogue UI.

The Catalogue Browser

Every table at a glance. Layer badges. Row counts. Quality scores with color-coded dots. PII column counts. Owner. Refresh cadence. Search and filter by layer, name, or tag.

This is what analysts see on Day 1. No Confluence pages. No Slack messages asking "who owns dim_customer?"

Column-Level Profiling

Click into any table and see every column profiled:

Data type inference — the engine detects integers, decimals, dates, booleans, emails, phone numbers, identifiers
PII badges — red for PII (names, emails), amber for SPII (FICO scores, income), purple for Confidential (account IDs, balances), green for Public
Null rates — 0% across the board in our golden records (as it should be for MDM output)
Cardinality — segment has 5 distinct values, customer_id has 2,000 (unique)
Distribution previews — status: active 80%, inactive 10%, closed 8%, suspended 2%
Glossary links — inline definitions for business terms like "FICO Score" and "Risk Tier"

For PII columns, sample values are automatically masked. The catalogue never exposes sensitive data — even to the people documenting it.

Data Lineage: Where Everything Comes From

This is the question every auditor asks and every team dreads: "Where does this number come from?"

The lineage map traces every table's full dependency chain:

Each connection carries metadata: the join key, the transformation applied, the refresh frequency, and the SLA. When a regulator asks "how did you calculate this customer's risk tier?", the answer is one click away.

Business Glossary: Shared Language

Ask five people at a bank what "customer segment" means and you'll get five answers. The glossary fixes that.

17 terms, each with:

A precise definition (e.g., "Probability of Default: Statistical likelihood (0-1) of default within 12 months. Basel II/III regulatory metric.")
The domain it belongs to (Credit Risk, Marketing, MDM, Payments, Fraud, Finance, Digital)
The data steward responsible
Every table where that column appears

When someone asks "what does days_past_due mean?", the answer isn't in someone's head. It's in the catalogue. Searchable. Authoritative. Version-controlled.

Quality Observatory

Quality isn't a one-time audit. It's continuous.

The Quality Observatory scores every table across 6 dimensions:

A heatmap shows all 15 tables at once — green for 98%+, teal for 96%+, amber for 95%+, red below. You spot problems before they reach dashboards.

34 automated DQ tests. All passing. The quality gate that prevents bad data from becoming bad decisions.

PII: Found Before You Knew It Was There

The engine automatically classified 225 columns:

This isn't a manual spreadsheet exercise. The engine uses column name patterns and content analysis to classify at ingestion time. For financial services, where GLBA, CCPA, and OCC regulations carry real teeth, knowing where your PII lives isn't optional.

How It Works: The Engine

The catalogue engine is a Python script. Point it at a directory of CSVs. It produces structured JSON.

The output powers the React UI, but it's also machine-readable. Plug it into Airflow for automated profiling after every pipeline run. Feed it to Great Expectations for continuous validation. Export it to your existing governance tools.

The catalogue doesn't replace enterprise platforms like Collibra, Alation, or DataHub — it complements them. Those platforms excel at enterprise-wide governance, access control, collaboration workflows, and policy management. This engine handles the automated profiling and metadata generation that feeds into those systems. Every organization's governance needs are unique, and the best solutions often combine approaches.

The Full Stack

This Data Catalogue sits alongside the MDM Lakehouse as the governance layer:

Together they demonstrate a complete "Idea to Display" pipeline: from raw source system extracts to governed, documented, quality-scored, dashboard-ready analytics.

The Comparison

What's Next

This is open. The repository includes:

Python catalogue engine with auto-profiling, PII classification, lineage, and glossary
19 JSON metadata files (master catalogue, quality report, glossary, lineage map, 15 table profiles)
5-tab React catalogue UI
15/15 validation tests passing
Animated GIFs of all 5 views
6-slide presentation deck

If you're building a data platform and governance is an afterthought, it doesn't have to be. The catalogue can build itself — from the same data your pipelines already produce.

The data tells its own story. The catalogue makes sure everyone can read it.

#DataCatalogue #DataGovernance #DataLineage #MDM #DataQuality #FinancialServices #AI #DataEngineering #Metadata #PII #BusinessGlossary #Simultaneous #IdeaToDisplay #DataArchitecture #EnterpriseTech