The Information System Collapse - Part 3: The Architecture That Survives

From Collapse to Construction: Building Event-Driven Systems That Actually Work

Where We've Been

Part 1 exposed the delay problem: Your traditional ERP→ETL→Warehouse→BI stack takes 3 days to answer simple questions. By the time you get the answer, the opportunity is gone.

Part 2 revealed the source: ERP's two ancestral curses (Customization Chaos + Thousand-Table Labyrinth) force data teams to excavate instead of analyze.

Now comes the hard question: What do we build instead?

The Honest Starting Point

Let's be clear about what we're NOT doing:

❌ We're not "eliminating data warehouses" (too evangelical)
❌ We're not claiming "AI solves everything" (too magical)
❌ We're not proposing a rip-and-replace (too risky)

What we ARE doing:

✅ Building a parallel system that handles 80% of business questions faster
✅ Using proven technologies (Kafka, PostgreSQL, LLMs) in a new way
✅ Starting with one use case, proving value, then expanding
✅ Keeping your existing infrastructure running while we transition

This is engineering, not revolution.

The Core Insight: Events, Not States

The fundamental problem with ERP→Warehouse architecture: It stores STATES (final values) instead of EVENTS (what actually happened).

When your ERP records a sale, it stores:

Order ID: 12345
Customer ID: CUST_789
Amount: $500
Date: 2025-01-15

What it DOESN'T store:

Customer browsed 7 products before buying
Added item to cart, removed it, added it back
Hesitated at checkout for 3 minutes
Applied a discount code
Checkout page loaded slowly (4 seconds)

All that context is LOST.

And that context is exactly what you need to answer "why did this happen?"

The New Architecture: Event-Driven Intelligence

Here's what we're building:

Let's break down each layer with actual technical details, not buzzwords.

Layer 1: Event Capture (The Foundation)

What Is An Event?

An event is an immutable record of something that happened, with full context.

Example Event Schema:

Key Properties of Events

Immutable Once written, never changed. To correct an error, you write a new event (like accounting).
Self-Contained Every event has all the context it needs. No need to join 12 tables to understand it.
Causally Linked Every event knows what caused it and what it might trigger.
Semantically Indexed Each event has a vector embedding that captures its meaning, enabling semantic search.

Layer 2: Semantic Event Fabric (The Smart Storage)

This is where we store and index events in a way that makes them queryable by meaning, not just by field names.

Technical Components

Component 1: Event Store (PostgreSQL with TimescaleDB)

Why PostgreSQL?

Battle-tested reliability
Native JSON support for flexible event payloads
TimescaleDB extension for time-series optimization
pgvector extension for semantic search
ACID guarantees (unlike some NoSQL solutions)

Schema:

Component 2: Event Bus (Apache Kafka or Redpanda)

Why Kafka/Redpanda?

Handles millions of events per second
Guaranteed ordering within partitions
Replay capability (reprocess events if needed)
Multiple consumers can read same stream

Topic Structure:

Component 3: Stream Processors (Kafka Streams or Flink)

Real-time enrichment and aggregation:

Layer 3: Domain Reasoning Nodes (The Intelligence)

These are specialized query engines for specific business domains. Not generic AI—focused, deterministic processors with LLM-enhanced reasoning.

Anatomy of a Domain Reasoning Node

Example: Customer Behavior Node

Why This Architecture Works

90% of queries are deterministic (fast SQL):

"Show me conversion funnel for last week"
"What's the cart abandonment rate by device?"
"How many orders today?"

10% of queries need synthesis (SQL + LLM):

"Why did mobile conversion drop yesterday?"
"Which customer segments are at risk?"
"What happened before high-value users churned?"

The LLM is only used for:

Understanding intent (parsing natural language)
Finding semantic patterns (when causality needed)
Synthesizing explanations (turning data into narrative)

The LLM never does math. All calculations are SQL.

Layer 4: Conversational Query Interface

Users don't write SQL. They ask questions.

How It Works

Input: Natural language query Output: Answer with evidence and confidence

Example Interaction:

Technical Implementation

Real-World Example: E-commerce Flow

Let's trace a complete customer journey through the system.

Events Generated

Query Examples

Query 1: Simple Aggregation (Deterministic)

Query 2: Funnel Analysis (Deterministic)

Query 3: Diagnostic (Hybrid - SQL + Vector Search + LLM)

Infrastructure Requirements (Real Numbers)

For a mid-sized e-commerce company (1M visitors/month, 50K orders/month):

Event Volume

~50M events/month
~60 events/second average
~300 events/second peak

Infrastructure

Event Store (PostgreSQL + TimescaleDB):

3-node cluster (primary + 2 replicas)
16 vCPU, 64GB RAM per node
2TB SSD storage (with compression)
Cost: ~$2,000/month (AWS RDS equivalent)

Event Bus (Kafka or Redpanda):

3-node cluster
8 vCPU, 32GB RAM per node
1TB SSD per node
Cost: ~$1,500/month

Domain Reasoning Nodes:

4 nodes (customer, revenue, inventory, product)
8 vCPU, 32GB RAM per node
Cost: ~$1,200/month

LLM API (Claude Sonnet):

~10K queries/month
~80% deterministic (no LLM needed)
~2K queries need LLM synthesis
Cost: ~$500/month

Vector Embeddings (OpenAI):

Batch process 50M events/month
Cost: ~$400/month

Total Infrastructure: ~$5,600/month

Compare to Traditional Stack

Traditional (Snowflake + dbt + Fivetran + Looker):

Snowflake: $3,000/month
Fivetran: $1,500/month
dbt Cloud: $500/month
Looker: $1,000/month
Data team time: 60% on maintenance
Total: $6,000/month + massive time waste

Event-driven stack: $5,600/month + 10% time on maintenance

Migration Strategy: 90-Day Proof of Concept

Week 1-2: Event Capture POC

Goal: Prove we can capture events from existing systems

Tasks:

Set up Kafka cluster
Write event producers for 1 source system (e.g., e-commerce platform)
Capture 1 event type (e.g., order_completed)
Store in PostgreSQL

Success metric: 100K events captured and stored

Week 3-4: Build First Domain Node

Goal: Answer 1 business question faster than current system

Tasks:

Build Revenue Node
Implement deterministic queries (simple aggregations)
Compare results with existing warehouse

Success metric: "Daily revenue by product" query: 0.5s (vs 3 minutes in warehouse)

Week 5-6: Add Vector Search + Semantic Layer

Goal: Enable semantic queries

Tasks:

Generate embeddings for all events
Implement vector similarity search
Test semantic queries

Success metric: "Find orders similar to this one" works accurately

Week 7-8: Conversational Interface

Goal: Natural language queries work

Tasks:

Integrate Claude API
Build intent parser
Implement answer synthesis
Test with 20 real business questions

Success metric: 90%+ accuracy on test questions

Week 9-10: Hybrid Query Execution

Goal: Handle complex "why" questions

Tasks:

Implement causal chain analysis
Build cross-domain query capability
Test diagnostic queries

Success metric: "Why did conversion drop?" answered in <10 seconds with causality

Week 11-12: Pilot with Real Users

Goal: Prove business value

Tasks:

Give access to 5 business users
Track: queries asked, time saved, satisfaction
Compare accuracy vs traditional dashboards

Success metrics:

50+ queries asked
80%+ user satisfaction
100x faster than traditional method
Demonstrate ROI

What Makes This Different

Not Data Warehouse 2.0

This isn't "a faster warehouse." It's a different paradigm:

Old: Store aggregated states, visualize, human interprets
New: Store events, synthesize answers, explain causality

Not "AI Does Everything"

90% is deterministic SQL (fast, accurate, cheap) 10% uses LLM (for understanding intent and synthesis)
The intelligence is in the architecture, not just the AI.

Not Rip-and-Replace

Your existing warehouse keeps running. New system handles new use cases. Migrate gradually as you prove value.

Not Vendor Lock-In

Built on open technologies:

PostgreSQL (open source)
Kafka (open source)
DuckDB (open source)
Claude API (anthropic.com, but swappable)

You own the infrastructure.

The Path Forward

Months 1-3: POC (prove it works)
Months 4-6: Pilot (10-20 users, 20% of queries)
Months 7-12: Scale (50% of queries)
Year 2: Primary system (80% of queries)
Year 3: Warehouse becomes archive only

You don't rip out the old system. You build the new one alongside it. You prove value before betting the company.

The Information System Collapse - Part 3: The Architecture That Survives

From Collapse to Construction: Building Event-Driven Systems That Actually Work

Where We've Been

The Honest Starting Point

The Core Insight: Events, Not States

The New Architecture: Event-Driven Intelligence

Layer 1: Event Capture (The Foundation)

What Is An Event?

Key Properties of Events

Layer 2: Semantic Event Fabric (The Smart Storage)

Technical Components

Layer 3: Domain Reasoning Nodes (The Intelligence)

Anatomy of a Domain Reasoning Node

Why This Architecture Works

Layer 4: Conversational Query Interface

How It Works

Technical Implementation

Real-World Example: E-commerce Flow

Events Generated

Query Examples

Infrastructure Requirements (Real Numbers)

Event Volume

Infrastructure

Compare to Traditional Stack

Migration Strategy: 90-Day Proof of Concept

Week 1-2: Event Capture POC

Week 3-4: Build First Domain Node

Week 5-6: Add Vector Search + Semantic Layer

Week 7-8: Conversational Interface

Week 9-10: Hybrid Query Execution

Week 11-12: Pilot with Real Users

What Makes This Different

Not Data Warehouse 2.0

Not "AI Does Everything"

Not Rip-and-Replace

Not Vendor Lock-In

The Path Forward

You Might Also Enjoy