▸ AI ENGINEERING · the DE-to-AIE bridge

AI engineering — from data pipelines to LLM-native systems.

The interview track every data engineer is being asked to add in 2026. RAG architecture, vector databases, embedding pipelines, fine-tuning vs prompting, eval pipelines, agent orchestration, cost & latency at scale. The shape rhymes with DE — but the failure modes, the eval rigor, and the cost model are different. This page is the bridge.

The DE-to-AIE shape map — what transfers and what doesn't
RAG architecture — retrieval, embeddings, vector DBs
Eval pipelines — the hardest unsolved problem
Fine-tuning vs prompting vs RAG — when each wins
Agent orchestration — tool calling, planning, memory
MCP (Model Context Protocol) — the agent integration standard
AI system design — the L5+ whiteboard
Cost & latency — the production-grade trade-offs
The 90-second articulation script

§ 00 — Practise

Practise the AI Engineering track.

150+ curated questions across 20 modules — LLM fundamentals, tokens & context windows, prompt & context design, model selection & routing, cost optimization, caching, RAG, embeddings & vector DBs, tool-calling, agents, memory, evaluation, hallucination & grounding, guardrails & security, responsible AI, fine-tuning vs RAG vs prompting, architecture, reliability, product metrics and AI-assisted coding. Filter by topic, level and role; bookmark; and track your interview-readiness. Progress saves on this device — no sign-up.

§ 01 — The shape map

DE-to-AIE — what transfers and what doesn't.

If you've built data pipelines, you have ~70% of the AI engineering skill set. The transferable pieces:

DE skill	AIE analogue	What's the same
ETL pipelines	Embedding pipelines	Schedule, retry, partial-failure recovery, idempotency, lineage
Schema migrations	Embedding model upgrades	Versioning, backfills, dual-write windows
Streaming aggregations	Online RAG indexing	Watermarks, eventual consistency, hot-key skew (popular docs)
Hot-shard handling	Vector DB hot partitions	Salting, isolation, per-tenant rate limits (see Hot Shards)
Cost monitoring	Token / inference cost	Per-tenant attribution, budget alerts, rate limiting

What's new in AIE and where DE intuition fails:

Non-deterministic outputs — same input, different output. Evals replace unit tests.
The data is the model — embedding drift, semantic search quality, retrieval precision/recall are first-class metrics.
Cost-per-call is the new latency-per-call — every architectural choice gets weighed against $/token.
The vendor matrix changes monthly — OpenAI, Anthropic, Mistral, Llama, Cohere, Gemini all have moved cost/quality frontiers in the last 90 days.

· · ·

§ 02 — RAG architecture

RAG — retrieval-augmented generation done right.

The default architecture for "give the LLM context from our docs." Three pieces that look simple and aren't:

A. The indexing pipeline

raw_docs → chunker → embedder → vector_db (+ metadata store)
                    ↓
                 BM25 / sparse index (for hybrid retrieval)

The chunking question dominates retrieval quality. 512-token chunks lose context; 4096-token chunks bury the needle. Production: recursive character splitting with 800-1200 token chunks + 100 token overlap, plus parent-document retrieval (chunks index for matching, full parent doc returned for context).

B. Retrieval at query time

Pattern	What it solves	Cost
Dense vector only	Semantic match (synonyms, paraphrases)	1 embedding call + vector search
Hybrid (dense + BM25)	Catches exact tokens (codes, names, jargon) that vectors blur	+ sparse index call; rerank both
Reranker on top-K	Cross-encoder reorders retrieved chunks by query relevance	+ reranker call (often 10-100ms)
HyDE / query expansion	Generate hypothetical answer first, embed THAT for retrieval	+ extra LLM call
Multi-query / fusion	3 query reformulations, RRF-fuse the results	3× embedding + retrieval

C. Vector database choice

Choice	Best when	Watch out
pgvector	You already use Postgres; < 10M vectors	Index rebuild on insert; tune `ivfflat` vs `hnsw`
Pinecone / Weaviate / Qdrant	Managed; 10M-1B vectors; multi-tenant	Cost scales with vector count + dim
FAISS / Chroma local	Dev, single-node, < 1M vectors	No managed persistence; rebuild on restart
Elastic / OpenSearch	You already run it; want hybrid out of the box	kNN quality varies vs purpose-built engines

The L5 RAG trap. "Just dump docs into a vector DB" produces a demo that fails on production queries. The real questions: how do you handle updates (re-embed on doc change), deletes (soft-delete + reindex), multi-tenant isolation (namespace per tenant, never share index), and quality drift (eval pipeline on a held-out set every release). Most candidates skip these.

· · ·

§ 03 — Evals

Eval pipelines — the hardest unsolved problem.

Unit tests don't work — outputs are non-deterministic. The eval pyramid:

Layer	What it catches	Cost
Reference-based metrics	Exact match, F1, BLEU, ROUGE	Cheap — only works for tasks with known answers
Embedding similarity	Semantic similarity to a golden answer	Cheap; threshold tuning is hard
LLM-as-judge	"GPT-4 grades GPT-3.5's output" — pairwise or absolute	Expensive; judge bias toward verbose/familiar outputs
Human eval	The ground truth; everything else is calibrated against it	Slowest, costliest; required for launch decisions
Online metrics (production)	User thumbs-up/down, follow-up rate, task completion	Free at scale but lags — by the time you see it, you've shipped a bad model

The golden test set

A held-out set of 100-1000 representative queries with expected behavior (not exact outputs). For each, log the model's response, run the LLM-as-judge against a rubric, store per-query scores. Track the aggregate across versions: quality regression is the single biggest reason model upgrades fail in production.

RAG-specific evals

Metric	What it measures
Retrieval recall@K	Did the right chunk appear in the top-K results?
Context precision	Of the K chunks retrieved, how many were actually relevant?
Faithfulness	Does the answer cite information that's actually in the context? (hallucination check)
Answer relevance	Did the answer address the question, even if technically correct?

RAGAS, Phoenix, LangSmith, Braintrust are the toolchain names interviewers expect you to know.

· · ·

§ 04 — Adaptation strategies

Fine-tuning vs prompting vs RAG — when each wins.

Method	Wins when	Loses when
Prompt engineering	Task fits in context; rapid iteration; no training infra	Need consistent format; specialized vocabulary; high volume cost
Few-shot prompting	Task pattern recognizable from 2-10 examples	Context window blows up; per-call cost climbs
RAG	Knowledge is dynamic; need citation; small org	Reasoning over many docs at once; latency budget tight
Fine-tuning (full or LoRA)	Format consistency; specialized domain; high QPS where prompt cost matters	Knowledge cutoff baked in; needs training data; eval pipeline
Continued pretraining	Genuinely novel domain (legal, medical, codebase)	$$$ and skill; usually overkill

The 2026 default. Start with prompting + RAG. Only fine-tune when you've measured a specific failure mode that prompting can't fix (format drift, latency at high QPS, cost at scale). LoRA on a 7B-13B open model is the modern fine-tuning sweet spot — small training cost, fast inference, deployable in-house.

· · ·

§ 05 — Agents

Agent orchestration — tool calling, planning, memory.

"Agents" = LLMs with tools (functions they can call). The orchestration patterns:

Pattern	Description	Use when
Single-shot tool call	One LLM call, one tool invocation, done	Simple lookups (weather, calculator)
ReAct loop	Think → tool → observe → think → ... until done	Multi-step research, debugging tasks
Plan-and-execute	Generate full plan upfront, then execute steps	Long-horizon tasks where mid-loop drift hurts
Multi-agent	Specialist agents (researcher, coder, critic) cooperating	Tasks decomposable by role; expensive
Tree of thoughts	Explore N branches, prune, retry	Reasoning tasks with verifiable subgoals

Memory architecture

Conversation memory — sliding window or summary; tokens cost money
Vector memory — embed turns, retrieve relevant past context (long sessions)
Structured memory — extract facts (key-value), store in DB; refresh on use
Episodic memory — full session traces for replay/debugging

The agent failure mode. Loops. ReAct agents can spin: tool returns ambiguous result → LLM retries → tool returns same result → loop. The fixes: max-iteration cap, deterministic exit conditions (e.g. "if same tool called with same args twice, exit"), and a "scratchpad" the LLM has to reference before acting. Production agents fail open (graceful "I couldn't complete this") rather than spinning forever.

· · ·

§ 06 — MCP

MCP — Model Context Protocol, the agent integration standard.

Anthropic's Model Context Protocol (open-sourced Nov 2024) is the JSON-RPC standard for connecting LLM agents to tools, data sources, and prompts. The "USB-C for AI" framing — one connector spec, every host (Claude Desktop, Cursor, IDE plugins, custom agents) speaks it, every server (file system, GitHub, databases, custom APIs) exposes capabilities through it.

The three primitives

Primitive	What it does	Who controls
Tools	Functions the LLM can call (with JSON schema for args)	LLM decides when to call
Resources	Read-only data the host can attach to context (files, DB rows, web pages)	User/host decides what to attach
Prompts	Templated prompt snippets the server provides	User triggers via slash command or UI

Why this matters for AIE interviews

Standardisation — before MCP, every agent shipped its own tool-calling protocol; integrations were N×M. Now it's N+M.
Discoverability — clients query tools/list and resources/list at startup; agents adapt without retraining.
Security boundary — MCP servers run in their own process, talk JSON-RPC over stdio or HTTP. Clean isolation; per-tool permissioning.
Ecosystem velocity — by Dec 2025, hundreds of community MCP servers exist (GitHub, Slack, Google Drive, Postgres, Salesforce). The whiteboard answer often is "I'd expose this as an MCP server" rather than "build a custom integration."

The MCP architecture in 30 seconds

┌──────────────┐    JSON-RPC over stdio    ┌──────────────────┐
│   MCP Host   │──────────  or HTTP  ─────▶│  MCP Server      │
│ (Claude,     │                           │  • tools/list    │
│  Cursor,     │◀──────────────────────────│  • tools/call    │
│  IDE, ...)   │                           │  • resources/*   │
└──────────────┘                           │  • prompts/*     │
                                           └──────────────────┘
                                                      │
                                                      ▼
                                          (DB, API, filesystem, ...)

Host = the app the user interacts with. Server = the integration provider. Client = the host's MCP library connecting to a specific server. One host typically connects to multiple servers (one per integration).

When you'd build vs use an MCP server

Scenario	Build your own MCP server when	Use existing when
Internal company data (CRM, data warehouse)	✓ — custom semantics, auth, permissions	—
Standard SaaS (GitHub, Slack, GDrive)	If the official one is missing a feature	✓ — official servers maintained by Anthropic / vendors
Database / SQL access	For your specific dialect or auth layer	✓ for vanilla Postgres / SQLite
Long-running background tasks	Yes — model as MCP tools with progress reporting	—

The interview pivot. If asked "how would you give your agent access to our company's Salesforce data", the L4 answer is "I'd write a tool that wraps the Salesforce API." The L5 answer is "I'd expose Salesforce as an MCP server with tools for query/update, resources for record attachment, and a permission boundary on the server side — so any MCP-compatible host (our internal agent, Claude Desktop for ad-hoc analyst use, Cursor for the CRM engineers) can connect with the same auth model."

· · ·

§ 07 — AI system design

AI system design — the L5+ whiteboard.

The interview prompts you'll get: "Design a Q&A chatbot over our docs", "Design a code review assistant", "Design a customer support agent". The shape is consistent — clarify, sketch, dive on bottleneck. The senior signal isn't the architecture; it's what you weigh.

The 7-box AI system reference architecture

┌─────────────┐   ┌──────────────┐   ┌──────────────┐
│   Client    │──▶│ API gateway  │──▶│ Auth / quota │
│  (web/app)  │   │  (rate limit)│   │ (per-tenant) │
└─────────────┘   └──────┬───────┘   └──────┬───────┘
                         │                  │
                         ▼                  ▼
                   ┌──────────────────────────────┐
                   │   Orchestrator (agent / app) │
                   │   • prompt builder           │
                   │   • routing                  │
                   │   • caching layer            │
                   └──────────────┬───────────────┘
                                  │
            ┌─────────────────────┼───────────────────────┐
            ▼                     ▼                       ▼
     ┌────────────┐       ┌────────────┐         ┌──────────────┐
     │ Retrieval  │       │ Tools /    │         │  LLM call    │
     │ (vector DB │       │ MCP servers│         │ (model API)  │
     │  + BM25)   │       │            │         │              │
     └─────┬──────┘       └─────┬──────┘         └──────┬───────┘
           ▼                    ▼                       ▼
      ┌──────────────────────────────────────────────────────┐
      │  Observability: traces, eval signals, cost per call  │
      └──────────────────────────────────────────────────────┘

The 5 questions to clarify before drawing anything

Read or write? Q&A → read; code-edit agent → write. Write demands transactions, audit, rollback.
How fresh must the data be? Real-time (CDC into vector store) vs nightly (batch embed).
What's the latency budget? Chatbot: 2-5s OK. Inline autocomplete: < 100ms TTFT.
What's the eval rubric? Faithfulness? Conversion? CSAT? Drives the metrics layer.
Multi-tenant? Drives namespace isolation, cost attribution, rate limit.

Pattern cookbook — common AI design prompts

Prompt	Headline answer	The trap
"Design a docs chatbot"	RAG with hybrid retrieval + reranker + LLM	Eval pipeline; doc updates; multi-tenancy
"Design a code review assistant"	MCP server exposing git + linter + test tools; agent reasoning over diffs	Cost per PR; false-positive fatigue
"Design customer support agent"	RAG over docs + KB + ticket history; agent with escalation tool	Hallucination on policy; PII handling
"Design an LLM-powered search"	BM25 first-pass → embedding rerank → query rewriting → LLM summarisation	Latency budget; query intent classification
"Design a meeting-notes summariser"	Whisper for ASR → chunked LLM summary → action-item extraction → calendar tool	Speaker diarisation; PII redaction
"Design AI image moderation"	Vision model + policy classifier + human review queue + appeal flow	Adversarial inputs; false-positive cost

· · ·

§ 08 — Cost & latency

Cost & latency — the production trade-offs.

Lever	Impact on cost	Impact on latency	Impact on quality
Model tier (GPT-4 → GPT-4o-mini)	10-30× cheaper	2-4× faster	Mostly fine for non-reasoning tasks
Prompt caching	50-90% off cached portion	+TTFT savings	Neutral
Streaming	Neutral	Massive TTFB win	Neutral (UX win)
Batch API	~50% off	+24h latency (offline only)	Neutral
Shorter context (RAG over fine-tune)	Lower per-call	Faster prefill	Depends on retrieval quality
Self-host open model	Higher fixed, lower per-call	Eliminates network	Often -10-20% vs frontier

The cost-attribution problem

Every LLM call costs $X. Multi-tenant SaaS needs to attribute spend per customer. Pattern: tag every API call with customer_id + feature_id, log to a metrics store, roll up nightly. The bill from OpenAI/Anthropic is a single line item; your internal attribution is what makes pricing decisions defensible.

Caching strategy

Exact-match cache — same prompt → same response. Hit rate < 5% in most products.
Semantic cache — embed prompt; if nearest neighbor > 0.95 similarity, return cached. Hit rate 20-40%.
Prompt caching (API-side) — Anthropic/OpenAI cache the prompt prefix; subsequent calls 50-90% off and faster. Critical for long system prompts.

· · ·

§ 09 — The interview script

The 90-second articulation script.

"AI engineering is data engineering with a non-deterministic compute layer. The pipelines look the same — extract, chunk, embed, index, serve — but the failure modes are different: hallucinations, embedding drift, retrieval misses, agent loops. The shape of the architecture is RAG (default for dynamic knowledge) vs fine-tuning (default for format / domain consistency) vs prompting (default for one-shot tasks); I'd start with prompting + RAG and only fine-tune when I have a measurable failure mode prompting can't fix. The eval pipeline is the hardest part — LLM-as-judge on a held-out set, RAG metrics (faithfulness, context precision, retrieval recall@K), and online metrics (thumbs, follow-up, task completion) feeding back into the training loop. Cost is the production constraint — model tiering, prompt caching, semantic cache, and per-tenant attribution let you ship at scale without the bill blowing up."

▸ Continue your prep

Design Deep-Dive

ML Engineering Prep

Feature stores, training/serving skew, model versioning, A/B infrastructure — the ML systems design round.

Design Deep-Dive

Hot Shards & Data Skew

Kafka skew, vector DB hot partitions, Spark salting — the skew patterns that bite AI systems hardest.

Design Deep-Dive

Streaming Architecture

Online RAG indexing, change-data feeds into vector stores, stateful streaming — real-time AI data pipelines.

Strategic Frame

Cloud & AI Age

AI as data engineer, governance at machine speed, unit economics — the senior frame for the AI engineering round.

· · ·

AI engineering — from data pipelines to LLM-native systems.

Contents

Practise the AI Engineering track.

DE-to-AIE — what transfers and what doesn't.

RAG — retrieval-augmented generation done right.

A. The indexing pipeline

B. Retrieval at query time

C. Vector database choice

Eval pipelines — the hardest unsolved problem.

The golden test set

RAG-specific evals

Fine-tuning vs prompting vs RAG — when each wins.

Agent orchestration — tool calling, planning, memory.

Memory architecture

MCP — Model Context Protocol, the agent integration standard.

The three primitives

Why this matters for AIE interviews

The MCP architecture in 30 seconds

When you'd build vs use an MCP server

AI system design — the L5+ whiteboard.

The 7-box AI system reference architecture

The 5 questions to clarify before drawing anything

Pattern cookbook — common AI design prompts

Cost & latency — the production trade-offs.

The cost-attribution problem

Caching strategy

The 90-second articulation script.