AI engineering — from data pipelines to LLM-native systems.
The interview track every data engineer is being asked to add in 2026. RAG architecture, vector databases, embedding pipelines, fine-tuning vs prompting, eval pipelines, agent orchestration, cost & latency at scale. The shape rhymes with DE — but the failure modes, the eval rigor, and the cost model are different. This page is the bridge.
Contents
- The DE-to-AIE shape map — what transfers and what doesn't
- RAG architecture — retrieval, embeddings, vector DBs
- Eval pipelines — the hardest unsolved problem
- Fine-tuning vs prompting vs RAG — when each wins
- Agent orchestration — tool calling, planning, memory
- MCP (Model Context Protocol) — the agent integration standard
- AI system design — the L5+ whiteboard
- Cost & latency — the production-grade trade-offs
- The 90-second articulation script
DE-to-AIE — what transfers and what doesn't.
If you've built data pipelines, you have ~70% of the AI engineering skill set. The transferable pieces:
| DE skill | AIE analogue | What's the same |
|---|---|---|
| ETL pipelines | Embedding pipelines | Schedule, retry, partial-failure recovery, idempotency, lineage |
| Schema migrations | Embedding model upgrades | Versioning, backfills, dual-write windows |
| Streaming aggregations | Online RAG indexing | Watermarks, eventual consistency, hot-key skew (popular docs) |
| Hot-shard handling | Vector DB hot partitions | Salting, isolation, per-tenant rate limits (see Hot Shards) |
| Cost monitoring | Token / inference cost | Per-tenant attribution, budget alerts, rate limiting |
What's new in AIE and where DE intuition fails:
- Non-deterministic outputs — same input, different output. Evals replace unit tests.
- The data is the model — embedding drift, semantic search quality, retrieval precision/recall are first-class metrics.
- Cost-per-call is the new latency-per-call — every architectural choice gets weighed against $/token.
- The vendor matrix changes monthly — OpenAI, Anthropic, Mistral, Llama, Cohere, Gemini all have moved cost/quality frontiers in the last 90 days.
RAG — retrieval-augmented generation done right.
The default architecture for "give the LLM context from our docs." Three pieces that look simple and aren't:
A. The indexing pipeline
raw_docs → chunker → embedder → vector_db (+ metadata store)
↓
BM25 / sparse index (for hybrid retrieval)
The chunking question dominates retrieval quality. 512-token chunks lose context; 4096-token chunks bury the needle. Production: recursive character splitting with 800-1200 token chunks + 100 token overlap, plus parent-document retrieval (chunks index for matching, full parent doc returned for context).
B. Retrieval at query time
| Pattern | What it solves | Cost |
|---|---|---|
| Dense vector only | Semantic match (synonyms, paraphrases) | 1 embedding call + vector search |
| Hybrid (dense + BM25) | Catches exact tokens (codes, names, jargon) that vectors blur | + sparse index call; rerank both |
| Reranker on top-K | Cross-encoder reorders retrieved chunks by query relevance | + reranker call (often 10-100ms) |
| HyDE / query expansion | Generate hypothetical answer first, embed THAT for retrieval | + extra LLM call |
| Multi-query / fusion | 3 query reformulations, RRF-fuse the results | 3× embedding + retrieval |
C. Vector database choice
| Choice | Best when | Watch out |
|---|---|---|
| pgvector | You already use Postgres; < 10M vectors | Index rebuild on insert; tune ivfflat vs hnsw |
| Pinecone / Weaviate / Qdrant | Managed; 10M-1B vectors; multi-tenant | Cost scales with vector count + dim |
| FAISS / Chroma local | Dev, single-node, < 1M vectors | No managed persistence; rebuild on restart |
| Elastic / OpenSearch | You already run it; want hybrid out of the box | kNN quality varies vs purpose-built engines |
Eval pipelines — the hardest unsolved problem.
Unit tests don't work — outputs are non-deterministic. The eval pyramid:
| Layer | What it catches | Cost |
|---|---|---|
| Reference-based metrics | Exact match, F1, BLEU, ROUGE | Cheap — only works for tasks with known answers |
| Embedding similarity | Semantic similarity to a golden answer | Cheap; threshold tuning is hard |
| LLM-as-judge | "GPT-4 grades GPT-3.5's output" — pairwise or absolute | Expensive; judge bias toward verbose/familiar outputs |
| Human eval | The ground truth; everything else is calibrated against it | Slowest, costliest; required for launch decisions |
| Online metrics (production) | User thumbs-up/down, follow-up rate, task completion | Free at scale but lags — by the time you see it, you've shipped a bad model |
The golden test set
A held-out set of 100-1000 representative queries with expected behavior (not exact outputs). For each, log the model's response, run the LLM-as-judge against a rubric, store per-query scores. Track the aggregate across versions: quality regression is the single biggest reason model upgrades fail in production.
RAG-specific evals
| Metric | What it measures |
|---|---|
| Retrieval recall@K | Did the right chunk appear in the top-K results? |
| Context precision | Of the K chunks retrieved, how many were actually relevant? |
| Faithfulness | Does the answer cite information that's actually in the context? (hallucination check) |
| Answer relevance | Did the answer address the question, even if technically correct? |
RAGAS, Phoenix, LangSmith, Braintrust are the toolchain names interviewers expect you to know.
Fine-tuning vs prompting vs RAG — when each wins.
| Method | Wins when | Loses when |
|---|---|---|
| Prompt engineering | Task fits in context; rapid iteration; no training infra | Need consistent format; specialized vocabulary; high volume cost |
| Few-shot prompting | Task pattern recognizable from 2-10 examples | Context window blows up; per-call cost climbs |
| RAG | Knowledge is dynamic; need citation; small org | Reasoning over many docs at once; latency budget tight |
| Fine-tuning (full or LoRA) | Format consistency; specialized domain; high QPS where prompt cost matters | Knowledge cutoff baked in; needs training data; eval pipeline |
| Continued pretraining | Genuinely novel domain (legal, medical, codebase) | $$$ and skill; usually overkill |
Agent orchestration — tool calling, planning, memory.
"Agents" = LLMs with tools (functions they can call). The orchestration patterns:
| Pattern | Description | Use when |
|---|---|---|
| Single-shot tool call | One LLM call, one tool invocation, done | Simple lookups (weather, calculator) |
| ReAct loop | Think → tool → observe → think → ... until done | Multi-step research, debugging tasks |
| Plan-and-execute | Generate full plan upfront, then execute steps | Long-horizon tasks where mid-loop drift hurts |
| Multi-agent | Specialist agents (researcher, coder, critic) cooperating | Tasks decomposable by role; expensive |
| Tree of thoughts | Explore N branches, prune, retry | Reasoning tasks with verifiable subgoals |
Memory architecture
- Conversation memory — sliding window or summary; tokens cost money
- Vector memory — embed turns, retrieve relevant past context (long sessions)
- Structured memory — extract facts (key-value), store in DB; refresh on use
- Episodic memory — full session traces for replay/debugging
MCP — Model Context Protocol, the agent integration standard.
Anthropic's Model Context Protocol (open-sourced Nov 2024) is the JSON-RPC standard for connecting LLM agents to tools, data sources, and prompts. The "USB-C for AI" framing — one connector spec, every host (Claude Desktop, Cursor, IDE plugins, custom agents) speaks it, every server (file system, GitHub, databases, custom APIs) exposes capabilities through it.
The three primitives
| Primitive | What it does | Who controls |
|---|---|---|
| Tools | Functions the LLM can call (with JSON schema for args) | LLM decides when to call |
| Resources | Read-only data the host can attach to context (files, DB rows, web pages) | User/host decides what to attach |
| Prompts | Templated prompt snippets the server provides | User triggers via slash command or UI |
Why this matters for AIE interviews
- Standardisation — before MCP, every agent shipped its own tool-calling protocol; integrations were N×M. Now it's N+M.
- Discoverability — clients query
tools/listandresources/listat startup; agents adapt without retraining. - Security boundary — MCP servers run in their own process, talk JSON-RPC over stdio or HTTP. Clean isolation; per-tool permissioning.
- Ecosystem velocity — by Dec 2025, hundreds of community MCP servers exist (GitHub, Slack, Google Drive, Postgres, Salesforce). The whiteboard answer often is "I'd expose this as an MCP server" rather than "build a custom integration."
The MCP architecture in 30 seconds
┌──────────────┐ JSON-RPC over stdio ┌──────────────────┐
│ MCP Host │────────── or HTTP ─────▶│ MCP Server │
│ (Claude, │ │ • tools/list │
│ Cursor, │◀──────────────────────────│ • tools/call │
│ IDE, ...) │ │ • resources/* │
└──────────────┘ │ • prompts/* │
└──────────────────┘
│
▼
(DB, API, filesystem, ...)
Host = the app the user interacts with. Server = the integration provider. Client = the host's MCP library connecting to a specific server. One host typically connects to multiple servers (one per integration).
When you'd build vs use an MCP server
| Scenario | Build your own MCP server when | Use existing when |
|---|---|---|
| Internal company data (CRM, data warehouse) | ✓ — custom semantics, auth, permissions | — |
| Standard SaaS (GitHub, Slack, GDrive) | If the official one is missing a feature | ✓ — official servers maintained by Anthropic / vendors |
| Database / SQL access | For your specific dialect or auth layer | ✓ for vanilla Postgres / SQLite |
| Long-running background tasks | Yes — model as MCP tools with progress reporting | — |
AI system design — the L5+ whiteboard.
The interview prompts you'll get: "Design a Q&A chatbot over our docs", "Design a code review assistant", "Design a customer support agent". The shape is consistent — clarify, sketch, dive on bottleneck. The senior signal isn't the architecture; it's what you weigh.
The 7-box AI system reference architecture
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Client │──▶│ API gateway │──▶│ Auth / quota │
│ (web/app) │ │ (rate limit)│ │ (per-tenant) │
└─────────────┘ └──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
┌──────────────────────────────┐
│ Orchestrator (agent / app) │
│ • prompt builder │
│ • routing │
│ • caching layer │
└──────────────┬───────────────┘
│
┌─────────────────────┼───────────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌──────────────┐
│ Retrieval │ │ Tools / │ │ LLM call │
│ (vector DB │ │ MCP servers│ │ (model API) │
│ + BM25) │ │ │ │ │
└─────┬──────┘ └─────┬──────┘ └──────┬───────┘
▼ ▼ ▼
┌──────────────────────────────────────────────────────┐
│ Observability: traces, eval signals, cost per call │
└──────────────────────────────────────────────────────┘
The 5 questions to clarify before drawing anything
- Read or write? Q&A → read; code-edit agent → write. Write demands transactions, audit, rollback.
- How fresh must the data be? Real-time (CDC into vector store) vs nightly (batch embed).
- What's the latency budget? Chatbot: 2-5s OK. Inline autocomplete: < 100ms TTFT.
- What's the eval rubric? Faithfulness? Conversion? CSAT? Drives the metrics layer.
- Multi-tenant? Drives namespace isolation, cost attribution, rate limit.
Pattern cookbook — common AI design prompts
| Prompt | Headline answer | The trap |
|---|---|---|
| "Design a docs chatbot" | RAG with hybrid retrieval + reranker + LLM | Eval pipeline; doc updates; multi-tenancy |
| "Design a code review assistant" | MCP server exposing git + linter + test tools; agent reasoning over diffs | Cost per PR; false-positive fatigue |
| "Design customer support agent" | RAG over docs + KB + ticket history; agent with escalation tool | Hallucination on policy; PII handling |
| "Design an LLM-powered search" | BM25 first-pass → embedding rerank → query rewriting → LLM summarisation | Latency budget; query intent classification |
| "Design a meeting-notes summariser" | Whisper for ASR → chunked LLM summary → action-item extraction → calendar tool | Speaker diarisation; PII redaction |
| "Design AI image moderation" | Vision model + policy classifier + human review queue + appeal flow | Adversarial inputs; false-positive cost |
Cost & latency — the production trade-offs.
| Lever | Impact on cost | Impact on latency | Impact on quality |
|---|---|---|---|
| Model tier (GPT-4 → GPT-4o-mini) | 10-30× cheaper | 2-4× faster | Mostly fine for non-reasoning tasks |
| Prompt caching | 50-90% off cached portion | +TTFT savings | Neutral |
| Streaming | Neutral | Massive TTFB win | Neutral (UX win) |
| Batch API | ~50% off | +24h latency (offline only) | Neutral |
| Shorter context (RAG over fine-tune) | Lower per-call | Faster prefill | Depends on retrieval quality |
| Self-host open model | Higher fixed, lower per-call | Eliminates network | Often -10-20% vs frontier |
The cost-attribution problem
Every LLM call costs $X. Multi-tenant SaaS needs to attribute spend per customer. Pattern: tag every API call with customer_id + feature_id, log to a metrics store, roll up nightly. The bill from OpenAI/Anthropic is a single line item; your internal attribution is what makes pricing decisions defensible.
Caching strategy
- Exact-match cache — same prompt → same response. Hit rate < 5% in most products.
- Semantic cache — embed prompt; if nearest neighbor > 0.95 similarity, return cached. Hit rate 20-40%.
- Prompt caching (API-side) — Anthropic/OpenAI cache the prompt prefix; subsequent calls 50-90% off and faster. Critical for long system prompts.
The 90-second articulation script.
Related design pages
- ML Engineering Interview Prep — feature stores, training/serving skew, model versioning, A/B
- Hot Shards & Data Skew — the same skew patterns apply to vector DB hot partitions
- Streaming Architecture — for online RAG indexing / change-data feeds into the vector store