▸ AI ENGINEERING · the DE-to-AIE bridge

AI engineering — from data pipelines to LLM-native systems.

The interview track every data engineer is being asked to add in 2026. RAG architecture, vector databases, embedding pipelines, fine-tuning vs prompting, eval pipelines, agent orchestration, cost & latency at scale. The shape rhymes with DE — but the failure modes, the eval rigor, and the cost model are different. This page is the bridge.

§ 01 — The shape map

DE-to-AIE — what transfers and what doesn't.

If you've built data pipelines, you have ~70% of the AI engineering skill set. The transferable pieces:

DE skillAIE analogueWhat's the same
ETL pipelinesEmbedding pipelinesSchedule, retry, partial-failure recovery, idempotency, lineage
Schema migrationsEmbedding model upgradesVersioning, backfills, dual-write windows
Streaming aggregationsOnline RAG indexingWatermarks, eventual consistency, hot-key skew (popular docs)
Hot-shard handlingVector DB hot partitionsSalting, isolation, per-tenant rate limits (see Hot Shards)
Cost monitoringToken / inference costPer-tenant attribution, budget alerts, rate limiting

What's new in AIE and where DE intuition fails:

  • Non-deterministic outputs — same input, different output. Evals replace unit tests.
  • The data is the model — embedding drift, semantic search quality, retrieval precision/recall are first-class metrics.
  • Cost-per-call is the new latency-per-call — every architectural choice gets weighed against $/token.
  • The vendor matrix changes monthly — OpenAI, Anthropic, Mistral, Llama, Cohere, Gemini all have moved cost/quality frontiers in the last 90 days.
· · ·
§ 02 — RAG architecture

RAG — retrieval-augmented generation done right.

The default architecture for "give the LLM context from our docs." Three pieces that look simple and aren't:

A. The indexing pipeline

raw_docs → chunker → embedder → vector_db (+ metadata store)
                    ↓
                 BM25 / sparse index (for hybrid retrieval)

The chunking question dominates retrieval quality. 512-token chunks lose context; 4096-token chunks bury the needle. Production: recursive character splitting with 800-1200 token chunks + 100 token overlap, plus parent-document retrieval (chunks index for matching, full parent doc returned for context).

B. Retrieval at query time

PatternWhat it solvesCost
Dense vector onlySemantic match (synonyms, paraphrases)1 embedding call + vector search
Hybrid (dense + BM25)Catches exact tokens (codes, names, jargon) that vectors blur+ sparse index call; rerank both
Reranker on top-KCross-encoder reorders retrieved chunks by query relevance+ reranker call (often 10-100ms)
HyDE / query expansionGenerate hypothetical answer first, embed THAT for retrieval+ extra LLM call
Multi-query / fusion3 query reformulations, RRF-fuse the results3× embedding + retrieval

C. Vector database choice

ChoiceBest whenWatch out
pgvectorYou already use Postgres; < 10M vectorsIndex rebuild on insert; tune ivfflat vs hnsw
Pinecone / Weaviate / QdrantManaged; 10M-1B vectors; multi-tenantCost scales with vector count + dim
FAISS / Chroma localDev, single-node, < 1M vectorsNo managed persistence; rebuild on restart
Elastic / OpenSearchYou already run it; want hybrid out of the boxkNN quality varies vs purpose-built engines
The L5 RAG trap. "Just dump docs into a vector DB" produces a demo that fails on production queries. The real questions: how do you handle updates (re-embed on doc change), deletes (soft-delete + reindex), multi-tenant isolation (namespace per tenant, never share index), and quality drift (eval pipeline on a held-out set every release). Most candidates skip these.
· · ·
§ 03 — Evals

Eval pipelines — the hardest unsolved problem.

Unit tests don't work — outputs are non-deterministic. The eval pyramid:

LayerWhat it catchesCost
Reference-based metricsExact match, F1, BLEU, ROUGECheap — only works for tasks with known answers
Embedding similaritySemantic similarity to a golden answerCheap; threshold tuning is hard
LLM-as-judge"GPT-4 grades GPT-3.5's output" — pairwise or absoluteExpensive; judge bias toward verbose/familiar outputs
Human evalThe ground truth; everything else is calibrated against itSlowest, costliest; required for launch decisions
Online metrics (production)User thumbs-up/down, follow-up rate, task completionFree at scale but lags — by the time you see it, you've shipped a bad model

The golden test set

A held-out set of 100-1000 representative queries with expected behavior (not exact outputs). For each, log the model's response, run the LLM-as-judge against a rubric, store per-query scores. Track the aggregate across versions: quality regression is the single biggest reason model upgrades fail in production.

RAG-specific evals

MetricWhat it measures
Retrieval recall@KDid the right chunk appear in the top-K results?
Context precisionOf the K chunks retrieved, how many were actually relevant?
FaithfulnessDoes the answer cite information that's actually in the context? (hallucination check)
Answer relevanceDid the answer address the question, even if technically correct?

RAGAS, Phoenix, LangSmith, Braintrust are the toolchain names interviewers expect you to know.

· · ·
§ 04 — Adaptation strategies

Fine-tuning vs prompting vs RAG — when each wins.

MethodWins whenLoses when
Prompt engineeringTask fits in context; rapid iteration; no training infraNeed consistent format; specialized vocabulary; high volume cost
Few-shot promptingTask pattern recognizable from 2-10 examplesContext window blows up; per-call cost climbs
RAGKnowledge is dynamic; need citation; small orgReasoning over many docs at once; latency budget tight
Fine-tuning (full or LoRA)Format consistency; specialized domain; high QPS where prompt cost mattersKnowledge cutoff baked in; needs training data; eval pipeline
Continued pretrainingGenuinely novel domain (legal, medical, codebase)$$$ and skill; usually overkill
The 2026 default. Start with prompting + RAG. Only fine-tune when you've measured a specific failure mode that prompting can't fix (format drift, latency at high QPS, cost at scale). LoRA on a 7B-13B open model is the modern fine-tuning sweet spot — small training cost, fast inference, deployable in-house.
· · ·
§ 05 — Agents

Agent orchestration — tool calling, planning, memory.

"Agents" = LLMs with tools (functions they can call). The orchestration patterns:

PatternDescriptionUse when
Single-shot tool callOne LLM call, one tool invocation, doneSimple lookups (weather, calculator)
ReAct loopThink → tool → observe → think → ... until doneMulti-step research, debugging tasks
Plan-and-executeGenerate full plan upfront, then execute stepsLong-horizon tasks where mid-loop drift hurts
Multi-agentSpecialist agents (researcher, coder, critic) cooperatingTasks decomposable by role; expensive
Tree of thoughtsExplore N branches, prune, retryReasoning tasks with verifiable subgoals

Memory architecture

  • Conversation memory — sliding window or summary; tokens cost money
  • Vector memory — embed turns, retrieve relevant past context (long sessions)
  • Structured memory — extract facts (key-value), store in DB; refresh on use
  • Episodic memory — full session traces for replay/debugging
The agent failure mode. Loops. ReAct agents can spin: tool returns ambiguous result → LLM retries → tool returns same result → loop. The fixes: max-iteration cap, deterministic exit conditions (e.g. "if same tool called with same args twice, exit"), and a "scratchpad" the LLM has to reference before acting. Production agents fail open (graceful "I couldn't complete this") rather than spinning forever.
· · ·
§ 06 — MCP

MCP — Model Context Protocol, the agent integration standard.

Anthropic's Model Context Protocol (open-sourced Nov 2024) is the JSON-RPC standard for connecting LLM agents to tools, data sources, and prompts. The "USB-C for AI" framing — one connector spec, every host (Claude Desktop, Cursor, IDE plugins, custom agents) speaks it, every server (file system, GitHub, databases, custom APIs) exposes capabilities through it.

The three primitives

PrimitiveWhat it doesWho controls
ToolsFunctions the LLM can call (with JSON schema for args)LLM decides when to call
ResourcesRead-only data the host can attach to context (files, DB rows, web pages)User/host decides what to attach
PromptsTemplated prompt snippets the server providesUser triggers via slash command or UI

Why this matters for AIE interviews

  • Standardisation — before MCP, every agent shipped its own tool-calling protocol; integrations were N×M. Now it's N+M.
  • Discoverability — clients query tools/list and resources/list at startup; agents adapt without retraining.
  • Security boundary — MCP servers run in their own process, talk JSON-RPC over stdio or HTTP. Clean isolation; per-tool permissioning.
  • Ecosystem velocity — by Dec 2025, hundreds of community MCP servers exist (GitHub, Slack, Google Drive, Postgres, Salesforce). The whiteboard answer often is "I'd expose this as an MCP server" rather than "build a custom integration."

The MCP architecture in 30 seconds

┌──────────────┐    JSON-RPC over stdio    ┌──────────────────┐
│   MCP Host   │──────────  or HTTP  ─────▶│  MCP Server      │
│ (Claude,     │                           │  • tools/list    │
│  Cursor,     │◀──────────────────────────│  • tools/call    │
│  IDE, ...)   │                           │  • resources/*   │
└──────────────┘                           │  • prompts/*     │
                                           └──────────────────┘
                                                      │
                                                      ▼
                                          (DB, API, filesystem, ...)

Host = the app the user interacts with. Server = the integration provider. Client = the host's MCP library connecting to a specific server. One host typically connects to multiple servers (one per integration).

When you'd build vs use an MCP server

ScenarioBuild your own MCP server whenUse existing when
Internal company data (CRM, data warehouse)✓ — custom semantics, auth, permissions
Standard SaaS (GitHub, Slack, GDrive)If the official one is missing a feature✓ — official servers maintained by Anthropic / vendors
Database / SQL accessFor your specific dialect or auth layer✓ for vanilla Postgres / SQLite
Long-running background tasksYes — model as MCP tools with progress reporting
The interview pivot. If asked "how would you give your agent access to our company's Salesforce data", the L4 answer is "I'd write a tool that wraps the Salesforce API." The L5 answer is "I'd expose Salesforce as an MCP server with tools for query/update, resources for record attachment, and a permission boundary on the server side — so any MCP-compatible host (our internal agent, Claude Desktop for ad-hoc analyst use, Cursor for the CRM engineers) can connect with the same auth model."
· · ·
§ 07 — AI system design

AI system design — the L5+ whiteboard.

The interview prompts you'll get: "Design a Q&A chatbot over our docs", "Design a code review assistant", "Design a customer support agent". The shape is consistent — clarify, sketch, dive on bottleneck. The senior signal isn't the architecture; it's what you weigh.

The 7-box AI system reference architecture

┌─────────────┐   ┌──────────────┐   ┌──────────────┐
│   Client    │──▶│ API gateway  │──▶│ Auth / quota │
│  (web/app)  │   │  (rate limit)│   │ (per-tenant) │
└─────────────┘   └──────┬───────┘   └──────┬───────┘
                         │                  │
                         ▼                  ▼
                   ┌──────────────────────────────┐
                   │   Orchestrator (agent / app) │
                   │   • prompt builder           │
                   │   • routing                  │
                   │   • caching layer            │
                   └──────────────┬───────────────┘
                                  │
            ┌─────────────────────┼───────────────────────┐
            ▼                     ▼                       ▼
     ┌────────────┐       ┌────────────┐         ┌──────────────┐
     │ Retrieval  │       │ Tools /    │         │  LLM call    │
     │ (vector DB │       │ MCP servers│         │ (model API)  │
     │  + BM25)   │       │            │         │              │
     └─────┬──────┘       └─────┬──────┘         └──────┬───────┘
           ▼                    ▼                       ▼
      ┌──────────────────────────────────────────────────────┐
      │  Observability: traces, eval signals, cost per call  │
      └──────────────────────────────────────────────────────┘

The 5 questions to clarify before drawing anything

  1. Read or write? Q&A → read; code-edit agent → write. Write demands transactions, audit, rollback.
  2. How fresh must the data be? Real-time (CDC into vector store) vs nightly (batch embed).
  3. What's the latency budget? Chatbot: 2-5s OK. Inline autocomplete: < 100ms TTFT.
  4. What's the eval rubric? Faithfulness? Conversion? CSAT? Drives the metrics layer.
  5. Multi-tenant? Drives namespace isolation, cost attribution, rate limit.

Pattern cookbook — common AI design prompts

PromptHeadline answerThe trap
"Design a docs chatbot"RAG with hybrid retrieval + reranker + LLMEval pipeline; doc updates; multi-tenancy
"Design a code review assistant"MCP server exposing git + linter + test tools; agent reasoning over diffsCost per PR; false-positive fatigue
"Design customer support agent"RAG over docs + KB + ticket history; agent with escalation toolHallucination on policy; PII handling
"Design an LLM-powered search"BM25 first-pass → embedding rerank → query rewriting → LLM summarisationLatency budget; query intent classification
"Design a meeting-notes summariser"Whisper for ASR → chunked LLM summary → action-item extraction → calendar toolSpeaker diarisation; PII redaction
"Design AI image moderation"Vision model + policy classifier + human review queue + appeal flowAdversarial inputs; false-positive cost
· · ·
§ 08 — Cost & latency

Cost & latency — the production trade-offs.

LeverImpact on costImpact on latencyImpact on quality
Model tier (GPT-4 → GPT-4o-mini)10-30× cheaper2-4× fasterMostly fine for non-reasoning tasks
Prompt caching50-90% off cached portion+TTFT savingsNeutral
StreamingNeutralMassive TTFB winNeutral (UX win)
Batch API~50% off+24h latency (offline only)Neutral
Shorter context (RAG over fine-tune)Lower per-callFaster prefillDepends on retrieval quality
Self-host open modelHigher fixed, lower per-callEliminates networkOften -10-20% vs frontier

The cost-attribution problem

Every LLM call costs $X. Multi-tenant SaaS needs to attribute spend per customer. Pattern: tag every API call with customer_id + feature_id, log to a metrics store, roll up nightly. The bill from OpenAI/Anthropic is a single line item; your internal attribution is what makes pricing decisions defensible.

Caching strategy

  • Exact-match cache — same prompt → same response. Hit rate < 5% in most products.
  • Semantic cache — embed prompt; if nearest neighbor > 0.95 similarity, return cached. Hit rate 20-40%.
  • Prompt caching (API-side) — Anthropic/OpenAI cache the prompt prefix; subsequent calls 50-90% off and faster. Critical for long system prompts.
· · ·
§ 09 — The interview script

The 90-second articulation script.

"AI engineering is data engineering with a non-deterministic compute layer. The pipelines look the same — extract, chunk, embed, index, serve — but the failure modes are different: hallucinations, embedding drift, retrieval misses, agent loops. The shape of the architecture is RAG (default for dynamic knowledge) vs fine-tuning (default for format / domain consistency) vs prompting (default for one-shot tasks); I'd start with prompting + RAG and only fine-tune when I have a measurable failure mode prompting can't fix. The eval pipeline is the hardest part — LLM-as-judge on a held-out set, RAG metrics (faithfulness, context precision, retrieval recall@K), and online metrics (thumbs, follow-up, task completion) feeding back into the training loop. Cost is the production constraint — model tiering, prompt caching, semantic cache, and per-tenant attribution let you ship at scale without the bill blowing up."

Related design pages

· · ·