▸ DESIGN · ML Engineering Interview Prep

ML Engineering Interview Prep — where DE meets ML.

The round when the prompt is "design a recommendation system" or "productionize this model". Feature stores, training/serving skew, model versioning, online vs offline metrics, A/B and incrementality. Seven sections — every concept articulable in two sentences, every trap pre-empted.

The MLE opening — five-box ML system diagram
Feature stores — offline + online + the consistency contract
Training/serving skew — the #1 cause of "good model, bad production"
Model versioning & rollouts (shadow → canary → A/B → 100%)
Online vs offline metrics — why they disagree
A/B testing & incrementality — the causal layer
The 90-second articulation script

§ 01 — The opening

The MLE opening — five-box ML system diagram.

The data-platform diagram has 5 boxes (sources → ingest → storage → transform → serve). The ML diagram has the same 5 boxes plus a training loop on the side and a feature store in the middle. Draw this; everything else slots into it.

┌──────────┐ ┌──────────┐ ┌─────────────────┐ ┌──────────┐ ┌──────────┐ │ Events │───▶│ Ingest │───▶│ Feature Store │───▶│ Online │───▶│ Product │ │ (clicks, │ │ (Kafka, │ │ ┌─────────────┐ │ │ Inference│ │ surface │ │ views, │ │ CDC, │ │ │ Offline (S3)│ │ │ (model │ │ (UI, API,│ │ txns) │ │ batch) │ │ │ for train │ │ │ server) │ │ ML rec) │ │ │ │ │ │ ├─────────────┤ │ │ │ │ │ └──────────┘ └──────────┘ │ │ Online (KV) │ │ └──────────┘ └──────────┘ │ │ for serve │ │ │ │ │ └─────────────┘ │ │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Training │◀───────────┘ │ │ pipeline │ feedback events │ │ (Spark ML, │ (impression+label) │ │ PyTorch, │◀──────────────────────────┘ │ TF) │ └──────┬───────┘ │ trained model ▼ ┌──────────────┐ │ Model │──▶ deployed to inference server │ Registry │ with version, metadata, A/B config └──────────────┘

The seven questions to ask BEFORE drawing

Online or batch inference? Real-time (recommendations on page load) vs batch (nightly fraud scoring) — completely different latency tier.
How fresh do features need to be? Last-7-day clickstream is OK at hourly freshness; "items in your cart right now" is sub-second.
How big is the feature space? 100 features = simple; 10K features = embeddings + dedicated infra.
What's the label latency? Click feedback in seconds; conversion in days; LTV in months.
How often does the model retrain? Continuous (online learning), daily, weekly, monthly?
What are the online and offline metrics? Are they known to align?
Is there a counterfactual? Can we measure incremental impact, or only attributed impact?

Senior signal. Don't ask all seven — ask the two that aren't pinned. "Before I dive in, is this online inference with sub-second latency, or batch scoring? And what's the label latency — clicks return in seconds, conversions in days?" Two questions, system shape decided.

§ 02 — Feature stores

Feature stores — offline + online + the consistency contract.

The feature store is the most important infrastructure piece in modern MLE — and the easiest to skip in the interview. The single sentence to memorize: "a feature store guarantees the model sees the same feature definition at training and at serving."

Why feature stores exist — the 2018 problem

Before feature stores, every team built their own pipeline twice: once in Spark for training (read from S3, batch transform), once in Python for serving (read from KV cache, real-time transform). The two implementations drifted — different timezone handling, different null fills, different aggregation windows. Models that performed in offline eval would tank online. The feature store collapses both pipelines into one source of truth.

The dual-store architecture

	Offline store	Online store
Storage	S3 / GCS / Iceberg	Redis / DynamoDB / Cassandra / Pinot
Read pattern	Bulk scan for training	Point lookup by entity_id
Latency	Seconds-minutes	Sub-10ms
Freshness	Hours-days	Seconds-minutes
Use case	Training set generation; offline batch scoring	Online inference at request time

Point-in-time-correct training data — the must-have

A user's "last_7_day_purchase_count" today is not the same as it was when the model was being trained. Point-in-time joins reconstruct the feature value as of the prediction timestamp, not as of now:

# Tecton / Feast pattern — get_historical_features
training_df = feature_store.get_historical_features(
  entity_df=labels_df,                      # has user_id + label_ts
  features=[
    "user_features:last_7d_purchases",
    "user_features:avg_order_value",
    "session_features:current_session_clicks"
  ]
)
# Each row in training_df: features computed AS OF the label's timestamp.
# Crucially: features that didn't exist at label_ts (because data was late-arriving)
# are NULL, not back-filled with "current" values — that would leak the future.

The leakage trap

Trap. Naive training join: "for each label, look up the feature now." Looks reasonable; catastrophically wrong. The "now" features include data from AFTER the label was generated — the model "predicts" using the future. Models look amazing in offline eval; tank in production. This is the #1 reason "good model, bad production" happens. Always point-in-time join.

Open-source vs managed feature stores

Tool	Type	Pick when
Feast	Open-source	You control the cloud; small-mid team; no $$$ for vendor
Tecton	Managed	Fast time-to-production; Spark + streaming features; $$$
Databricks Feature Store	Managed (DBX)	All-in on Databricks; co-located with notebooks
Vertex AI / SageMaker FS	Managed (cloud-native)	Single-cloud shop; integrated with the rest of the ML stack
DIY	Built in-house	FAANG-scale; custom requirements; deep platform team

Feature definition as code

# Feast example — a feature view with offline + online materialization
@feature_view(
  entities=[user],
  ttl=Duration(days=30),
  source=user_purchases_source,
)
def user_features_v1(df):
  return df.groupBy("user_id").agg(
    F.count("*").over(window_7d).alias("last_7d_purchases"),
    F.avg("amount").over(window_30d).alias("avg_order_value")
  )

# Same definition serves training (offline) and inference (online)

Senior signal. "Feature stores guarantee training/serving consistency. Offline store is S3/Iceberg for bulk scan; online store is Redis/Dynamo for sub-10ms point lookup. Critical: point-in-time-correct joins for training data — features as-of-label-timestamp, not as-of-now. Otherwise the future leaks into training and offline eval lies to you."

§ 03 — The skew problem

Training/serving skew — the #1 cause of "good model, bad production".

Even with a feature store, models drift between training and serving. The four classic skew types every senior MLE knows by heart:

The four skew types

Skew type	What it is	Example
Schema skew	Training uses N features; serving sends N±1	Training has `user_age`; serving forgot to populate it (null) → model handles NULL differently than zero
Distribution skew	Same features, different value distribution	Training data from last quarter (winter); serving on summer traffic
Concept drift	Underlying relationship changes over time	Pre-COVID purchase patterns vs post-COVID; user preferences shift
Computation skew	Same logical feature, different implementation	Training: pandas `mean()` ignores NaN; serving: Java `average()` propagates NaN

Detection — what to monitor

Skew	Detection
Schema	Compare feature schema at train-time vs request-time; alert on diff
Distribution	KL-divergence or PSI (Population Stability Index) per feature, train vs serving daily
Concept	Online metric (CTR, conversion) trending below offline-predicted
Computation	Replay training-served features through the serving path; compare byte-equal

PSI — the production drift metric

# Population Stability Index — bucketed feature distribution shift
def psi(train_dist, serving_dist, eps=1e-6):
    return sum(
      (s - t) * np.log((s + eps) / (t + eps))
      for t, s in zip(train_dist, serving_dist)
    )

# PSI < 0.1  : no significant drift
# 0.1-0.25  : moderate drift, investigate
# > 0.25    : major drift, retrain

Mitigation — the four moves

Same code path — use the feature store's serving SDK, not a re-implementation.
Shadow traffic logging — log served features; replay through training pipeline; assert byte-equal.
Continuous retraining — daily/weekly retrains keep the model close to current distribution.
Automatic drift alarm — page the team when PSI > 0.25 sustained.

The "training pipeline ≠ serving pipeline" smell

Diagnostic. If your team has a "train Spark job" and a "serve Python service" and the feature transforms exist in both, you have skew. Detect it: pick a recent training row, replay the user_id through the serving pipeline at the training timestamp, compare feature values. The diff is your skew. Fix it once, in the feature store.

Senior signal. "Training/serving skew kills more models than any other cause. Four flavors — schema, distribution, concept, computation — each detected differently. PSI per feature on a daily schedule for distribution drift; replay-and-compare for computation skew. Mitigation is single source of truth — the feature store's serving SDK, not a re-implementation. Most 'good in eval, bad in prod' bugs are computation skew."

§ 04 — Rollouts

Model versioning & rollouts (shadow → canary → A/B → 100%).

You don't deploy a new model to 100% on day one. The four-stage rollout is the safety contract — every step has a failure boundary.

The four stages

Stage	Traffic %	What gets validated	Time at this stage
Shadow	0% (new model scores requests but doesn't serve)	Latency, error rate, prediction distribution	1-7 days
Canary	1-5%	Live online metrics (CTR, conversion) on small slice	3-7 days
A/B	50/50 (proper experiment)	Statistical significance vs old model	1-4 weeks (until stat-sig)
100%	100%	Long-term stability, drift, retention impact	Until next model

Shadow — the cheap safety net

New model receives every request and produces a prediction; the old model's output is what the user sees. Logs the new prediction alongside actual user behavior. Detects:

Latency regression (new model takes 200ms; old takes 50ms)
Error rate (new model crashes on certain inputs)
Prediction distribution skew (new model heavily favors one class)

Canary — the small-blast-radius online test

1-5% of real users see the new model's predictions. Watch:

Online metric (CTR, conversion) — directional only; not yet stat-sig
User complaints / support tickets
Downstream system load (does the new model cause spikes elsewhere?)

A/B — the proper experiment

50/50 (or 90/10 if risk-averse). Run until statistical significance on the primary metric. Pre-register:

Primary metric — the one you'll commit on (e.g., 7-day retention)
Guardrail metrics — must not regress (e.g., latency P99, support contact rate)
Sample size — pre-computed for desired power (typically 80%) and minimum detectable effect (e.g., +1pp CTR)
Stop conditions — early-stop on guardrail breach; otherwise wait for full sample

Model registry — the version-of-record

Tool	Strength
MLflow Model Registry	Open-source; tracks artifacts + metadata + transitions
Weights & Biases / Comet	Experiment tracking + lineage; commercial
Vertex AI / SageMaker Model Registry	Cloud-native; integrated with deployment

What every model version must store

model_v117:
  model_artifact:    s3://models/recsys/v117.pkl
  training_run_id:   wandb-run-abc123
  training_data:     s3://snapshots/2024-09-01/train.parquet  (frozen)
  feature_view:      user_features_v3                          (versioned!)
  hyperparameters:   { learning_rate: 0.001, n_layers: 4, ... }
  offline_metrics:   { auc: 0.81, recall@10: 0.42 }
  deployed_at:       2024-09-12T10:00:00Z
  rolled_back_from:  null   (or v116 if rollback)

Rollback — the must-have

Production model crashes / regresses. The rollback path must be: one config flip, < 60 seconds to revert. Two patterns:

Blue-green — old model still loaded; flip routes between v117 and v116.
Shadow keep-alive — old model continues to score in shadow during canary; instant revert if needed.

Senior signal. "Four-stage rollout: shadow → canary → A/B → 100%, with a model registry holding artifact + training-data snapshot + feature-view version + hyperparameters per version. Rollback is one config flip, < 60 seconds. Pre-register primary + guardrail metrics + sample size before A/B starts — peeking at the experiment is how teams convince themselves of fake wins."

§ 05 — Metric divergence

Online vs offline metrics — why they disagree.

The most disorienting moment in ML: offline AUC up 5%, online CTR down 2%. Either the offline metric is the wrong proxy, or the system is doing something different in production. Both are common.

Why they diverge — the four reasons

Reason	Description
Selection bias in eval data	Offline data only contains items the OLD model surfaced — new model is evaluated on a self-selected set
Feedback-loop bias	Old model's outputs influenced past user behavior; future labels are conditioned on old model
Distribution shift	Training data is N days old; live traffic distribution has moved
Wrong metric	Offline AUC measures ranking quality; online CTR measures CTR — they're related but not the same

Selection bias — the silent killer

Old recommender showed 10 items per page. Users clicked some, ignored others. Training data has labels only for shown items. New model trained on that data is evaluated on... the same shown items. It's never seen the items the old model didn't show. Online, the new model recommends some of those unshown items — and we have no idea how they'll perform.

Counterfactual evaluation — the fix

Reserve 5-10% of traffic for random or holdout recommendations. This data is unbiased — no model selection. Train on it (or use it to weight the rest). The TikTok FYP design (Scenario 14 in Data Modeling) does exactly this with fct_exploration_event.

Picking the right offline metric

Online metric	Best offline proxy	Trap
CTR	Top-K accuracy or NDCG@K	AUC measures ranking quality globally; CTR depends on top of list — different things
Conversion rate	Pointwise calibration + uplift modeling	Predicted probability ≠ actual rate; calibration matters
Watch time / engagement	Mean watched-time on holdout	Highly censored — many events don't have observed completion
Long-term retention	Lift on 7-day return	Multi-day signal; can't be measured offline at all without simulation

When online metric goes negative — the diagnostic

Check guardrails — did anything regress (latency, error rate)?
Slice by user segment — is the regression concentrated (e.g., new users) or uniform?
Check exploration data — does the new model perform worse on unbiased data too, or just on biased eval?
Compare to shadow logs — is the new model's prediction distribution different from old?
If all 4 are clean, suspect concept drift between training and serving.

Senior signal. "Offline-online divergence is almost always one of: selection bias in eval (the old model picked the test set), feedback-loop bias, distribution shift, or wrong metric. The fix is exploration data — 5-10% random traffic — for unbiased evaluation. Without it, you can't tell whether the new model is better or just different."

§ 06 — Causality

A/B testing & incrementality — the causal layer.

The hardest interview reframe: attribution ≠ causality. "We attribute $50M of revenue to recommendations" is not the same as "recommendations CAUSED $50M of revenue". Most of those purchases would have happened anyway. Senior MLE separates the two.

Attribution vs incrementality — the definitions

Concept	Definition
Attribution	Of all conversions, how many touched the model's output? (e.g., last-click, multi-touch)
Incrementality	Of all conversions, how many would NOT have happened without the model?

Incrementality is always lower — often 30-50% of attribution. The gap is the model's actual value.

Measuring incrementality — three methods

Method	Setup	Pros / cons
Holdout (best)	Random subset of users sees no model output (or default version) — measure their conversion vs treated users	Causal by construction; expensive (foregone revenue on holdout)
Geo / time split	Roll out model in some markets / weeks first; compare	No within-user leakage; weaker control for confounds
Causal modeling	Use ML / econometrics to estimate from observational data	No revenue cost; many assumptions (selection, confounding)

A/B test — the standard tool

The 50/50 A/B test with random assignment is the gold standard. Pre-register everything:

experiment:
  name:                 recsys_v117_vs_v116
  hypothesis:           "v117 will lift 7-day return by ≥ 1pp"
  primary_metric:       7d_return_rate
  guardrail_metrics:
    - latency_p99 (must not increase > 5%)
    - error_rate  (must not increase > 0.1pp)
    - support_contact_rate
  treatment_split:      50/50
  randomization_unit:   user_id
  pre-registered_n:     50,000 users per arm  (80% power for +1pp MDE)
  duration:             14 days minimum (capture weekly cycles)
  stop_conditions:
    - guardrail breach > 1 day
    - primary metric stat-sig at p < 0.01 with bonferroni for peeking

Common A/B mistakes

Mistake	Why it's wrong
Peeking — checking results daily and stopping when "significant"	Inflates false-positive rate; nominal p-value lies
Sample-ratio mismatch — actual treatment % drifts from 50%	Randomization is broken; results invalid
Novelty effect — measuring during the first week	Users react to "new" regardless of quality; revisit at 4 weeks
Heterogeneous treatment effects — averaging hides segment-specific harm	+5% on power users, -3% on new users; net positive but new-user damage is real
Post-treatment selection — comparing only users who "engaged" with treatment	Selects on the dependent variable; biased comparison

CUPED — variance reduction in A/B

For metrics with high variance (purchase amount, watch time), CUPED reduces required sample size by 30-50% by adjusting for pre-experiment user covariates:

Y_adjusted = Y - θ * (X_pre - mean(X_pre))
# Y     = post-experiment metric
# X_pre = same metric measured in pre-experiment period
# θ     = correlation coefficient

# Compare Y_adjusted across arms instead of Y. Same statistical power, ~half the sample.

When A/B isn't possible — switch-back / time-series

Some scenarios prevent random A/B:

Network effects (your "treatment" affects untreated users via friend graph)
Marketplace dynamics (treatment changes inventory available to control)
Two-sided platforms (driver-side change affects rider-side metric)

Use switch-back tests (alternate treatment / control by time period) or geo-split tests (random by region). Acknowledge confounds explicitly.

Senior signal. "Attribution measures the model's contact with conversions; incrementality measures conversions that wouldn't have happened without it. Always 30-50% lower. Gold standard is a 5-10% holdout — expensive but causal. A/B tests with pre-registered primary + guardrails + sample size are the standard tool; CUPED cuts required sample in half. Network-effect domains need switch-back or geo-split, not user-level random."

§ 07 — Articulation

The 90-second articulation script.

▸ THE 90-SECOND SCRIPT

"For an ML system, I'd shape it as the standard five-box pipeline plus a feature store in the middle and a training loop on the side. The feature store guarantees training/serving consistency — same feature definition produces the same value at training time and at request time. Offline store on S3 / Iceberg for bulk training scans; online store on Redis / DynamoDB for sub-10ms point lookups."

"Training-data correctness: point-in-time joins are non-negotiable — features as-of-label-timestamp, not as-of-now. Naive 'look up features now' joins leak the future and inflate offline metrics."

"Training/serving skew is the #1 cause of 'good in eval, bad in prod'. Four flavors — schema, distribution, concept, computation — detected by feature-schema diff, PSI per feature, online-vs-offline metric divergence, and replay-byte-equality. Mitigation is single source of truth — feature store's serving SDK, never a re-implementation."

"Model rollouts go through four stages: shadow (0% traffic, validate latency / errors), canary (1-5%, real users on small slice), A/B (50/50 with pre-registered primary + guardrails), 100%. Model registry holds artifact + training data snapshot + feature view version + hyperparameters per version. Rollback is < 60 seconds via blue-green."

"Online vs offline metrics: divergence is normal. Causes are selection bias in eval, feedback-loop bias, distribution shift, or wrong proxy metric. Fix with 5-10% exploration traffic — random / holdout — for unbiased evaluation."

"Causality: attribution ≠ incrementality. Most attributed conversions would happen anyway. Gold standard for incrementality is a 5-10% holdout; switch-back or geo-split when user-level A/B isn't possible (network effects, marketplaces). Pre-register primary + guardrail + sample size; CUPED for variance reduction."

"Two risks I'd flag and mitigate first: training-serving computation skew (replay-and-compare in CI), and selection-bias-driven metric divergence (exploration table from day one). The system without these two is a ticking clock."

Three sentences that signal seniority — in any MLE round

"Point-in-time correct training joins are the difference between a model that ships and a model that lies in offline eval."
"Training/serving skew is computational at root — same code path, single source of truth, or you're rebuilding the same logic twice and they will drift."
"Attribution measures association; incrementality measures causation. The 5-10% holdout cost is the price of knowing which one your model produces."

· · ·

▸ Seven sections · the patterns hold across domains · go well.

  ← Back to Design pillar
   · 
  Streaming Architecture
   · 
  Data Quality & Reliability

ML Engineering Interview Prep — where DE meets ML.

Contents

The MLE opening — five-box ML system diagram.

The seven questions to ask BEFORE drawing

Feature stores — offline + online + the consistency contract.

Why feature stores exist — the 2018 problem

The dual-store architecture

Point-in-time-correct training data — the must-have

The leakage trap

Open-source vs managed feature stores

Feature definition as code

Training/serving skew — the #1 cause of "good model, bad production".

The four skew types

Detection — what to monitor

PSI — the production drift metric

Mitigation — the four moves

The "training pipeline ≠ serving pipeline" smell

Model versioning & rollouts (shadow → canary → A/B → 100%).

The four stages

Shadow — the cheap safety net

Canary — the small-blast-radius online test

A/B — the proper experiment

Model registry — the version-of-record

What every model version must store

Rollback — the must-have

Online vs offline metrics — why they disagree.

Why they diverge — the four reasons

Selection bias — the silent killer

Counterfactual evaluation — the fix

Picking the right offline metric

When online metric goes negative — the diagnostic

A/B testing & incrementality — the causal layer.

Attribution vs incrementality — the definitions

Measuring incrementality — three methods

A/B test — the standard tool

Common A/B mistakes

CUPED — variance reduction in A/B

When A/B isn't possible — switch-back / time-series

The 90-second articulation script.

Three sentences that signal seniority — in any MLE round