ML Engineering Interview Prep — where DE meets ML.
The round when the prompt is "design a recommendation system" or "productionize this model". Feature stores, training/serving skew, model versioning, online vs offline metrics, A/B and incrementality. Seven sections — every concept articulable in two sentences, every trap pre-empted.
Contents
- The MLE opening — five-box ML system diagram
- Feature stores — offline + online + the consistency contract
- Training/serving skew — the #1 cause of "good model, bad production"
- Model versioning & rollouts (shadow → canary → A/B → 100%)
- Online vs offline metrics — why they disagree
- A/B testing & incrementality — the causal layer
- The 90-second articulation script
The MLE opening — five-box ML system diagram.
The data-platform diagram has 5 boxes (sources → ingest → storage → transform → serve). The ML diagram has the same 5 boxes plus a training loop on the side and a feature store in the middle. Draw this; everything else slots into it.
The seven questions to ask BEFORE drawing
- Online or batch inference? Real-time (recommendations on page load) vs batch (nightly fraud scoring) — completely different latency tier.
- How fresh do features need to be? Last-7-day clickstream is OK at hourly freshness; "items in your cart right now" is sub-second.
- How big is the feature space? 100 features = simple; 10K features = embeddings + dedicated infra.
- What's the label latency? Click feedback in seconds; conversion in days; LTV in months.
- How often does the model retrain? Continuous (online learning), daily, weekly, monthly?
- What are the online and offline metrics? Are they known to align?
- Is there a counterfactual? Can we measure incremental impact, or only attributed impact?
Feature stores — offline + online + the consistency contract.
The feature store is the most important infrastructure piece in modern MLE — and the easiest to skip in the interview. The single sentence to memorize: "a feature store guarantees the model sees the same feature definition at training and at serving."
Why feature stores exist — the 2018 problem
Before feature stores, every team built their own pipeline twice: once in Spark for training (read from S3, batch transform), once in Python for serving (read from KV cache, real-time transform). The two implementations drifted — different timezone handling, different null fills, different aggregation windows. Models that performed in offline eval would tank online. The feature store collapses both pipelines into one source of truth.
The dual-store architecture
| Offline store | Online store | |
|---|---|---|
| Storage | S3 / GCS / Iceberg | Redis / DynamoDB / Cassandra / Pinot |
| Read pattern | Bulk scan for training | Point lookup by entity_id |
| Latency | Seconds-minutes | Sub-10ms |
| Freshness | Hours-days | Seconds-minutes |
| Use case | Training set generation; offline batch scoring | Online inference at request time |
Point-in-time-correct training data — the must-have
A user's "last_7_day_purchase_count" today is not the same as it was when the model was being trained. Point-in-time joins reconstruct the feature value as of the prediction timestamp, not as of now:
# Tecton / Feast pattern — get_historical_features
training_df = feature_store.get_historical_features(
entity_df=labels_df, # has user_id + label_ts
features=[
"user_features:last_7d_purchases",
"user_features:avg_order_value",
"session_features:current_session_clicks"
]
)
# Each row in training_df: features computed AS OF the label's timestamp.
# Crucially: features that didn't exist at label_ts (because data was late-arriving)
# are NULL, not back-filled with "current" values — that would leak the future.
The leakage trap
Open-source vs managed feature stores
| Tool | Type | Pick when |
|---|---|---|
| Feast | Open-source | You control the cloud; small-mid team; no $$$ for vendor |
| Tecton | Managed | Fast time-to-production; Spark + streaming features; $$$ |
| Databricks Feature Store | Managed (DBX) | All-in on Databricks; co-located with notebooks |
| Vertex AI / SageMaker FS | Managed (cloud-native) | Single-cloud shop; integrated with the rest of the ML stack |
| DIY | Built in-house | FAANG-scale; custom requirements; deep platform team |
Feature definition as code
# Feast example — a feature view with offline + online materialization
@feature_view(
entities=[user],
ttl=Duration(days=30),
source=user_purchases_source,
)
def user_features_v1(df):
return df.groupBy("user_id").agg(
F.count("*").over(window_7d).alias("last_7d_purchases"),
F.avg("amount").over(window_30d).alias("avg_order_value")
)
# Same definition serves training (offline) and inference (online)
Training/serving skew — the #1 cause of "good model, bad production".
Even with a feature store, models drift between training and serving. The four classic skew types every senior MLE knows by heart:
The four skew types
| Skew type | What it is | Example |
|---|---|---|
| Schema skew | Training uses N features; serving sends N±1 | Training has user_age; serving forgot to populate it (null) → model handles NULL differently than zero |
| Distribution skew | Same features, different value distribution | Training data from last quarter (winter); serving on summer traffic |
| Concept drift | Underlying relationship changes over time | Pre-COVID purchase patterns vs post-COVID; user preferences shift |
| Computation skew | Same logical feature, different implementation | Training: pandas mean() ignores NaN; serving: Java average() propagates NaN |
Detection — what to monitor
| Skew | Detection |
|---|---|
| Schema | Compare feature schema at train-time vs request-time; alert on diff |
| Distribution | KL-divergence or PSI (Population Stability Index) per feature, train vs serving daily |
| Concept | Online metric (CTR, conversion) trending below offline-predicted |
| Computation | Replay training-served features through the serving path; compare byte-equal |
PSI — the production drift metric
# Population Stability Index — bucketed feature distribution shift
def psi(train_dist, serving_dist, eps=1e-6):
return sum(
(s - t) * np.log((s + eps) / (t + eps))
for t, s in zip(train_dist, serving_dist)
)
# PSI < 0.1 : no significant drift
# 0.1-0.25 : moderate drift, investigate
# > 0.25 : major drift, retrain
Mitigation — the four moves
- Same code path — use the feature store's serving SDK, not a re-implementation.
- Shadow traffic logging — log served features; replay through training pipeline; assert byte-equal.
- Continuous retraining — daily/weekly retrains keep the model close to current distribution.
- Automatic drift alarm — page the team when PSI > 0.25 sustained.
The "training pipeline ≠ serving pipeline" smell
Model versioning & rollouts (shadow → canary → A/B → 100%).
You don't deploy a new model to 100% on day one. The four-stage rollout is the safety contract — every step has a failure boundary.
The four stages
| Stage | Traffic % | What gets validated | Time at this stage |
|---|---|---|---|
| Shadow | 0% (new model scores requests but doesn't serve) | Latency, error rate, prediction distribution | 1-7 days |
| Canary | 1-5% | Live online metrics (CTR, conversion) on small slice | 3-7 days |
| A/B | 50/50 (proper experiment) | Statistical significance vs old model | 1-4 weeks (until stat-sig) |
| 100% | 100% | Long-term stability, drift, retention impact | Until next model |
Shadow — the cheap safety net
New model receives every request and produces a prediction; the old model's output is what the user sees. Logs the new prediction alongside actual user behavior. Detects:
- Latency regression (new model takes 200ms; old takes 50ms)
- Error rate (new model crashes on certain inputs)
- Prediction distribution skew (new model heavily favors one class)
Canary — the small-blast-radius online test
1-5% of real users see the new model's predictions. Watch:
- Online metric (CTR, conversion) — directional only; not yet stat-sig
- User complaints / support tickets
- Downstream system load (does the new model cause spikes elsewhere?)
A/B — the proper experiment
50/50 (or 90/10 if risk-averse). Run until statistical significance on the primary metric. Pre-register:
- Primary metric — the one you'll commit on (e.g., 7-day retention)
- Guardrail metrics — must not regress (e.g., latency P99, support contact rate)
- Sample size — pre-computed for desired power (typically 80%) and minimum detectable effect (e.g., +1pp CTR)
- Stop conditions — early-stop on guardrail breach; otherwise wait for full sample
Model registry — the version-of-record
| Tool | Strength |
|---|---|
| MLflow Model Registry | Open-source; tracks artifacts + metadata + transitions |
| Weights & Biases / Comet | Experiment tracking + lineage; commercial |
| Vertex AI / SageMaker Model Registry | Cloud-native; integrated with deployment |
What every model version must store
model_v117:
model_artifact: s3://models/recsys/v117.pkl
training_run_id: wandb-run-abc123
training_data: s3://snapshots/2024-09-01/train.parquet (frozen)
feature_view: user_features_v3 (versioned!)
hyperparameters: { learning_rate: 0.001, n_layers: 4, ... }
offline_metrics: { auc: 0.81, recall@10: 0.42 }
deployed_at: 2024-09-12T10:00:00Z
rolled_back_from: null (or v116 if rollback)
Rollback — the must-have
Production model crashes / regresses. The rollback path must be: one config flip, < 60 seconds to revert. Two patterns:
- Blue-green — old model still loaded; flip routes between v117 and v116.
- Shadow keep-alive — old model continues to score in shadow during canary; instant revert if needed.
Online vs offline metrics — why they disagree.
The most disorienting moment in ML: offline AUC up 5%, online CTR down 2%. Either the offline metric is the wrong proxy, or the system is doing something different in production. Both are common.
Why they diverge — the four reasons
| Reason | Description |
|---|---|
| Selection bias in eval data | Offline data only contains items the OLD model surfaced — new model is evaluated on a self-selected set |
| Feedback-loop bias | Old model's outputs influenced past user behavior; future labels are conditioned on old model |
| Distribution shift | Training data is N days old; live traffic distribution has moved |
| Wrong metric | Offline AUC measures ranking quality; online CTR measures CTR — they're related but not the same |
Selection bias — the silent killer
Old recommender showed 10 items per page. Users clicked some, ignored others. Training data has labels only for shown items. New model trained on that data is evaluated on... the same shown items. It's never seen the items the old model didn't show. Online, the new model recommends some of those unshown items — and we have no idea how they'll perform.
Counterfactual evaluation — the fix
Reserve 5-10% of traffic for random or holdout recommendations. This data is unbiased — no model selection. Train on it (or use it to weight the rest). The TikTok FYP design (Scenario 14 in Data Modeling) does exactly this with fct_exploration_event.
Picking the right offline metric
| Online metric | Best offline proxy | Trap |
|---|---|---|
| CTR | Top-K accuracy or NDCG@K | AUC measures ranking quality globally; CTR depends on top of list — different things |
| Conversion rate | Pointwise calibration + uplift modeling | Predicted probability ≠ actual rate; calibration matters |
| Watch time / engagement | Mean watched-time on holdout | Highly censored — many events don't have observed completion |
| Long-term retention | Lift on 7-day return | Multi-day signal; can't be measured offline at all without simulation |
When online metric goes negative — the diagnostic
- Check guardrails — did anything regress (latency, error rate)?
- Slice by user segment — is the regression concentrated (e.g., new users) or uniform?
- Check exploration data — does the new model perform worse on unbiased data too, or just on biased eval?
- Compare to shadow logs — is the new model's prediction distribution different from old?
- If all 4 are clean, suspect concept drift between training and serving.
A/B testing & incrementality — the causal layer.
The hardest interview reframe: attribution ≠ causality. "We attribute $50M of revenue to recommendations" is not the same as "recommendations CAUSED $50M of revenue". Most of those purchases would have happened anyway. Senior MLE separates the two.
Attribution vs incrementality — the definitions
| Concept | Definition |
|---|---|
| Attribution | Of all conversions, how many touched the model's output? (e.g., last-click, multi-touch) |
| Incrementality | Of all conversions, how many would NOT have happened without the model? |
Incrementality is always lower — often 30-50% of attribution. The gap is the model's actual value.
Measuring incrementality — three methods
| Method | Setup | Pros / cons |
|---|---|---|
| Holdout (best) | Random subset of users sees no model output (or default version) — measure their conversion vs treated users | Causal by construction; expensive (foregone revenue on holdout) |
| Geo / time split | Roll out model in some markets / weeks first; compare | No within-user leakage; weaker control for confounds |
| Causal modeling | Use ML / econometrics to estimate from observational data | No revenue cost; many assumptions (selection, confounding) |
A/B test — the standard tool
The 50/50 A/B test with random assignment is the gold standard. Pre-register everything:
experiment:
name: recsys_v117_vs_v116
hypothesis: "v117 will lift 7-day return by ≥ 1pp"
primary_metric: 7d_return_rate
guardrail_metrics:
- latency_p99 (must not increase > 5%)
- error_rate (must not increase > 0.1pp)
- support_contact_rate
treatment_split: 50/50
randomization_unit: user_id
pre-registered_n: 50,000 users per arm (80% power for +1pp MDE)
duration: 14 days minimum (capture weekly cycles)
stop_conditions:
- guardrail breach > 1 day
- primary metric stat-sig at p < 0.01 with bonferroni for peeking
Common A/B mistakes
| Mistake | Why it's wrong |
|---|---|
| Peeking — checking results daily and stopping when "significant" | Inflates false-positive rate; nominal p-value lies |
| Sample-ratio mismatch — actual treatment % drifts from 50% | Randomization is broken; results invalid |
| Novelty effect — measuring during the first week | Users react to "new" regardless of quality; revisit at 4 weeks |
| Heterogeneous treatment effects — averaging hides segment-specific harm | +5% on power users, -3% on new users; net positive but new-user damage is real |
| Post-treatment selection — comparing only users who "engaged" with treatment | Selects on the dependent variable; biased comparison |
CUPED — variance reduction in A/B
For metrics with high variance (purchase amount, watch time), CUPED reduces required sample size by 30-50% by adjusting for pre-experiment user covariates:
Y_adjusted = Y - θ * (X_pre - mean(X_pre))
# Y = post-experiment metric
# X_pre = same metric measured in pre-experiment period
# θ = correlation coefficient
# Compare Y_adjusted across arms instead of Y. Same statistical power, ~half the sample.
When A/B isn't possible — switch-back / time-series
Some scenarios prevent random A/B:
- Network effects (your "treatment" affects untreated users via friend graph)
- Marketplace dynamics (treatment changes inventory available to control)
- Two-sided platforms (driver-side change affects rider-side metric)
Use switch-back tests (alternate treatment / control by time period) or geo-split tests (random by region). Acknowledge confounds explicitly.
The 90-second articulation script.
▸ THE 90-SECOND SCRIPT
"For an ML system, I'd shape it as the standard five-box pipeline plus a feature store in the middle and a training loop on the side. The feature store guarantees training/serving consistency — same feature definition produces the same value at training time and at request time. Offline store on S3 / Iceberg for bulk training scans; online store on Redis / DynamoDB for sub-10ms point lookups."
"Training-data correctness: point-in-time joins are non-negotiable — features as-of-label-timestamp, not as-of-now. Naive 'look up features now' joins leak the future and inflate offline metrics."
"Training/serving skew is the #1 cause of 'good in eval, bad in prod'. Four flavors — schema, distribution, concept, computation — detected by feature-schema diff, PSI per feature, online-vs-offline metric divergence, and replay-byte-equality. Mitigation is single source of truth — feature store's serving SDK, never a re-implementation."
"Model rollouts go through four stages: shadow (0% traffic, validate latency / errors), canary (1-5%, real users on small slice), A/B (50/50 with pre-registered primary + guardrails), 100%. Model registry holds artifact + training data snapshot + feature view version + hyperparameters per version. Rollback is < 60 seconds via blue-green."
"Online vs offline metrics: divergence is normal. Causes are selection bias in eval, feedback-loop bias, distribution shift, or wrong proxy metric. Fix with 5-10% exploration traffic — random / holdout — for unbiased evaluation."
"Causality: attribution ≠ incrementality. Most attributed conversions would happen anyway. Gold standard for incrementality is a 5-10% holdout; switch-back or geo-split when user-level A/B isn't possible (network effects, marketplaces). Pre-register primary + guardrail + sample size; CUPED for variance reduction."
"Two risks I'd flag and mitigate first: training-serving computation skew (replay-and-compare in CI), and selection-bias-driven metric divergence (exploration table from day one). The system without these two is a ticking clock."
Three sentences that signal seniority — in any MLE round
- "Point-in-time correct training joins are the difference between a model that ships and a model that lies in offline eval."
- "Training/serving skew is computational at root — same code path, single source of truth, or you're rebuilding the same logic twice and they will drift."
- "Attribution measures association; incrementality measures causation. The 5-10% holdout cost is the price of knowing which one your model produces."