← PaddySpeaks
Interview Studio · Practice · Q&A Design
▸ DESIGN · ML Engineering Interview Prep

ML Engineering Interview Prep — where DE meets ML.

The round when the prompt is "design a recommendation system" or "productionize this model". Feature stores, training/serving skew, model versioning, online vs offline metrics, A/B and incrementality. Seven sections — every concept articulable in two sentences, every trap pre-empted.

§ 01 — The opening

The MLE opening — five-box ML system diagram.

The data-platform diagram has 5 boxes (sources → ingest → storage → transform → serve). The ML diagram has the same 5 boxes plus a training loop on the side and a feature store in the middle. Draw this; everything else slots into it.

┌──────────┐ ┌──────────┐ ┌─────────────────┐ ┌──────────┐ ┌──────────┐ │ Events │───▶│ Ingest │───▶│ Feature Store │───▶│ Online │───▶│ Product │ │ (clicks, │ │ (Kafka, │ │ ┌─────────────┐ │ │ Inference│ │ surface │ │ views, │ │ CDC, │ │ │ Offline (S3)│ │ │ (model │ │ (UI, API,│ │ txns) │ │ batch) │ │ │ for train │ │ │ server) │ │ ML rec) │ │ │ │ │ │ ├─────────────┤ │ │ │ │ │ └──────────┘ └──────────┘ │ │ Online (KV) │ │ └──────────┘ └──────────┘ │ │ for serve │ │ │ │ │ └─────────────┘ │ │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Training │◀───────────┘ │ │ pipeline │ feedback events │ │ (Spark ML, │ (impression+label) │ │ PyTorch, │◀──────────────────────────┘ │ TF) │ └──────┬───────┘ │ trained model ▼ ┌──────────────┐ │ Model │──▶ deployed to inference server │ Registry │ with version, metadata, A/B config └──────────────┘

The seven questions to ask BEFORE drawing

  1. Online or batch inference? Real-time (recommendations on page load) vs batch (nightly fraud scoring) — completely different latency tier.
  2. How fresh do features need to be? Last-7-day clickstream is OK at hourly freshness; "items in your cart right now" is sub-second.
  3. How big is the feature space? 100 features = simple; 10K features = embeddings + dedicated infra.
  4. What's the label latency? Click feedback in seconds; conversion in days; LTV in months.
  5. How often does the model retrain? Continuous (online learning), daily, weekly, monthly?
  6. What are the online and offline metrics? Are they known to align?
  7. Is there a counterfactual? Can we measure incremental impact, or only attributed impact?
Senior signal. Don't ask all seven — ask the two that aren't pinned. "Before I dive in, is this online inference with sub-second latency, or batch scoring? And what's the label latency — clicks return in seconds, conversions in days?" Two questions, system shape decided.
§ 02 — Feature stores

Feature stores — offline + online + the consistency contract.

The feature store is the most important infrastructure piece in modern MLE — and the easiest to skip in the interview. The single sentence to memorize: "a feature store guarantees the model sees the same feature definition at training and at serving."

Why feature stores exist — the 2018 problem

Before feature stores, every team built their own pipeline twice: once in Spark for training (read from S3, batch transform), once in Python for serving (read from KV cache, real-time transform). The two implementations drifted — different timezone handling, different null fills, different aggregation windows. Models that performed in offline eval would tank online. The feature store collapses both pipelines into one source of truth.

The dual-store architecture

Offline storeOnline store
StorageS3 / GCS / IcebergRedis / DynamoDB / Cassandra / Pinot
Read patternBulk scan for trainingPoint lookup by entity_id
LatencySeconds-minutesSub-10ms
FreshnessHours-daysSeconds-minutes
Use caseTraining set generation; offline batch scoringOnline inference at request time

Point-in-time-correct training data — the must-have

A user's "last_7_day_purchase_count" today is not the same as it was when the model was being trained. Point-in-time joins reconstruct the feature value as of the prediction timestamp, not as of now:

# Tecton / Feast pattern — get_historical_features
training_df = feature_store.get_historical_features(
  entity_df=labels_df,                      # has user_id + label_ts
  features=[
    "user_features:last_7d_purchases",
    "user_features:avg_order_value",
    "session_features:current_session_clicks"
  ]
)
# Each row in training_df: features computed AS OF the label's timestamp.
# Crucially: features that didn't exist at label_ts (because data was late-arriving)
# are NULL, not back-filled with "current" values — that would leak the future.

The leakage trap

Trap. Naive training join: "for each label, look up the feature now." Looks reasonable; catastrophically wrong. The "now" features include data from AFTER the label was generated — the model "predicts" using the future. Models look amazing in offline eval; tank in production. This is the #1 reason "good model, bad production" happens. Always point-in-time join.

Open-source vs managed feature stores

ToolTypePick when
FeastOpen-sourceYou control the cloud; small-mid team; no $$$ for vendor
TectonManagedFast time-to-production; Spark + streaming features; $$$
Databricks Feature StoreManaged (DBX)All-in on Databricks; co-located with notebooks
Vertex AI / SageMaker FSManaged (cloud-native)Single-cloud shop; integrated with the rest of the ML stack
DIYBuilt in-houseFAANG-scale; custom requirements; deep platform team

Feature definition as code

# Feast example — a feature view with offline + online materialization
@feature_view(
  entities=[user],
  ttl=Duration(days=30),
  source=user_purchases_source,
)
def user_features_v1(df):
  return df.groupBy("user_id").agg(
    F.count("*").over(window_7d).alias("last_7d_purchases"),
    F.avg("amount").over(window_30d).alias("avg_order_value")
  )

# Same definition serves training (offline) and inference (online)
Senior signal. "Feature stores guarantee training/serving consistency. Offline store is S3/Iceberg for bulk scan; online store is Redis/Dynamo for sub-10ms point lookup. Critical: point-in-time-correct joins for training data — features as-of-label-timestamp, not as-of-now. Otherwise the future leaks into training and offline eval lies to you."
§ 03 — The skew problem

Training/serving skew — the #1 cause of "good model, bad production".

Even with a feature store, models drift between training and serving. The four classic skew types every senior MLE knows by heart:

The four skew types

Skew typeWhat it isExample
Schema skewTraining uses N features; serving sends N±1Training has user_age; serving forgot to populate it (null) → model handles NULL differently than zero
Distribution skewSame features, different value distributionTraining data from last quarter (winter); serving on summer traffic
Concept driftUnderlying relationship changes over timePre-COVID purchase patterns vs post-COVID; user preferences shift
Computation skewSame logical feature, different implementationTraining: pandas mean() ignores NaN; serving: Java average() propagates NaN

Detection — what to monitor

SkewDetection
SchemaCompare feature schema at train-time vs request-time; alert on diff
DistributionKL-divergence or PSI (Population Stability Index) per feature, train vs serving daily
ConceptOnline metric (CTR, conversion) trending below offline-predicted
ComputationReplay training-served features through the serving path; compare byte-equal

PSI — the production drift metric

# Population Stability Index — bucketed feature distribution shift
def psi(train_dist, serving_dist, eps=1e-6):
    return sum(
      (s - t) * np.log((s + eps) / (t + eps))
      for t, s in zip(train_dist, serving_dist)
    )

# PSI < 0.1  : no significant drift
# 0.1-0.25  : moderate drift, investigate
# > 0.25    : major drift, retrain

Mitigation — the four moves

  1. Same code path — use the feature store's serving SDK, not a re-implementation.
  2. Shadow traffic logging — log served features; replay through training pipeline; assert byte-equal.
  3. Continuous retraining — daily/weekly retrains keep the model close to current distribution.
  4. Automatic drift alarm — page the team when PSI > 0.25 sustained.

The "training pipeline ≠ serving pipeline" smell

Diagnostic. If your team has a "train Spark job" and a "serve Python service" and the feature transforms exist in both, you have skew. Detect it: pick a recent training row, replay the user_id through the serving pipeline at the training timestamp, compare feature values. The diff is your skew. Fix it once, in the feature store.
Senior signal. "Training/serving skew kills more models than any other cause. Four flavors — schema, distribution, concept, computation — each detected differently. PSI per feature on a daily schedule for distribution drift; replay-and-compare for computation skew. Mitigation is single source of truth — the feature store's serving SDK, not a re-implementation. Most 'good in eval, bad in prod' bugs are computation skew."
§ 04 — Rollouts

Model versioning & rollouts (shadow → canary → A/B → 100%).

You don't deploy a new model to 100% on day one. The four-stage rollout is the safety contract — every step has a failure boundary.

The four stages

StageTraffic %What gets validatedTime at this stage
Shadow0% (new model scores requests but doesn't serve)Latency, error rate, prediction distribution1-7 days
Canary1-5%Live online metrics (CTR, conversion) on small slice3-7 days
A/B50/50 (proper experiment)Statistical significance vs old model1-4 weeks (until stat-sig)
100%100%Long-term stability, drift, retention impactUntil next model

Shadow — the cheap safety net

New model receives every request and produces a prediction; the old model's output is what the user sees. Logs the new prediction alongside actual user behavior. Detects:

  • Latency regression (new model takes 200ms; old takes 50ms)
  • Error rate (new model crashes on certain inputs)
  • Prediction distribution skew (new model heavily favors one class)

Canary — the small-blast-radius online test

1-5% of real users see the new model's predictions. Watch:

  • Online metric (CTR, conversion) — directional only; not yet stat-sig
  • User complaints / support tickets
  • Downstream system load (does the new model cause spikes elsewhere?)

A/B — the proper experiment

50/50 (or 90/10 if risk-averse). Run until statistical significance on the primary metric. Pre-register:

  • Primary metric — the one you'll commit on (e.g., 7-day retention)
  • Guardrail metrics — must not regress (e.g., latency P99, support contact rate)
  • Sample size — pre-computed for desired power (typically 80%) and minimum detectable effect (e.g., +1pp CTR)
  • Stop conditions — early-stop on guardrail breach; otherwise wait for full sample

Model registry — the version-of-record

ToolStrength
MLflow Model RegistryOpen-source; tracks artifacts + metadata + transitions
Weights & Biases / CometExperiment tracking + lineage; commercial
Vertex AI / SageMaker Model RegistryCloud-native; integrated with deployment

What every model version must store

model_v117:
  model_artifact:    s3://models/recsys/v117.pkl
  training_run_id:   wandb-run-abc123
  training_data:     s3://snapshots/2024-09-01/train.parquet  (frozen)
  feature_view:      user_features_v3                          (versioned!)
  hyperparameters:   { learning_rate: 0.001, n_layers: 4, ... }
  offline_metrics:   { auc: 0.81, recall@10: 0.42 }
  deployed_at:       2024-09-12T10:00:00Z
  rolled_back_from:  null   (or v116 if rollback)

Rollback — the must-have

Production model crashes / regresses. The rollback path must be: one config flip, < 60 seconds to revert. Two patterns:

  1. Blue-green — old model still loaded; flip routes between v117 and v116.
  2. Shadow keep-alive — old model continues to score in shadow during canary; instant revert if needed.
Senior signal. "Four-stage rollout: shadow → canary → A/B → 100%, with a model registry holding artifact + training-data snapshot + feature-view version + hyperparameters per version. Rollback is one config flip, < 60 seconds. Pre-register primary + guardrail metrics + sample size before A/B starts — peeking at the experiment is how teams convince themselves of fake wins."
§ 05 — Metric divergence

Online vs offline metrics — why they disagree.

The most disorienting moment in ML: offline AUC up 5%, online CTR down 2%. Either the offline metric is the wrong proxy, or the system is doing something different in production. Both are common.

Why they diverge — the four reasons

ReasonDescription
Selection bias in eval dataOffline data only contains items the OLD model surfaced — new model is evaluated on a self-selected set
Feedback-loop biasOld model's outputs influenced past user behavior; future labels are conditioned on old model
Distribution shiftTraining data is N days old; live traffic distribution has moved
Wrong metricOffline AUC measures ranking quality; online CTR measures CTR — they're related but not the same

Selection bias — the silent killer

Old recommender showed 10 items per page. Users clicked some, ignored others. Training data has labels only for shown items. New model trained on that data is evaluated on... the same shown items. It's never seen the items the old model didn't show. Online, the new model recommends some of those unshown items — and we have no idea how they'll perform.

Counterfactual evaluation — the fix

Reserve 5-10% of traffic for random or holdout recommendations. This data is unbiased — no model selection. Train on it (or use it to weight the rest). The TikTok FYP design (Scenario 14 in Data Modeling) does exactly this with fct_exploration_event.

Picking the right offline metric

Online metricBest offline proxyTrap
CTRTop-K accuracy or NDCG@KAUC measures ranking quality globally; CTR depends on top of list — different things
Conversion ratePointwise calibration + uplift modelingPredicted probability ≠ actual rate; calibration matters
Watch time / engagementMean watched-time on holdoutHighly censored — many events don't have observed completion
Long-term retentionLift on 7-day returnMulti-day signal; can't be measured offline at all without simulation

When online metric goes negative — the diagnostic

  1. Check guardrails — did anything regress (latency, error rate)?
  2. Slice by user segment — is the regression concentrated (e.g., new users) or uniform?
  3. Check exploration data — does the new model perform worse on unbiased data too, or just on biased eval?
  4. Compare to shadow logs — is the new model's prediction distribution different from old?
  5. If all 4 are clean, suspect concept drift between training and serving.
Senior signal. "Offline-online divergence is almost always one of: selection bias in eval (the old model picked the test set), feedback-loop bias, distribution shift, or wrong metric. The fix is exploration data — 5-10% random traffic — for unbiased evaluation. Without it, you can't tell whether the new model is better or just different."
§ 06 — Causality

A/B testing & incrementality — the causal layer.

The hardest interview reframe: attribution ≠ causality. "We attribute $50M of revenue to recommendations" is not the same as "recommendations CAUSED $50M of revenue". Most of those purchases would have happened anyway. Senior MLE separates the two.

Attribution vs incrementality — the definitions

ConceptDefinition
AttributionOf all conversions, how many touched the model's output? (e.g., last-click, multi-touch)
IncrementalityOf all conversions, how many would NOT have happened without the model?

Incrementality is always lower — often 30-50% of attribution. The gap is the model's actual value.

Measuring incrementality — three methods

MethodSetupPros / cons
Holdout (best)Random subset of users sees no model output (or default version) — measure their conversion vs treated usersCausal by construction; expensive (foregone revenue on holdout)
Geo / time splitRoll out model in some markets / weeks first; compareNo within-user leakage; weaker control for confounds
Causal modelingUse ML / econometrics to estimate from observational dataNo revenue cost; many assumptions (selection, confounding)

A/B test — the standard tool

The 50/50 A/B test with random assignment is the gold standard. Pre-register everything:

experiment:
  name:                 recsys_v117_vs_v116
  hypothesis:           "v117 will lift 7-day return by ≥ 1pp"
  primary_metric:       7d_return_rate
  guardrail_metrics:
    - latency_p99 (must not increase > 5%)
    - error_rate  (must not increase > 0.1pp)
    - support_contact_rate
  treatment_split:      50/50
  randomization_unit:   user_id
  pre-registered_n:     50,000 users per arm  (80% power for +1pp MDE)
  duration:             14 days minimum (capture weekly cycles)
  stop_conditions:
    - guardrail breach > 1 day
    - primary metric stat-sig at p < 0.01 with bonferroni for peeking

Common A/B mistakes

MistakeWhy it's wrong
Peeking — checking results daily and stopping when "significant"Inflates false-positive rate; nominal p-value lies
Sample-ratio mismatch — actual treatment % drifts from 50%Randomization is broken; results invalid
Novelty effect — measuring during the first weekUsers react to "new" regardless of quality; revisit at 4 weeks
Heterogeneous treatment effects — averaging hides segment-specific harm+5% on power users, -3% on new users; net positive but new-user damage is real
Post-treatment selection — comparing only users who "engaged" with treatmentSelects on the dependent variable; biased comparison

CUPED — variance reduction in A/B

For metrics with high variance (purchase amount, watch time), CUPED reduces required sample size by 30-50% by adjusting for pre-experiment user covariates:

Y_adjusted = Y - θ * (X_pre - mean(X_pre))
# Y     = post-experiment metric
# X_pre = same metric measured in pre-experiment period
# θ     = correlation coefficient

# Compare Y_adjusted across arms instead of Y. Same statistical power, ~half the sample.

When A/B isn't possible — switch-back / time-series

Some scenarios prevent random A/B:

  • Network effects (your "treatment" affects untreated users via friend graph)
  • Marketplace dynamics (treatment changes inventory available to control)
  • Two-sided platforms (driver-side change affects rider-side metric)

Use switch-back tests (alternate treatment / control by time period) or geo-split tests (random by region). Acknowledge confounds explicitly.

Senior signal. "Attribution measures the model's contact with conversions; incrementality measures conversions that wouldn't have happened without it. Always 30-50% lower. Gold standard is a 5-10% holdout — expensive but causal. A/B tests with pre-registered primary + guardrails + sample size are the standard tool; CUPED cuts required sample in half. Network-effect domains need switch-back or geo-split, not user-level random."
§ 07 — Articulation

The 90-second articulation script.

▸ THE 90-SECOND SCRIPT

"For an ML system, I'd shape it as the standard five-box pipeline plus a feature store in the middle and a training loop on the side. The feature store guarantees training/serving consistency — same feature definition produces the same value at training time and at request time. Offline store on S3 / Iceberg for bulk training scans; online store on Redis / DynamoDB for sub-10ms point lookups."

"Training-data correctness: point-in-time joins are non-negotiable — features as-of-label-timestamp, not as-of-now. Naive 'look up features now' joins leak the future and inflate offline metrics."

"Training/serving skew is the #1 cause of 'good in eval, bad in prod'. Four flavors — schema, distribution, concept, computation — detected by feature-schema diff, PSI per feature, online-vs-offline metric divergence, and replay-byte-equality. Mitigation is single source of truth — feature store's serving SDK, never a re-implementation."

"Model rollouts go through four stages: shadow (0% traffic, validate latency / errors), canary (1-5%, real users on small slice), A/B (50/50 with pre-registered primary + guardrails), 100%. Model registry holds artifact + training data snapshot + feature view version + hyperparameters per version. Rollback is < 60 seconds via blue-green."

"Online vs offline metrics: divergence is normal. Causes are selection bias in eval, feedback-loop bias, distribution shift, or wrong proxy metric. Fix with 5-10% exploration traffic — random / holdout — for unbiased evaluation."

"Causality: attribution ≠ incrementality. Most attributed conversions would happen anyway. Gold standard for incrementality is a 5-10% holdout; switch-back or geo-split when user-level A/B isn't possible (network effects, marketplaces). Pre-register primary + guardrail + sample size; CUPED for variance reduction."

"Two risks I'd flag and mitigate first: training-serving computation skew (replay-and-compare in CI), and selection-bias-driven metric divergence (exploration table from day one). The system without these two is a ticking clock."

Three sentences that signal seniority — in any MLE round

  1. "Point-in-time correct training joins are the difference between a model that ships and a model that lies in offline eval."
  2. "Training/serving skew is computational at root — same code path, single source of truth, or you're rebuilding the same logic twice and they will drift."
  3. "Attribution measures association; incrementality measures causation. The 5-10% holdout cost is the price of knowing which one your model produces."
· · ·
▸ Seven sections · the patterns hold across domains · go well.
← Back to Design pillar  ·  Streaming Architecture  ·  Data Quality & Reliability