▸ INTERVIEW SKILLS · the war stories

Production incident scenarios.

Every senior interview has a variant of "Tell me about a time you dealt with a production incident." What interviewers are really probing: how you diagnose under pressure, how you communicate, and what you learned. Here are 8 real-pattern scenarios with full diagnosis paths.

What interviewers are actually probing

THE HIDDEN RUBRIC

When an interviewer asks about a production incident, they're not looking for drama. They're grading on four signals:

  1. Diagnosis speed. Did you know where to look, or did you thrash? Engineers who've seen real incidents know the first 10 minutes — check the obvious things first, confirm the blast radius, stop the bleeding before root-causing.
  2. Communication quality. Did you keep stakeholders informed without overwhelming them? Senior engineers manage information flow as carefully as they manage the incident itself.
  3. Blamelessness. Did you own it, or deflect? The best incident stories name your own decisions that contributed — and what you changed.
  4. Prevention thinking. What broke the system, not just what broke? The answer should always include systemic changes, not just the fix.
What tanks incident answers: Vague timelines ("it took a while to figure out") · Blaming upstream teams · No concrete numbers (how bad was it? how long?) · Jumping to the fix without diagnosing · No postmortem or prevention step.

The STAR+T framework for incidents

STRUCTURE YOUR ANSWER
BeatContentTime
SituationWhat system, what was normal, what broke. One sentence each.20 sec
TaskYour role in the incident. Incident commander? On-call? Escalated in?10 sec
ActionYour actual diagnosis path — what you checked, when, why. This is the meat.60–90 sec
ResultTime to resolution, business impact, SLA outcome. Numbers required.20 sec
+TakeawayWhat systemic change came out of it. This is what separates L5 from L6.20 sec
· · ·

8 Incident scenarios

REAL PATTERNS · WORKED THROUGH
P0 Silent data loss in the revenue pipeline Fintech / Stripe-pattern
The scenario
Finance notices the daily revenue reconciliation is $2.3M short. Your Kafka-to-Snowflake pipeline has been running green for 3 days. No alerts fired. The discrepancy covers 72 hours of production data.
Diagnosis path
  • T+0 Confirm the blast radius. Pull row counts from Snowflake vs Kafka consumer lag metrics. Find the divergence timestamp.
  • T+8m Check consumer group lag. If lag is zero, the consumer processed the messages — the issue is downstream (transformation or load). If lag is nonzero, the consumer stalled.
  • T+15m Query the Snowflake load history. Look for failed COPY commands, partial loads, or microbatch gaps. Cross-reference with your Airflow DAG run history.
  • T+25m Find the gap: a schema change on a source table (new nullable column) caused a VARIANT parse to silently null-coalesce rather than fail. Rows loaded with amount = NULL instead of the real value.
  • T+40m Patch: replay from Kafka offset at the divergence point, fix the parser, backfill the 72-hour window.
Prevention
Added null-rate monitors on all financial columns — alert if null rate on amount changes by more than 0.1% relative to the prior 24h. Added schema evolution tests in CI that catch nullable column additions. Added a reconciliation check (row count + sum) to every DAG run, not just the nightly batch.
What the interviewer asks next
"Why didn't your monitoring catch this?" — They want to hear you acknowledge the monitoring gap and describe a systemic fix, not just "we added an alert."
P0 Cascading failure from a bad dbt model promotion Analytics platform
The scenario
A dbt model change went to production on Friday at 4pm. By Saturday morning, 40 downstream dashboards show wrong numbers. The change was a DISTINCT that was removed from a CTE — 10x fanout on a join made aggregate metrics look like growth was 900% WoW.
Diagnosis path
  • T+0 Identify the blast radius. Which dashboards? Which models? Run dbt ls --select +affected_model to see all dependents.
  • T+10m Check dbt run history. Find the first run where row counts diverged — compare fct_orders row count before and after the deploy.
  • T+18m Isolate the commit. git log on the model. The DISTINCT removal is immediately visible in the diff.
  • T+25m Rollback the model. Push the revert, trigger a full dbt run. 40 dashboards self-heal within the next refresh cycle.
  • T+2h Manually audit any reports that were pulled and shared externally during the window.
Prevention
Added row count and cardinality tests to every fact model in dbt — tests block deployment if row count changes by more than 20%. Added a staging environment with a 24-hour bake period before production promotion. No Friday afternoon deploys rule added to the runbook.
What the interviewer asks next
"How did this get through code review?" — Answer honestly. Usually: the reviewer looked at the logic but not the data impact. What changed: added data validation as a mandatory PR checklist item, not just code review.
P0 Kafka consumer lag spike — real-time pipeline collapses Streaming platform / Netflix-pattern
The scenario
Consumer lag on your event ingestion topic goes from 0 to 14 million messages in 20 minutes. Your Flink job is processing but not keeping up. SLA for freshness is 5 minutes; you're now at 47 minutes and climbing. Product teams are paging you.
Diagnosis path
  • T+0 Check producer throughput. Is this a traffic spike (normal) or a consumer slowdown (abnormal)? Kafka topic in/sec vs consumer out/sec in your metrics dashboard.
  • T+5m Traffic spike confirmed: a new product event (push notification batch) sent 50M events in 8 minutes — 10× the normal peak.
  • T+8m Check Flink task manager metrics — CPU at 100%, GC pressure. The job is not crashing but can't keep up with this volume.
  • T+12m Increase Flink parallelism by 3×. Scale up task managers. Consumer lag starts decreasing within 6 minutes.
  • T+38m Lag returns to zero. Total gap: 67 minutes of data was delayed, not lost — Kafka retained everything.
Prevention
Added auto-scaling policy on the Flink cluster triggered by consumer lag > 500K messages. Added a traffic spike coordination process — product teams must notify data infra before batch notification sends over 10M events. Added lag SLA alerting at 10-minute freshness, not just 5-minute breach.
What the interviewer asks next
"What would you have done if scaling wasn't enough?" — Backpressure strategy: drop non-critical events, prioritize revenue-critical topics, shed load at the producer if necessary. Shows you think in layers.
P1 SCD Type 2 dimension corrupted — history overwritten E-commerce / Amazon-pattern
The scenario
A data engineer ran a backfill job on dim_customer without setting the is_current flag correctly. All historical rows now have is_current = TRUE. Point-in-time queries for the last 6 months of customer attributes return wrong results. 14 downstream models are affected.
Diagnosis path
  • T+0 Confirm: SELECT customer_id, COUNT(*) FROM dim_customer WHERE is_current = TRUE GROUP BY 1 HAVING COUNT(*) > 1. If any customer has more than one "current" row, it's confirmed.
  • T+10m Assess recovery options: (a) restore from last clean snapshot, (b) re-derive the correct SCD2 state from the source transaction log.
  • T+20m Option (a): restore dim_customer from the pre-backfill snapshot. Replay the 3 hours of legitimate changes that happened between the snapshot and now from the CDC log.
  • T+90m Dimension restored. Trigger downstream model refreshes. Validate with row count + unique key checks.
Prevention
Added a pre-commit hook on any DML touching dim_* tables that requires review sign-off. Added a data contract test: "for each customer_id, exactly one row has is_current = TRUE" — runs on every load. Added snapshot retention policy (72h minimum) so rollback is always available.
What the interviewer asks next
"How did you communicate this to downstream teams?" — They want to see triage communication: immediate alert with blast radius assessment, ETA to resolution, a post-incident report. Not silence followed by a fix.
P1 Query cost explosion — Snowflake bill 40× normal SaaS / Snowflake-heavy
The scenario
Overnight your Snowflake credits burned 40× the daily baseline. Finance pages the data team at 8am. A $180K monthly budget is now $12K into the first week. No obvious pipeline changes were deployed.
Diagnosis path
  • T+0 Run QUERY_HISTORY: find the top 10 queries by credits consumed in the last 24 hours. Sort by CREDITS_USED_CLOUD_SERVICES.
  • T+8m One query consumed 92% of the total: a new analyst wrote a self-join on a 4B-row table without a filter. The query ran 47 times via a scheduled BI tool refresh.
  • T+12m Kill any still-running instances. Suspend the offending warehouse. Fix the query (add date filter + limit self-join to the CTE).
  • T+20m Enable resource monitors with a credit limit and suspend-on-exceed policy. Should have been there from day one.
Prevention
Resource monitors on every warehouse with hard credit limits and email alerts at 50%/80%/100%. Query governance policy: analysts cannot run unfiltered full-table scans above 100M rows without a reviewer sign-off. Added a cost estimation step to the BI tool's query queue for queries above a threshold.
What the interviewer asks next
"How did you prevent blaming the analyst?" — Blameless postmortem: the analyst didn't know. The system had no guardrails. You owned the systemic failure.
P1 Late data from a third-party source breaks ML feature freshness ML platform
The scenario
Your feature store relies on a third-party credit bureau feed that arrives at 2am. Today it arrived at 9:15am. Your fraud model has been scoring transactions for 7 hours using 24-hour-old features. Risk team needs to know the blast radius before market open.
Diagnosis path
  • T+0 Quantify: how many transactions were scored in the 7-hour window? What's the expected model accuracy degradation with 24h-old credit features vs. 2h-old? Pull from model evaluation data.
  • T+15m Risk decision: the accuracy delta is 1.8% — below the risk team's threshold of 5% for manual review. Document and move on.
  • T+30m Re-score transactions in the window using the now-arrived fresh features. Flag any decisions that would have changed. 3 high-value transactions get a second look.
  • T+60m Notify risk team with the full analysis. No escalation needed.
Prevention
Added freshness SLA monitoring on all third-party data sources with alerting at T+30min past expected arrival. Added a fallback policy: if credit features are stale beyond threshold, the model switches to a degraded-mode feature set that uses internal-only signals. Negotiated an SLA with the data vendor with financial penalties for late delivery.
What the interviewer asks next
"What if the accuracy delta had been 8%?" — Shows you know the decision tree: at what threshold do you halt scoring, route to human review, or revert to a simpler model?
P0 Duplicate transactions in the financial ledger Payments / Stripe-pattern
The scenario
A retry storm during a brief network partition caused your idempotency layer to fail. 14,000 transactions were written twice to the ledger. Reconciliation is off by $8.7M. The window is 11 minutes.
Diagnosis path
  • T+0 Confirm duplicates: SELECT transaction_id, COUNT(*) FROM ledger GROUP BY 1 HAVING COUNT(*) > 1. Verify the count.
  • T+5m Check idempotency key table. The partition caused a brief window where the cache missed and keys were re-processed.
  • T+15m Quarantine the duplicate rows (soft-delete with audit flag, never hard-delete financial data).
  • T+30m Re-run reconciliation. Confirm the $8.7M delta is now explained and resolved.
  • T+2h Legal and compliance notified per protocol. Written incident report submitted.
Prevention
Moved idempotency key storage from cache-only to a durable database with cache-aside pattern. Added a reconciliation check that runs every 15 minutes, not just nightly. Added circuit breaker on the retry logic: after 3 retries, route to a dead-letter queue for manual review rather than continuing to retry.
What the interviewer asks next
"Why did you soft-delete instead of hard-delete?" — Audit trail. Financial data must be immutable — you mark it void, you never delete it. Shows regulatory awareness.
P2 Airflow DAG slowdown — morning jobs take 4× normal time Analytics engineering
The scenario
Monday morning. Your 7am DAG run that normally completes by 8:30am isn't done at 10am. Dashboards are stale. Analysts are asking. No failures — just very, very slow.
Diagnosis path
  • T+0 Check Airflow task durations vs. historical baseline. Which tasks are slow vs. which are fast?
  • T+8m The dbt transformation tasks are normal. The slow tasks are the Snowflake loads. Check the Snowflake query history for the warehouse in use.
  • T+15m Warehouse is heavily queued — 47 concurrent queries waiting. Someone started a large ad-hoc analysis on the same warehouse at 6:55am.
  • T+18m Move the DAG to the dedicated transform warehouse. Jobs complete within 35 minutes of the switch.
Prevention
Separated Snowflake warehouses by workload type: ETL/transform, ad-hoc/analyst, BI/serving. Each warehouse has resource monitors. Added a "warehouse health" check as the first task in every critical DAG — if the warehouse queue depth is above threshold, alert and optionally re-route before the DAG proceeds.
What the interviewer asks next
"How did you catch this before analysts escalated?" — Honest answer: you didn't. The right answer adds: what monitoring would have caught it 30 minutes earlier.
· · ·

Postmortem template

WRITE THIS AFTER EVERY P0/P1

Having a postmortem template in your head tells the interviewer you've done this for real. Here's the structure that senior engineers recognize as correct:

SectionContent
SummaryOne paragraph. What broke, when, for how long, business impact. No blame.
TimelineTimestamped sequence from first signal to resolution. Include what you checked that turned out to be wrong — shows honest diagnosis.
Root causeThe technical cause, one level deeper than "the query was slow" — why was it slow, and why wasn't that caught?
Contributing factorsWhat made the impact worse than it needed to be. Usually: late detection, missing runbooks, or unclear ownership.
Action itemsNamed owner + due date for each item. Vague action items do not get done.
What went wellBlameless culture requires honoring what the team did right. Communication was good? Say so.
Interview tip: If you say "we wrote a postmortem," be ready to describe what was in it. If you can't, the interviewer knows you didn't own the incident — you were a bystander.

Behavioral prep  ·  Interviewer's Mind →