Production incident scenarios.
Every senior interview has a variant of "Tell me about a time you dealt with a production incident." What interviewers are really probing: how you diagnose under pressure, how you communicate, and what you learned. Here are 8 real-pattern scenarios with full diagnosis paths.
Contents
What interviewers are actually probing
THE HIDDEN RUBRIC
When an interviewer asks about a production incident, they're not looking for drama. They're grading on four signals:
- Diagnosis speed. Did you know where to look, or did you thrash? Engineers who've seen real incidents know the first 10 minutes — check the obvious things first, confirm the blast radius, stop the bleeding before root-causing.
- Communication quality. Did you keep stakeholders informed without overwhelming them? Senior engineers manage information flow as carefully as they manage the incident itself.
- Blamelessness. Did you own it, or deflect? The best incident stories name your own decisions that contributed — and what you changed.
- Prevention thinking. What broke the system, not just what broke? The answer should always include systemic changes, not just the fix.
What tanks incident answers: Vague timelines ("it took a while to figure out") · Blaming upstream teams · No concrete numbers (how bad was it? how long?) · Jumping to the fix without diagnosing · No postmortem or prevention step.
The STAR+T framework for incidents
STRUCTURE YOUR ANSWER
| Beat | Content | Time |
|---|---|---|
| Situation | What system, what was normal, what broke. One sentence each. | 20 sec |
| Task | Your role in the incident. Incident commander? On-call? Escalated in? | 10 sec |
| Action | Your actual diagnosis path — what you checked, when, why. This is the meat. | 60–90 sec |
| Result | Time to resolution, business impact, SLA outcome. Numbers required. | 20 sec |
| +Takeaway | What systemic change came out of it. This is what separates L5 from L6. | 20 sec |
· · ·
8 Incident scenarios
REAL PATTERNS · WORKED THROUGH
P0
Silent data loss in the revenue pipeline
Fintech / Stripe-pattern
The scenario
Finance notices the daily revenue reconciliation is $2.3M short. Your Kafka-to-Snowflake pipeline has been running green for 3 days. No alerts fired. The discrepancy covers 72 hours of production data.
Diagnosis path
- T+0 Confirm the blast radius. Pull row counts from Snowflake vs Kafka consumer lag metrics. Find the divergence timestamp.
- T+8m Check consumer group lag. If lag is zero, the consumer processed the messages — the issue is downstream (transformation or load). If lag is nonzero, the consumer stalled.
- T+15m Query the Snowflake load history. Look for failed COPY commands, partial loads, or microbatch gaps. Cross-reference with your Airflow DAG run history.
- T+25m Find the gap: a schema change on a source table (new nullable column) caused a
VARIANTparse to silently null-coalesce rather than fail. Rows loaded withamount = NULLinstead of the real value. - T+40m Patch: replay from Kafka offset at the divergence point, fix the parser, backfill the 72-hour window.
Prevention
Added null-rate monitors on all financial columns — alert if null rate on
amount changes by more than 0.1% relative to the prior 24h. Added schema evolution tests in CI that catch nullable column additions. Added a reconciliation check (row count + sum) to every DAG run, not just the nightly batch.What the interviewer asks next
"Why didn't your monitoring catch this?" — They want to hear you acknowledge the monitoring gap and describe a systemic fix, not just "we added an alert."
P0
Cascading failure from a bad dbt model promotion
Analytics platform
The scenario
A dbt model change went to production on Friday at 4pm. By Saturday morning, 40 downstream dashboards show wrong numbers. The change was a DISTINCT that was removed from a CTE — 10x fanout on a join made aggregate metrics look like growth was 900% WoW.
Diagnosis path
- T+0 Identify the blast radius. Which dashboards? Which models? Run
dbt ls --select +affected_modelto see all dependents. - T+10m Check dbt run history. Find the first run where row counts diverged — compare
fct_ordersrow count before and after the deploy. - T+18m Isolate the commit.
git logon the model. The DISTINCT removal is immediately visible in the diff. - T+25m Rollback the model. Push the revert, trigger a full dbt run. 40 dashboards self-heal within the next refresh cycle.
- T+2h Manually audit any reports that were pulled and shared externally during the window.
Prevention
Added row count and cardinality tests to every fact model in dbt — tests block deployment if row count changes by more than 20%. Added a staging environment with a 24-hour bake period before production promotion. No Friday afternoon deploys rule added to the runbook.
What the interviewer asks next
"How did this get through code review?" — Answer honestly. Usually: the reviewer looked at the logic but not the data impact. What changed: added data validation as a mandatory PR checklist item, not just code review.
P0
Kafka consumer lag spike — real-time pipeline collapses
Streaming platform / Netflix-pattern
The scenario
Consumer lag on your event ingestion topic goes from 0 to 14 million messages in 20 minutes. Your Flink job is processing but not keeping up. SLA for freshness is 5 minutes; you're now at 47 minutes and climbing. Product teams are paging you.
Diagnosis path
- T+0 Check producer throughput. Is this a traffic spike (normal) or a consumer slowdown (abnormal)? Kafka topic in/sec vs consumer out/sec in your metrics dashboard.
- T+5m Traffic spike confirmed: a new product event (push notification batch) sent 50M events in 8 minutes — 10× the normal peak.
- T+8m Check Flink task manager metrics — CPU at 100%, GC pressure. The job is not crashing but can't keep up with this volume.
- T+12m Increase Flink parallelism by 3×. Scale up task managers. Consumer lag starts decreasing within 6 minutes.
- T+38m Lag returns to zero. Total gap: 67 minutes of data was delayed, not lost — Kafka retained everything.
Prevention
Added auto-scaling policy on the Flink cluster triggered by consumer lag > 500K messages. Added a traffic spike coordination process — product teams must notify data infra before batch notification sends over 10M events. Added lag SLA alerting at 10-minute freshness, not just 5-minute breach.
What the interviewer asks next
"What would you have done if scaling wasn't enough?" — Backpressure strategy: drop non-critical events, prioritize revenue-critical topics, shed load at the producer if necessary. Shows you think in layers.
P1
SCD Type 2 dimension corrupted — history overwritten
E-commerce / Amazon-pattern
The scenario
A data engineer ran a backfill job on
dim_customer without setting the is_current flag correctly. All historical rows now have is_current = TRUE. Point-in-time queries for the last 6 months of customer attributes return wrong results. 14 downstream models are affected.Diagnosis path
- T+0 Confirm:
SELECT customer_id, COUNT(*) FROM dim_customer WHERE is_current = TRUE GROUP BY 1 HAVING COUNT(*) > 1. If any customer has more than one "current" row, it's confirmed. - T+10m Assess recovery options: (a) restore from last clean snapshot, (b) re-derive the correct SCD2 state from the source transaction log.
- T+20m Option (a): restore
dim_customerfrom the pre-backfill snapshot. Replay the 3 hours of legitimate changes that happened between the snapshot and now from the CDC log. - T+90m Dimension restored. Trigger downstream model refreshes. Validate with row count + unique key checks.
Prevention
Added a pre-commit hook on any DML touching
dim_* tables that requires review sign-off. Added a data contract test: "for each customer_id, exactly one row has is_current = TRUE" — runs on every load. Added snapshot retention policy (72h minimum) so rollback is always available.What the interviewer asks next
"How did you communicate this to downstream teams?" — They want to see triage communication: immediate alert with blast radius assessment, ETA to resolution, a post-incident report. Not silence followed by a fix.
P1
Query cost explosion — Snowflake bill 40× normal
SaaS / Snowflake-heavy
The scenario
Overnight your Snowflake credits burned 40× the daily baseline. Finance pages the data team at 8am. A $180K monthly budget is now $12K into the first week. No obvious pipeline changes were deployed.
Diagnosis path
- T+0 Run
QUERY_HISTORY: find the top 10 queries by credits consumed in the last 24 hours. Sort byCREDITS_USED_CLOUD_SERVICES. - T+8m One query consumed 92% of the total: a new analyst wrote a self-join on a 4B-row table without a filter. The query ran 47 times via a scheduled BI tool refresh.
- T+12m Kill any still-running instances. Suspend the offending warehouse. Fix the query (add date filter + limit self-join to the CTE).
- T+20m Enable resource monitors with a credit limit and suspend-on-exceed policy. Should have been there from day one.
Prevention
Resource monitors on every warehouse with hard credit limits and email alerts at 50%/80%/100%. Query governance policy: analysts cannot run unfiltered full-table scans above 100M rows without a reviewer sign-off. Added a cost estimation step to the BI tool's query queue for queries above a threshold.
What the interviewer asks next
"How did you prevent blaming the analyst?" — Blameless postmortem: the analyst didn't know. The system had no guardrails. You owned the systemic failure.
P1
Late data from a third-party source breaks ML feature freshness
ML platform
The scenario
Your feature store relies on a third-party credit bureau feed that arrives at 2am. Today it arrived at 9:15am. Your fraud model has been scoring transactions for 7 hours using 24-hour-old features. Risk team needs to know the blast radius before market open.
Diagnosis path
- T+0 Quantify: how many transactions were scored in the 7-hour window? What's the expected model accuracy degradation with 24h-old credit features vs. 2h-old? Pull from model evaluation data.
- T+15m Risk decision: the accuracy delta is 1.8% — below the risk team's threshold of 5% for manual review. Document and move on.
- T+30m Re-score transactions in the window using the now-arrived fresh features. Flag any decisions that would have changed. 3 high-value transactions get a second look.
- T+60m Notify risk team with the full analysis. No escalation needed.
Prevention
Added freshness SLA monitoring on all third-party data sources with alerting at T+30min past expected arrival. Added a fallback policy: if credit features are stale beyond threshold, the model switches to a degraded-mode feature set that uses internal-only signals. Negotiated an SLA with the data vendor with financial penalties for late delivery.
What the interviewer asks next
"What if the accuracy delta had been 8%?" — Shows you know the decision tree: at what threshold do you halt scoring, route to human review, or revert to a simpler model?
P0
Duplicate transactions in the financial ledger
Payments / Stripe-pattern
The scenario
A retry storm during a brief network partition caused your idempotency layer to fail. 14,000 transactions were written twice to the ledger. Reconciliation is off by $8.7M. The window is 11 minutes.
Diagnosis path
- T+0 Confirm duplicates:
SELECT transaction_id, COUNT(*) FROM ledger GROUP BY 1 HAVING COUNT(*) > 1. Verify the count. - T+5m Check idempotency key table. The partition caused a brief window where the cache missed and keys were re-processed.
- T+15m Quarantine the duplicate rows (soft-delete with audit flag, never hard-delete financial data).
- T+30m Re-run reconciliation. Confirm the $8.7M delta is now explained and resolved.
- T+2h Legal and compliance notified per protocol. Written incident report submitted.
Prevention
Moved idempotency key storage from cache-only to a durable database with cache-aside pattern. Added a reconciliation check that runs every 15 minutes, not just nightly. Added circuit breaker on the retry logic: after 3 retries, route to a dead-letter queue for manual review rather than continuing to retry.
What the interviewer asks next
"Why did you soft-delete instead of hard-delete?" — Audit trail. Financial data must be immutable — you mark it void, you never delete it. Shows regulatory awareness.
P2
Airflow DAG slowdown — morning jobs take 4× normal time
Analytics engineering
The scenario
Monday morning. Your 7am DAG run that normally completes by 8:30am isn't done at 10am. Dashboards are stale. Analysts are asking. No failures — just very, very slow.
Diagnosis path
- T+0 Check Airflow task durations vs. historical baseline. Which tasks are slow vs. which are fast?
- T+8m The dbt transformation tasks are normal. The slow tasks are the Snowflake loads. Check the Snowflake query history for the warehouse in use.
- T+15m Warehouse is heavily queued — 47 concurrent queries waiting. Someone started a large ad-hoc analysis on the same warehouse at 6:55am.
- T+18m Move the DAG to the dedicated transform warehouse. Jobs complete within 35 minutes of the switch.
Prevention
Separated Snowflake warehouses by workload type: ETL/transform, ad-hoc/analyst, BI/serving. Each warehouse has resource monitors. Added a "warehouse health" check as the first task in every critical DAG — if the warehouse queue depth is above threshold, alert and optionally re-route before the DAG proceeds.
What the interviewer asks next
"How did you catch this before analysts escalated?" — Honest answer: you didn't. The right answer adds: what monitoring would have caught it 30 minutes earlier.
· · ·
Postmortem template
WRITE THIS AFTER EVERY P0/P1
Having a postmortem template in your head tells the interviewer you've done this for real. Here's the structure that senior engineers recognize as correct:
| Section | Content |
|---|---|
| Summary | One paragraph. What broke, when, for how long, business impact. No blame. |
| Timeline | Timestamped sequence from first signal to resolution. Include what you checked that turned out to be wrong — shows honest diagnosis. |
| Root cause | The technical cause, one level deeper than "the query was slow" — why was it slow, and why wasn't that caught? |
| Contributing factors | What made the impact worse than it needed to be. Usually: late detection, missing runbooks, or unclear ownership. |
| Action items | Named owner + due date for each item. Vague action items do not get done. |
| What went well | Blameless culture requires honoring what the team did right. Communication was good? Say so. |
Interview tip: If you say "we wrote a postmortem," be ready to describe what was in it. If you can't, the interviewer knows you didn't own the incident — you were a bystander.