▸ Featured · Senior & Staff DE · 2026
Interview Studio · the year's twenty-one

Hot Topics 2026 — what the senior DE loop is actually testing.

The format war is over (Iceberg won), so the questions moved up the stack — to the catalog, to cost at petabyte scale, to quality at the source, and to the data plumbing that production GenAI lives or dies on. Twenty-one topics, each with an architecture diagram and the one-line interview signal that separates an L4 answer from an L5/L6 one.

How to use this. Each topic is a self-contained whiteboard primer: why it's hot in 2026, a salient architecture diagram, the named techniques, and the senior signal — the sentence that tells the interviewer you've run this in production, not just read about it. Where a topic deepens something in another pillar, it links across to Performance, Analytics or Design.

The four macro-shifts driving the 2026 question set

Shift №1

Iceberg won → the catalog war

With the table format settled, the differentiation — and the hard problems — moved to the REST catalog and metadata layer.

Shift №2

FinOps at petabyte scale

Storage and compute bills are now a first-class engineering KPI. "Make it cheaper" is a design constraint, not an afterthought.

Shift №3

Quality shifts left

Contracts and assertions move enforcement to the producer and the ingestion edge — catch bad data before it poisons the lake.

Shift №4

GenAI needs plumbing

Feature stores, vector stores and retrieval pipelines — with freshness and lineage — are the new must-build, plus copilots in ops.

The twenty-one topics

  1. S3 Storage Lens & Cost Optimization — the FinOps entry point
  2. Data Lake Query Engines & Catalogs — Trino + Iceberg REST catalogs
  3. Storage Layout & Partitioning Strategy — the highest-leverage L5 topic
  4. Compute-to-Storage Skew Mitigation — killing the straggler
  5. Real-time Stream Partitioning & Ingestion — the partition-key decision
  6. Schema Evolution & Format Governance — registries & columnar IDs
  7. Data Quality, Anomaly Detection & Data Contracts — circuit breakers
  8. Data Lineage, Dependencies & Observability — OpenLineage / OTel
  9. Data Tiering & TTL Management — the compute-side of cost
  10. AI-Ready Infrastructure & Pipeline Copilots — the signature 2026 theme
  11. Multi-Region Data Architecture & Disaster Recovery — surviving a region loss
  12. Table Maintenance Automation — metadata failures, not compute
  13. Data Platform Reliability Engineering — SLOs, error budgets, SRE-for-data
  14. Lakehouse Security & Governance — RLS, masking, GDPR-as-code
  15. Change Data Capture (CDC) at Scale — OLTP → Kafka → Iceberg
  16. Cost-Aware Compute Optimization — cut 40% without breaking SLA
  17. Metadata Engineering — discovery & trust
  18. Event-Driven Architectures — sourcing, CQRS, outbox, sagas
  19. Vector Data Infrastructure — embeddings, ANN, RAG freshness
  20. Data Mesh & Domain Ownership — organizational scaling
  21. CI/CD & DataOps Optimization with AI — the intelligent delivery loop
Nº 01 · FinOps

S3 Storage Lens & Cost

Org-wide growth, waste surfacing, and lifecycle to Glacier / Deep Archive.

Nº 02 · Catalog

Query Engines & Catalogs

Trino + Iceberg REST catalogs — Polaris, Glue, Nessie; manifest pruning.

Nº 03 · Layout

Partitioning Strategy

Hidden partitioning, bucketing, sorting, Z-order; the small-file killer.

Nº 04 · Skew

Skew Mitigation

Salting, AQE, skew joins, isolating heavy hitters in Spark/Flink.

Nº 05 · Streaming

Stream Partitioning

Kafka/Flink partition keys, backpressure, rebalancing at device scale.

Nº 06 · Governance

Schema Evolution

Avro/Protobuf registries, compatibility modes, Iceberg ID-based evolution.

Nº 07 · Quality

Quality & Contracts

Assertions as circuit breakers; producer/consumer data contracts.

Nº 08 · Observability

Lineage & OTel

Auto-mapped dependencies, impact analysis, OpenLineage/OpenTelemetry.

Nº 09 · Tiering

Tiering & TTL

TTL frameworks, rollup cascades, retiring raw to cheaper tiers.

Nº 10 · AI-Ready ★

AI Infra & Copilots

Feature/vector stores, retrieval freshness, and copilots in operations.

Nº 11 · Resilience

Multi-Region & DR

Catalog recovery, RPO/RTO, active-active vs passive, cross-region replication.

Nº 12 · Iceberg Ops

Table Maintenance

Compaction, snapshot expiry, manifest rewrites, orphan removal, branching.

Nº 13 · DPRE

Reliability Engineering

Pipeline SLOs, error budgets, automated rollback, self-healing.

Nº 14 · Security

Security & Governance

Row/column security, dynamic masking, PII, GDPR-as-code, zero-trust.

Nº 15 · CDC

CDC at Scale

Debezium, ordering, idempotency, exactly-once upserts into Iceberg.

Nº 16 · FinOps

Compute Cost

AQE, autoscaling, spot, right-sizing, shuffle tuning, resource queues.

Nº 17 · Discovery

Metadata Engineering

Discovery, glossary, semantic layer, ownership, catalog performance.

Nº 18 · Staff design

Event-Driven Arch

Event sourcing, CQRS, the outbox pattern, sagas, event versioning.

Nº 19 · AI

Vector Infrastructure

Embedding pipelines, ANN, hybrid search, RAG freshness, re-embedding.

Nº 20 · Org

Data Mesh

Domain ownership, data products, federated governance, platform teams.

Nº 21 · DataOps ★

CI/CD with AI

Selective testing, change-risk scoring, anomaly-gated deploys, self-healing pipelines.

Nº 01 · FinOps · storage cost

S3 Storage Lens & Cost Optimization

Why it's hot in 2026: at petabyte scale the storage bill is an engineering KPI, and Storage Lens is the org-wide lens that turns "where did 8 PB come from?" into a dashboard — the natural entry point to a FinOps practice.

ORG-WIDE STORAGE → WASTE SURFACING → LIFECYCLE TIERING 100s of buckets N accounts · 8 PB S3 Storage Lens advanced · prefix-level Waste surfaced orphaned · uncompressed incomplete multipart non-current versions Lifecycle policies transition · expire · abort MPU TIER LADDER → Standard$0.023/GB·mo Standard-IA$0.0125 Glacier Flexible$0.0036 Glacier Deep Archive$0.00099 (~23×↓) colder = cheaper, slower to retrieve
Storage Lens → waste categories → lifecycle transitions down the cost/retrieval ladder

Storage Lens aggregates usage and activity metrics across every account and bucket in the org (the advanced tier adds prefix-level detail and recommendations), so growth and waste become visible in one place. The four waste categories it surfaces map directly to one-line lifecycle fixes.

The techniques

  • Surface the four wastes: orphaned objects (no owning table/job), uncompressed data, incomplete multipart uploads silently accruing, and non-current object versions in versioned buckets.
  • Automate lifecycle rules: transition cold prefixes Standard → IA → Glacier Flexible → Deep Archive; AbortIncompleteMultipartUpload; NoncurrentVersionExpiration; expire true temp data.
  • Intelligent-Tiering for unpredictable access patterns (auto-moves between tiers, no retrieval fee on frequent/infrequent), vs explicit lifecycle when you know the access curve.
  • FinOps wiring: cost-allocation tags + showback/chargeback so each team sees its own bill; treat $/TB-scanned and $/TB-stored as tracked KPIs.
Interview signal — name the real trade-offs: Deep Archive is ~23× cheaper than Standard but has a 12-hour retrieval and a 180-day minimum-storage charge; lifecycle transitions themselves cost per-1,000-objects, so transitioning millions of tiny files can cost more than it saves — which is exactly why this couples to layout/compaction (№03) and tiering/TTL (№09).
✦ ✦ ✦
Nº 02 · catalog · query engines

Data Lake Query Engines & Catalogs

Why it's hot in 2026: Iceberg won the format war, so the catalog layer is the new battleground. The REST catalog and the metadata tree — not the data files — now decide how fast a petabyte query plans.

PLANNING PRUNES THE METADATA TREE BEFORE TOUCHING DATA Trino / Presto Spark · Flink Iceberg REST catalog Polaris · Glue · Nessie metadata.json (snapshots) manifest list manifest files ↓ pick snapshot ↓ partition prune ↓ column min/max skip Parquet data files read only survivors
The Iceberg metadata tree — catalog → metadata.json → manifest list → manifests → data files

Once the table format is a commodity, the catalog is where vendors compete (Snowflake Polaris, Databricks Unity, AWS Glue, the git-like Nessie) and where query planning lives or dies. A query never scans data first — it walks the metadata tree, pruning at each level.

The techniques

  • Minimize metadata overhead: expire old snapshots, rewrite/compact manifests, keep metadata.json from ballooning; small, current metadata = fast planning.
  • Prune manifests efficiently: partition stats in the manifest list and column lower/upper bounds (and Puffin sketches) let the planner skip whole manifests before reading data.
  • REST catalog choice: Polaris/Unity/Glue/Nessie — Nessie adds git-style branches & tags for isolated writes and rollbacks; REST decouples engine from catalog implementation.
  • Engine tuning: Trino split sizing, dynamic filtering, table/column stats (ANALYZE), metadata caching; cost-based join ordering.
Interview signal — draw the metadata tree from memory and explain that a well-pruned plan touches a handful of manifests and a few data files out of millions — and that the classic failure mode is metadata bloat (un-expired snapshots, millions of tiny manifests) making planning the bottleneck, not the scan. Ties straight to layout (№03).
✦ ✦ ✦
Nº 03 · layout · the L5 lever

Storage Layout & Partitioning Strategy

Why it's hot in 2026: layout ties physical decisions directly to query cost, which makes it one of the highest-leverage topics in an L5 loop — get it wrong and you're paying for it on every query, forever.

LAYOUT DECISIONS → BYTES SCANNED ✗ Over-partitioned → small files 1M × 1MB files · open-overhead ≫ data planning chokes · scan inflated cardinality too high to prune compact ✓ Right-sized · bucketed · sorted · clustered ~128–512 MB files partition: day(ts) hidden bucket(N, user_id) Z-ORDER / sort keys hidden partitioning: query stays clean, partition evolves without rewrite multi-dim skipping → only matching files scanned
Kill the small-file problem with compaction; cut bytes scanned with hidden partitioning, bucketing & Z-order

Every layout knob trades write-time effort for read-time savings, paid back on every query. The senior move is to derive the layout from the dominant query shapes, not the other way round.

The techniques

  • Partition on a low-cardinality, frequently-filtered dimension (usually date) — over-partitioning is how you create the small-file problem.
  • Bucket on the join/group key to co-locate matches and remove shuffles; sort/cluster within files for min/max data skipping.
  • Iceberg hidden partitioning: transform-based (day(ts), bucket(16, id)) so queries don't reference partition columns, and partition evolution changes the scheme with no table rewrite.
  • Z-order / space-filling curves for multi-column skipping; compaction to ~128 MB–1 GB targets to kill small files.
Interview signal — tie each knob to a query: "we partition by day for the time filter, bucket by user_id for the join, and Z-order on country for the dashboard filter." Mention that Iceberg's hidden partitioning + evolution is what lets you fix a bad partition choice without rewriting petabytes. Deep mechanics in Performance Families 1 & 7.
✦ ✦ ✦
Nº 04 · skew · distributed systems

Compute-to-Storage Skew Mitigation

Why it's hot in 2026: classic distributed-systems interview territory that never goes away — a few hot keys bottleneck the whole cluster while every other task sits idle. The total work is fine; the distribution is the problem.

ONE HOT KEY = ONE STRAGGLER → SPREAD IT ✗ Before — skewed shuffle t1 t2 t3 🔥 t4 t5 whale stage ends when t3 ends (hours) salt / AQE ✓ After — hot key split N ways 3a 3b 3c 3d stage ends when the median ends (minutes)
Partition-size histogram — the whale's task t3 dominates until salting / AQE splits it across workers

Skew is the optimization juniors miss because the plan "looks fine." Detection is half the skill — read task-duration and partition-size distributions, not just the average.

The techniques

  • Salting: append a random salt to the hot key to split it across N partitions; replicate the small side N ways to keep the join correct.
  • Adaptive Query Execution (Spark 3+): skewJoin splits oversized partitions at runtime, coalesces tiny ones, and can flip sort-merge → broadcast once it sees real sizes.
  • Isolate heavy hitters: a known hot set goes down a separate two-phase path; the long tail aggregates normally (no broad replication cost).
  • Flink: keyBy skew → local/pre-aggregation (two-phase), rebalance/rescale, key-group tuning.
Interview signal — say the line "total work is fine, the distribution isn't," then name AQE's skewedPartitionFactor and why you salt only the hot keys (salting everything just inflates the small side). Full operational treatment on the Hot Shards & Data Skew page and Performance Family 3.
✦ ✦ ✦
Nº 05 · streaming · ingestion

Real-time Stream Partitioning & Ingestion

Why it's hot in 2026: millions of concurrent devices, one wrong partition key, and you get a hot partition that lags the whole consumer group. The partition-key choice is the single most consequential decision in a streaming design.

PARTITION KEY → BALANCE OR HOT PARTITION + LAG Millions of devices key = country ✗ P0 hot 🔥 → consumer lag ↑ key = hash(device_id) ✓ Kafka topic N partitions Flink consumers parallelism = partitions credit-based backpressure SLO: consumer lag
A low-cardinality key (country) creates a hot partition; a hashed high-cardinality key (device_id) balances the load

Throughput and balance are won or lost at the partition key. The trade-off to articulate: per-key ordering requires same-key→same-partition, which is exactly what creates hot partitions when a key is popular.

The techniques

  • Key design: high-cardinality, evenly-hashed keys (e.g., hash(device_id)); composite keys to break up hot tenants; never a low-cardinality key like country/region if it's skewed.
  • Throughput: right-size partition count, producer batching/linger.ms, idempotent/transactional producers for exactly-once.
  • Backpressure: Flink credit-based flow control + buffer debloating; watch Kafka consumer lag as the health SLO.
  • Rebalancing: cooperative/incremental sticky assignor to avoid stop-the-world rebalances when consumers join/leave.
Interview signal — lead with "the partition key is the most important decision," then name the ordering-vs-parallelism trade-off and consumer lag as the SLO. Bonus: cooperative rebalancing so scaling the consumer group doesn't pause ingestion. Streaming internals on the Streaming Architecture page.
✦ ✦ ✦
Nº 06 · governance · formats

Schema Evolution & Format Governance

Why it's hot in 2026: decoupled microservices deploy independently, so a producer's schema change must not break downstream consumers. The schema registry is the contract that makes that safe.

THE REGISTRY GATES CHANGES BEFORE THEY SHIP Producer svc Avro / Protobuf Schema Registry compat: BACKWARD / FORWARD / FULL CI gate on PR ✓ add OPTIONAL field compatible → registered ✗ rename / remove field breaking → rejected at build Consumers deploy independently
Additive (optional-field) changes pass; renames/removals are rejected at build time, before they can break consumers

Format governance is what lets independent teams move fast without breaking each other. The registry encodes which changes are safe in which direction, and columnar formats add their own evolution rules.

The techniques

  • Schema registry (Avro/Protobuf) enforcing BACKWARD/FORWARD/FULL compatibility as a CI gate — breaking changes fail the build, not production.
  • Safe vs breaking: adding an optional/defaulted field is compatible; renaming, removing, or retyping a required field is not.
  • Iceberg/Parquet evolution by field-ID: add/drop/rename/reorder columns with no data rewrite — vs Hive/positional schemas that break on reorder.
  • Direction matters: backward (new consumer reads old data) vs forward (old consumer reads new data) decides who can upgrade first.
Interview signal — explain compatibility direction in terms of deploy order ("BACKWARD means I can upgrade consumers first"), and why Iceberg's ID-based evolution is safe where Parquet-by-position silently corrupts on a column reorder. Connects to the data-contracts topic (№07).
✦ ✦ ✦
Nº 07 · quality · contracts

Data Quality, Anomaly Detection & Data Contracts

Why it's hot in 2026: quality is shifting left. Instead of dashboards catching bad data days later, contracts and in-pipeline assertions act as circuit breakers that stop a malformed batch at the door.

CIRCUIT BREAKER — FAIL FAST, QUARANTINE, ALERT Producer contract enforced in CI assert GE·Soda·dbt ✓ pass → land into curated tables ✗ fail → quarantine dead-letter + page on-call Downstream never sees bad data Anomaly detection (cross-cutting) volume · freshness · null-rate · distribution drift shift-left: contract at the source
Contract + assertions gate every batch; bad data is quarantined and alerted, never propagated downstream

The shift is from detecting bad data downstream to preventing it at the source. Two complementary mechanisms: contracts (an agreement) and assertions (the enforcement).

The techniques

  • Data contracts: a versioned producer/consumer agreement — schema + semantics + SLAs — enforced in the producer's CI so a breaking change is caught in the PR.
  • In-pipeline assertions as circuit breakers: Great Expectations / Soda / dbt tests that fail the run and quarantine the batch (dead-letter) rather than poison downstream.
  • Anomaly detection on volume, freshness, null-rate and distribution drift — statistical or ML, alerting on deviation from the learned baseline.
  • Policy-as-code: quality rules live in version control, reviewed like any other code.
Interview signal — frame it as "fail fast at the source," distinguish a contract (the agreement, owned by the producer) from a test (the enforcement), and explain quarantine/dead-letter so one bad batch doesn't taint the lake. Builds on the Data Quality & Reliability page.
✦ ✦ ✦
Nº 08 · observability · lineage

Data Lineage, Dependencies & Observability

Why it's hot in 2026: lineage is the backbone of both incident response and change management — and it's increasingly built on vendor-neutral OpenLineage/OpenTelemetry standards rather than tool-specific silos.

LINEAGE DAG → IMPACT ANALYSIS + ROOT-CAUSE source stg_orders stg_users fct_revenue🔥 failed exec_dash finance_mart ↑ impact: who's affected ↓ root-cause: trace upstream OpenLineage / OTel jobs emit lineage events
Every job emits lineage; the DAG answers "what breaks if this changes?" (down) and "why did this go stale?" (up)

Auto-mapped lineage turns a pile of jobs into a navigable dependency graph, which powers the two questions every on-call and every migration needs answered: impact (downstream) and root-cause (upstream).

The techniques

  • Auto lineage: parse SQL or emit OpenLineage events from Airflow/Spark/dbt — table-level and, ideally, column-level.
  • Impact analysis & cascading-failure handling: when a node fails or changes, highlight everything downstream before you ship/pause.
  • Scheduling optimization: dependency-aware orchestration runs jobs in true topological order, not on guessed timers.
  • OpenTelemetry standardization: vendor-neutral traces/metrics/logs + lineage facets — the five observability pillars (freshness, volume, schema, distribution, lineage).
Interview signal — insist on column-level lineage (table-level is too coarse for real impact analysis) and explain the move to OpenLineage/OpenTelemetry as escaping vendor lock-in — one standard your whole stack emits, so observability isn't N disconnected tools.
✦ ✦ ✦
Nº 09 · tiering · TTL

Data Tiering & TTL Management

Why it's hot in 2026: the compute-side complement to Storage-Lens cost work (№01) — strict TTLs and rollup cascades retire raw events to cheaper tiers, so you stop paying to keep (and scan) granular history forever.

ROLLUP CASCADE + TTL → SHRINK THE HOT PATH HOT · raw events last 7–30 days full granularity $$$ to store + scan TTL expires raw → WARM · daily agg last ~13 months 1 row / day / dim COLD · monthly years · Glacier rollup rollup Operational data layer serves recent + agg store sketches, not distinct counts, so each rollup stays additive & correct (→ Analytics №05)
Granular raw → daily → monthly cascade; TTL retires raw to cheaper tiers while aggregates keep the history queryable

Tiering is the recognition that not all data deserves hot, granular, expensive storage. Access patterns decay with age, so storage and granularity should too.

The techniques

  • TTL frameworks: per-dataset retention enforced automatically — raw expires after N days once it's rolled up (also a GDPR/retention-compliance lever).
  • Rollup cascade: raw → hourly → daily → monthly aggregates; retire or delete the granular tier behind each rollup.
  • Operational vs analytical layers: a lean operational layer serves recent + aggregated data; deep history lives cold.
  • Additivity discipline: store sums/counts and HLL sketches (not pre-computed distinct counts or ratios) so rollups stay correct at every grain.
Interview signal — design the retention/rollup policy from access patterns + cost + compliance, and flag the additivity trap — you can't sum daily distinct counts, so you keep sketches. The compute-side bookend to Storage Lens (№01); mechanics in Analytics Lever A.
✦ ✦ ✦
Nº 10 · AI-ready · the signature 2026 theme ★

AI-Ready Infrastructure & Pipeline Copilots

Why it's hot in 2026: the defining theme. Two halves — (a) the data foundations production GenAI depends on, and (b) AI copilots embedded in the platform itself, pushing toward "autonomous" data operations.

(a) FOUNDATIONS GENAI DEPENDS ON sources docs · events Feature store online/offline · PIT-correct Vector store embeddings · ANN index Retrieval (RAG) freshness + lineage doc→chunk→answer LLM app (b) COPILOTS EMBEDDED IN OPERATIONS → AUTONOMOUS PLATFORM auto-tune SQL / layout flag risky schema changes anomaly triage self-healing
(a) feature + vector stores feed retrieval with freshness/lineage; (b) copilots auto-tune, gate changes, and triage anomalies

This is the topic that signals you're building for where the field is going. Both halves are squarely a data-engineering responsibility — the model is somebody else's; the plumbing is yours.

The techniques

  • Feature stores: online/offline parity and point-in-time correctness to kill training/serving skew — the #1 reason a model that tested well fails in production.
  • Vector stores & retrieval: embeddings + ANN index; chunking and embedding-refresh pipelines with freshness and lineage (doc → chunk → answer) so RAG can be trusted and audited.
  • Data quality as the RAG lever: retrieval is only as good as the freshness and quality of what's indexed — ties directly to contracts (№07) and lineage (№08).
  • Copilots in ops: text-to-SQL with guardrails, schema-change risk analysis, partition/clustering recommendations, anomaly triage — augmenting the engineer, trending toward autonomous platforms.
Interview signal — "GenAI is only as good as the data plumbing" — lead with point-in-time correctness in the feature store and freshness+lineage in the retrieval pipeline, and frame copilots as augmenting the engineer (auto-tuning, change-risk gating) rather than replacing judgment. This is the answer that reads as 2026 staff.
✦ ✦ ✦
Nº 11 · resilience · disaster recovery

Multi-Region Data Architecture & Disaster Recovery

Why it's hot in 2026: anyone can build a pipeline; far fewer can recover an Iceberg catalog after a region outage with a defined RPO/RTO. "Can you survive losing a region?" is becoming a bigger senior question than "can you write Spark?"

REPLICATE STATE ACROSS REGIONS · DEFINE RPO / RTO PRIMARY · us-east DR · us-west Object store (S3) Kafka commit log Catalog · Glue / Nessie / Polaris Object store replica mirrored topics catalog metadata replica S3 CRR MirrorMaker 2 metadata sync RPO = replication-lag (data you can lose) RTO = time to promote DR & redirect active-passive: B is warm standby · active-active: both serve, conflict resolution required
Replicate object store, log and catalog metadata; RPO is your replication lag, RTO is your promotion time

DR for a lakehouse is mostly a metadata problem: the data files replicate cheaply via cross-region replication, but a query can't read them until the catalog that points at them is recovered and consistent.

The techniques

  • RPO/RTO first: agree the acceptable data-loss window (RPO) and recovery time (RTO) before choosing a topology — they drive everything else.
  • Active-passive vs active-active: warm standby (cheaper, simpler, slower RTO) vs both-regions-serving (fast, expensive, needs conflict resolution / single-writer per key).
  • Cross-region replication: S3 CRR for data, Kafka MirrorMaker 2 for the log, and catalog metadata replication — recovering Iceberg/Nessie/Glue is the part teams forget.
  • Prove it: game-day failover drills and chaos engineering — an untested DR plan is a hypothesis, not a capability.
Interview signal — lead with RPO/RTO, then stress that the catalog is the recovery bottleneck ("the data's already in us-west; the question is whether Nessie/Glue can point at it consistently"), and insist the plan is only real if it's been chaos-tested.
✦ ✦ ✦
Nº 12 · Iceberg · table ops

Table Maintenance Automation

Why it's hot in 2026: Iceberg adoption is exploding, but many engineers stop at CREATE TABLE. A surprising share of production failures in 2026 are metadata failures, not compute failures — unmaintained tables that slowly strangle their own planning.

UNMAINTAINED TABLE → SCHEDULED MAINTENANCE → HEALTHY TABLE ✗ Cruft accrues thousands of snapshots millions of tiny manifests orphan files (failed writes) small data files → planning slows to a crawl Scheduled maintenance (policy-as-code) rewrite_data_files expire_snapshots rewrite_manifests remove_orphan_files + branching / tagging for safe writes & rollback ✓ Healthy table few current snapshots compact manifests right-sized files → fast planning, lower cost
Compaction, snapshot expiry, manifest rewrites and orphan removal keep planning fast and storage lean

An Iceberg table is a living thing: every write adds snapshots, manifests and files. Without scheduled maintenance, planning time and storage creep up until the table becomes the incident.

The techniques

  • Compaction (rewrite_data_files) to kill small files; manifest rewrites to keep the manifest tree shallow.
  • Snapshot expiration + orphan-file removal to reclaim storage and stop metadata bloat (the №02 planning killer).
  • Branching/tagging for isolated writes, audits and instant rollback; incremental processing off Iceberg snapshots/changelog to read only what changed.
  • Automate it as policy-as-code on a schedule — not a heroic manual cleanup after the table is already slow.
Interview signal — say "in 2026 the outage is usually metadata, not compute," then name the four maintenance ops and their cadence, and that you monitor manifest/snapshot counts as health metrics. Pairs with catalogs (№02) and layout (№03).
✦ ✦ ✦
Nº 13 · DPRE · SRE-for-data

Data Platform Reliability Engineering

Why it's hot in 2026: companies increasingly expect data engineers to think like SREs — SLOs, error budgets, automated rollback and self-healing. "What happens if Kafka is down for 4 hours?" is now a standard senior question.

SLI → SLO → ERROR BUDGET → AUTOMATED RESPONSE SLIs measured freshness / lag latency completeness error rate SLO targets e.g. 99% < 15-min freshness Error budget burn rate gates releases Automated response budget burned → freeze auto-rollback bad deploy self-heal / runbook page on-call
The SRE loop applied to data: measure SLIs, set SLOs, spend an error budget, automate the response

DPRE imports the SRE discipline into data: pipelines get reliability targets, an error budget that gates change, and automated responses so humans aren't the first line of defence.

The techniques

  • SLIs/SLOs for pipelines: freshness, end-to-end latency, completeness and error rate — with explicit targets, not vibes.
  • Error budgets that gate releases: burn the budget and you freeze risky changes until reliability recovers.
  • Automated rollback & self-healing: bad deploy or data → revert to the last good snapshot (Iceberg time-travel helps), retry/quarantine, runbook automation.
  • Degradation design: answer "Kafka down 4 hours?" with buffering, replay from offsets, and an explicit acceptable-data-loss stance.
Interview signal — answer the outage question with concrete numbers — "freshness SLO is 15 min, Kafka retains 24 h, so a 4-hour outage is recoverable via replay with zero data loss but an SLO breach we'd spend error budget on." That specificity is the staff tell. Builds on Data Quality & Reliability.
✦ ✦ ✦
Nº 14 · security · governance

Lakehouse Security & Governance

Why it's hot in 2026: most DE topic lists stop at schema and lineage, but interviews — especially in healthcare, finance and AI — increasingly probe row/column-level security, dynamic masking, PII handling and GDPR/CCPA deletion as policy-as-code.

ROLE → POLICY ENGINE → ROW FILTER + COLUMN MASK → GOVERNED RESULT User / role zero-trust identity Policy engine policy-as-code PII auto-classified RLS + masking rules Governed view (analyst role) region ssn email EU ***-**-1234 a••@x.com EU ***-**-9920 b••@y.com US rows filtered out (RLS) · ssn/email masked (column mask) GDPR / CCPA right-to-be-forgotten delete workflow
One policy engine enforces row-level security + dynamic masking per role; deletion workflows satisfy GDPR/CCPA

Governance is shifting from documentation to enforcement: access rules live as code in the query path, and compliance (deletion, residency) is an engineered workflow, not a manual scramble.

The techniques

  • Row-level security (per-role row filters) and column-level / dynamic masking (mask SSNs, emails unless authorized) enforced centrally.
  • PII detection & tagging to drive masking automatically; policy-as-code so access rules are versioned and reviewed.
  • GDPR/CCPA deletion workflows: right-to-be-forgotten at scale on immutable files — Iceberg row-level deletes + compaction to physically purge.
  • Zero-trust data access: short-lived, attribute-based credentials; no standing broad grants.
Interview signal — explain how you do a GDPR delete on an immutable lakehouse (delete files + rewrite/compact to truly purge, not just hide), and put enforcement in a central policy engine rather than copy-pasted WHERE clauses. Connects to lineage (№08) for "where did this PII flow?"
✦ ✦ ✦
Nº 15 · CDC · incremental ingestion

Change Data Capture (CDC) at Scale

Why it's hot in 2026: a surprising number of modern architectures are OLTP → CDC → Kafka → Iceberg → Trino rather than traditional batch ETL. Getting ordering, idempotency and exactly-once right is one of the hottest operational topics.

OLTP → CDC → KAFKA → ICEBERG MERGE → TRINO OLTP DBWAL / binlog Debeziumlog-based capture Kafkaordered per key Stream procdedup · idempotent IcebergMERGE upsert Trinoquery ordering guaranteed within a key (partition) — not globally exactly-once = idempotent upsert keyed on (pk) + dedup on op_seq; handle late + out-of-order
Log-based CDC turns every OLTP change into an ordered event, merged idempotently into the lakehouse

CDC replaces nightly batch with a continuous, low-latency mirror of the source. The hard parts aren't the capture — they're the correctness guarantees on the way into the lake.

The techniques

  • Log-based capture (Debezium off the WAL/binlog) — low-impact and complete, vs query-based polling that misses deletes.
  • Ordering & idempotency: Kafka guarantees order within a key/partition; make the apply idempotent (upsert on PK, dedup on an op_seq/LSN) so replays are safe.
  • Exactly-once into Iceberg: MERGE upserts/deletes; handle late-arriving and out-of-order updates with a monotonic version tiebreaker.
  • Snapshots + deletes: tombstone handling so source deletes propagate, not just inserts/updates.
Interview signal — say "Kafka only orders within a key, so I key by primary key and make the upsert idempotent on an LSN/op_seq — replays and out-of-order events then can't corrupt state." That sentence covers ordering, idempotency and exactly-once at once. Mechanics in Performance №24 (MERGE).
✦ ✦ ✦
Nº 16 · FinOps · compute

Cost-Aware Compute Optimization

Why it's hot in 2026: storage cost (№01) has a compute twin. "Cut this job's cost 40% without hurting the SLA" is a more realistic senior question than "write a Spark transformation" — it tests judgment, not syntax.

SAME SLA, LOWER BILL — TURN THE RIGHT KNOBS ✗ Over-provisioned fixed huge cluster all on-demand idle between stages $$$$ · SLA met The cost knobs Spark AQE autoscaling spot / preemptible right-size executors shuffle tuning resource queues serverless vs cluster: pay-per-query convenience vs steady-state cost ✓ Right-sized scales to the work spot + on-demand base no idle burn −40% · same SLA
AQE, autoscaling, spot instances, right-sizing, shuffle tuning and queues — cost down, SLA held

Compute cost is a design constraint, not an afterthought. The senior skill is finding the slack — idle capacity, over-provisioning, needless shuffle — without putting the SLA at risk.

The techniques

  • Spark AQE (coalesce partitions, skew join, dynamic switching) and shuffle optimization to cut the most expensive stages.
  • Autoscaling + spot/preemptible with an on-demand base for the driver/critical path — the biggest lever, if your job tolerates interruption.
  • Right-size executors (cores/memory to the actual workload) and use resource queues so cheap batch doesn't starve interactive.
  • Serverless trade-off: pay-per-query convenience and zero idle vs cost at steady high utilization — choose per workload.
Interview signal — answer "cut cost 40%" with a prioritized plan — spot instances + autoscaling first (biggest win), then AQE/shuffle, then right-sizing — each with its SLA risk called out. That ordering-by-impact-and-risk is the staff move. Deep shuffle/skew mechanics in Performance Families 2–3.
✦ ✦ ✦
Nº 17 · discovery · metadata

Metadata Engineering

Why it's hot in 2026: it's becoming its own specialty. The industry is moving from "store data" to "make data discoverable and trustworthy" — discovery, a business glossary, ownership and a semantic layer over a fast metadata catalog.

FROM "STORE DATA" TO "MAKE IT DISCOVERABLE & TRUSTWORTHY" dataset · orders dataset · users dataset · events …1000s more Metadata catalog (indexed) technical metadata business glossary ownership / domain quality / freshness + lineage (№08) · classification (№14) Trustworthy consumption 🔍 search & discovery (Google-for-data) semantic layer · governed metrics data products with owners & SLAs
A fast, indexed metadata layer turns thousands of datasets into searchable, owned, trustworthy data products

When there are thousands of tables, the bottleneck isn't storage or compute — it's finding the right, trustworthy dataset. Metadata engineering makes the catalog a product in its own right.

The techniques

  • Data discovery: a searchable catalog ("Google for data") with ranking by popularity, freshness and quality signals.
  • Business glossary + semantic layer: shared definitions and governed metrics so "revenue" means one thing everywhere (ties to Analytics №04).
  • Ownership & data products: every dataset has an owner, an SLA and a contract — domain-driven, not orphaned.
  • Catalog performance: metadata indexing so discovery and column-level lineage queries are fast at scale.
Interview signal — frame metadata as a product with discovery, ownership and trust signals — not a wiki nobody updates — and connect it to lineage (№08), the semantic layer and data-product ownership (№20). The shift is "store data → make data findable and trustworthy."
✦ ✦ ✦
Nº 18 · staff system design · events

Event-Driven Architectures

Why it's hot in 2026: one layer above Kafka partitioning (№05) sit the patterns that appear constantly in staff-level system-design rounds — event sourcing, CQRS, the outbox pattern and sagas.

EVENTS AS SOURCE OF TRUTH → CQRS READ MODELS · SAGAS Service (1 txn) write state + write OUTBOX (atomic) relay / CDC outbox → log Event log append-only · versioned source of truth read model: search read model: analytics read model: cache CQRS — many projections Saga multi-service txn + compensations
The outbox makes state+event atomic; the event log is the source of truth feeding CQRS projections and sagas

Event-driven design treats the stream of events as the system of record, with everything else a derived view. It's how you decouple services and rebuild state on demand.

The techniques

  • Event sourcing: persist state changes as an append-only log; current state is a fold over events, and you can replay/rebuild any projection.
  • CQRS: separate the write model from many purpose-built read models (search, analytics, cache) — each optimized for its query.
  • Outbox pattern / transactional messaging: write state and the event in one DB transaction, relay the outbox to the log — no dual-write inconsistency.
  • Sagas for multi-service transactions (with compensating actions), and event versioning for schema evolution of the log.
Interview signal — reach for the outbox pattern the moment someone says "write to the DB and publish to Kafka" — naming the dual-write problem and its fix is an instant staff signal. Then connect event versioning back to schema governance (№06).
✦ ✦ ✦
Nº 19 · AI · retrieval infrastructure

Vector Data Infrastructure

Why it's hot in 2026: the AI pillar's vector stores deserve their own deep-dive. Companies now expect data engineers to own the embedding pipelines, indexing and freshness that AI workloads consume.

INGEST → EMBED → INDEX · QUERY → HYBRID SEARCH → RE-RANK INGEST docs / data chunksplit + clean embedmodel → vectors vector indexANN (HNSW/IVF) re-embed on model/content change (freshness) QUERY user query embed querysame model hybrid searchANN + keyword (BM25) re-ranktop-k context LLM multi-modal: text · image · audio
Embedding pipeline builds the index; queries do hybrid ANN + keyword search and re-rank, with re-embedding for freshness

A vector store is only as good as the pipeline that fills and refreshes it. This is squarely data-engineering work — ingestion, indexing, freshness and lineage — not model training.

The techniques

  • Embedding pipelines: chunk → embed → upsert into the index, with the same model on both ingest and query side.
  • Vector indexing & ANN search: HNSW/IVF trade-offs (recall vs latency vs memory); filtering by metadata alongside the vector.
  • Hybrid search (vector + keyword/BM25) and re-ranking for relevance the pure-vector recall misses.
  • Freshness & re-embedding: re-embed on content or model change; track lineage (doc→chunk→vector) so RAG answers are auditable. Multi-modal retrieval where needed.
Interview signal — stress that RAG quality is a data-freshness and pipeline problem, not a prompt problem — and that re-embedding strategy (what triggers it, how you backfill without downtime) is the part most teams underbuild. Extends AI-Ready Infra (№10).
✦ ✦ ✦
Nº 20 · organizational architecture

Data Mesh & Domain Ownership

Why it's hot in 2026: the hype cooled but the principles stuck. Staff-level interviews increasingly test organizational architecture — domain ownership, data products and federated governance — not just technical architecture.

DOMAINS OWN DATA PRODUCTS · PLATFORM TEAM ENABLES · GOVERNANCE FEDERATES Orders domainowns: orders data product (SLA + contract) Payments domainowns: payments data product Growth domainowns: engagement data product Self-serve data platform (platform team enables — storage, catalog, CI/CD, observability) domains build ON the platform · platform team does NOT own the data Federated computational governance — global standards (interop, security, quality), local autonomy
Domains own data products; the platform team enables; governance is federated — global standards, local autonomy

Data mesh is an answer to an organizational scaling problem: a central team becomes the bottleneck as data needs outgrow it. The fix is ownership at the domain, on a shared platform.

The techniques

  • Domain ownership: the team that knows the data owns it end-to-end — no central team bottleneck for every change.
  • Data as a product: each domain ships discoverable, documented, SLA-backed data products with contracts (№07) and ownership in the catalog (№17).
  • Self-serve platform: a platform team provides the paved road (storage, catalog, CI/CD, observability) so domains move fast without reinventing infra.
  • Federated computational governance: global, automated standards (interop, security, quality) with local autonomy — not a central committee.
Interview signal — distinguish platform team vs domain team responsibilities crisply (platform enables, domains own), and note when mesh is wrong — a small org doesn't need it, and without strong platform + governance it just becomes silos. Knowing when not to apply it is the staff signal.
✦ ✦ ✦
Nº 21 · DataOps · AI-assisted delivery

CI/CD & DataOps Optimization with AI

Why it's hot in 2026: data-platform code and pipelines ship through CI/CD too — and AI is now optimizing the delivery loop itself. Selective testing, change-risk scoring and anomaly-gated deploys are turning CI/CD from a slow gate into an intelligent one.

AI AUGMENTS EVERY STAGE — SELECTIVE TESTS · RISK SCORING · ANOMALY-GATED DEPLOY PR / committhe diff AI reviewchange-risk scoreflag risky schema CI testsimpact-selectedflaky quarantine Buildsmart cache / DAGpredict failure Canary deployAI anomaly gateDQ + freshness Prodpromote anomaly detected → auto-rollback + self-healing fix PR blast-radius from lineage (№08) error-budget gating (№13) · contract checks (№07)
AI optimizes the loop (selective tests, risk scoring, predictive build) and closes it (anomaly-gated canary + auto-rollback)

CI/CD for a data platform gates on more than unit tests — it gates on data quality, contract compatibility and blast radius. AI both speeds the loop up and makes the deploy decision smarter.

The techniques

  • Test impact analysis / selective testing: run only the tests a diff actually affects (and pipeline tests for the touched models) — minutes instead of a full suite.
  • Flaky-test detection & quarantine: classify and auto-quarantine flaky tests so they stop blocking merges, and fix them on a separate track.
  • AI code review & change-risk scoring: auto-review PRs, score blast radius (from lineage, №08), and flag risky schema changes (№06) for human eyes; predict build failures and optimize caching/DAG parallelism.
  • Anomaly-gated progressive delivery: canary/blue-green where an AI watches data-quality, freshness and error metrics to auto-promote or auto-rollback — self-healing pipelines (auto-retry, revert, even auto-fix PRs), the DPRE loop (№13) closed automatically.
Interview signal — frame data CI/CD as gating on data quality + contract compatibility + blast radius, not just code tests — then describe AI optimizing the loop (selective testing, risk scoring) and closing it (anomaly-gated canary with auto-rollback). Tie it back to contracts (№07), lineage (№08) and error budgets (№13) — that systems view is the staff signal.
✦ ✦ ✦
§ How the twenty-one connect

One thread runs through all twenty-one.

Read together, the 2026 set is a single argument: now that the format is settled, value moved up the stack — to the catalog, metadata and table internals (№02, №12, №17), to layout and cost on both storage and compute (№03, №01, №09, №16), to reliability, quality, security and governance enforced as code (№13, №07, №06, №14, №08), to the streaming, CDC and skew fundamentals that never stopped mattering (№04, №05, №15), to resilience across regions (№11), to the staff-level concerns of event-driven and organizational architecture (№18, №20), and to the AI-ready foundations — feature, vector and retrieval infra — that are the new table stakes (№10, №19), with AI now optimizing the delivery loop itself (№21). The senior signal in every one is the same: you tie the technical choice to cost, blast radius, or trust — not just correctness.

Where to go deeper → The mechanics behind these live in the other pillars: Performance (scan/shuffle/skew/incremental), Analytics (rollups, serving, additivity), and Design (the schemas, plus dedicated Hot Shards, Streaming and Data Quality deep-dives). Practice the query mechanics in Practice · Q&A and pressure-test yourself in Skill Check.

← Practice · Q&A  ·  Performance  ·  Analytics  ·  ↑ Top