PaddySpeaks · Systems at the Whiteboard · Nº 26

The Orchestration Problem

Design Kubernetes — the container orchestration system that runs most of the world's cloud workloads. What happens when you run kubectl apply -f deployment.yaml? From API server validation through etcd persistence, scheduler placement, kubelet container creation, and health probes — to how the control plane, autoscaler, and service discovery hold together at 10,000 nodes. The schema, the watch API as distributed pub/sub, the scheduling algorithm, and the data architecture behind the pods your code runs inside.

WORDS · PaddySpeaks Editorial FIELD · Distributed Systems / Platform Engineering READING · ~42 min

  📦 Not familiar with containers yet? Start with the container primer first →
  The Container Problem
   (what Docker is, images, layers, namespaces, cgroups — the foundation Kubernetes builds on)

§ 00 — BEFORE KUBERNETESThe container sprawl problem

Containers were a revolution — until you had hundreds of them. Before orchestration existed, teams discovered a new category of operational pain that Docker itself couldn't solve.

Problem (pre-K8s)	Symptom	K8s Solution
Container Sprawl	Hundreds of containers, no coordination, unknown health	ReplicaSets + controllers maintain desired count automatically
Manual deployment	SSH into servers, run `docker run` by hand — error-prone, not repeatable	Declarative manifests + `kubectl apply` — the API server is the deployment
No self-healing	Container crashes = downtime until someone notices and restarts it	kubelet restarts crashed containers; ReplicaSet replaces dead pods
No service discovery	Hardcoded IP addresses; a redeployed container gets a new IP, everything breaks	Services provide stable ClusterIPs + DNS names via CoreDNS

Question	60-second answer
Why Kubernetes exists	Bin-packing, self-healing, service discovery — schedule containers across a fleet and keep them running without human intervention.
Control plane in one sentence	API server is the front door; etcd is the truth store; scheduler assigns pods to nodes; controller manager reconciles desired vs actual state.
kubectl apply → pod running	API server validates → writes to etcd → Deployment controller creates ReplicaSet → scheduler binds pod to node → kubelet pulls image and starts container.
How service discovery works	DNS via CoreDNS; kube-proxy programs iptables/IPVS rules; Services are stable virtual IPs load-balanced across healthy pod endpoints.
10,000-node scheduling	Predicates filter ineligible nodes; scoring ranks feasible nodes; scheduler samples a subset (not all 10K) for latency. Binding is atomic via etcd compare-and-swap.

Typical Interview Site	Interview Studio
Memorization	Understanding
Coding only	Coding + Architecture + Data Modeling
Short answers	Deep reasoning with trade-offs
LeetCode style	Real-world engineering at scale
Junior focus	Senior / Staff / L6–L7

§ 01 — THE QUESTIONThe container orchestration problem

Containers solve the "it works on my machine" problem. Kubernetes solves the "who runs the container on which machine, keeps it alive when the machine dies, and routes traffic to it" problem — at the scale of tens of thousands of machines and hundreds of thousands of containers, with zero human hand-holding.

Interview Prompt

"Design Kubernetes. Walk me through what happens when you run kubectl apply -f deployment.yaml — from API server to pod running on a node. Then explain how you'd design the scheduler, autoscaler, and service discovery to handle a 10,000-node cluster."

LEVEL · SENIOR / STAFFDURATION · 45–60 MINFORMAT · WHITEBOARD

The question catches most candidates because Kubernetes looks like a deployment tool but is actually a distributed database with a reconciliation engine built on top. A weak answer names the components. A strong answer names the four forces that make this hard — and shows how every design decision exists to survive one of them.

There are exactly four forces:

THE DESIRED-STATE GAP: The cluster is always drifting away from what you declared. Nodes crash, containers OOM-kill, images fail to pull, network partitions split the cluster. The entire control plane architecture exists to continuously close the gap between "what the user declared" (etcd) and "what is actually running" (node status). This is reconciliation loops all the way down.
THE SCHEDULING HARDNESS: Optimal bin-packing is NP-hard. At 10,000 nodes with 50 constraints per pod, a brute-force scheduler would take minutes. Kubernetes solves this with a two-phase filter-then-score pipeline, plus sampling — the scheduler never evaluates all nodes. But "good enough in 10ms" requires deliberately accepting suboptimal placement.
THE ETCD BOTTLENECK: One consistent store for the entire cluster creates a write bottleneck. Every API object — pod, service, endpoint, configmap — is stored in etcd. Every watch (there are thousands of them) is registered against etcd's gRPC watch stream. The API server fans out watch events; etcd itself is sized to handle ~1,000 writes/sec. At 10,000 nodes sending heartbeats every 5s, that's 2,000 writes/sec — you're at the etcd ceiling from heartbeats alone.
THE NETWORK FLATNESS ASSUMPTION: Kubernetes assumes every pod can reach every other pod directly. The CNI spec delegates actual implementation to plugins (Calico, Cilium, Flannel) — Kubernetes itself has no routing table. This assumption powers flat pod networking but means network policy, encryption, and multi-cluster routing all have to be bolted on by the CNI.

Read those four again: they are all data and distributed systems problems. The interesting engineering is in the watch API as a distributed pub/sub system, the scheduler as an online bin-packing approximation algorithm, and the reconciliation loop as a control theory feedback system. The YAML is nearly boring.

Envelope math, volunteered:

Quantity	Estimate	Consequence
Nodes in large cluster	10,000	Node heartbeats every 5s = 2,000 writes/sec to etcd — near the ceiling
etcd write throughput	~1,000–3,000 req/s	Heartbeat aggregation in kubelet + lease objects reduce actual etcd load
Scheduler decisions/sec	~1,000 pods/s peak	Parallel scheduling goroutines; each decision is microseconds with sampling
Watch connections to API server	~100K–1M	Watch multiplexing: API server has one etcd watch stream, fans out to all clients
Pod startup latency (P99)	<5 seconds	Image pull is the bottleneck; pre-pull on nodes with DaemonSets
Kubernetes objects per cluster	~300K–500K	etcd key count limit; use namespaces + labels for logical separation
HPA scrape interval	15 seconds	Scale decision latency ~30–60s; cooldown prevents thrashing

§ 02 — PLAIN LANGUAGEThe shipping fleet analogy

Before we dive into etcd and gRPC watches, here is Kubernetes in one analogy that a non-engineer will remember.

Imagine a global shipping fleet. You are the fleet manager. You don't care which specific ship carries which cargo — you care that the cargo arrives, is replaced if the ship sinks, and can be found by other ships that need to connect with it.

THE CAPTAIN (Control Plane): The control plane is the fleet captain — it knows every ship's capacity, what cargo is loaded on each, and what the manifest says should be loaded. It gives orders (schedule this pod here) but never touches the ships directly. The captain doesn't row.
THE SHIPS (Nodes): Nodes are the ships — physical or virtual machines that actually run your containers. Each ship has a first mate (kubelet) who receives orders from the captain and actually loads/unloads cargo. The ship reports its capacity and health back to the captain every 5 seconds.
THE CARGO (Pods): Pods are the shipping containers — the smallest unit of deployment. A pod contains one or more containers that share a network namespace and storage volumes. Just as a shipping container might carry multiple items, a pod might carry an app container plus a sidecar logging agent.
THE MANIFEST (Desired State): When you run kubectl apply, you hand the captain a manifest: "I want 3 copies of this container, each with 1 CPU and 2GB memory, running port 8080." The captain's entire job is to make reality match the manifest — and to keep matching it, forever, even as ships sink and cargo goes overboard.

What Kubernetes solves

Problem	Without Kubernetes	With Kubernetes
Bin-packing	You SSH into servers and start containers by hand. Half your CPU is wasted.	Scheduler places pods to maximize resource utilization across the fleet automatically.
Self-healing	Container crashes at 3 AM. On-call wakes up to restart it.	kubelet notices the container died and restarts it. ReplicaSet controller replaces the pod. You sleep.
Service discovery	You hardcode IP addresses. Server gets replaced, IPs change, everything breaks.	Services provide stable DNS names and virtual IPs. `http://payments-api:8080` always works.
Rolling deployments	Deploy new version = downtime, or complex blue-green scripting.	`kubectl set image` does a rolling update: new pods come up, old pods go down, zero downtime.
Scaling	You notice high CPU on the dashboard and manually add instances.	HPA watches metrics and scales replicas up/down automatically within seconds.

§ 03 — CONTROL PLANEThe brain of the cluster

The control plane is a set of processes that together maintain the desired state of the cluster. None of them run user workloads. All of them communicate exclusively through the API server — no component talks to etcd directly except the API server.

The five control plane components

Component	What it does	Where it runs
kube-apiserver	REST + gRPC front door. Validates all API objects. The only writer to etcd. Serves the watch stream. Enforces RBAC and admission control.	Control plane nodes (replicated 3×)
etcd	Distributed key-value store. Single source of truth for all cluster state. Raft consensus. Strong consistency guarantees. Every API object lives here.	Control plane nodes (replicated 3× or 5×)
kube-scheduler	Watches for unscheduled pods. Runs filter+score pipeline. Writes a Binding object to the API server. Does NOT start containers.	Control plane nodes (active-standby)
kube-controller-manager	Runs all the reconciliation loops: Deployment controller, ReplicaSet controller, Node controller, Endpoint controller, Job controller, and ~30 more. Each is a goroutine watching the API.	Control plane nodes (active-standby)
cloud-controller-manager	Talks to cloud provider APIs: provision LoadBalancer Services, attach persistent volumes, update Node objects with cloud metadata (instance type, region, zone).	Control plane nodes; optional if on-prem

How they communicate

The golden rule: every component communicates exclusively through the API server. The scheduler does not write to etcd. The controller manager does not call the scheduler. Every action is a write to the API server, which persists to etcd, which triggers a watch event, which wakes up the relevant controller or kubelet.

The API server is the only component that reads and writes etcd. Every other component — scheduler, controller manager, kubelet — communicates by watching the API server's gRPC stream and writing new objects back through the API. This indirection is not bureaucracy; it's how Kubernetes achieves exactly-once semantics, RBAC enforcement, and audit logging at every layer.

§ 04 — THE NODEWhat happens on a worker node

Every worker node runs three processes that together receive a pod spec and turn it into running containers.

kubelet: The node agent. Watches the API server for pods scheduled to this node. For each pod: calls the container runtime (via CRI) to pull the image and create containers, mounts volumes, configures the network namespace via CNI, runs health probes, and reports status back to the API server every few seconds. The kubelet is the only process that actually starts containers.
kube-proxy: Programs the node's network rules (iptables or IPVS) to implement Service routing. When a Service is created, kube-proxy watches the Endpoints object and writes iptables/IPVS rules so packets destined for the Service's ClusterIP get redirected to one of the healthy backing pods. Modern clusters use eBPF (Cilium) instead, replacing kube-proxy entirely.
container runtime (containerd / CRI-O): The low-level runtime that speaks the Container Runtime Interface (CRI) gRPC protocol to kubelet. kubelet says "create this container with this image, these env vars, this resource limit." containerd pulls the image from the registry (OCI), creates the namespace isolation (using Linux namespaces + cgroups), and returns the container ID. Docker itself was removed as a supported runtime in Kubernetes 1.24.

The kubectl apply → pod running sequence

kubectl apply -f deployment.yaml
   │
   ▼
1. API SERVER receives HTTPS request
   ├── Authentication: who are you? (x509 cert / bearer token / OIDC)
   ├── Authorization: are you allowed? (RBAC check)
   ├── Admission control: is this valid? (ValidatingWebhookConfiguration + MutatingWebhookConfiguration)
   ├── Schema validation: is this a valid Deployment spec?
   └── Write to etcd: /registry/deployments/default/my-app  (resource_version increments)

2. DEPLOYMENT CONTROLLER (in controller-manager) notices new Deployment via watch
   └── Creates a ReplicaSet object:  /registry/replicasets/default/my-app-6d4f9c7b2

3. REPLICASET CONTROLLER notices new ReplicaSet
   └── Creates 3 Pod objects, each with spec.nodeName = "" (unscheduled)
       /registry/pods/default/my-app-6d4f9c7b2-xk9p2
       /registry/pods/default/my-app-6d4f9c7b2-m7q4r
       /registry/pods/default/my-app-6d4f9c7b2-p3n8s

4. SCHEDULER notices unscheduled pods via watch (spec.nodeName == "")
   ├── FILTER phase: remove nodes that cannot run this pod
   │     - PodFitsResources: does node have enough CPU/memory?
   │     - PodFitsHostPorts: is the host port available?
   │     - MatchNodeSelector: do labels match nodeSelector/nodeAffinity?
   │     - NoTaint / Toleration: are taints tolerated?
   │     - VolumeZoneConformance: is the PV zone compatible?
   ├── SCORE phase: rank remaining feasible nodes (0–100)
   │     - LeastRequestedPriority: prefer nodes with most free resources
   │     - BalancedResourceAllocation: balance CPU and memory usage
   │     - NodeAffinityPriority: prefer nodes matching preferred affinity
   └── BIND: writes Binding object → API server sets spec.nodeName = "node-42"

5. KUBELET on node-42 notices a pod bound to it via watch
   ├── Calls containerd via CRI: "create sandbox" (pause container for network namespace)
   ├── CNI plugin configures pod network (assigns IP from pod CIDR)
   ├── Calls containerd: "pull image" → overlay filesystem layers stacked
   ├── Calls containerd: "create containers" → cgroups + namespaces set up
   ├── Calls containerd: "start containers"
   ├── Runs init containers sequentially (if any), then all regular containers in parallel
   ├── Starts liveness/readiness/startup probes
   └── Updates pod status in API server: phase=Running, podIP=10.244.42.15

6. ENDPOINT SLICE CONTROLLER notices pod is Running + Ready
   └── Adds pod IP to EndpointSlice for all matching Services

7. kube-proxy on every node notices updated EndpointSlice
   └── Reprograms iptables/IPVS rules — traffic can now reach the new pod

Total time from kubectl apply to pod running: ~2–10 seconds (image pre-pulled) or ~30–90s (cold pull)

§ 05 — POD LIFECYCLEStates, probes, and init containers

A pod transitions through a well-defined set of states, and the probes that kubelet runs determine when a pod is considered healthy — and when it is evicted or restarted.

Init containers and sidecars

Container type	Runs when	Purpose	Failure behavior
Init container	Before any regular container, sequentially	Database migration, config pre-fetch, wait-for-dependency. Completes then exits.	Pod stays Pending; RestartPolicy applies to init containers too
Sidecar container	Alongside main container (K8s 1.29+ native; otherwise just a regular container)	Log forwarding (Fluentd), proxy (Envoy/Istio), metrics exporter, secret sync	Independent restart; does not terminate when main container exits (native sidecar does)
Ephemeral container	On-demand, live container debug	`kubectl debug` injects a debug container into a running pod without restarting it	Debugging only — not in pod spec, not restarted

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payments-api
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      initContainers:
        - name: wait-for-db
          image: busybox:1.36
          command: ['sh', '-c', 'until nc -z postgres:5432; do sleep 2; done']

      containers:
        - name: payments-api
          image: company/payments-api:v2.1.4
          ports:
            - containerPort: 8080
          resources:
            requests:               # used for scheduling bin-packing
              cpu: "500m"
              memory: "512Mi"
            limits:                 # enforced by cgroups; OOM-kill at limit
              cpu: "2"
              memory: "2Gi"
          livenessProbe:            # fail → restart container
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:           # fail → removed from Service endpoints
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          startupProbe:             # one-time: disables liveness until /startup returns 200
            httpGet:
              path: /startup
              port: 8080
            failureThreshold: 30   # 30 × 10s = 5 minute grace window for slow start
            periodSeconds: 10

§ 06 — NETWORKINGFlat pods, Services, Ingress, DNS

Kubernetes networking rests on one core assumption: every pod has a unique IP, and every pod can reach every other pod directly, without NAT. Everything else — Services, Ingress, NetworkPolicy — is built on top of that flat network.

The four networking layers

Layer	What it provides	How it works
Pod networking (CNI)	Pod-to-pod IP reachability across nodes. Every pod gets a real IP from the cluster's pod CIDR (e.g., 10.244.0.0/16).	CNI plugin (Calico, Cilium, Flannel) assigns IPs, sets up routes or VXLAN overlays, programs the kernel. kubelet calls the CNI binary at pod creation/deletion.
Service (ClusterIP)	Stable virtual IP + DNS name for a set of pods. `http://payments-api:8080` always resolves to the same ClusterIP, even as pods cycle.	kube-proxy watches Endpoints and programs iptables DNAT rules: ClusterIP → one of the healthy pod IPs (round-robin). Or IPVS for high performance. Or eBPF (Cilium).
NodePort / LoadBalancer	Expose a Service outside the cluster. NodePort binds a port on every node. LoadBalancer provisions a cloud LB in front.	NodePort: kube-proxy opens the port on all nodes. LoadBalancer: cloud-controller-manager calls the cloud API (AWS ELB, GCP LB) to create the load balancer and update its targets.
Ingress	Layer-7 HTTP routing: route `/api/` to one Service, `/static/` to another, by hostname. TLS termination.	An Ingress controller (nginx-ingress, Traefik, AWS ALB Ingress) watches Ingress objects and reprograms its own routing config. Kubernetes defines the Ingress API; you install the controller.

DNS: CoreDNS

Every pod gets /etc/resolv.conf pointing at CoreDNS (a cluster-internal DNS server). CoreDNS resolves:

payments-api → ClusterIP of the payments-api Service in the same namespace
payments-api.default.svc.cluster.local → ClusterIP (fully-qualified)
Headless Services (clusterIP: None) → A records for each pod IP directly (used for StatefulSets, Kafka brokers, etc.)

Readiness probe is the key to zero-downtime deployments. When a new pod fails its readiness probe, it is never added to the Service's EndpointSlice — traffic never reaches it. Only once it passes readiness does traffic flow. Combined with rolling update strategy (maxUnavailable=0), every request is served by a healthy pod throughout the deployment.

§ 07 — SCHEMA DESIGNetcd as a distributed database

Kubernetes stores every object in etcd as a key-value pair. Understanding the key schema and the watch mechanism is understanding why Kubernetes is architecturally elegant — and where its scalability limits lie.

The etcd key schema

# etcd key format: /registry/{resource-type}/{namespace}/{name}

/registry/pods/default/payments-api-6d4f9c7b2-xk9p2
/registry/pods/kube-system/coredns-74ff55c5b-4rz8k
/registry/deployments/default/payments-api
/registry/replicasets/default/payments-api-6d4f9c7b2
/registry/services/default/payments-api
/registry/endpoints/default/payments-api          # being replaced by EndpointSlices
/registry/endpointslices/default/payments-api-xz9k
/registry/configmaps/default/payments-api-config
/registry/secrets/default/payments-api-tls
/registry/nodes/node-42                           # cluster-scoped, no namespace
/registry/namespaces/production                   # cluster-scoped

# Value: serialized protobuf (not JSON, despite the API accepting JSON)
# Each write atomically increments the resource_version (cluster-wide monotonic counter)

The watch mechanism — Kubernetes as a distributed pub/sub

This is the most important architectural insight in Kubernetes. The entire system is an event-driven, watch-driven pub/sub built on top of a key-value store.

# etcd watch API (simplified)
# Every component establishes a long-lived gRPC watch to the API server

# The scheduler watches for unscheduled pods:
GET /api/v1/pods?fieldSelector=spec.nodeName=&watch=true&resourceVersion=12345
# → receives a stream of ADDED/MODIFIED/DELETED events

# The Deployment controller watches Deployments:
GET /apis/apps/v1/deployments?watch=true&resourceVersion=12345

# Each kubelet watches pods assigned to its node:
GET /api/v1/pods?fieldSelector=spec.nodeName=node-42&watch=true

# The critical insight: API server maintains ONE watch connection to etcd
# and fans out watch events to the potentially THOUSANDS of watchers.
# This "reflector + informer + workqueue" pattern appears in every K8s component:

# Informer lifecycle:
# 1. LIST all objects at startup (paginated, 500 at a time)
# 2. Sync the local in-memory cache
# 3. WATCH from the last seen resourceVersion
# 4. On reconnect, resume from last resourceVersion (no full re-list if etcd has the history)
# 5. If resourceVersion is too old (etcd compacted), fall back to full re-LIST

# resourceVersion is the etcd revision number — a cluster-wide monotonic integer.
# It enables optimistic concurrency:
# GET pod → resource_version=5000
# PUT pod (modify) with resourceVersion=5000
# → API server checks: current etcd revision == 5000? If not (someone else wrote), return 409 Conflict
# This is etcd's compare-and-swap used to prevent lost updates.

What if you replaced etcd with PostgreSQL?

This is a real interview question. etcd is Kubernetes' weakest scalability point — it tops out around 8GB of data and ~3,000 writes/sec. The question forces you to articulate what etcd actually provides.

Feature	etcd provides	PostgreSQL equivalent
Consistent reads	Linearizable reads via Raft — always reads the latest committed value	`SET TRANSACTION ISOLATION LEVEL SERIALIZABLE` or `SELECT FOR UPDATE`
Watch / Change Notification	gRPC streaming watch: efficient long-lived watch on key prefixes	`LISTEN/NOTIFY` + logical replication slots; much more complex to fan out
Optimistic concurrency	resourceVersion CAS on write: `txn(compare(rev,5000)).put(k,v)`	`WHERE xmin = $1` optimistic locking or `SELECT FOR UPDATE`
Lease / TTL keys	etcd leases: key expires if lessor doesn't renew (heartbeat)	Scheduled job to expire rows; no native push notification on expiry
Cluster-wide monotonic revision	Every write increments a global revision counter	PostgreSQL `txid_current()` / sequences; monotonic per-table not cluster-wide

The K8s community built kine — a shim that lets Kubernetes use MySQL, PostgreSQL, or SQLite as the backing store instead of etcd. K3s uses SQLite for single-node installs via kine. The watch API is the hardest part to replicate — kine implements it via polling + NOTIFY.

etcd's watch API is what makes the reconciliation loop pattern efficient. Without long-lived watches, controllers would have to poll the API server every second — generating 100× the traffic. The watch stream means a controller receives exactly the events it cares about, the instant they happen, with no polling overhead. This is the core reason Kubernetes chose etcd over a relational database.

§ 08 — SCHEDULINGPlacing pods at scale

The scheduler's job is to assign each pending pod to a node that can run it. At 10,000 nodes, evaluating every node for every pod would be too slow. The solution is a two-phase pipeline with sampling.

The filter → score → bind pipeline

For each unscheduled pod:

PHASE 1: FILTER (predicates — binary pass/fail)
  Remove nodes that CANNOT run this pod:
  ├── PodFitsResources        cpu+memory requests fit within allocatable capacity
  ├── PodFitsHostPorts        no port conflict on the node
  ├── MatchNodeSelector       node labels match pod's nodeSelector / nodeAffinity
  ├── NoDiskConflict          required volumes can be attached
  ├── NoVolumeZoneConflict    PV zone matches node zone
  ├── MaxEBSVolumeCount       AWS: max 39 EBS volumes per node
  ├── MatchInterPodAffinity   pod can coexist with existing pods on this node
  ├── PodToleratesNodeTaints  pod tolerates all NoSchedule taints on the node
  └── (... 20+ predicates total)

  Result: feasibleNodes (could be 0 → Pending, or 1–N)

PHASE 2: SCORE (priorities — 0–100 per node)
  Rank feasible nodes to pick the best:
  ├── LeastRequestedPriority      (100 - (cpu% + mem%) / 2) — prefer emptier nodes
  ├── BalancedResourceAllocation  penalize imbalanced cpu:memory ratio
  ├── NodeAffinityPriority        weight preferred node affinity rules
  ├── InterPodAffinityPriority    prefer/avoid nodes with specific pods
  ├── ImageLocalityPriority       bonus for nodes that already have the image cached
  └── (custom plugins via Scheduling Framework)

  Result: scored nodes sorted descending

PHASE 3: BIND
  Selected node → create Binding object via API server
  API server: SET spec.nodeName = "node-42" in etcd
  This is a compare-and-swap — if two schedulers race, one wins, one gets 409 Conflict

──────────────────────────────────────────────────────────────────────
10,000-NODE PERFORMANCE TRICK: SAMPLING

  Instead of scoring ALL 10,000 feasible nodes:
  percentageOfNodesToScore: 50   (default in large clusters)
  → after passing filters, randomly sample min(50%, 100 nodes) for scoring
  → scoring phase runs on ~100 nodes, not 10,000
  → worst case: slightly suboptimal placement
  → benefit: scheduling latency stays <5ms per pod

  The scheduler also runs in parallel goroutines — ~16 pods scheduled concurrently.

Advanced scheduling constraints

# Node affinity — require or prefer nodes by label
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:   # hard requirement
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.kubernetes.io/zone
              operator: In
              values: [us-east-1a, us-east-1b]
    preferredDuringSchedulingIgnoredDuringExecution:  # soft preference
      - weight: 80
        preference:
          matchExpressions:
            - key: node-type
              operator: In
              values: [high-memory]

# Pod anti-affinity — spread pods across zones (prevent all replicas on one node)
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: payments-api
        topologyKey: kubernetes.io/hostname   # no two replicas on same node

# Topology spread constraints (K8s 1.19+) — more flexible than anti-affinity
topologySpreadConstraints:
  - maxSkew: 1                               # max 1 replica difference between zones
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: payments-api

# Taints and tolerations — reserve nodes for specific workloads
# Node: kubectl taint nodes gpu-node-01 dedicated=gpu:NoSchedule
# Pod tolerates the taint:
tolerations:
  - key: dedicated
    operator: Equal
    value: gpu
    effect: NoSchedule

§ 09 — AUTOSCALINGHPA, VPA, and Cluster Autoscaler

Kubernetes has three layers of autoscaling, each operating at a different timescale and on different resources.

Scaler	What it scales	Trigger	Latency
HPA (Horizontal Pod Autoscaler)	Number of replicas in a Deployment/StatefulSet	CPU utilization, memory, or custom metrics (KEDA for queue depth, RPS, etc.)	30–60 seconds (scrape interval 15s + decision cooldown)
VPA (Vertical Pod Autoscaler)	CPU and memory requests/limits of existing pods	Historical utilization; recommends right-sizing	Minutes to hours; usually requires pod restart
Cluster Autoscaler	Number of nodes in the cluster	Pods stuck in Pending (no room) → scale up; underutilized nodes → scale down	Scale-up: 1–3 minutes (node provisioning). Scale-down: 10+ minutes (safety margin)

KEDA — extending HPA with external metrics

# KEDA (Kubernetes Event-Driven Autoscaling) scales on external signals:
# queue depth, Kafka lag, HTTP RPS, cron schedule, custom Prometheus metrics

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: payments-worker
spec:
  scaleTargetRef:
    name: payments-worker
  minReplicaCount: 0             # scale to zero when queue is empty (save $$$)
  maxReplicaCount: 50
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/payments-queue
        queueLength: "10"        # target: 10 messages per replica
        awsRegion: us-east-1
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: payments_queue_depth
        threshold: "100"         # scale up if >100 unprocessed payments per replica

Cluster Autoscaler — adding nodes

When pods are stuck in Pending because no node has enough capacity, the Cluster Autoscaler asks the cloud provider to provision a new node. When nodes have been underutilized for 10+ minutes and all pods could be rescheduled to other nodes, it cordons and drains the node then deletes it.

# Key Cluster Autoscaler behaviors:

# Scale-up trigger:
# pod is Pending AND can be scheduled if one more node of type X exists
# → call cloud provider API: "add 1 node to node group X"
# → node joins cluster in ~90 seconds
# → scheduler places the pending pod

# Scale-down trigger:
# node utilization < 50% of requests for 10 minutes
# AND all pods can be rescheduled elsewhere
# → cordon node (no new pods scheduled)
# → gracefully evict pods (respecting PodDisruptionBudgets)
# → cloud provider: "delete this node"

# PodDisruptionBudget — protect critical workloads from eviction:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payments-api-pdb
spec:
  minAvailable: 2              # always keep at least 2 pods running during disruptions
  selector:
    matchLabels:
      app: payments-api

§ 10 — RBAC & SECURITYWho can do what to which resources

Kubernetes RBAC is a four-part system: Subjects (who), Verbs (what action), Resources (on what), and Scope (namespace or cluster). Every API request goes through RBAC before reaching etcd.

# ServiceAccount — the identity a pod runs as
apiVersion: v1
kind: ServiceAccount
metadata:
  name: payments-api
  namespace: payments
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/payments-api  # IRSA for AWS

---
# Role — namespaced permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: payments-api-role
  namespace: payments
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["db-credentials", "stripe-key"]  # specific secrets only
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch"]

---
# RoleBinding — bind the Role to the ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: payments-api-binding
  namespace: payments
subjects:
  - kind: ServiceAccount
    name: payments-api
    namespace: payments
roleRef:
  kind: Role
  name: payments-api-role
  apiGroup: rbac.authorization.k8s.io

---
# NetworkPolicy — allow only specific pod-to-pod traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: payments-api-netpol
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payments-api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: ingress-nginx       # allow from ingress only
      ports:
        - port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: postgres            # allow to postgres only
      ports:
        - port: 5432

Security layers in depth

Layer	Mechanism	What it prevents
Authentication	x.509 client certs, bearer tokens, OIDC, service account tokens (JWT)	Unauthenticated API access
RBAC authorization	Roles + RoleBindings; ClusterRoles for cluster-scoped resources	Privilege escalation; a compromised pod can't read all secrets
Admission Control	ValidatingWebhookConfiguration, MutatingWebhookConfiguration, PodSecurity admission	Policy violations before they reach etcd; auto-inject sidecars
PodSecurity Standards	Baseline / Restricted / Privileged per namespace label	Containers running as root; host network access; privilege escalation
NetworkPolicy	L3/L4 allow/deny rules enforced by the CNI plugin	Lateral movement: compromised pod can't reach all other pods
Secrets encryption at rest	EncryptionConfiguration in kube-apiserver: AES-GCM or KMS provider (AWS KMS, GCP KMS)	Secrets readable if etcd backup is leaked
Audit logging	Every API request logged to file or webhook sink (SIEM)	Post-incident forensics; "what was running before the breach?"

§ 11 — Q&AFifteen tough interview questions

These questions separate the senior answer from the staff answer — etcd internals, split-brain, rolling strategies, StatefulSet vs Deployment, CRDs, and multi-tenancy.

Q 01

etcd loses quorum — one node dies in a 3-node etcd cluster. What happens to the Kubernetes cluster?

With 2 of 3 nodes remaining, the Raft cluster still has quorum and continues operating normally — etcd requires floor(n/2)+1 nodes. If a second node dies (only 1 of 3 remaining), etcd loses quorum: all writes are rejected with etcdserver: request timed out. The API server can still serve reads from its in-memory cache but cannot persist new state — pods keep running but no new deployments can be created or updated.

Q 02

Explain rolling update strategy. What are maxUnavailable and maxSurge?

maxUnavailable: how many pods can be unavailable during the update (defaults to 25%). maxSurge: how many extra pods above the desired count can exist during the update (defaults to 25%). Setting maxUnavailable: 0, maxSurge: 1 means: create one new pod, wait for it to be ready, then terminate one old pod — guarantees zero downtime but requires extra capacity. maxUnavailable: 1, maxSurge: 0 never exceeds the replica count but temporarily reduces capacity.

Q 03

kubectl apply vs kubectl create — what's the difference?

kubectl create is imperative — create the object, fail if it already exists. kubectl apply is declarative — send a server-side merge patch (or client-side three-way merge) comparing your desired state against the last-applied-configuration annotation and the live state; create if missing, update if changed. Always use apply in CI/CD pipelines; create is for one-off resource creation.

Q 04

What is a headless Service? When do you use one?

A headless Service (clusterIP: None) does not get a virtual IP. Instead, DNS returns A records for each pod IP directly. Used for: StatefulSets (each pod needs a stable DNS name: kafka-0.kafka-headless.default.svc.cluster.local), client-side load balancing (gRPC streams, Cassandra), and any case where you need to address individual pods rather than a random healthy pod.

Q 05

StatefulSet vs Deployment — when do you use each?

Deployment: stateless pods where any replica can serve any request — web servers, API services, batch workers. StatefulSet: pods with stable identity (persistent hostname, persistent volume, ordered startup/shutdown) — Kafka, ZooKeeper, MySQL replica sets, Elasticsearch. The key difference: StatefulSet pods have predictable names (kafka-0, kafka-1), and each gets its own PersistentVolumeClaim that follows the pod across reschedules.

Q 06

Resource requests vs resource limits — what's the difference and why does it matter?

Requests are what the scheduler uses for bin-packing — a node is only considered feasible if its allocatable capacity exceeds the pod's requests. Limits are enforced at runtime by cgroups — a container exceeding its CPU limit is throttled; exceeding its memory limit is OOM-killed. Setting requests == limits (Guaranteed QoS class) prevents OOM-kill-based evictions and gets highest priority in node pressure. Setting no limits (BestEffort QoS class) means the pod is the first evicted when the node runs low on memory.

Q 07

What is node pressure eviction? How does it differ from OOM kill?

Node pressure eviction is kubelet gracefully terminating pods when the node approaches a resource threshold (memory, disk, PID count). kubelet orders eviction by QoS class (BestEffort first, then Burstable, finally Guaranteed) and by how much the pod exceeds its requests. OOM kill is the Linux kernel forcibly killing a process without any graceful shutdown when physical memory is exhausted — the pod gets OOMKilled status. Eviction is gentler; OOM kill is a last resort.

Q 08

What is a CRD (Custom Resource Definition) and what is an Operator?

A CRD extends the Kubernetes API with your own resource types — you can define a KafkaCluster or PostgresDatabase custom resource and manage it via kubectl. An Operator is a controller that watches these custom resources and implements the runbook for managing the underlying software — provisioning, scaling, backup, failover. cert-manager (Certificate, ClusterIssuer), KEDA (ScaledObject), and Istio (VirtualService, AuthorizationPolicy) are all Operators.

Q 09

How do you handle multi-tenancy in Kubernetes — giving different teams isolated environments on the same cluster?

Soft multi-tenancy (namespaces + RBAC + NetworkPolicy + ResourceQuota): each team gets a namespace, can only touch their own resources, has a resource budget, and pods cannot reach other namespaces' pods. Hard multi-tenancy (separate clusters): for PCI-DSS, HIPAA, or true blast-radius isolation — use cluster-per-tenant with shared control plane (Cluster API) or completely separate clusters with federated tooling. The Kubernetes docs explicitly say namespaces are not a security boundary for untrusted tenants.

Q 10

What is a PodDisruptionBudget and when is it critical?

A PDB constrains how many pods of a deployment can be simultaneously unavailable during voluntary disruptions (node drain, Cluster Autoscaler scale-down, rolling update). minAvailable: 2 means at least 2 pods must be running; the drain/autoscaler will wait if draining a node would violate this. Critical for: stateful services, any service with low replication factor, database connection pools that need warmup time. Without PDBs, a node drain could evict all replicas simultaneously.

Q 11

How does the Kubernetes watch API work at the protocol level? What is the informer pattern?

Watch is a long-lived HTTP/2 (or gRPC) streaming response: GET /api/v1/pods?watch=true returns a chunked stream of JSON events (ADDED, MODIFIED, DELETED) indefinitely. The informer pattern (in client-go) wraps this: on startup, LIST all objects and populate a local in-memory cache; then WATCH from the returned resourceVersion and apply events to the cache. This means controllers work against a local read cache (zero API server calls for reads) and only need to write — dramatically reducing API server load.

Q 12

A pod is stuck in Pending. Walk me through your diagnosis.

kubectl describe pod → look at Events section. Common causes: (1) Insufficient cpu/memory — all nodes full; (2) no nodes matched nodeSelector — labels wrong; (3) had taint X which pod did not tolerate — missing toleration; (4) unbound immediate PersistentVolumeClaims — no PV available or StorageClass misconfigured; (5) pod has unbound immediate PersistentVolumeClaims in zone — PV is in wrong zone for multi-AZ clusters. Fix: check kubectl get events --sort-by=.lastTimestamp and kubectl get nodes -o wide.

Q 13

What is the scheduler extender / scheduling framework and when would you use a custom scheduler plugin?

The Kubernetes Scheduling Framework defines extension points (PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit, PreBind, Bind, PostBind) where you can inject custom logic as compiled-in plugins. Use cases: GPU topology awareness (place GPU pods on nodes sharing NVLink), gang scheduling (batch ML jobs must all start together), specialized hardware affinity (FPGA, RDMA), or cost-aware scheduling (prefer spot instances). The older scheduler extender (HTTP webhook) is deprecated; prefer the in-process plugin framework.

Q 14

How does Kubernetes handle node failure? What is the difference between node NotReady and pod eviction?

When a node stops heartbeating for 40s (default node-monitor-grace-period), it transitions to NotReady. After a further 5 minutes (default pod-eviction-timeout), the Node controller marks all pods on that node for eviction — adding a NoExecute taint. The Deployment controller sees pods disappear and creates replacements. In EKS/GKE, the cloud provider often terminates the node before this timeout triggers. PDB constraints are respected only for voluntary evictions — a hard node failure bypasses PDB.

Q 15

How does a Kubernetes Ingress controller work? What's the difference between an Ingress and a Gateway API?

An Ingress controller is an Operator that watches Ingress objects and configures an actual reverse proxy (nginx, Envoy, HAProxy) to route L7 HTTP traffic. The Ingress API is intentionally limited — one IngressClass, simple path routing, TLS termination. The Gateway API (Kubernetes SIG-Network, v1 since K8s 1.28) is the successor: richer routing (headers, weights, cross-namespace, TCP/UDP), multi-tenancy (separate Roles for infrastructure vs app teams), and expressiveness comparable to Istio VirtualService. Use Gateway API for new deployments; Ingress for mature stacks with existing nginx-ingress configurations.

§ 12 — SUMMARYWhat the strong answer looks like

Dimension	Weak answer	Strong answer
kubectl apply → pod	"API server deploys it to a node"	Names all 7 steps: validate → etcd → Deployment ctrl → ReplicaSet ctrl → scheduler filter/score/bind → kubelet CRI → CNI network → readiness gates Service endpoint
Control plane	Lists the 5 components	Explains that ALL communication goes through API server; etcd watch as pub/sub; resourceVersion as optimistic concurrency
Scheduler at 10K nodes	"It assigns pods to nodes"	Two-phase filter+score pipeline; sampling (percentageOfNodesToScore); parallel goroutines; Binding as CAS; affinity / anti-affinity trade-offs
etcd schema	Vague key-value store	Key format /registry/{type}/{ns}/{name}; watch API as long-lived gRPC stream; resourceVersion CAS; informer pattern; why PostgreSQL is harder (LISTEN/NOTIFY vs native watch)
Autoscaling	"HPA scales pods"	Differentiates HPA/VPA/Cluster Autoscaler; explains cooldown periods; KEDA for external metrics; scale-to-zero; PDB protection during scale-down
Networking	"Services have a virtual IP"	CNI flat network assumption; kube-proxy iptables DNAT; EndpointSlice controller excludes not-ready pods; headless Services for StatefulSets; Ingress vs Gateway API
RBAC	"Namespace isolation"	ServiceAccount bound to Role via RoleBinding; least-privilege secrets access; NetworkPolicy as L3/L4 microsegmentation; admission webhooks for policy; encryption at rest via KMS provider

Kubernetes is not a deployment tool. It is a distributed database (etcd) with a reconciliation engine (controller manager) built on top of a distributed pub/sub system (the watch API), with a constraint-satisfaction scheduler, a flat IP network, and a distributed process manager (kubelet) on every node. Understanding it at that level — not as a YAML applier but as a distributed system — is what separates the staff-level answer from the senior one.

← Back to Design Scenarios

§ 09 — COMMON MISTAKESWhat trips up experienced engineers

Even engineers who understand Kubernetes conceptually make these operational mistakes. Recognise them before your interviewer does.

❌ "Kubernetes for every project"

K8s has significant operational overhead: control plane costs, upgrade cadence, CNI complexity, RBAC management. A startup with 3 services doesn't need K8s. A managed PaaS — Railway, Render, Fly.io — is faster to ship on, cheaper to run, and operationally simpler until you have real scaling needs. The right tool is the simplest one that meets your requirements.

❌ "Pods are permanent"

Pods are ephemeral. They can be evicted, rescheduled, OOM-killed, or replaced at any time. Never store state in a pod directly — a filesystem write inside a container disappears when the pod dies. Use PersistentVolumes for filesystem state, StatefulSets for stable pod identity, or external managed databases (RDS, DynamoDB) for the simplest path.

❌ "Setting no resource limits"

Without CPU and memory limits, a single noisy pod can consume all resources on a node and starve every other workload — OOM-killing them in a cascade. Always set both requests (for scheduler bin-packing) and limits (for cgroup enforcement). For production workloads, set requests == limits (Guaranteed QoS class) to prevent throttling surprises.

§ 10 — WHY NOT?Kubernetes isn't always the answer

Every system design interview expects trade-offs. Know when Kubernetes is the wrong choice — and be able to say so clearly.

✓ Use K8s When

✓ 10+ microservices needing independent deploys

✓ Traffic varies significantly (autoscaling needed)

✓ Multi-team platform with shared infrastructure

✓ Need sophisticated rolling deployments

✓ Team already has K8s expertise

✓ Multi-region, multi-cluster federation

✗ Skip K8s When

✗ Small team (< 5 engineers)

✗ Simple monolith or 2–3 services

✗ Tight budget (K8s control plane adds cost)

✗ Team lacks K8s expertise

✗ Speed-to-market is the primary constraint

✗ No scaling requirements yet

The strong answer in an interview names the alternatives: Railway, Render, Fly.io for small teams; ECS on Fargate for AWS-native simple services; managed Heroku-style platforms for early startups. Defaulting to K8s without acknowledging its complexity cost is a junior signal.

§ 11 — ONE-MINUTE ANSWERIf you only remember one thing

When an interviewer asks "Why does Kubernetes exist?" you need to be able to answer in 60 seconds — clearly, with the insight, not just the features.

Question: "Why does Kubernetes exist?"

As teams adopted containers, they faced a new problem: container sprawl. Running hundreds of containers manually — deciding which server hosts which container, restarting crashed containers, routing traffic to healthy instances — became operationally unsustainable. Kubernetes solves this with a declarative model: you describe desired state (3 replicas of the API service), and Kubernetes continuously reconciles actual state toward it. The control loop — observe, diff, act — is the core insight. Kubernetes doesn't just run containers; it makes distributed systems self-healing.

The One-Sentence Version
Kubernetes is a reconciliation engine: you declare desired state, and it continuously closes the gap between what you want and what is running — even as machines fail, containers crash, and traffic spikes.

§ 12 — INTERVIEWER'S MINDWhat they're really testing

When an interviewer asks about Kubernetes, they're probing four levels of understanding. Here's what they want to hear at each level.

Declarative vs Imperative

Can you explain why desired-state beats imperative commands at scale? The key insight: at thousands of containers, reconciliation loops are more reliable than runbooks. Imperative commands fail silently; declarative state is continuously verified.

Control Loop Pattern

Can you describe the reconciliation loop? Watch → compare → act. The controller observes current state from the API server, diffs it against desired state in etcd, and takes the minimal action to close the gap. This pattern appears in every K8s component.

Scheduler Understanding

What factors does the scheduler consider? Resources (CPU/memory requests), node affinity, pod anti-affinity, taints and tolerations, topology spread constraints, volume zone compatibility. At 10K nodes, it samples rather than evaluating all nodes for latency.

Production Readiness

Do you know about PodDisruptionBudgets (protect replicas during node drain), resource limits (prevent noisy-neighbor OOM cascades), liveness vs readiness probes (restart vs traffic exclusion), and HPA cooldown periods (prevent scale thrashing)?

§ 13 — THE EVOLUTIONHow we got here

Kubernetes didn't appear in a vacuum. It is the product of a decade of industry convergence around the right abstraction for running software at scale.

§ 14 — WHAT'S NEXT?The frontier beyond orchestration

Kubernetes solved the orchestration problem. Each solution created the next problem. Here is the chain of constraints as the industry has moved up the abstraction ladder.

Kubernetes solved orchestration → Next problem: developer experience. K8s is powerful but hard. Platform Engineering (Backstage, Port) builds internal developer portals so teams can deploy without writing YAML.

Platform Engineering solved self-service → Next problem: deployment consistency. Manual kubectl apply drifts. GitOps (ArgoCD, Flux) continuously reconciles cluster state from a Git repository — no manual ops.

GitOps solved deployment → Next problem: policy and governance. At scale, who can deploy what? Policy-as-code (OPA Gatekeeper, Kyverno) enforces rules at admission time — before anything reaches etcd.

AI agents will solve toil → Next problem: autonomous operations. AI-driven cluster management (k8sgpt, Robusta) diagnoses incidents, suggests fixes, and eventually applies remediations autonomously. The control loop becomes AI-assisted.

Layer	Problem Solved	Key Technologies	Status
Containers	Environment consistency	Docker, OCI, containerd	Mature
Orchestration	Running containers at scale	Kubernetes, EKS/GKE/AKS	Industry standard
Service Mesh	Observability, mTLS, traffic control	Istio, Linkerd, Cilium	Maturing
GitOps	Deployment consistency	ArgoCD, Flux, Crossplane	Maturing
Platform Engineering	Developer self-service	Backstage, Port, Cortex	Early mainstream
Policy-as-Code	Governance at scale	OPA, Kyverno, Styra	Growing
AI-Native Infra	Autonomous toil elimination	k8sgpt, Robusta, AI agents	Emerging

The trajectory is clear: each layer abstracts the complexity of the layer below. Kubernetes abstracts nodes. Platform Engineering abstracts Kubernetes. GitOps abstracts deployment operations. AI will abstract the rest. The engineer who understands why each layer exists — and what problem it actually solves — will thrive at every level of this stack.

← paddyspeaks.com

↑ ↓