Design Kubernetes — the container orchestration system that runs most of the world's cloud workloads. What happens when you run kubectl apply -f deployment.yaml? From API server validation through etcd persistence, scheduler placement, kubelet container creation, and health probes — to how the control plane, autoscaler, and service discovery hold together at 10,000 nodes. The schema, the watch API as distributed pub/sub, the scheduling algorithm, and the data architecture behind the pods your code runs inside.
Containers were a revolution — until you had hundreds of them. Before orchestration existed, teams discovered a new category of operational pain that Docker itself couldn't solve.
| Problem (pre-K8s) | Symptom | K8s Solution |
|---|---|---|
| Container Sprawl | Hundreds of containers, no coordination, unknown health | ReplicaSets + controllers maintain desired count automatically |
| Manual deployment | SSH into servers, run docker run by hand — error-prone, not repeatable | Declarative manifests + kubectl apply — the API server is the deployment |
| No self-healing | Container crashes = downtime until someone notices and restarts it | kubelet restarts crashed containers; ReplicaSet replaces dead pods |
| No service discovery | Hardcoded IP addresses; a redeployed container gets a new IP, everything breaks | Services provide stable ClusterIPs + DNS names via CoreDNS |
| Question | 60-second answer |
|---|---|
| Why Kubernetes exists | Bin-packing, self-healing, service discovery — schedule containers across a fleet and keep them running without human intervention. |
| Control plane in one sentence | API server is the front door; etcd is the truth store; scheduler assigns pods to nodes; controller manager reconciles desired vs actual state. |
| kubectl apply → pod running | API server validates → writes to etcd → Deployment controller creates ReplicaSet → scheduler binds pod to node → kubelet pulls image and starts container. |
| How service discovery works | DNS via CoreDNS; kube-proxy programs iptables/IPVS rules; Services are stable virtual IPs load-balanced across healthy pod endpoints. |
| 10,000-node scheduling | Predicates filter ineligible nodes; scoring ranks feasible nodes; scheduler samples a subset (not all 10K) for latency. Binding is atomic via etcd compare-and-swap. |
| Typical Interview Site | Interview Studio |
|---|---|
| Memorization | Understanding |
| Coding only | Coding + Architecture + Data Modeling |
| Short answers | Deep reasoning with trade-offs |
| LeetCode style | Real-world engineering at scale |
| Junior focus | Senior / Staff / L6–L7 |
Containers solve the "it works on my machine" problem. Kubernetes solves the "who runs the container on which machine, keeps it alive when the machine dies, and routes traffic to it" problem — at the scale of tens of thousands of machines and hundreds of thousands of containers, with zero human hand-holding.
"Design Kubernetes. Walk me through what happens when you run kubectl apply -f deployment.yaml — from API server to pod running on a node. Then explain how you'd design the scheduler, autoscaler, and service discovery to handle a 10,000-node cluster."
The question catches most candidates because Kubernetes looks like a deployment tool but is actually a distributed database with a reconciliation engine built on top. A weak answer names the components. A strong answer names the four forces that make this hard — and shows how every design decision exists to survive one of them.
There are exactly four forces:
Envelope math, volunteered:
| Quantity | Estimate | Consequence |
|---|---|---|
| Nodes in large cluster | 10,000 | Node heartbeats every 5s = 2,000 writes/sec to etcd — near the ceiling |
| etcd write throughput | ~1,000–3,000 req/s | Heartbeat aggregation in kubelet + lease objects reduce actual etcd load |
| Scheduler decisions/sec | ~1,000 pods/s peak | Parallel scheduling goroutines; each decision is microseconds with sampling |
| Watch connections to API server | ~100K–1M | Watch multiplexing: API server has one etcd watch stream, fans out to all clients |
| Pod startup latency (P99) | <5 seconds | Image pull is the bottleneck; pre-pull on nodes with DaemonSets |
| Kubernetes objects per cluster | ~300K–500K | etcd key count limit; use namespaces + labels for logical separation |
| HPA scrape interval | 15 seconds | Scale decision latency ~30–60s; cooldown prevents thrashing |
Before we dive into etcd and gRPC watches, here is Kubernetes in one analogy that a non-engineer will remember.
Imagine a global shipping fleet. You are the fleet manager. You don't care which specific ship carries which cargo — you care that the cargo arrives, is replaced if the ship sinks, and can be found by other ships that need to connect with it.
kubectl apply, you hand the captain a manifest: "I want 3 copies of this container, each with 1 CPU and 2GB memory, running port 8080." The captain's entire job is to make reality match the manifest — and to keep matching it, forever, even as ships sink and cargo goes overboard.| Problem | Without Kubernetes | With Kubernetes |
|---|---|---|
| Bin-packing | You SSH into servers and start containers by hand. Half your CPU is wasted. | Scheduler places pods to maximize resource utilization across the fleet automatically. |
| Self-healing | Container crashes at 3 AM. On-call wakes up to restart it. | kubelet notices the container died and restarts it. ReplicaSet controller replaces the pod. You sleep. |
| Service discovery | You hardcode IP addresses. Server gets replaced, IPs change, everything breaks. | Services provide stable DNS names and virtual IPs. http://payments-api:8080 always works. |
| Rolling deployments | Deploy new version = downtime, or complex blue-green scripting. | kubectl set image does a rolling update: new pods come up, old pods go down, zero downtime. |
| Scaling | You notice high CPU on the dashboard and manually add instances. | HPA watches metrics and scales replicas up/down automatically within seconds. |
The control plane is a set of processes that together maintain the desired state of the cluster. None of them run user workloads. All of them communicate exclusively through the API server — no component talks to etcd directly except the API server.
| Component | What it does | Where it runs |
|---|---|---|
| kube-apiserver | REST + gRPC front door. Validates all API objects. The only writer to etcd. Serves the watch stream. Enforces RBAC and admission control. | Control plane nodes (replicated 3×) |
| etcd | Distributed key-value store. Single source of truth for all cluster state. Raft consensus. Strong consistency guarantees. Every API object lives here. | Control plane nodes (replicated 3× or 5×) |
| kube-scheduler | Watches for unscheduled pods. Runs filter+score pipeline. Writes a Binding object to the API server. Does NOT start containers. | Control plane nodes (active-standby) |
| kube-controller-manager | Runs all the reconciliation loops: Deployment controller, ReplicaSet controller, Node controller, Endpoint controller, Job controller, and ~30 more. Each is a goroutine watching the API. | Control plane nodes (active-standby) |
| cloud-controller-manager | Talks to cloud provider APIs: provision LoadBalancer Services, attach persistent volumes, update Node objects with cloud metadata (instance type, region, zone). | Control plane nodes; optional if on-prem |
The golden rule: every component communicates exclusively through the API server. The scheduler does not write to etcd. The controller manager does not call the scheduler. Every action is a write to the API server, which persists to etcd, which triggers a watch event, which wakes up the relevant controller or kubelet.
Every worker node runs three processes that together receive a pod spec and turn it into running containers.
kubectl apply -f deployment.yaml
│
▼
1. API SERVER receives HTTPS request
├── Authentication: who are you? (x509 cert / bearer token / OIDC)
├── Authorization: are you allowed? (RBAC check)
├── Admission control: is this valid? (ValidatingWebhookConfiguration + MutatingWebhookConfiguration)
├── Schema validation: is this a valid Deployment spec?
└── Write to etcd: /registry/deployments/default/my-app (resource_version increments)
2. DEPLOYMENT CONTROLLER (in controller-manager) notices new Deployment via watch
└── Creates a ReplicaSet object: /registry/replicasets/default/my-app-6d4f9c7b2
3. REPLICASET CONTROLLER notices new ReplicaSet
└── Creates 3 Pod objects, each with spec.nodeName = "" (unscheduled)
/registry/pods/default/my-app-6d4f9c7b2-xk9p2
/registry/pods/default/my-app-6d4f9c7b2-m7q4r
/registry/pods/default/my-app-6d4f9c7b2-p3n8s
4. SCHEDULER notices unscheduled pods via watch (spec.nodeName == "")
├── FILTER phase: remove nodes that cannot run this pod
│ - PodFitsResources: does node have enough CPU/memory?
│ - PodFitsHostPorts: is the host port available?
│ - MatchNodeSelector: do labels match nodeSelector/nodeAffinity?
│ - NoTaint / Toleration: are taints tolerated?
│ - VolumeZoneConformance: is the PV zone compatible?
├── SCORE phase: rank remaining feasible nodes (0–100)
│ - LeastRequestedPriority: prefer nodes with most free resources
│ - BalancedResourceAllocation: balance CPU and memory usage
│ - NodeAffinityPriority: prefer nodes matching preferred affinity
└── BIND: writes Binding object → API server sets spec.nodeName = "node-42"
5. KUBELET on node-42 notices a pod bound to it via watch
├── Calls containerd via CRI: "create sandbox" (pause container for network namespace)
├── CNI plugin configures pod network (assigns IP from pod CIDR)
├── Calls containerd: "pull image" → overlay filesystem layers stacked
├── Calls containerd: "create containers" → cgroups + namespaces set up
├── Calls containerd: "start containers"
├── Runs init containers sequentially (if any), then all regular containers in parallel
├── Starts liveness/readiness/startup probes
└── Updates pod status in API server: phase=Running, podIP=10.244.42.15
6. ENDPOINT SLICE CONTROLLER notices pod is Running + Ready
└── Adds pod IP to EndpointSlice for all matching Services
7. kube-proxy on every node notices updated EndpointSlice
└── Reprograms iptables/IPVS rules — traffic can now reach the new pod
Total time from kubectl apply to pod running: ~2–10 seconds (image pre-pulled) or ~30–90s (cold pull)
A pod transitions through a well-defined set of states, and the probes that kubelet runs determine when a pod is considered healthy — and when it is evicted or restarted.
| Container type | Runs when | Purpose | Failure behavior |
|---|---|---|---|
| Init container | Before any regular container, sequentially | Database migration, config pre-fetch, wait-for-dependency. Completes then exits. | Pod stays Pending; RestartPolicy applies to init containers too |
| Sidecar container | Alongside main container (K8s 1.29+ native; otherwise just a regular container) | Log forwarding (Fluentd), proxy (Envoy/Istio), metrics exporter, secret sync | Independent restart; does not terminate when main container exits (native sidecar does) |
| Ephemeral container | On-demand, live container debug | kubectl debug injects a debug container into a running pod without restarting it | Debugging only — not in pod spec, not restarted |
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
replicas: 3
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
spec:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z postgres:5432; do sleep 2; done']
containers:
- name: payments-api
image: company/payments-api:v2.1.4
ports:
- containerPort: 8080
resources:
requests: # used for scheduling bin-packing
cpu: "500m"
memory: "512Mi"
limits: # enforced by cgroups; OOM-kill at limit
cpu: "2"
memory: "2Gi"
livenessProbe: # fail → restart container
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
readinessProbe: # fail → removed from Service endpoints
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
startupProbe: # one-time: disables liveness until /startup returns 200
httpGet:
path: /startup
port: 8080
failureThreshold: 30 # 30 × 10s = 5 minute grace window for slow start
periodSeconds: 10
Kubernetes networking rests on one core assumption: every pod has a unique IP, and every pod can reach every other pod directly, without NAT. Everything else — Services, Ingress, NetworkPolicy — is built on top of that flat network.
| Layer | What it provides | How it works |
|---|---|---|
| Pod networking (CNI) | Pod-to-pod IP reachability across nodes. Every pod gets a real IP from the cluster's pod CIDR (e.g., 10.244.0.0/16). | CNI plugin (Calico, Cilium, Flannel) assigns IPs, sets up routes or VXLAN overlays, programs the kernel. kubelet calls the CNI binary at pod creation/deletion. |
| Service (ClusterIP) | Stable virtual IP + DNS name for a set of pods. http://payments-api:8080 always resolves to the same ClusterIP, even as pods cycle. | kube-proxy watches Endpoints and programs iptables DNAT rules: ClusterIP → one of the healthy pod IPs (round-robin). Or IPVS for high performance. Or eBPF (Cilium). |
| NodePort / LoadBalancer | Expose a Service outside the cluster. NodePort binds a port on every node. LoadBalancer provisions a cloud LB in front. | NodePort: kube-proxy opens the port on all nodes. LoadBalancer: cloud-controller-manager calls the cloud API (AWS ELB, GCP LB) to create the load balancer and update its targets. |
| Ingress | Layer-7 HTTP routing: route /api/* to one Service, /static/* to another, by hostname. TLS termination. | An Ingress controller (nginx-ingress, Traefik, AWS ALB Ingress) watches Ingress objects and reprograms its own routing config. Kubernetes defines the Ingress API; you install the controller. |
Every pod gets /etc/resolv.conf pointing at CoreDNS (a cluster-internal DNS server). CoreDNS resolves:
payments-api → ClusterIP of the payments-api Service in the same namespacepayments-api.default.svc.cluster.local → ClusterIP (fully-qualified)clusterIP: None) → A records for each pod IP directly (used for StatefulSets, Kafka brokers, etc.)Kubernetes stores every object in etcd as a key-value pair. Understanding the key schema and the watch mechanism is understanding why Kubernetes is architecturally elegant — and where its scalability limits lie.
# etcd key format: /registry/{resource-type}/{namespace}/{name}
/registry/pods/default/payments-api-6d4f9c7b2-xk9p2
/registry/pods/kube-system/coredns-74ff55c5b-4rz8k
/registry/deployments/default/payments-api
/registry/replicasets/default/payments-api-6d4f9c7b2
/registry/services/default/payments-api
/registry/endpoints/default/payments-api # being replaced by EndpointSlices
/registry/endpointslices/default/payments-api-xz9k
/registry/configmaps/default/payments-api-config
/registry/secrets/default/payments-api-tls
/registry/nodes/node-42 # cluster-scoped, no namespace
/registry/namespaces/production # cluster-scoped
# Value: serialized protobuf (not JSON, despite the API accepting JSON)
# Each write atomically increments the resource_version (cluster-wide monotonic counter)
This is the most important architectural insight in Kubernetes. The entire system is an event-driven, watch-driven pub/sub built on top of a key-value store.
# etcd watch API (simplified)
# Every component establishes a long-lived gRPC watch to the API server
# The scheduler watches for unscheduled pods:
GET /api/v1/pods?fieldSelector=spec.nodeName=&watch=true&resourceVersion=12345
# → receives a stream of ADDED/MODIFIED/DELETED events
# The Deployment controller watches Deployments:
GET /apis/apps/v1/deployments?watch=true&resourceVersion=12345
# Each kubelet watches pods assigned to its node:
GET /api/v1/pods?fieldSelector=spec.nodeName=node-42&watch=true
# The critical insight: API server maintains ONE watch connection to etcd
# and fans out watch events to the potentially THOUSANDS of watchers.
# This "reflector + informer + workqueue" pattern appears in every K8s component:
# Informer lifecycle:
# 1. LIST all objects at startup (paginated, 500 at a time)
# 2. Sync the local in-memory cache
# 3. WATCH from the last seen resourceVersion
# 4. On reconnect, resume from last resourceVersion (no full re-list if etcd has the history)
# 5. If resourceVersion is too old (etcd compacted), fall back to full re-LIST
# resourceVersion is the etcd revision number — a cluster-wide monotonic integer.
# It enables optimistic concurrency:
# GET pod → resource_version=5000
# PUT pod (modify) with resourceVersion=5000
# → API server checks: current etcd revision == 5000? If not (someone else wrote), return 409 Conflict
# This is etcd's compare-and-swap used to prevent lost updates.
This is a real interview question. etcd is Kubernetes' weakest scalability point — it tops out around 8GB of data and ~3,000 writes/sec. The question forces you to articulate what etcd actually provides.
| Feature | etcd provides | PostgreSQL equivalent |
|---|---|---|
| Consistent reads | Linearizable reads via Raft — always reads the latest committed value | SET TRANSACTION ISOLATION LEVEL SERIALIZABLE or SELECT FOR UPDATE |
| Watch / Change Notification | gRPC streaming watch: efficient long-lived watch on key prefixes | LISTEN/NOTIFY + logical replication slots; much more complex to fan out |
| Optimistic concurrency | resourceVersion CAS on write: txn(compare(rev,5000)).put(k,v) | WHERE xmin = $1 optimistic locking or SELECT FOR UPDATE |
| Lease / TTL keys | etcd leases: key expires if lessor doesn't renew (heartbeat) | Scheduled job to expire rows; no native push notification on expiry |
| Cluster-wide monotonic revision | Every write increments a global revision counter | PostgreSQL txid_current() / sequences; monotonic per-table not cluster-wide |
The K8s community built kine — a shim that lets Kubernetes use MySQL, PostgreSQL, or SQLite as the backing store instead of etcd. K3s uses SQLite for single-node installs via kine. The watch API is the hardest part to replicate — kine implements it via polling + NOTIFY.
The scheduler's job is to assign each pending pod to a node that can run it. At 10,000 nodes, evaluating every node for every pod would be too slow. The solution is a two-phase pipeline with sampling.
For each unscheduled pod:
PHASE 1: FILTER (predicates — binary pass/fail)
Remove nodes that CANNOT run this pod:
├── PodFitsResources cpu+memory requests fit within allocatable capacity
├── PodFitsHostPorts no port conflict on the node
├── MatchNodeSelector node labels match pod's nodeSelector / nodeAffinity
├── NoDiskConflict required volumes can be attached
├── NoVolumeZoneConflict PV zone matches node zone
├── MaxEBSVolumeCount AWS: max 39 EBS volumes per node
├── MatchInterPodAffinity pod can coexist with existing pods on this node
├── PodToleratesNodeTaints pod tolerates all NoSchedule taints on the node
└── (... 20+ predicates total)
Result: feasibleNodes (could be 0 → Pending, or 1–N)
PHASE 2: SCORE (priorities — 0–100 per node)
Rank feasible nodes to pick the best:
├── LeastRequestedPriority (100 - (cpu% + mem%) / 2) — prefer emptier nodes
├── BalancedResourceAllocation penalize imbalanced cpu:memory ratio
├── NodeAffinityPriority weight preferred node affinity rules
├── InterPodAffinityPriority prefer/avoid nodes with specific pods
├── ImageLocalityPriority bonus for nodes that already have the image cached
└── (custom plugins via Scheduling Framework)
Result: scored nodes sorted descending
PHASE 3: BIND
Selected node → create Binding object via API server
API server: SET spec.nodeName = "node-42" in etcd
This is a compare-and-swap — if two schedulers race, one wins, one gets 409 Conflict
──────────────────────────────────────────────────────────────────────
10,000-NODE PERFORMANCE TRICK: SAMPLING
Instead of scoring ALL 10,000 feasible nodes:
percentageOfNodesToScore: 50 (default in large clusters)
→ after passing filters, randomly sample min(50%, 100 nodes) for scoring
→ scoring phase runs on ~100 nodes, not 10,000
→ worst case: slightly suboptimal placement
→ benefit: scheduling latency stays <5ms per pod
The scheduler also runs in parallel goroutines — ~16 pods scheduled concurrently.
# Node affinity — require or prefer nodes by label
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # hard requirement
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: [us-east-1a, us-east-1b]
preferredDuringSchedulingIgnoredDuringExecution: # soft preference
- weight: 80
preference:
matchExpressions:
- key: node-type
operator: In
values: [high-memory]
# Pod anti-affinity — spread pods across zones (prevent all replicas on one node)
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: payments-api
topologyKey: kubernetes.io/hostname # no two replicas on same node
# Topology spread constraints (K8s 1.19+) — more flexible than anti-affinity
topologySpreadConstraints:
- maxSkew: 1 # max 1 replica difference between zones
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: payments-api
# Taints and tolerations — reserve nodes for specific workloads
# Node: kubectl taint nodes gpu-node-01 dedicated=gpu:NoSchedule
# Pod tolerates the taint:
tolerations:
- key: dedicated
operator: Equal
value: gpu
effect: NoSchedule
Kubernetes has three layers of autoscaling, each operating at a different timescale and on different resources.
| Scaler | What it scales | Trigger | Latency |
|---|---|---|---|
| HPA (Horizontal Pod Autoscaler) | Number of replicas in a Deployment/StatefulSet | CPU utilization, memory, or custom metrics (KEDA for queue depth, RPS, etc.) | 30–60 seconds (scrape interval 15s + decision cooldown) |
| VPA (Vertical Pod Autoscaler) | CPU and memory requests/limits of existing pods | Historical utilization; recommends right-sizing | Minutes to hours; usually requires pod restart |
| Cluster Autoscaler | Number of nodes in the cluster | Pods stuck in Pending (no room) → scale up; underutilized nodes → scale down | Scale-up: 1–3 minutes (node provisioning). Scale-down: 10+ minutes (safety margin) |
# KEDA (Kubernetes Event-Driven Autoscaling) scales on external signals:
# queue depth, Kafka lag, HTTP RPS, cron schedule, custom Prometheus metrics
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: payments-worker
spec:
scaleTargetRef:
name: payments-worker
minReplicaCount: 0 # scale to zero when queue is empty (save $$$)
maxReplicaCount: 50
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/payments-queue
queueLength: "10" # target: 10 messages per replica
awsRegion: us-east-1
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: payments_queue_depth
threshold: "100" # scale up if >100 unprocessed payments per replica
When pods are stuck in Pending because no node has enough capacity, the Cluster Autoscaler asks the cloud provider to provision a new node. When nodes have been underutilized for 10+ minutes and all pods could be rescheduled to other nodes, it cordons and drains the node then deletes it.
# Key Cluster Autoscaler behaviors:
# Scale-up trigger:
# pod is Pending AND can be scheduled if one more node of type X exists
# → call cloud provider API: "add 1 node to node group X"
# → node joins cluster in ~90 seconds
# → scheduler places the pending pod
# Scale-down trigger:
# node utilization < 50% of requests for 10 minutes
# AND all pods can be rescheduled elsewhere
# → cordon node (no new pods scheduled)
# → gracefully evict pods (respecting PodDisruptionBudgets)
# → cloud provider: "delete this node"
# PodDisruptionBudget — protect critical workloads from eviction:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payments-api-pdb
spec:
minAvailable: 2 # always keep at least 2 pods running during disruptions
selector:
matchLabels:
app: payments-api
Kubernetes RBAC is a four-part system: Subjects (who), Verbs (what action), Resources (on what), and Scope (namespace or cluster). Every API request goes through RBAC before reaching etcd.
# ServiceAccount — the identity a pod runs as
apiVersion: v1
kind: ServiceAccount
metadata:
name: payments-api
namespace: payments
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/payments-api # IRSA for AWS
---
# Role — namespaced permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: payments-api-role
namespace: payments
rules:
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["db-credentials", "stripe-key"] # specific secrets only
verbs: ["get"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch"]
---
# RoleBinding — bind the Role to the ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: payments-api-binding
namespace: payments
subjects:
- kind: ServiceAccount
name: payments-api
namespace: payments
roleRef:
kind: Role
name: payments-api-role
apiGroup: rbac.authorization.k8s.io
---
# NetworkPolicy — allow only specific pod-to-pod traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: payments-api-netpol
namespace: payments
spec:
podSelector:
matchLabels:
app: payments-api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: ingress-nginx # allow from ingress only
ports:
- port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: postgres # allow to postgres only
ports:
- port: 5432
| Layer | Mechanism | What it prevents |
|---|---|---|
| Authentication | x.509 client certs, bearer tokens, OIDC, service account tokens (JWT) | Unauthenticated API access |
| RBAC authorization | Roles + RoleBindings; ClusterRoles for cluster-scoped resources | Privilege escalation; a compromised pod can't read all secrets |
| Admission Control | ValidatingWebhookConfiguration, MutatingWebhookConfiguration, PodSecurity admission | Policy violations before they reach etcd; auto-inject sidecars |
| PodSecurity Standards | Baseline / Restricted / Privileged per namespace label | Containers running as root; host network access; privilege escalation |
| NetworkPolicy | L3/L4 allow/deny rules enforced by the CNI plugin | Lateral movement: compromised pod can't reach all other pods |
| Secrets encryption at rest | EncryptionConfiguration in kube-apiserver: AES-GCM or KMS provider (AWS KMS, GCP KMS) | Secrets readable if etcd backup is leaked |
| Audit logging | Every API request logged to file or webhook sink (SIEM) | Post-incident forensics; "what was running before the breach?" |
These questions separate the senior answer from the staff answer — etcd internals, split-brain, rolling strategies, StatefulSet vs Deployment, CRDs, and multi-tenancy.
etcd loses quorum — one node dies in a 3-node etcd cluster. What happens to the Kubernetes cluster?
With 2 of 3 nodes remaining, the Raft cluster still has quorum and continues operating normally — etcd requires floor(n/2)+1 nodes. If a second node dies (only 1 of 3 remaining), etcd loses quorum: all writes are rejected with etcdserver: request timed out. The API server can still serve reads from its in-memory cache but cannot persist new state — pods keep running but no new deployments can be created or updated.
Explain rolling update strategy. What are maxUnavailable and maxSurge?
maxUnavailable: how many pods can be unavailable during the update (defaults to 25%). maxSurge: how many extra pods above the desired count can exist during the update (defaults to 25%). Setting maxUnavailable: 0, maxSurge: 1 means: create one new pod, wait for it to be ready, then terminate one old pod — guarantees zero downtime but requires extra capacity. maxUnavailable: 1, maxSurge: 0 never exceeds the replica count but temporarily reduces capacity.
kubectl apply vs kubectl create — what's the difference?
kubectl create is imperative — create the object, fail if it already exists. kubectl apply is declarative — send a server-side merge patch (or client-side three-way merge) comparing your desired state against the last-applied-configuration annotation and the live state; create if missing, update if changed. Always use apply in CI/CD pipelines; create is for one-off resource creation.
What is a headless Service? When do you use one?
A headless Service (clusterIP: None) does not get a virtual IP. Instead, DNS returns A records for each pod IP directly. Used for: StatefulSets (each pod needs a stable DNS name: kafka-0.kafka-headless.default.svc.cluster.local), client-side load balancing (gRPC streams, Cassandra), and any case where you need to address individual pods rather than a random healthy pod.
StatefulSet vs Deployment — when do you use each?
Deployment: stateless pods where any replica can serve any request — web servers, API services, batch workers. StatefulSet: pods with stable identity (persistent hostname, persistent volume, ordered startup/shutdown) — Kafka, ZooKeeper, MySQL replica sets, Elasticsearch. The key difference: StatefulSet pods have predictable names (kafka-0, kafka-1), and each gets its own PersistentVolumeClaim that follows the pod across reschedules.
Resource requests vs resource limits — what's the difference and why does it matter?
Requests are what the scheduler uses for bin-packing — a node is only considered feasible if its allocatable capacity exceeds the pod's requests. Limits are enforced at runtime by cgroups — a container exceeding its CPU limit is throttled; exceeding its memory limit is OOM-killed. Setting requests == limits (Guaranteed QoS class) prevents OOM-kill-based evictions and gets highest priority in node pressure. Setting no limits (BestEffort QoS class) means the pod is the first evicted when the node runs low on memory.
What is node pressure eviction? How does it differ from OOM kill?
Node pressure eviction is kubelet gracefully terminating pods when the node approaches a resource threshold (memory, disk, PID count). kubelet orders eviction by QoS class (BestEffort first, then Burstable, finally Guaranteed) and by how much the pod exceeds its requests. OOM kill is the Linux kernel forcibly killing a process without any graceful shutdown when physical memory is exhausted — the pod gets OOMKilled status. Eviction is gentler; OOM kill is a last resort.
What is a CRD (Custom Resource Definition) and what is an Operator?
A CRD extends the Kubernetes API with your own resource types — you can define a KafkaCluster or PostgresDatabase custom resource and manage it via kubectl. An Operator is a controller that watches these custom resources and implements the runbook for managing the underlying software — provisioning, scaling, backup, failover. cert-manager (Certificate, ClusterIssuer), KEDA (ScaledObject), and Istio (VirtualService, AuthorizationPolicy) are all Operators.
How do you handle multi-tenancy in Kubernetes — giving different teams isolated environments on the same cluster?
Soft multi-tenancy (namespaces + RBAC + NetworkPolicy + ResourceQuota): each team gets a namespace, can only touch their own resources, has a resource budget, and pods cannot reach other namespaces' pods. Hard multi-tenancy (separate clusters): for PCI-DSS, HIPAA, or true blast-radius isolation — use cluster-per-tenant with shared control plane (Cluster API) or completely separate clusters with federated tooling. The Kubernetes docs explicitly say namespaces are not a security boundary for untrusted tenants.
What is a PodDisruptionBudget and when is it critical?
A PDB constrains how many pods of a deployment can be simultaneously unavailable during voluntary disruptions (node drain, Cluster Autoscaler scale-down, rolling update). minAvailable: 2 means at least 2 pods must be running; the drain/autoscaler will wait if draining a node would violate this. Critical for: stateful services, any service with low replication factor, database connection pools that need warmup time. Without PDBs, a node drain could evict all replicas simultaneously.
How does the Kubernetes watch API work at the protocol level? What is the informer pattern?
Watch is a long-lived HTTP/2 (or gRPC) streaming response: GET /api/v1/pods?watch=true returns a chunked stream of JSON events (ADDED, MODIFIED, DELETED) indefinitely. The informer pattern (in client-go) wraps this: on startup, LIST all objects and populate a local in-memory cache; then WATCH from the returned resourceVersion and apply events to the cache. This means controllers work against a local read cache (zero API server calls for reads) and only need to write — dramatically reducing API server load.
A pod is stuck in Pending. Walk me through your diagnosis.
kubectl describe pod → look at Events section. Common causes: (1) Insufficient cpu/memory — all nodes full; (2) no nodes matched nodeSelector — labels wrong; (3) had taint X which pod did not tolerate — missing toleration; (4) unbound immediate PersistentVolumeClaims — no PV available or StorageClass misconfigured; (5) pod has unbound immediate PersistentVolumeClaims in zone — PV is in wrong zone for multi-AZ clusters. Fix: check kubectl get events --sort-by=.lastTimestamp and kubectl get nodes -o wide.
What is the scheduler extender / scheduling framework and when would you use a custom scheduler plugin?
The Kubernetes Scheduling Framework defines extension points (PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit, PreBind, Bind, PostBind) where you can inject custom logic as compiled-in plugins. Use cases: GPU topology awareness (place GPU pods on nodes sharing NVLink), gang scheduling (batch ML jobs must all start together), specialized hardware affinity (FPGA, RDMA), or cost-aware scheduling (prefer spot instances). The older scheduler extender (HTTP webhook) is deprecated; prefer the in-process plugin framework.
How does Kubernetes handle node failure? What is the difference between node NotReady and pod eviction?
When a node stops heartbeating for 40s (default node-monitor-grace-period), it transitions to NotReady. After a further 5 minutes (default pod-eviction-timeout), the Node controller marks all pods on that node for eviction — adding a NoExecute taint. The Deployment controller sees pods disappear and creates replacements. In EKS/GKE, the cloud provider often terminates the node before this timeout triggers. PDB constraints are respected only for voluntary evictions — a hard node failure bypasses PDB.
How does a Kubernetes Ingress controller work? What's the difference between an Ingress and a Gateway API?
An Ingress controller is an Operator that watches Ingress objects and configures an actual reverse proxy (nginx, Envoy, HAProxy) to route L7 HTTP traffic. The Ingress API is intentionally limited — one IngressClass, simple path routing, TLS termination. The Gateway API (Kubernetes SIG-Network, v1 since K8s 1.28) is the successor: richer routing (headers, weights, cross-namespace, TCP/UDP), multi-tenancy (separate Roles for infrastructure vs app teams), and expressiveness comparable to Istio VirtualService. Use Gateway API for new deployments; Ingress for mature stacks with existing nginx-ingress configurations.
| Dimension | Weak answer | Strong answer |
|---|---|---|
| kubectl apply → pod | "API server deploys it to a node" | Names all 7 steps: validate → etcd → Deployment ctrl → ReplicaSet ctrl → scheduler filter/score/bind → kubelet CRI → CNI network → readiness gates Service endpoint |
| Control plane | Lists the 5 components | Explains that ALL communication goes through API server; etcd watch as pub/sub; resourceVersion as optimistic concurrency |
| Scheduler at 10K nodes | "It assigns pods to nodes" | Two-phase filter+score pipeline; sampling (percentageOfNodesToScore); parallel goroutines; Binding as CAS; affinity / anti-affinity trade-offs |
| etcd schema | Vague key-value store | Key format /registry/{type}/{ns}/{name}; watch API as long-lived gRPC stream; resourceVersion CAS; informer pattern; why PostgreSQL is harder (LISTEN/NOTIFY vs native watch) |
| Autoscaling | "HPA scales pods" | Differentiates HPA/VPA/Cluster Autoscaler; explains cooldown periods; KEDA for external metrics; scale-to-zero; PDB protection during scale-down |
| Networking | "Services have a virtual IP" | CNI flat network assumption; kube-proxy iptables DNAT; EndpointSlice controller excludes not-ready pods; headless Services for StatefulSets; Ingress vs Gateway API |
| RBAC | "Namespace isolation" | ServiceAccount bound to Role via RoleBinding; least-privilege secrets access; NetworkPolicy as L3/L4 microsegmentation; admission webhooks for policy; encryption at rest via KMS provider |
Even engineers who understand Kubernetes conceptually make these operational mistakes. Recognise them before your interviewer does.
requests (for scheduler bin-packing) and limits (for cgroup enforcement). For production workloads, set requests == limits (Guaranteed QoS class) to prevent throttling surprises.Every system design interview expects trade-offs. Know when Kubernetes is the wrong choice — and be able to say so clearly.
When an interviewer asks "Why does Kubernetes exist?" you need to be able to answer in 60 seconds — clearly, with the insight, not just the features.
| The One-Sentence Version |
|---|
| Kubernetes is a reconciliation engine: you declare desired state, and it continuously closes the gap between what you want and what is running — even as machines fail, containers crash, and traffic spikes. |
When an interviewer asks about Kubernetes, they're probing four levels of understanding. Here's what they want to hear at each level.
Kubernetes didn't appear in a vacuum. It is the product of a decade of industry convergence around the right abstraction for running software at scale.
Kubernetes solved the orchestration problem. Each solution created the next problem. Here is the chain of constraints as the industry has moved up the abstraction ladder.
| Layer | Problem Solved | Key Technologies | Status |
|---|---|---|---|
| Containers | Environment consistency | Docker, OCI, containerd | Mature |
| Orchestration | Running containers at scale | Kubernetes, EKS/GKE/AKS | Industry standard |
| Service Mesh | Observability, mTLS, traffic control | Istio, Linkerd, Cilium | Maturing |
| GitOps | Deployment consistency | ArgoCD, Flux, Crossplane | Maturing |
| Platform Engineering | Developer self-service | Backstage, Port, Cortex | Early mainstream |
| Policy-as-Code | Governance at scale | OPA, Kyverno, Styra | Growing |
| AI-Native Infra | Autonomous toil elimination | k8sgpt, Robusta, AI agents | Emerging |