Design a container runtime and image registry at Docker/GitHub Container Registry scale — what a container image actually is, how OverlayFS layers stack and copy-on-write, the OCI image spec and content-addressable blob storage, the push/pull protocol, container networking from veth pairs to VXLAN overlays, what runc and containerd actually do when you run docker run, image security scanning, and how you serve 10 billion pulls per day with CDN layers, P2P distribution, and lazy-loading.
Containers did not appear in a vacuum. Each step in infrastructure history created a new problem that the next step solved — until containers created their own problem: scale management.
| Question | 60-second answer |
|---|---|
| What is a container? | A process with its own filesystem (OverlayFS), network namespace, and cgroup limits — not a VM; shares the host kernel. |
| What is an image? | A stack of read-only content-addressable layers (SHA256 blobs) plus a JSON config — described by an OCI manifest. |
| How does a registry work? | Content-addressable blob store + manifest index. Pull: check local cache → HEAD blob → GET blob chunks. Deduplication is free — same SHA256 blob serves thousands of images. |
| How is networking isolated? | Linux network namespaces + veth pairs + a bridge (docker0) + iptables NAT. Multi-host: VXLAN overlay encapsulation. |
| 10B pulls/day? | CDN edge caching of blobs by SHA256, P2P distribution (Dragonfly/Kraken), lazy loading (eStargz streaming pulls), multi-region blob replication. |
| Typical Interview Site | Interview Studio |
|---|---|
| Memorization | Understanding |
| Coding only | Coding + Architecture + Data Modeling |
| Short answers | Deep reasoning with trade-offs |
| LeetCode style | Real-world engineering at scale |
| Junior focus | Senior / Staff / L6–L7 |
A container looks like isolation. It is, in fact, a regular Linux process using kernel namespaces, cgroups, and a union filesystem to create the illusion of a private machine — without the hypervisor overhead that makes VMs take 30+ seconds to boot.
"Design a container runtime and image registry at Docker/GitHub Container Registry scale. Walk me through what a container image actually is, how layers work, how networking is isolated, and how you'd build a registry that serves 10 billion pulls per day."
The question catches most candidates because containers sit at an uncomfortable intersection: it's operating systems, distributed storage, networking, and a high-scale CDN problem all at once. A weak answer draws a box labeled "Docker" and describes docker run. A strong answer names the four forces that make the design hard — then shows how every later decision exists to survive one of them.
docker run command itself.Envelope math, volunteered:
| Quantity | Estimate | Consequence |
|---|---|---|
| Docker Hub pulls / day | ~10B | CDN is not optional; origin cannot absorb this traffic |
| Avg image size (layers) | ~200 MB (compressed) | 10B pulls × 200 MB = 2 EB/day data transfer (CDN cache hit rate must be >99%) |
| Distinct layer blobs (Docker Hub) | ~50–100M | Content-addressed; each stored once regardless of how many images reference it |
| Typical layer count per image | 5–15 layers | Each layer is a separate blob pull; shallow caches help less than deep blob deduplication |
| Image pull latency target (CI) | <10 s | Warm cache: check-layer-exists (HEAD) → skip unchanged layers; cold: parallel chunk download |
| Registry HA requirement | 99.99% uptime | ~52 min/year downtime; multi-region active-active with blob replication |
| Garbage collection window | 7–30 days | Dangling layers deleted after all manifests referencing them are GC'd; soft-delete then sweep |
Before the engineering, the mental model. Three analogies that actually hold up under scrutiny.
A virtual machine is a full computer running inside your computer — it has its own kernel, its own memory allocator, its own device drivers. Starting one takes 30 seconds because the OS is actually booting. A container is not a virtual machine. It is a process that thinks it has its own computer — it runs directly on the host kernel, uses the host's network stack, and shares memory with the host. The kernel creates an illusion of isolation using namespaces. The illusion is convincing, but it is not a hard wall.
| Virtual Machine | Container | |
|---|---|---|
| Kernel | Its own (Guest OS) | Shared host kernel |
| Startup time | 30–120 seconds | 100–500 ms |
| Memory overhead | 100–500 MB (OS overhead) | ~1 MB (just the process) |
| Isolation level | Hardware-level (hypervisor) | Kernel-level (namespaces) |
| Security boundary | Very hard to escape | Kernel vulnerabilities can escape |
| Use case | Full OS, different kernels, strong multi-tenancy | Microservices, CI, ephemeral workloads |
Before Docker, deploying software meant: (1) install the right version of Java/Python/Node on the server, (2) hope it matches your laptop, (3) debug the production difference at 2 AM. Docker's insight was simple: ship the filesystem, not just the code. Package the app, its runtime, its libraries, and its configuration into an image — and that image runs identically on your laptop, in CI, and in production. The image is the deployment unit. "Works on my machine" becomes "this is the machine."
Before standardized shipping containers, loading a cargo ship meant custom-stacking thousands of different-shaped parcels — slow, lossy, requiring specialists at every port. After: everything fits into an ISO-standard container. The crane operator doesn't care what's inside. The port doesn't care about the contents. The ship is fully interchangeable.
Docker containers work identically. The runtime (containerd/runc) doesn't care if your container runs Python, Go, or a Postgres database. The orchestrator (Kubernetes) doesn't care about your app. The cloud (ECS, GKE, AKS) doesn't care about your runtime. The standard interface — OCI Image Spec + OCI Runtime Spec — is the interoperability layer, exactly like ISO 668 for shipping.
An image is a stack of immutable, content-addressed filesystem layers plus a JSON config. The OCI Image Spec formalizes what Docker invented. Understanding the anatomy is the prerequisite for understanding the registry, the runtime, and security.
When you run a container, Linux mounts a union filesystem using OverlayFS. OverlayFS takes multiple directory trees ("lower layers") and presents them as a single unified view. Each image layer is a tar archive of filesystem changes. Layers are stacked bottom-up. A container gets a writable "upper" layer on top — and copy-on-write means files from lower layers are only copied to the writable layer when modified.
docker images. It does not describe what the image does — it describes how to run it.# OCI Manifest (simplified)
{
"schemaVersion": 2,
"mediaType": "application/vnd.oci.image.manifest.v1+json",
"config": {
"mediaType": "application/vnd.oci.image.config.v1+json",
"digest": "sha256:abc123...",
"size": 7023
},
"layers": [
{ "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:1a2b3c...", "size": 30428672 }, // ubuntu base
{ "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:8b1f2e...", "size": 45678901 }, // python runtime
{ "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:3c8a9d...", "size": 12345678 }, // app deps
{ "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:f9e2a1...", "size": 234567 } // app code
]
}
# The digest is the SHA256 of the compressed layer tar.
# The diffID (in config) is the SHA256 of the uncompressed tar.
# Content-addressable: if sha256 matches, the blob is valid — no trust required.
A container registry is a content-addressable blob store plus a metadata index. The OCI Distribution Spec formalizes the API. The interesting engineering is deduplication, CDN caching, multi-arch support, and garbage collection.
Every blob in a registry is addressed by its SHA256 digest: sha256:1a2b3c.... This has three profound consequences:
# docker manifest inspect ubuntu:22.04 --verbose
# An OCI Image Index ("fat manifest") points to per-platform manifests
{
"mediaType": "application/vnd.oci.image.index.v1+json",
"manifests": [
{ "digest": "sha256:amd64...", "platform": {"os":"linux","architecture":"amd64"} },
{ "digest": "sha256:arm64...", "platform": {"os":"linux","architecture":"arm64"} },
{ "digest": "sha256:armv7...", "platform": {"os":"linux","architecture":"arm","variant":"v7"} }
]
}
# docker pull resolves the correct manifest for the host platform automatically.
# buildx --platform linux/amd64,linux/arm64 builds and pushes all platforms in one push.
Container networking is Linux networking — namespaces, veth pairs, bridges, iptables NAT. Multi-host adds VXLAN overlay encapsulation. Understanding the primitives is the prerequisite for debugging any networking issue in Kubernetes.
ip netns exec container1 ip addr shows only that container's interfaces. The host namespace and container namespace are completely separate — a port conflict in one is invisible to the other.eth0); the other is placed in the host namespace (appears as vethXXXX) and attached to the docker0 bridge. Traffic flows between them — whatever enters one end exits the other. This is the only "wire" between the container and the host network.172.17.0.1/16. All container veth pairs attach here. Container-to-container traffic within the same bridge is direct — no NAT, no kernel routing, just a virtual switch forward. The bridge also has an IP on the host, allowing the host to reach containers.-p 8080:80): iptables DNAT rewrites destination from host:8080 to 172.17.0.x:80. This is why docker run -p works — it's just iptables rules.Single-host bridge networking doesn't cross machines. For multi-host clusters (Docker Swarm, Kubernetes), an overlay network encapsulates container frames inside UDP packets using VXLAN (Virtual Extensible LAN). Container 172.17.0.2 on host A communicating with container 172.17.0.3 on host B: the frame is wrapped in a VXLAN UDP packet (destination: host B's physical IP, port 4789) and unwrapped on arrival. Flannel, Calico, Cilium, and Weave all implement variations of this model.
# VXLAN packet structure (simplified)
Outer Ethernet frame:
src MAC: host-A NIC
dst MAC: host-B NIC
Outer IP header:
src: 10.0.0.1 (host A physical IP)
dst: 10.0.0.2 (host B physical IP)
UDP header:
dst port: 4789 (VXLAN)
VXLAN header:
VNI: 100 (virtual network identifier — which overlay network)
Inner Ethernet frame:
src MAC: veth in container-A namespace
dst MAC: veth in container-B namespace
Inner IP:
src: 10.244.1.2 (pod/container IP on host A)
dst: 10.244.2.3 (pod/container IP on host B)
# The kernel's VXLAN driver on host B strips the outer headers and
# delivers the inner frame to the correct container namespace.
# To the container, it looks like a normal Ethernet packet.
curl http://db:5432) resolves via Docker's embedded DNS which maps service names to 172.17.0.x IPs. In Kubernetes, CoreDNS does the same job for pod-to-service resolution: my-service.default.svc.cluster.local → 10.96.x.x.The interesting design is layer deduplication (blobs are shared across images), garbage collection (GC blobs only when no manifest references them), quota enforcement per repository, and pull event analytics. Each table serves a distinct operational requirement.
-- Repositories (e.g. library/ubuntu, ghcr.io/org/app)
CREATE TABLE repository (
repo_id BIGSERIAL PRIMARY KEY,
registry TEXT NOT NULL, -- 'docker.io', 'ghcr.io', 'gcr.io'
namespace TEXT NOT NULL, -- 'library', 'myorg'
name TEXT NOT NULL, -- 'ubuntu', 'myapp'
is_public BOOLEAN NOT NULL DEFAULT TRUE,
owner_id BIGINT NOT NULL,
storage_bytes BIGINT NOT NULL DEFAULT 0, -- updated by trigger on blob insert
quota_bytes BIGINT, -- NULL = unlimited
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (registry, namespace, name)
);
-- Content-addressed blobs — shared across ALL images/repositories
CREATE TABLE blob (
blob_id BIGSERIAL PRIMARY KEY,
digest TEXT NOT NULL UNIQUE, -- sha256:hex — THE primary key conceptually
media_type TEXT NOT NULL, -- 'application/vnd.oci.image.layer.v1.tar+gzip' etc.
size_bytes BIGINT NOT NULL,
storage_path TEXT NOT NULL, -- s3://blobs/{digest[0:2]}/{digest[2:4]}/{digest}
uploaded_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_referenced_at TIMESTAMPTZ, -- updated on pull; used by GC
gc_eligible BOOLEAN NOT NULL DEFAULT FALSE -- set true when ref_count → 0
);
-- OCI Manifests (one per image+platform)
CREATE TABLE manifest (
manifest_id BIGSERIAL PRIMARY KEY,
repo_id BIGINT NOT NULL REFERENCES repository(repo_id),
digest TEXT NOT NULL, -- sha256 of the manifest JSON
media_type TEXT NOT NULL, -- OCI manifest or manifest list
raw_json JSONB NOT NULL, -- full manifest content
config_digest TEXT REFERENCES blob(digest), -- null for manifest lists
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (repo_id, digest)
);
-- Manifest ↔ blob association (many-to-many — core deduplication)
CREATE TABLE manifest_blob (
manifest_id BIGINT NOT NULL REFERENCES manifest(manifest_id),
blob_id BIGINT NOT NULL REFERENCES blob(blob_id),
layer_order INT NOT NULL, -- 0-based order in manifest layers array
PRIMARY KEY (manifest_id, blob_id)
);
-- Tags (mutable pointers to manifests)
CREATE TABLE tag (
tag_id BIGSERIAL PRIMARY KEY,
repo_id BIGINT NOT NULL REFERENCES repository(repo_id),
name TEXT NOT NULL, -- 'latest', 'v1.2.3', 'main'
manifest_id BIGINT NOT NULL REFERENCES manifest(manifest_id),
pushed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
pushed_by TEXT NOT NULL,
UNIQUE (repo_id, name)
);
-- Image index (multi-arch / manifest list → per-platform manifests)
CREATE TABLE image_index_entry (
parent_manifest_id BIGINT NOT NULL REFERENCES manifest(manifest_id),
child_manifest_id BIGINT NOT NULL REFERENCES manifest(manifest_id),
platform_os TEXT NOT NULL, -- 'linux'
platform_arch TEXT NOT NULL, -- 'amd64', 'arm64', 'arm'
platform_variant TEXT, -- 'v7' for arm/v7
PRIMARY KEY (parent_manifest_id, child_manifest_id)
);
-- Pull events (analytics + rate limiting)
CREATE TABLE pull_event (
event_id BIGSERIAL PRIMARY KEY,
repo_id BIGINT NOT NULL REFERENCES repository(repo_id),
manifest_id BIGINT REFERENCES manifest(manifest_id),
tag_name TEXT,
puller_id BIGINT, -- authenticated user / org, null for anon
client_ip INET NOT NULL,
pulled_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
bytes_transferred BIGINT NOT NULL DEFAULT 0,
was_cache_hit BOOLEAN NOT NULL DEFAULT FALSE
) PARTITION BY RANGE (pulled_at); -- monthly partitions; archive to S3 after 90 days
CREATE TABLE pull_event_2026_01 PARTITION OF pull_event
FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
-- Indexes
CREATE INDEX idx_blob_digest ON blob (digest);
CREATE INDEX idx_manifest_repo ON manifest (repo_id, created_at DESC);
CREATE INDEX idx_tag_repo_name ON tag (repo_id, name);
CREATE INDEX idx_manifest_blob_blob ON manifest_blob (blob_id); -- for GC ref-count
CREATE INDEX idx_pull_event_repo ON pull_event (repo_id, pulled_at DESC);
CREATE INDEX idx_blob_gc ON blob (gc_eligible, last_referenced_at)
WHERE gc_eligible = TRUE;
-- Garbage collection query: find blobs with no manifest references
-- Run periodically after tag deletes / manifest deletes
UPDATE blob SET gc_eligible = TRUE
WHERE blob_id NOT IN (SELECT DISTINCT blob_id FROM manifest_blob)
AND uploaded_at < NOW() - INTERVAL '7 days'; -- soft-delete grace period
gc_eligible = TRUE, then a GC worker sweeps blobs with no references and no recent push. This prevents races between concurrent push and delete operations.Quota enforcement: storage_bytes on repository is updated by a trigger on manifest_blob insert. Before accepting a push, check storage_bytes + new_layer_sizes <= quota_bytes. Deduplication means shared layers don't count against individual repo quotas — only layers unique to that repo.
docker runEight steps from command to running process. Each step is a distinct subsystem: registry pull, layer unpacking, rootfs construction, namespace setup, cgroup limits, and exec. runc is the last mile — it executes exactly one process.
| Component | Responsibility | Interface |
|---|---|---|
| runc | OCI runtime — creates namespaces, applies cgroup limits, execs PID 1. Does exactly one thing: start a container. Exits afterward. | OCI Runtime Spec (config.json) |
| containerd | Container lifecycle, image pull/store, snapshot management, shim management. Does NOT know about pods or services. | gRPC CRI (from Kubernetes), containerd client API |
| CRI | Container Runtime Interface — the Kubernetes API for talking to container runtimes. Kubelet speaks CRI; containerd implements it. | gRPC (RunPodSandbox, CreateContainer, StartContainer…) |
| dockerd | Docker daemon — legacy wrapper that translates Docker API calls to containerd calls. Not needed in Kubernetes (kubelet speaks CRI directly). | Docker Engine API (HTTP/Unix socket) |
| containerd shim | Per-container process that bridges containerd and runc. Keeps running after runc exits to hold stdio and report exit status. | TTRpc (tiny RPC between containerd and shim) |
Container security has two distinct layers: image security (what's in the image before it runs) and runtime security (what a running container can do). Both require explicit design decisions — the defaults are permissive.
Image scanning tools (Trivy, Grype, Snyk Container) inspect the image's layer contents for known CVEs in OS packages (apt/yum), language runtimes (pip, npm, maven), and application dependencies. They do not run the image — they parse the filesystem statically.
| Trivy | Grype | |
|---|---|---|
| Vulnerability DB | NVD, GHSA, OS advisories (Ubuntu, Alpine, RHEL…) | NVD, GHSA + Anchore Feed |
| SBOM output | CycloneDX, SPDX, Syft JSON | CycloneDX, SPDX |
| CI integration | GitHub Actions, GitLab CI, Tekton, GitHub Advanced Security | GitHub Actions, Jenkins |
| Secret scanning | Yes (finds hardcoded secrets in image layers) | No |
| Registry scan (without pull) | Yes (remote scan via registry API) | Yes |
An image tag (nginx:latest) is mutable — someone can push a new image and overwrite the tag. Content trust creates a cryptographic signature over a manifest digest, stored alongside the image in the registry. The consumer verifies the signature before pulling.
# cosign (Sigstore) — keyless signing via OIDC (no long-lived key)
# Sign in CI with GitHub Actions OIDC token:
cosign sign --yes ghcr.io/myorg/myapp@sha256:abc123...
# cosign uploads a signature as a separate OCI artifact in the registry
# Verify before pull (in production admission controller):
cosign verify --certificate-identity=https://github.com/myorg/myapp/.github/workflows/build.yaml \
--certificate-oidc-issuer=https://token.actions.githubusercontent.com \
ghcr.io/myorg/myapp:latest
# Notary v2 (nv2) — policy-based: only trusted images run
# OPA Gatekeeper or Kyverno enforces policy at admission:
# "any Pod spec.containers[].image must have a valid cosign signature"
| Control | How | What it prevents |
|---|---|---|
| Read-only root filesystem | docker run --read-only or securityContext.readOnlyRootFilesystem: true | Malware can't write to the container's root FS; pivots and persistence attacks |
| Non-root user | Dockerfile USER 1001 or securityContext.runAsNonRoot: true | Reduces blast radius if container is compromised; root in container = root on host with some escape paths |
| No new privileges | --security-opt=no-new-privileges or securityContext.allowPrivilegeEscalation: false | Blocks setuid binaries from escalating privileges inside the container |
| Seccomp profile | Docker default seccomp blocks ~44 syscalls. RuntimeDefault in Kubernetes. | Limits available kernel attack surface; blocks ptrace, kexec_load, etc. |
| Drop capabilities | --cap-drop=ALL --cap-add=NET_BIND_SERVICE | Removes CAP_SYS_ADMIN, CAP_NET_ADMIN, etc. — the dangerous capabilities that enable container escapes |
| Rootless containers | Podman rootless / rootless containerd / Docker rootless mode | Container root maps to unprivileged host user via user namespace — host cannot be root-escaped into |
| Distroless images | Google's gcr.io/distroless — no shell, no package manager | No shell = no interactive exploits; nothing to exec into. Attack surface reduced to the app binary only. |
Docker Hub serves ~10B pulls/day. GitHub Container Registry, GCR, and ECR handle similar scale. The architecture has three layers: CDN edge caching, P2P distribution for large-scale cluster pulls, and lazy loading for eliminating unnecessary data transfer entirely.
Blobs are addressed by SHA256 digest — they are immutable and can be cached indefinitely. The registry redirects blob GETs to a CDN presigned URL (S3/GCS + CloudFront/Fastly). Cache hit rates for popular base images (ubuntu, alpine, python, node) exceed 99%. This means the origin registry handles only cache misses — roughly 1% of 10B = 100M requests/day.
# Registry blob GET → 307 redirect to CDN URL
GET /v2/library/ubuntu/blobs/sha256:1a2b3c...
→ 307 Location: https://cdn.hub.docker.com/v2/blobs/sha256:1a2b3c...?X-Amz-Expires=3600&X-Amz-Signature=...
# CDN cache key = sha256 digest (immutable — no TTL needed)
# Popular blobs (ubuntu, alpine base layers) served from CDN PoP nearest to client
# Cold miss → CDN fetches from S3, caches permanently
When a Kubernetes cluster autoscales from 10 pods to 1000 pods simultaneously, each pod tries to pull the same image. Without P2P, 1000 nodes all hit the CDN/registry simultaneously. With P2P distribution (Alibaba's Dragonfly or Uber's Kraken), nodes form a BitTorrent-like swarm: each node that has pulled a chunk seeds it to peers. The registry/CDN serves only the initial seeder, and the swarm handles the fan-out.
| Solution | Architecture | Best for |
|---|---|---|
| Dragonfly (CNCF) | Manager + Scheduler + Seed Peer + Dfdaemon agent on each node. Content splits into P2P chunks. Used by Alibaba, Ant Group. | Large-scale clusters (1000+ nodes), frequent mass-scale deployments |
| Kraken (Uber) | Tracker + Origin + Proxy + Agent. Uses BitTorrent protocol. Written in Go. | Uber's scale — 1M+ container starts/day across hundreds of clusters |
| containerd mirror | Simpler: configure a local registry mirror per cluster (Harbor, Nexus). Node pulls from mirror; mirror pulls from Docker Hub once. | Single cluster, simpler ops, acceptable latency |
The fundamental insight: most containers start after pulling 30% of their layers, because only the files accessed during startup are needed immediately. eStargz (CNCF Stargz Snapshotter) reformats image layers so individual files are addressable and fetchable on-demand. Container startup begins before the image is fully downloaded.
# eStargz: Seekable GZIP layer format
# Each file within the layer is independently seekable
# Manifest includes a "stargz" footer with a TOC (table of contents)
# Startup latency comparison (Node.js app, 500 MB image):
# Traditional pull: download 500 MB → unpack → start ≈ 60 s on cold node
# eStargz lazy: download TOC (1 MB) → start → fetch on demand ≈ 3 s
# Enable in containerd config.toml:
[proxy_plugins]
[proxy_plugins.stargz]
type = "snapshot"
address = "/run/containerd-stargz-grpc/containerd-stargz-grpc.sock"
# Build eStargz-compatible image:
ctr-remote image optimize --oci \
nginx:alpine \
ghcr.io/myorg/nginx:alpine-sgz
| Component | Architecture | Why |
|---|---|---|
| Blob storage | Multi-region S3/GCS with cross-region replication. Blobs are immutable — eventual consistency is fine. | CDN pulls from nearest region; durability via replication; no single point of failure |
| Manifest/tag DB | PostgreSQL (primary + replicas per region) or CockroachDB for global active-active. Tags are mutable — strong consistency needed for writes. | Tag write must be globally consistent to avoid split-brain (two regions serving different digests for same tag) |
| Registry API | Stateless containers behind ALB in each region. Auto-scale on CPU. | Horizontal scaling; no session state; blob redirect offloads bandwidth |
| CDN | CloudFront / Fastly global PoP network. Cache blobs by SHA256 digest permanently. | Last-mile latency; absorbs 99%+ of blob traffic without touching origin |
| Rate limiting | Redis (or DynamoDB for multi-region) token bucket per IP/authenticated user. Unauthenticated Docker Hub: 100 pulls/6 hrs. | Protect origin from unauthenticated scraping; incentivize authentication |
These separate the staff-level answer from the senior one — layer caching, BuildKit, distroless images, and container escapes.
How does layer caching work in CI, and why does it break?
Dockerfile instructions are cached by their checksum + parent layer digest — if neither changes, the layer is reused. Cache breaks when any earlier layer changes (e.g., COPY . . before RUN pip install means every code change invalidates the pip layer). Fix: put COPY requirements.txt + RUN pip install before COPY . ..
What are multi-stage builds and why do they matter?
Multi-stage builds use multiple FROM instructions — a "builder" stage compiles the binary, a "runtime" stage copies only the compiled artifact. The final image contains no compiler, no build tools, no source code — just the binary and its runtime deps. A Go app goes from 1.2 GB (with Go toolchain) to 15 MB (scratch + binary).
What is a distroless image and when should you use one?
Distroless images (gcr.io/distroless) contain only the application runtime (Java 17, Python 3.11, etc.) and no shell, no package manager, no OS utilities. docker exec fails — there is nothing to exec into. Use when you want to minimize attack surface in production; accept the operational cost: debugging requires ephemeral debug sidecar containers (kubectl debug).
How do you minimize image size?
Use alpine or distroless base, multi-stage builds, combine RUN commands to reduce layer count, --no-cache in apk/apt, .dockerignore to exclude test files and docs, and docker scout or dive to inspect layer contents. Target: <100 MB for most production services; <20 MB for Go/Rust binaries on scratch/distroless.
What is BuildKit and how does it improve on the classic Docker build engine?
BuildKit (docker/buildkit) replaces the sequential layer-by-layer builder with a DAG executor: independent RUN stages run in parallel, build cache is external (registry-based, shared across CI workers), secrets can be passed as tmpfs mounts (never baked into layers), and output supports multiple exporters (OCI, Docker, local). DOCKER_BUILDKIT=1 or docker buildx build.
What is an SBOM and why does it matter for containers?
An SBOM (Software Bill of Materials) is a machine-readable inventory of every package in an image: name, version, license, source repo. In CycloneDX or SPDX format. Executive Order 14028 (US federal) requires SBOMs for federal software procurement. In practice: store the SBOM alongside the image in the registry (as a cosign-attached OCI artifact), and query it when a new CVE drops to find all images containing the vulnerable package.
What is a container escape and how does it happen?
A container escape is when a process inside a container gains access to the host. Three common paths: (1) kernel vulnerability that bypasses namespace isolation (CVE-2019-5736 runc overwrite, CVE-2022-0185); (2) misconfigured capabilities — CAP_SYS_ADMIN is nearly equivalent to root on host; (3) privileged container (--privileged) which disables all security mechanisms. Defence: rootless containers, seccomp, drop all caps, no privileged containers in production.
What is cgroup v2 and what does it change for containers?
cgroupv2 (unified hierarchy) replaces cgroupv1's split per-controller hierarchy. Key changes: memory.oom.group for coordinated OOM killing of a container (not just the worst-fit process), PSI (Pressure Stall Information) for detecting resource contention, and unified delegation of cgroup management to the container runtime. Kubernetes 1.25+ requires cgroupv2; Amazon Linux 2023 and Ubuntu 22.04 default to cgroupv2.
What is the difference between OCI format and Docker image format?
Docker's image format (schema 2) and OCI Image Spec 1.0 are nearly identical — OCI was derived from Docker's format and standardized by the OCI working group. Key differences: OCI uses different media types (application/vnd.oci.image.manifest.v1+json vs Docker's application/vnd.docker.distribution.manifest.v2+json), OCI uses "image index" where Docker uses "manifest list." All modern runtimes (containerd, Podman, buildah) speak both formats transparently.
How do you design registry HA? What fails when a region goes down?
Blobs are in multi-region S3 with CRR — CDN serves from nearest region, unaffected. Tag writes (manifest PUT) need cross-region consistency — use CockroachDB or geo-distributed Postgres with synchronous replication for writes, or accept that tag writes fail during a region outage and pulls continue serving stale-but-consistent cached manifests. The worst failure mode: a region outage during a mass autoscale event causes thundering-herd on the remaining region's API tier — rate limit by cluster/namespace, not by IP.
Walk me through what happens when Kubernetes pulls an image.
Kubelet detects a new Pod, calls CRI (containerd) PullImage. containerd checks its snapshot store for existing layers. For missing layers: GET manifest from registry, compare layer digests against local cache, GET missing blobs (via CDN redirect). Unpack blobs via snapshotter (OverlayFS). Report ImagePulled back to kubelet. Kubelet then calls CreateContainer and StartContainer. imagePullPolicy: IfNotPresent skips manifest fetch if image is already in containerd's image store.
How does Docker's rate limiting (100 pulls/6 hrs) work technically?
Rate limiting is applied per authenticated user (or per IP for anonymous). Each manifest GET (not blob GET) counts as a pull. The counter is stored in a Redis token bucket keyed by user/IP, decremented per manifest pull, refilled over 6 hours. Solution: authenticate in CI (docker login), or run a registry mirror (Harbor) in your cluster that pulls from Docker Hub once and serves internally unlimited.
What is the difference between docker stop and docker kill?
docker stop sends SIGTERM to PID 1 in the container, waits 10 seconds for graceful shutdown, then sends SIGKILL. docker kill sends SIGKILL immediately (or a specified signal). Apps should handle SIGTERM for graceful shutdown (drain connections, flush buffers). Kubernetes Pod termination follows the same pattern: SIGTERM → grace period → SIGKILL, with configurable terminationGracePeriodSeconds.
What is the FROM scratch base image and when do you use it?
FROM scratch is literally an empty filesystem — no OS, no libc, nothing. A single statically-linked binary is the entire container. Go and Rust compile to static binaries; a Go HTTP server on scratch is ~5 MB. The tradeoff: no sh, no wget, no debugging tools. curl healthcheck doesn't work. Use distroless instead of scratch when you need TLS root CAs, timezone data, or a minimal libc.
How do you handle secrets in containers — what are the anti-patterns?
Anti-patterns: ENV vars (visible in docker inspect, process env, and image history), ARGs baked into layers (visible in image history), secret files in image layers (visible in layer extraction). Correct patterns: Kubernetes Secrets mounted as tmpfs volumes (volumeMounts: secretKeyRef), Vault Agent Injector sidecar, AWS Secrets Manager + IAM IRSA, or BuildKit secret mounts (--mount=type=secret) during build — secrets available in RUN but never in final image.
| Dimension | Weak answer | Strong answer |
|---|---|---|
| What is a container? | "An isolated process" | Linux namespaces + cgroups + OverlayFS union mount. Shares host kernel. Not a VM. |
| Image anatomy | "A Docker image with layers" | OCI manifest + config + content-addressed layer blobs. SHA256 digest is the trust anchor and cache key. |
| Registry design | "Store images in S3" | Blob store keyed by SHA256 digest + manifest DB + CDN redirect on blob GET. Deduplication is free via content addressing. |
| Networking | "Containers have their own network" | Network namespaces + veth pairs + docker0 bridge + iptables NAT. Multi-host: VXLAN overlay. DNS via embedded resolver. |
| Schema | "Tables for images and tags" | blob (deduplicated), manifest, manifest_blob (junction), tag (mutable), pull_event (partitioned). GC via manifest_blob ref-count. |
| Runtime | "Docker runs the container" | dockerd → containerd (CRI) → containerd-shim → runc → namespace + cgroup setup → exec PID 1. runc exits after start. |
| 10B pulls/day | "Use a CDN" | CDN for blob serving (immutable by SHA256), P2P (Dragonfly/Kraken) for cluster autoscale bursts, eStargz lazy loading for cold start latency. |
| Security | "Scan for vulnerabilities" | Trivy scanning in CI + cosign signing + rootless containers + drop all caps + read-only rootfs + no-new-privileges + distroless base. |
Three misconceptions that show up repeatedly in interviews — and how to correct them on the spot.
❌ "Containers replace VMs"
Wrong: they solve different problems. Containers share the host OS kernel — isolation is at the process level via namespaces. VMs have a full separate kernel — isolation is at the hypervisor level, enforced in hardware. Use VMs when you need security boundaries between tenants (multi-tenant SaaS, compliance workloads, untrusted code). Containers are faster and lighter but a kernel vulnerability can cross the container boundary. VMs cannot escape the hypervisor without a hypervisor CVE. They coexist: in production you run containers inside VMs.
❌ "One container = one service" is always the right rule
True at the logical level — and it is the right default. But teams routinely ship containers with supervisord running multiple processes (e.g., nginx + app server + cron), causing ops nightmares: a crashed child process is invisible to the container runtime; logs are tangled; health checks can't distinguish which process failed; you can't scale processes independently; init signal handling breaks. The rule is one concern per container, not a literal process limit — but in practice one PID 1 is the correct implementation.
❌ "Docker is Kubernetes" / "Docker and Kubernetes are the same thing"
Docker builds images (Dockerfile → docker build) and runs individual containers locally (docker run). It is a developer tool and a local runtime. Kubernetes orchestrates containers at scale — scheduling pods onto nodes, managing replicas, rolling deployments, service discovery, autoscaling, self-healing. You can run Docker without Kubernetes (local dev, CI). In production at any meaningful scale you almost certainly need Kubernetes (or a managed equivalent: ECS, Cloud Run, Fargate) — not because Docker is inadequate but because Kubernetes solves a different problem: fleet management, not container execution.
The senior answer includes knowing when NOT to use containers. Trade-offs matter more than defaults.
Every interviewer will ask a variant of this. Have the answer internalized, not memorized.
"Why did containers become popular?"
Containers solved the 'works on my machine' problem by packaging an application with all its dependencies into a single portable unit. Unlike VMs, they share the host OS kernel, making them lightweight (MBs vs GBs) and fast to start (seconds vs minutes). This enables consistent behavior across dev, test, and prod environments. The tradeoff: at scale, managing hundreds of containers manually becomes chaos — which is exactly the problem Kubernetes was built to solve.
FOLLOW-UP READY: If they ask "what problem does Kubernetes solve?" — scheduling, self-healing, rolling deployments, service discovery, autoscaling. See The Orchestration Problem.
The question is about containers. The test is about depth of systems thinking. Four axes that separate junior from staff.
Do you know the difference between process isolation (containers — Linux namespaces, shared kernel) and kernel isolation (VMs — hypervisor, separate kernel)? Can you explain when each is appropriate? Candidates who say "containers are more secure than VMs" fail this test.
Can you explain layered image builds, multi-stage builds, and why image size matters? Do you know the Dockerfile instruction order rules (COPY before RUN breaks layer cache)? Staff engineers think about build cache as CI performance infrastructure.
Bridge networks (single-host, docker0), overlay networks (multi-host, VXLAN), port mapping (iptables DNAT) — do you know when each applies? Can you explain why -p 8080:80 works? Networking questions distinguish people who've debugged real container issues from those who only used them.
Root vs non-root containers (running as root in a container is nearly root on the host without user namespaces), read-only filesystems, capability dropping (--cap-drop=ALL), seccomp profiles. Staff-level: rootless containers via user namespaces, and why distroless eliminates most of the attack surface.
--privileged is essentially no isolation. Ask about their registry rate-limiting situation unprompted.Containers are one milestone in a longer arc. Knowing where they sit — and what came before and after — is the context that makes all other answers land.
Each generation of infrastructure solves one problem and creates the next. Understanding this chain is what makes a strong candidate's answer feel historically grounded rather than tool-focused.
Containers solved packaging
One portable unit with all dependencies. Works everywhere. "Works on my machine" is solved. But now you have 500 containers — who starts them, restarts them when they die, and balances load?
Kubernetes — scheduling pods onto nodes, self-healing via ReplicaSets, rolling deployments, service discovery via CoreDNS, autoscaling via HPA. The container is the atom; Kubernetes manages the molecule.
Kubernetes solved scheduling
Fleet management at scale — reliable, declarative. But developers still need to understand YAML, Helm charts, network policies, ingress controllers, CRDs. The cognitive overhead is enormous. Who builds the abstractions?
Platform Engineering — internal developer platforms (IDPs) that hide Kubernetes complexity. Tools like Backstage, Crossplane, Humanitec. The platform team owns the infrastructure; app teams get golden paths: backstage new-service → CI/CD + Kubernetes + observability, zero YAML.
Platform engineering solves developer experience — abstracting infrastructure into self-service workflows. The next frontier: AI agents that provision, scale, and heal infrastructure without human operators.
Instead of a developer triggering a deploy pipeline, an AI agent reads the service's SLO, observes current traffic patterns, provisions the right size of infrastructure, deploys the container, and adjusts replicas in real time — without a human writing a Helm chart or tuning an HPA. The container is still the execution unit. The agent is the new operator.
The through-line: every generation of infrastructure abstracts away the complexity of the previous one. Physical servers → VMs → cloud → containers → Kubernetes → platforms → agents. Understanding this arc — not just the current tool — is what a staff engineer brings to the interview.