PaddySpeaks · Systems at the Whiteboard · Nº 25

The Container Problem

Design a container runtime and image registry at Docker/GitHub Container Registry scale — what a container image actually is, how OverlayFS layers stack and copy-on-write, the OCI image spec and content-addressable blob storage, the push/pull protocol, container networking from veth pairs to VXLAN overlays, what runc and containerd actually do when you run docker run, image security scanning, and how you serve 10 billion pulls per day with CDN layers, P2P distribution, and lazy-loading.

WORDS · PaddySpeaks Editorial FIELD · Container Infrastructure / Systems READING · ~35 min

  📦 Using containers in production? See how Kubernetes orchestrates them at scale →
  The Orchestration Problem (Kubernetes)
   (pods, scheduling, control plane, operators)

§ 00 — BEFORE CONTAINERSThe evolution that made containers inevitable

Containers did not appear in a vacuum. Each step in infrastructure history created a new problem that the next step solved — until containers created their own problem: scale management.

PHYSICAL SERVER: Bare metal: one application per server. Utilization typically 5–15%. A new app means provisioning a new server — weeks of lead time. Upgrade a library and you risk breaking the OS or another tenant. Hardware is expensive; idle hardware is waste.
VIRTUAL MACHINES: Better utilization — run multiple VMs on one host. But each VM carries a full OS kernel: 0.5–2 GB of overhead, 30–120 seconds to boot. Spinning up 100 VMs for a traffic spike takes minutes. VM images are gigabytes; CI pipelines become slow.
DEPENDENCY HELL: "Works on my machine" — the definitive failure mode. App A needs Python 3.9 + OpenSSL 1.1. App B needs Python 3.11 + OpenSSL 3.0. On a shared server, one breaks the other. Virtualenv, rbenv, nvm are band-aids. The problem is the environment is not portable.
CONTAINERS: Package the application with its dependencies — not the full OS — into a portable unit. MBs instead of GBs. Seconds to start instead of minutes. Run identically on a developer's laptop, in CI, and in production. Docker's 2013 launch democratized what Google had been doing internally (Borg, lmctfy) for years.
CONTAINER SPRAWL: Containers solved packaging and isolation. But 500 containers across 20 hosts — how do you restart a failed container? How do you roll out a new version without downtime? How do you balance load? Manual management at scale is chaos. This is the problem Kubernetes was built to solve.

Question	60-second answer
What is a container?	A process with its own filesystem (OverlayFS), network namespace, and cgroup limits — not a VM; shares the host kernel.
What is an image?	A stack of read-only content-addressable layers (SHA256 blobs) plus a JSON config — described by an OCI manifest.
How does a registry work?	Content-addressable blob store + manifest index. Pull: check local cache → HEAD blob → GET blob chunks. Deduplication is free — same SHA256 blob serves thousands of images.
How is networking isolated?	Linux network namespaces + veth pairs + a bridge (docker0) + iptables NAT. Multi-host: VXLAN overlay encapsulation.
10B pulls/day?	CDN edge caching of blobs by SHA256, P2P distribution (Dragonfly/Kraken), lazy loading (eStargz streaming pulls), multi-region blob replication.

Typical Interview Site	Interview Studio
Memorization	Understanding
Coding only	Coding + Architecture + Data Modeling
Short answers	Deep reasoning with trade-offs
LeetCode style	Real-world engineering at scale
Junior focus	Senior / Staff / L6–L7

§ 01 — THE QUESTIONContainers are not VMs

A container looks like isolation. It is, in fact, a regular Linux process using kernel namespaces, cgroups, and a union filesystem to create the illusion of a private machine — without the hypervisor overhead that makes VMs take 30+ seconds to boot.

Interview Prompt

"Design a container runtime and image registry at Docker/GitHub Container Registry scale. Walk me through what a container image actually is, how layers work, how networking is isolated, and how you'd build a registry that serves 10 billion pulls per day."

LEVEL · SENIOR / STAFFDURATION · 45–60 MINFORMAT · WHITEBOARD

The question catches most candidates because containers sit at an uncomfortable intersection: it's operating systems, distributed storage, networking, and a high-scale CDN problem all at once. A weak answer draws a box labeled "Docker" and describes docker run. A strong answer names the four forces that make the design hard — then shows how every later decision exists to survive one of them.

THE LAYER SHARING PROBLEM: Images are expensive; layers are cheap. A base Ubuntu layer (100 MB) used by 10,000 images should be stored and transferred exactly once — not once per image. The whole registry design flows from this: content-addressable storage by SHA256 digest means a layer appears once in the blob store no matter how many images reference it. Pull deduplication is free.
THE ISOLATION ILLUSION: Containers share the host kernel — they don't own it. Network, PID, mount, UTS, and IPC namespaces create the appearance of isolation. cgroups enforce resource limits. OverlayFS creates the private filesystem view. If any of these kernel primitives is misconfigured, containers can escape to the host. This is why container security is fundamentally harder than VM security.
THE 10B PULLS PROBLEM: A registry is the hottest cache in infrastructure. Every CI job, every pod startup, every autoscale event pulls layers. The bottleneck is not compute — it's bandwidth, and specifically the last-mile latency between the registry and the cluster. CDN, P2P, and lazy-loading each exist to solve a different part of this bandwidth cliff.
THE FORMAT FRAGMENTATION PROBLEM: Docker format ≠ OCI format — but compatibility is expected. Docker's v2 manifest schema and the OCI Image Spec are nearly identical but differ in media types. Multi-arch image support (arm64, amd64, arm/v7) requires a manifest list (OCI: image index). BuildKit, Podman, containerd, and the CRI all have their own quirks. A registry must handle all of them seamlessly.

Read those four again. Every one is a data and architecture problem, not a container-feature problem. The interesting engineering is in the content-addressable blob store, the union filesystem, the namespace orchestration, and the CDN hierarchy — not the docker run command itself.

Envelope math, volunteered:

Quantity	Estimate	Consequence
Docker Hub pulls / day	~10B	CDN is not optional; origin cannot absorb this traffic
Avg image size (layers)	~200 MB (compressed)	10B pulls × 200 MB = 2 EB/day data transfer (CDN cache hit rate must be >99%)
Distinct layer blobs (Docker Hub)	~50–100M	Content-addressed; each stored once regardless of how many images reference it
Typical layer count per image	5–15 layers	Each layer is a separate blob pull; shallow caches help less than deep blob deduplication
Image pull latency target (CI)	<10 s	Warm cache: check-layer-exists (HEAD) → skip unchanged layers; cold: parallel chunk download
Registry HA requirement	99.99% uptime	~52 min/year downtime; multi-region active-active with blob replication
Garbage collection window	7–30 days	Dangling layers deleted after all manifests referencing them are GC'd; soft-delete then sweep

§ 02 — PLAIN LANGUAGEWhat is a container, really?

Before the engineering, the mental model. Three analogies that actually hold up under scrutiny.

The VM analogy (and why it breaks)

A virtual machine is a full computer running inside your computer — it has its own kernel, its own memory allocator, its own device drivers. Starting one takes 30 seconds because the OS is actually booting. A container is not a virtual machine. It is a process that thinks it has its own computer — it runs directly on the host kernel, uses the host's network stack, and shares memory with the host. The kernel creates an illusion of isolation using namespaces. The illusion is convincing, but it is not a hard wall.

	Virtual Machine	Container
Kernel	Its own (Guest OS)	Shared host kernel
Startup time	30–120 seconds	100–500 ms
Memory overhead	100–500 MB (OS overhead)	~1 MB (just the process)
Isolation level	Hardware-level (hypervisor)	Kernel-level (namespaces)
Security boundary	Very hard to escape	Kernel vulnerabilities can escape
Use case	Full OS, different kernels, strong multi-tenancy	Microservices, CI, ephemeral workloads

What Docker solves — "works on my machine"

Before Docker, deploying software meant: (1) install the right version of Java/Python/Node on the server, (2) hope it matches your laptop, (3) debug the production difference at 2 AM. Docker's insight was simple: ship the filesystem, not just the code. Package the app, its runtime, its libraries, and its configuration into an image — and that image runs identically on your laptop, in CI, and in production. The image is the deployment unit. "Works on my machine" becomes "this is the machine."

The shipping container analogy

Before standardized shipping containers, loading a cargo ship meant custom-stacking thousands of different-shaped parcels — slow, lossy, requiring specialists at every port. After: everything fits into an ISO-standard container. The crane operator doesn't care what's inside. The port doesn't care about the contents. The ship is fully interchangeable.

Docker containers work identically. The runtime (containerd/runc) doesn't care if your container runs Python, Go, or a Postgres database. The orchestrator (Kubernetes) doesn't care about your app. The cloud (ECS, GKE, AKS) doesn't care about your runtime. The standard interface — OCI Image Spec + OCI Runtime Spec — is the interoperability layer, exactly like ISO 668 for shipping.

The shipping container transformed global trade not by changing what was shipped, but by standardizing how it was moved. Docker transformed software deployment not by changing how apps are written, but by standardizing how they are packaged and run. The analogy is not metaphorical — it is precisely structural.

§ 03 — IMAGE ANATOMYWhat's actually in an OCI image

An image is a stack of immutable, content-addressed filesystem layers plus a JSON config. The OCI Image Spec formalizes what Docker invented. Understanding the anatomy is the prerequisite for understanding the registry, the runtime, and security.

OverlayFS: the union filesystem

When you run a container, Linux mounts a union filesystem using OverlayFS. OverlayFS takes multiple directory trees ("lower layers") and presents them as a single unified view. Each image layer is a tar archive of filesystem changes. Layers are stacked bottom-up. A container gets a writable "upper" layer on top — and copy-on-write means files from lower layers are only copied to the writable layer when modified.

The OCI Image Spec — three artifacts

MANIFEST: A JSON document that lists the image config blob (SHA256) and all layer blobs (SHA256) with their sizes. This is what a registry returns when you request an image by tag. It is the "table of contents" of the image. For multi-arch, an image index (manifest list) contains multiple manifests, one per platform.
CONFIG: A JSON blob containing environment variables, entrypoint, working directory, exposed ports, and the history of commands used to build the image. The config SHA256 is the image ID you see in docker images. It does not describe what the image does — it describes how to run it.
LAYERS: Each layer is a gzip-compressed tar archive of filesystem changes (adds, modifies, deletes as whiteout files). Layers are content-addressed by the SHA256 of the compressed blob. The same Ubuntu base layer — pulled once — is shared by every image that uses it. This is where the storage efficiency comes from.

# OCI Manifest (simplified)
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:abc123...",
    "size": 7023
  },
  "layers": [
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:1a2b3c...", "size": 30428672 },   // ubuntu base
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:8b1f2e...", "size": 45678901 },   // python runtime
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:3c8a9d...", "size": 12345678 },   // app deps
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:f9e2a1...", "size": 234567  }     // app code
  ]
}

# The digest is the SHA256 of the compressed layer tar.
# The diffID (in config) is the SHA256 of the uncompressed tar.
# Content-addressable: if sha256 matches, the blob is valid — no trust required.

Copy-on-write means a modified file in a running container is NOT written back to the image layer. It is copied into the writable upper layer and modified there. The lower layer is unchanged. When the container stops, the upper layer is discarded. This is why containers are ephemeral by default — and why you mount a volume for anything you want to persist.

§ 04 — THE REGISTRYHow Docker Hub and GHCR work

A container registry is a content-addressable blob store plus a metadata index. The OCI Distribution Spec formalizes the API. The interesting engineering is deduplication, CDN caching, multi-arch support, and garbage collection.

The push/pull protocol (OCI Distribution Spec)

Content-addressable storage — why SHA256 is the key

Every blob in a registry is addressed by its SHA256 digest: sha256:1a2b3c.... This has three profound consequences:

DEDUPLICATION: Any two images that share a layer — identical Ubuntu base, identical Python runtime — share exactly one copy of that layer in the blob store. Storage cost is O(unique layers), not O(images × layers).
CONTENT INTEGRITY: Before using a blob, recompute its SHA256. If it doesn't match the manifest's declared digest, the download is corrupt or tampered. No signature required for layer integrity — the hash is the trust anchor.
CACHE-FRIENDLY CDN: Blobs addressed by SHA256 are immutable — the same digest always means the same content. CDN edge nodes can cache blobs indefinitely with no TTL. A cache hit for sha256:ubuntu-base saves 100 MB of origin bandwidth for every pull worldwide.

Multi-arch: manifest lists (OCI image index)

# docker manifest inspect ubuntu:22.04 --verbose
# An OCI Image Index ("fat manifest") points to per-platform manifests

{
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "manifests": [
    { "digest": "sha256:amd64...", "platform": {"os":"linux","architecture":"amd64"} },
    { "digest": "sha256:arm64...", "platform": {"os":"linux","architecture":"arm64"} },
    { "digest": "sha256:armv7...", "platform": {"os":"linux","architecture":"arm","variant":"v7"} }
  ]
}

# docker pull resolves the correct manifest for the host platform automatically.
# buildx --platform linux/amd64,linux/arm64 builds and pushes all platforms in one push.

§ 05 — CONTAINER NETWORKINGFrom veth pairs to VXLAN overlays

Container networking is Linux networking — namespaces, veth pairs, bridges, iptables NAT. Multi-host adds VXLAN overlay encapsulation. Understanding the primitives is the prerequisite for debugging any networking issue in Kubernetes.

Single-host: bridge networking (docker0)

Network namespaces and veth pairs — the mechanics

NETWORK NAMESPACE: Each container gets its own network stack: interfaces, routing table, iptables rules, sockets — fully isolated. ip netns exec container1 ip addr shows only that container's interfaces. The host namespace and container namespace are completely separate — a port conflict in one is invisible to the other.
VETH PAIR: A virtual Ethernet cable with two ends. One end is placed inside the container namespace (appears as eth0); the other is placed in the host namespace (appears as vethXXXX) and attached to the docker0 bridge. Traffic flows between them — whatever enters one end exits the other. This is the only "wire" between the container and the host network.
DOCKER0 BRIDGE: A virtual switch at 172.17.0.1/16. All container veth pairs attach here. Container-to-container traffic within the same bridge is direct — no NAT, no kernel routing, just a virtual switch forward. The bridge also has an IP on the host, allowing the host to reach containers.
IPTABLES MASQUERADE: For container-to-internet traffic: the kernel NATs the source IP from 172.17.0.x to the host's public IP. For inbound port publishing (-p 8080:80): iptables DNAT rewrites destination from host:8080 to 172.17.0.x:80. This is why docker run -p works — it's just iptables rules.

Multi-host: VXLAN overlay networking

Single-host bridge networking doesn't cross machines. For multi-host clusters (Docker Swarm, Kubernetes), an overlay network encapsulates container frames inside UDP packets using VXLAN (Virtual Extensible LAN). Container 172.17.0.2 on host A communicating with container 172.17.0.3 on host B: the frame is wrapped in a VXLAN UDP packet (destination: host B's physical IP, port 4789) and unwrapped on arrival. Flannel, Calico, Cilium, and Weave all implement variations of this model.

# VXLAN packet structure (simplified)
Outer Ethernet frame:
  src MAC: host-A NIC
  dst MAC: host-B NIC

Outer IP header:
  src: 10.0.0.1  (host A physical IP)
  dst: 10.0.0.2  (host B physical IP)

UDP header:
  dst port: 4789 (VXLAN)

VXLAN header:
  VNI: 100  (virtual network identifier — which overlay network)

Inner Ethernet frame:
  src MAC: veth in container-A namespace
  dst MAC: veth in container-B namespace

Inner IP:
  src: 10.244.1.2  (pod/container IP on host A)
  dst: 10.244.2.3  (pod/container IP on host B)

# The kernel's VXLAN driver on host B strips the outer headers and
# delivers the inner frame to the correct container namespace.
# To the container, it looks like a normal Ethernet packet.

DNS inside containers: Docker embeds a DNS resolver at 127.0.0.11. Container-to-container by service name (e.g., curl http://db:5432) resolves via Docker's embedded DNS which maps service names to 172.17.0.x IPs. In Kubernetes, CoreDNS does the same job for pod-to-service resolution: my-service.default.svc.cluster.local → 10.96.x.x.

§ 06 — SCHEMA DESIGNThe registry database

The interesting design is layer deduplication (blobs are shared across images), garbage collection (GC blobs only when no manifest references them), quota enforcement per repository, and pull event analytics. Each table serves a distinct operational requirement.

-- Repositories (e.g. library/ubuntu, ghcr.io/org/app)
CREATE TABLE repository (
    repo_id         BIGSERIAL PRIMARY KEY,
    registry        TEXT NOT NULL,              -- 'docker.io', 'ghcr.io', 'gcr.io'
    namespace        TEXT NOT NULL,             -- 'library', 'myorg'
    name             TEXT NOT NULL,             -- 'ubuntu', 'myapp'
    is_public        BOOLEAN NOT NULL DEFAULT TRUE,
    owner_id         BIGINT NOT NULL,
    storage_bytes    BIGINT NOT NULL DEFAULT 0, -- updated by trigger on blob insert
    quota_bytes      BIGINT,                    -- NULL = unlimited
    created_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (registry, namespace, name)
);

-- Content-addressed blobs — shared across ALL images/repositories
CREATE TABLE blob (
    blob_id          BIGSERIAL PRIMARY KEY,
    digest           TEXT NOT NULL UNIQUE,      -- sha256:hex — THE primary key conceptually
    media_type       TEXT NOT NULL,             -- 'application/vnd.oci.image.layer.v1.tar+gzip' etc.
    size_bytes       BIGINT NOT NULL,
    storage_path     TEXT NOT NULL,             -- s3://blobs/{digest[0:2]}/{digest[2:4]}/{digest}
    uploaded_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_referenced_at TIMESTAMPTZ,             -- updated on pull; used by GC
    gc_eligible      BOOLEAN NOT NULL DEFAULT FALSE  -- set true when ref_count → 0
);

-- OCI Manifests (one per image+platform)
CREATE TABLE manifest (
    manifest_id      BIGSERIAL PRIMARY KEY,
    repo_id          BIGINT NOT NULL REFERENCES repository(repo_id),
    digest           TEXT NOT NULL,             -- sha256 of the manifest JSON
    media_type       TEXT NOT NULL,             -- OCI manifest or manifest list
    raw_json         JSONB NOT NULL,            -- full manifest content
    config_digest    TEXT REFERENCES blob(digest),  -- null for manifest lists
    created_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (repo_id, digest)
);

-- Manifest ↔ blob association (many-to-many — core deduplication)
CREATE TABLE manifest_blob (
    manifest_id      BIGINT NOT NULL REFERENCES manifest(manifest_id),
    blob_id          BIGINT NOT NULL REFERENCES blob(blob_id),
    layer_order      INT NOT NULL,              -- 0-based order in manifest layers array
    PRIMARY KEY (manifest_id, blob_id)
);

-- Tags (mutable pointers to manifests)
CREATE TABLE tag (
    tag_id           BIGSERIAL PRIMARY KEY,
    repo_id          BIGINT NOT NULL REFERENCES repository(repo_id),
    name             TEXT NOT NULL,             -- 'latest', 'v1.2.3', 'main'
    manifest_id      BIGINT NOT NULL REFERENCES manifest(manifest_id),
    pushed_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    pushed_by        TEXT NOT NULL,
    UNIQUE (repo_id, name)
);

-- Image index (multi-arch / manifest list → per-platform manifests)
CREATE TABLE image_index_entry (
    parent_manifest_id BIGINT NOT NULL REFERENCES manifest(manifest_id),
    child_manifest_id  BIGINT NOT NULL REFERENCES manifest(manifest_id),
    platform_os        TEXT NOT NULL,           -- 'linux'
    platform_arch      TEXT NOT NULL,           -- 'amd64', 'arm64', 'arm'
    platform_variant   TEXT,                    -- 'v7' for arm/v7
    PRIMARY KEY (parent_manifest_id, child_manifest_id)
);

-- Pull events (analytics + rate limiting)
CREATE TABLE pull_event (
    event_id         BIGSERIAL PRIMARY KEY,
    repo_id          BIGINT NOT NULL REFERENCES repository(repo_id),
    manifest_id      BIGINT REFERENCES manifest(manifest_id),
    tag_name         TEXT,
    puller_id        BIGINT,                    -- authenticated user / org, null for anon
    client_ip        INET NOT NULL,
    pulled_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    bytes_transferred BIGINT NOT NULL DEFAULT 0,
    was_cache_hit    BOOLEAN NOT NULL DEFAULT FALSE
) PARTITION BY RANGE (pulled_at);  -- monthly partitions; archive to S3 after 90 days

CREATE TABLE pull_event_2026_01 PARTITION OF pull_event
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

-- Indexes
CREATE INDEX idx_blob_digest         ON blob (digest);
CREATE INDEX idx_manifest_repo       ON manifest (repo_id, created_at DESC);
CREATE INDEX idx_tag_repo_name       ON tag (repo_id, name);
CREATE INDEX idx_manifest_blob_blob  ON manifest_blob (blob_id);  -- for GC ref-count
CREATE INDEX idx_pull_event_repo     ON pull_event (repo_id, pulled_at DESC);
CREATE INDEX idx_blob_gc             ON blob (gc_eligible, last_referenced_at)
    WHERE gc_eligible = TRUE;

-- Garbage collection query: find blobs with no manifest references
-- Run periodically after tag deletes / manifest deletes
UPDATE blob SET gc_eligible = TRUE
WHERE blob_id NOT IN (SELECT DISTINCT blob_id FROM manifest_blob)
  AND uploaded_at < NOW() - INTERVAL '7 days';  -- soft-delete grace period

GC design: never delete a blob immediately when a manifest is deleted — there is a grace period (7–30 days) during which a push can restore the reference. Soft-delete: mark gc_eligible = TRUE, then a GC worker sweeps blobs with no references and no recent push. This prevents races between concurrent push and delete operations.

Quota enforcement: storage_bytes on repository is updated by a trigger on manifest_blob insert. Before accepting a push, check storage_bytes + new_layer_sizes <= quota_bytes. Deduplication means shared layers don't count against individual repo quotas — only layers unique to that repo.

§ 07 — RUNTIME INTERNALSWhat happens when you run `docker run`

Eight steps from command to running process. Each step is a distinct subsystem: registry pull, layer unpacking, rootfs construction, namespace setup, cgroup limits, and exec. runc is the last mile — it executes exactly one process.

runc, containerd, CRI — the component stack

Component	Responsibility	Interface
runc	OCI runtime — creates namespaces, applies cgroup limits, execs PID 1. Does exactly one thing: start a container. Exits afterward.	OCI Runtime Spec (config.json)
containerd	Container lifecycle, image pull/store, snapshot management, shim management. Does NOT know about pods or services.	gRPC CRI (from Kubernetes), containerd client API
CRI	Container Runtime Interface — the Kubernetes API for talking to container runtimes. Kubelet speaks CRI; containerd implements it.	gRPC (RunPodSandbox, CreateContainer, StartContainer…)
dockerd	Docker daemon — legacy wrapper that translates Docker API calls to containerd calls. Not needed in Kubernetes (kubelet speaks CRI directly).	Docker Engine API (HTTP/Unix socket)
containerd shim	Per-container process that bridges containerd and runc. Keeps running after runc exits to hold stdio and report exit status.	TTRpc (tiny RPC between containerd and shim)

§ 08 — SECURITYImage scanning, content trust, and container hardening

Container security has two distinct layers: image security (what's in the image before it runs) and runtime security (what a running container can do). Both require explicit design decisions — the defaults are permissive.

Image scanning: Trivy and Grype

Image scanning tools (Trivy, Grype, Snyk Container) inspect the image's layer contents for known CVEs in OS packages (apt/yum), language runtimes (pip, npm, maven), and application dependencies. They do not run the image — they parse the filesystem statically.

	Trivy	Grype
Vulnerability DB	NVD, GHSA, OS advisories (Ubuntu, Alpine, RHEL…)	NVD, GHSA + Anchore Feed
SBOM output	CycloneDX, SPDX, Syft JSON	CycloneDX, SPDX
CI integration	GitHub Actions, GitLab CI, Tekton, GitHub Advanced Security	GitHub Actions, Jenkins
Secret scanning	Yes (finds hardcoded secrets in image layers)	No
Registry scan (without pull)	Yes (remote scan via registry API)	Yes

Content trust: Notary v2 / cosign (Sigstore)

An image tag (nginx:latest) is mutable — someone can push a new image and overwrite the tag. Content trust creates a cryptographic signature over a manifest digest, stored alongside the image in the registry. The consumer verifies the signature before pulling.

# cosign (Sigstore) — keyless signing via OIDC (no long-lived key)
# Sign in CI with GitHub Actions OIDC token:
cosign sign --yes ghcr.io/myorg/myapp@sha256:abc123...
# cosign uploads a signature as a separate OCI artifact in the registry

# Verify before pull (in production admission controller):
cosign verify --certificate-identity=https://github.com/myorg/myapp/.github/workflows/build.yaml \
              --certificate-oidc-issuer=https://token.actions.githubusercontent.com \
              ghcr.io/myorg/myapp:latest

# Notary v2 (nv2) — policy-based: only trusted images run
# OPA Gatekeeper or Kyverno enforces policy at admission:
# "any Pod spec.containers[].image must have a valid cosign signature"

Runtime hardening — the defense-in-depth checklist

Control	How	What it prevents
Read-only root filesystem	`docker run --read-only` or `securityContext.readOnlyRootFilesystem: true`	Malware can't write to the container's root FS; pivots and persistence attacks
Non-root user	Dockerfile `USER 1001` or `securityContext.runAsNonRoot: true`	Reduces blast radius if container is compromised; root in container = root on host with some escape paths
No new privileges	`--security-opt=no-new-privileges` or `securityContext.allowPrivilegeEscalation: false`	Blocks setuid binaries from escalating privileges inside the container
Seccomp profile	Docker default seccomp blocks ~44 syscalls. RuntimeDefault in Kubernetes.	Limits available kernel attack surface; blocks ptrace, kexec_load, etc.
Drop capabilities	`--cap-drop=ALL --cap-add=NET_BIND_SERVICE`	Removes CAP_SYS_ADMIN, CAP_NET_ADMIN, etc. — the dangerous capabilities that enable container escapes
Rootless containers	Podman rootless / rootless containerd / Docker rootless mode	Container root maps to unprivileged host user via user namespace — host cannot be root-escaped into
Distroless images	Google's gcr.io/distroless — no shell, no package manager	No shell = no interactive exploits; nothing to exec into. Attack surface reduced to the app binary only.

The most important container security insight: running as root inside a container is nearly the same as running as root on the host if there is a kernel vulnerability or misconfigured capability. Rootless containers (user namespaces) are the only defense that survives a kernel namespace escape — because even escaped, the attacker has only unprivileged host user access.

§ 09 — SCALEServing 10 billion pulls per day

Docker Hub serves ~10B pulls/day. GitHub Container Registry, GCR, and ECR handle similar scale. The architecture has three layers: CDN edge caching, P2P distribution for large-scale cluster pulls, and lazy loading for eliminating unnecessary data transfer entirely.

Layer 1: CDN edge caching

Blobs are addressed by SHA256 digest — they are immutable and can be cached indefinitely. The registry redirects blob GETs to a CDN presigned URL (S3/GCS + CloudFront/Fastly). Cache hit rates for popular base images (ubuntu, alpine, python, node) exceed 99%. This means the origin registry handles only cache misses — roughly 1% of 10B = 100M requests/day.

# Registry blob GET → 307 redirect to CDN URL
GET /v2/library/ubuntu/blobs/sha256:1a2b3c...
→ 307 Location: https://cdn.hub.docker.com/v2/blobs/sha256:1a2b3c...?X-Amz-Expires=3600&X-Amz-Signature=...

# CDN cache key = sha256 digest (immutable — no TTL needed)
# Popular blobs (ubuntu, alpine base layers) served from CDN PoP nearest to client
# Cold miss → CDN fetches from S3, caches permanently

Layer 2: P2P distribution (Dragonfly / Kraken)

When a Kubernetes cluster autoscales from 10 pods to 1000 pods simultaneously, each pod tries to pull the same image. Without P2P, 1000 nodes all hit the CDN/registry simultaneously. With P2P distribution (Alibaba's Dragonfly or Uber's Kraken), nodes form a BitTorrent-like swarm: each node that has pulled a chunk seeds it to peers. The registry/CDN serves only the initial seeder, and the swarm handles the fan-out.

Solution	Architecture	Best for
Dragonfly (CNCF)	Manager + Scheduler + Seed Peer + Dfdaemon agent on each node. Content splits into P2P chunks. Used by Alibaba, Ant Group.	Large-scale clusters (1000+ nodes), frequent mass-scale deployments
Kraken (Uber)	Tracker + Origin + Proxy + Agent. Uses BitTorrent protocol. Written in Go.	Uber's scale — 1M+ container starts/day across hundreds of clusters
containerd mirror	Simpler: configure a local registry mirror per cluster (Harbor, Nexus). Node pulls from mirror; mirror pulls from Docker Hub once.	Single cluster, simpler ops, acceptable latency

Layer 3: Lazy loading — eStargz and streaming pulls

The fundamental insight: most containers start after pulling 30% of their layers, because only the files accessed during startup are needed immediately. eStargz (CNCF Stargz Snapshotter) reformats image layers so individual files are addressable and fetchable on-demand. Container startup begins before the image is fully downloaded.

# eStargz: Seekable GZIP layer format
# Each file within the layer is independently seekable
# Manifest includes a "stargz" footer with a TOC (table of contents)

# Startup latency comparison (Node.js app, 500 MB image):
# Traditional pull:  download 500 MB → unpack → start    ≈ 60 s on cold node
# eStargz lazy:      download TOC (1 MB) → start → fetch on demand  ≈ 3 s

# Enable in containerd config.toml:
[proxy_plugins]
  [proxy_plugins.stargz]
    type = "snapshot"
    address = "/run/containerd-stargz-grpc/containerd-stargz-grpc.sock"

# Build eStargz-compatible image:
ctr-remote image optimize --oci \
  nginx:alpine \
  ghcr.io/myorg/nginx:alpine-sgz

Registry HA architecture

Component	Architecture	Why
Blob storage	Multi-region S3/GCS with cross-region replication. Blobs are immutable — eventual consistency is fine.	CDN pulls from nearest region; durability via replication; no single point of failure
Manifest/tag DB	PostgreSQL (primary + replicas per region) or CockroachDB for global active-active. Tags are mutable — strong consistency needed for writes.	Tag write must be globally consistent to avoid split-brain (two regions serving different digests for same tag)
Registry API	Stateless containers behind ALB in each region. Auto-scale on CPU.	Horizontal scaling; no session state; blob redirect offloads bandwidth
CDN	CloudFront / Fastly global PoP network. Cache blobs by SHA256 digest permanently.	Last-mile latency; absorbs 99%+ of blob traffic without touching origin
Rate limiting	Redis (or DynamoDB for multi-region) token bucket per IP/authenticated user. Unauthenticated Docker Hub: 100 pulls/6 hrs.	Protect origin from unauthenticated scraping; incentivize authentication

The 10B pulls/day number becomes tractable once you accept that 99% of it is CDN traffic serving immutable blobs by SHA256. The registry origin handles tags (mutable, ~1% of requests), new blob uploads (push path, rare), and cache misses. The hard problems are tag consistency across regions and garbage collection correctness — not raw throughput.

§ 10 — Q&AFifteen questions the loop actually asks

These separate the staff-level answer from the senior one — layer caching, BuildKit, distroless images, and container escapes.

Q 01

How does layer caching work in CI, and why does it break?

Dockerfile instructions are cached by their checksum + parent layer digest — if neither changes, the layer is reused. Cache breaks when any earlier layer changes (e.g., COPY . . before RUN pip install means every code change invalidates the pip layer). Fix: put COPY requirements.txt + RUN pip install before COPY . ..

Q 02

What are multi-stage builds and why do they matter?

Multi-stage builds use multiple FROM instructions — a "builder" stage compiles the binary, a "runtime" stage copies only the compiled artifact. The final image contains no compiler, no build tools, no source code — just the binary and its runtime deps. A Go app goes from 1.2 GB (with Go toolchain) to 15 MB (scratch + binary).

Q 03

What is a distroless image and when should you use one?

Distroless images (gcr.io/distroless) contain only the application runtime (Java 17, Python 3.11, etc.) and no shell, no package manager, no OS utilities. docker exec fails — there is nothing to exec into. Use when you want to minimize attack surface in production; accept the operational cost: debugging requires ephemeral debug sidecar containers (kubectl debug).

Q 04

How do you minimize image size?

Use alpine or distroless base, multi-stage builds, combine RUN commands to reduce layer count, --no-cache in apk/apt, .dockerignore to exclude test files and docs, and docker scout or dive to inspect layer contents. Target: <100 MB for most production services; <20 MB for Go/Rust binaries on scratch/distroless.

Q 05

What is BuildKit and how does it improve on the classic Docker build engine?

BuildKit (docker/buildkit) replaces the sequential layer-by-layer builder with a DAG executor: independent RUN stages run in parallel, build cache is external (registry-based, shared across CI workers), secrets can be passed as tmpfs mounts (never baked into layers), and output supports multiple exporters (OCI, Docker, local). DOCKER_BUILDKIT=1 or docker buildx build.

Q 06

What is an SBOM and why does it matter for containers?

An SBOM (Software Bill of Materials) is a machine-readable inventory of every package in an image: name, version, license, source repo. In CycloneDX or SPDX format. Executive Order 14028 (US federal) requires SBOMs for federal software procurement. In practice: store the SBOM alongside the image in the registry (as a cosign-attached OCI artifact), and query it when a new CVE drops to find all images containing the vulnerable package.

Q 07

What is a container escape and how does it happen?

A container escape is when a process inside a container gains access to the host. Three common paths: (1) kernel vulnerability that bypasses namespace isolation (CVE-2019-5736 runc overwrite, CVE-2022-0185); (2) misconfigured capabilities — CAP_SYS_ADMIN is nearly equivalent to root on host; (3) privileged container (--privileged) which disables all security mechanisms. Defence: rootless containers, seccomp, drop all caps, no privileged containers in production.

Q 08

What is cgroup v2 and what does it change for containers?

cgroupv2 (unified hierarchy) replaces cgroupv1's split per-controller hierarchy. Key changes: memory.oom.group for coordinated OOM killing of a container (not just the worst-fit process), PSI (Pressure Stall Information) for detecting resource contention, and unified delegation of cgroup management to the container runtime. Kubernetes 1.25+ requires cgroupv2; Amazon Linux 2023 and Ubuntu 22.04 default to cgroupv2.

Q 09

What is the difference between OCI format and Docker image format?

Docker's image format (schema 2) and OCI Image Spec 1.0 are nearly identical — OCI was derived from Docker's format and standardized by the OCI working group. Key differences: OCI uses different media types (application/vnd.oci.image.manifest.v1+json vs Docker's application/vnd.docker.distribution.manifest.v2+json), OCI uses "image index" where Docker uses "manifest list." All modern runtimes (containerd, Podman, buildah) speak both formats transparently.

Q 10

How do you design registry HA? What fails when a region goes down?

Blobs are in multi-region S3 with CRR — CDN serves from nearest region, unaffected. Tag writes (manifest PUT) need cross-region consistency — use CockroachDB or geo-distributed Postgres with synchronous replication for writes, or accept that tag writes fail during a region outage and pulls continue serving stale-but-consistent cached manifests. The worst failure mode: a region outage during a mass autoscale event causes thundering-herd on the remaining region's API tier — rate limit by cluster/namespace, not by IP.

Q 11

Walk me through what happens when Kubernetes pulls an image.

Kubelet detects a new Pod, calls CRI (containerd) PullImage. containerd checks its snapshot store for existing layers. For missing layers: GET manifest from registry, compare layer digests against local cache, GET missing blobs (via CDN redirect). Unpack blobs via snapshotter (OverlayFS). Report ImagePulled back to kubelet. Kubelet then calls CreateContainer and StartContainer. imagePullPolicy: IfNotPresent skips manifest fetch if image is already in containerd's image store.

Q 12

How does Docker's rate limiting (100 pulls/6 hrs) work technically?

Rate limiting is applied per authenticated user (or per IP for anonymous). Each manifest GET (not blob GET) counts as a pull. The counter is stored in a Redis token bucket keyed by user/IP, decremented per manifest pull, refilled over 6 hours. Solution: authenticate in CI (docker login), or run a registry mirror (Harbor) in your cluster that pulls from Docker Hub once and serves internally unlimited.

Q 13

What is the difference between docker stop and docker kill?

docker stop sends SIGTERM to PID 1 in the container, waits 10 seconds for graceful shutdown, then sends SIGKILL. docker kill sends SIGKILL immediately (or a specified signal). Apps should handle SIGTERM for graceful shutdown (drain connections, flush buffers). Kubernetes Pod termination follows the same pattern: SIGTERM → grace period → SIGKILL, with configurable terminationGracePeriodSeconds.

Q 14

What is the FROM scratch base image and when do you use it?

FROM scratch is literally an empty filesystem — no OS, no libc, nothing. A single statically-linked binary is the entire container. Go and Rust compile to static binaries; a Go HTTP server on scratch is ~5 MB. The tradeoff: no sh, no wget, no debugging tools. curl healthcheck doesn't work. Use distroless instead of scratch when you need TLS root CAs, timezone data, or a minimal libc.

Q 15

How do you handle secrets in containers — what are the anti-patterns?

Anti-patterns: ENV vars (visible in docker inspect, process env, and image history), ARGs baked into layers (visible in image history), secret files in image layers (visible in layer extraction). Correct patterns: Kubernetes Secrets mounted as tmpfs volumes (volumeMounts: secretKeyRef), Vault Agent Injector sidecar, AWS Secrets Manager + IAM IRSA, or BuildKit secret mounts (--mount=type=secret) during build — secrets available in RUN but never in final image.

§ 11 — SUMMARYWhat the strong answer looks like

Dimension	Weak answer	Strong answer
What is a container?	"An isolated process"	Linux namespaces + cgroups + OverlayFS union mount. Shares host kernel. Not a VM.
Image anatomy	"A Docker image with layers"	OCI manifest + config + content-addressed layer blobs. SHA256 digest is the trust anchor and cache key.
Registry design	"Store images in S3"	Blob store keyed by SHA256 digest + manifest DB + CDN redirect on blob GET. Deduplication is free via content addressing.
Networking	"Containers have their own network"	Network namespaces + veth pairs + docker0 bridge + iptables NAT. Multi-host: VXLAN overlay. DNS via embedded resolver.
Schema	"Tables for images and tags"	blob (deduplicated), manifest, manifest_blob (junction), tag (mutable), pull_event (partitioned). GC via manifest_blob ref-count.
Runtime	"Docker runs the container"	dockerd → containerd (CRI) → containerd-shim → runc → namespace + cgroup setup → exec PID 1. runc exits after start.
10B pulls/day	"Use a CDN"	CDN for blob serving (immutable by SHA256), P2P (Dragonfly/Kraken) for cluster autoscale bursts, eStargz lazy loading for cold start latency.
Security	"Scan for vulnerabilities"	Trivy scanning in CI + cosign signing + rootless containers + drop all caps + read-only rootfs + no-new-privileges + distroless base.

A container is not magic isolation. It is a Linux process with a private view of the filesystem (OverlayFS), a private network stack (namespace + veth pair), and resource limits (cgroup). Every container problem — networking bugs, security escapes, slow pulls, fat images — traces back to understanding which of these three primitives is involved. Start there, not with Docker commands.

      📦 Using containers in production? The next problem is orchestration —
      The Orchestration Problem: Kubernetes →
       (pods, scheduling, control plane, operators, HPA)
    

← Back to Design Scenarios

§ 12 — COMMON MISTAKESWhat candidates get wrong

Three misconceptions that show up repeatedly in interviews — and how to correct them on the spot.

❌ "Containers replace VMs"

Wrong: they solve different problems. Containers share the host OS kernel — isolation is at the process level via namespaces. VMs have a full separate kernel — isolation is at the hypervisor level, enforced in hardware. Use VMs when you need security boundaries between tenants (multi-tenant SaaS, compliance workloads, untrusted code). Containers are faster and lighter but a kernel vulnerability can cross the container boundary. VMs cannot escape the hypervisor without a hypervisor CVE. They coexist: in production you run containers inside VMs.

❌ "One container = one service" is always the right rule

True at the logical level — and it is the right default. But teams routinely ship containers with supervisord running multiple processes (e.g., nginx + app server + cron), causing ops nightmares: a crashed child process is invisible to the container runtime; logs are tangled; health checks can't distinguish which process failed; you can't scale processes independently; init signal handling breaks. The rule is one concern per container, not a literal process limit — but in practice one PID 1 is the correct implementation.

❌ "Docker is Kubernetes" / "Docker and Kubernetes are the same thing"

Docker builds images (Dockerfile → docker build) and runs individual containers locally (docker run). It is a developer tool and a local runtime. Kubernetes orchestrates containers at scale — scheduling pods onto nodes, managing replicas, rolling deployments, service discovery, autoscaling, self-healing. You can run Docker without Kubernetes (local dev, CI). In production at any meaningful scale you almost certainly need Kubernetes (or a managed equivalent: ECS, Cloud Run, Fargate) — not because Docker is inadequate but because Kubernetes solves a different problem: fleet management, not container execution.

§ 13 — WHY NOT?Containers are not always the answer

The senior answer includes knowing when NOT to use containers. Trade-offs matter more than defaults.

Use Containers When

✓ Microservices architecture — each service independently deployable
✓ Multiple languages / runtimes in the same system
✓ CI/CD pipeline consistency — eliminate "works on my machine"
✓ Cloud-native deployment targeting Kubernetes / ECS
✓ Rapid horizontal scaling of stateless services
✓ Ephemeral workloads: CI jobs, batch processing, serverless

Skip Containers When

✗ Legacy monolith tightly coupled to a specific OS configuration
✗ Strict compliance requiring full VM isolation (PCI-DSS, HIPAA)
✗ GPU-intensive ML training — container overhead adds latency at the driver boundary
✗ Small team, simple deployment, single monolith — ops overhead exceeds benefit
✗ Applications needing kernel module access or custom kernel builds
✗ Hard real-time workloads where namespace overhead is unacceptable

The most common mistake: defaulting to containers because they are modern, not because they solve a real problem in the current context. A two-person startup deploying a Rails monolith to a single server should ship a VM image and a deploy script — not a Kubernetes cluster. Complexity must earn its keep.

§ 14 — ONE-MINUTE ANSWERThe answer that ends the follow-up

Every interviewer will ask a variant of this. Have the answer internalized, not memorized.

QUESTION
"Why did containers become popular?"
ANSWER
Containers solved the 'works on my machine' problem by packaging an application with all its dependencies into a single portable unit. Unlike VMs, they share the host OS kernel, making them lightweight (MBs vs GBs) and fast to start (seconds vs minutes). This enables consistent behavior across dev, test, and prod environments. The tradeoff: at scale, managing hundreds of containers manually becomes chaos — which is exactly the problem Kubernetes was built to solve.

FOLLOW-UP READY: If they ask "what problem does Kubernetes solve?" — scheduling, self-healing, rolling deployments, service discovery, autoscaling. See The Orchestration Problem.

§ 15 — INTERVIEWER'S MINDWhat they are actually testing

The question is about containers. The test is about depth of systems thinking. Four axes that separate junior from staff.

01 · ISOLATION UNDERSTANDING

Do you know the difference between process isolation (containers — Linux namespaces, shared kernel) and kernel isolation (VMs — hypervisor, separate kernel)? Can you explain when each is appropriate? Candidates who say "containers are more secure than VMs" fail this test.

02 · IMAGE HYGIENE

Can you explain layered image builds, multi-stage builds, and why image size matters? Do you know the Dockerfile instruction order rules (COPY before RUN breaks layer cache)? Staff engineers think about build cache as CI performance infrastructure.

03 · NETWORKING BASICS

Bridge networks (single-host, docker0), overlay networks (multi-host, VXLAN), port mapping (iptables DNAT) — do you know when each applies? Can you explain why -p 8080:80 works? Networking questions distinguish people who've debugged real container issues from those who only used them.

04 · SECURITY POSTURE

Root vs non-root containers (running as root in a container is nearly root on the host without user namespaces), read-only filesystems, capability dropping (--cap-drop=ALL), seccomp profiles. Staff-level: rootless containers via user namespaces, and why distroless eliminates most of the attack surface.

The interviewer's unstated question is always: "Has this person operated containers in production and hit the edges, or did they just follow tutorials?" The way to demonstrate the former: volunteer the failure modes before being asked. Name the container escape vectors. Mention that --privileged is essentially no isolation. Ask about their registry rate-limiting situation unprompted.

§ 16 — THE EVOLUTIONContainer infrastructure in context

Containers are one milestone in a longer arc. Knowing where they sit — and what came before and after — is the context that makes all other answers land.

§ 17 — WHAT'S NEXT?The problems containers created — and what solves them

Each generation of infrastructure solves one problem and creates the next. Understanding this chain is what makes a strong candidate's answer feel historically grounded rather than tool-focused.

STEP 01

Containers solved packaging

One portable unit with all dependencies. Works everywhere. "Works on my machine" is solved. But now you have 500 containers — who starts them, restarts them when they die, and balances load?

→

NEXT PROBLEM → ORCHESTRATION

Kubernetes — scheduling pods onto nodes, self-healing via ReplicaSets, rolling deployments, service discovery via CoreDNS, autoscaling via HPA. The container is the atom; Kubernetes manages the molecule.

STEP 02

Kubernetes solved scheduling

Fleet management at scale — reliable, declarative. But developers still need to understand YAML, Helm charts, network policies, ingress controllers, CRDs. The cognitive overhead is enormous. Who builds the abstractions?

→

NEXT PROBLEM → DEVELOPER EXPERIENCE

Platform Engineering — internal developer platforms (IDPs) that hide Kubernetes complexity. Tools like Backstage, Crossplane, Humanitec. The platform team owns the infrastructure; app teams get golden paths: backstage new-service → CI/CD + Kubernetes + observability, zero YAML.

STEP 03 → WHAT COMES AFTER PLATFORM ENGINEERING
Platform engineering solves developer experience — abstracting infrastructure into self-service workflows. The next frontier: AI agents that provision, scale, and heal infrastructure without human operators.
Instead of a developer triggering a deploy pipeline, an AI agent reads the service's SLO, observes current traffic patterns, provisions the right size of infrastructure, deploys the container, and adjusts replicas in real time — without a human writing a Helm chart or tuning an HPA. The container is still the execution unit. The agent is the new operator.

The through-line: every generation of infrastructure abstracts away the complexity of the previous one. Physical servers → VMs → cloud → containers → Kubernetes → platforms → agents. Understanding this arc — not just the current tool — is what a staff engineer brings to the interview.

← paddyspeaks.com

↑ ↓