Est. 2026 Philosophy · Technology · Wisdom ▶ YouTube LinkedIn ↗

PaddySpeaks

Where ancient wisdom meets the architecture of tomorrow

← All Articles
technology

Designing DigiCert — PKI, X.509 & Certificates at Kubernetes Scale

System Design Deep-Dive · Security Infrastructure

Designing DigiCert
PKI, X.509 & Certificates
at Kubernetes Scale

From the byte layout of a certificate to a CA issuing a million certs a day on Kubernetes — the full stack, built out in schema, Python, cert-manager YAML, and real architecture decisions that keep the internet's TLS working.

Paddy Iyer · Jun 23, 2026 · 38 min read · Technology
~9BCerts issued (DigiCert lifetime)
8B+OCSP checks / day
90 daysMax TLS cert lifetime (2026 CA/B Forum)
47 msMedian TLS handshake (p50, modern TLS 1.3)
>500Root CA programs (Mozilla, Apple, MS, Google)

Advanced / Staff-Level · L6–L7 Interview Prep

🧒 4th Grader A certificate is like a secret handshake card that proves a website is really who it says it is — like a school ID badge, but for the internet.
🎓 CS Student An X.509 certificate binds a public key to an identity, signed by a CA. TLS uses it to authenticate the server before encrypting the session.
👨‍💻 Engineer cert-manager + Let's Encrypt auto-renews certs in Kubernetes. The tricky part is OCSP stapling, CT log compliance, and not letting certs expire silently.
🏗 Staff/Principal PKI at scale means HSM-backed intermediate CAs, SPIFFE/SPIRE mesh identity, multi-region OCSP infra, and CA distrust event runbooks.
👔 CEO / CTO Every HTTPS request in the world depends on this infrastructure. A single CA compromise (DigiNotar, 2011) can break the entire internet's trust model overnight.
Question 60-second answer
Why certificates exist Establish trust between strangers on the internet — prove a server is who it claims to be, and encrypt everything in transit.
Who issues them Certificate Authorities (CAs) — trusted third parties like DigiCert, Let's Encrypt, or your internal private PKI. Root CAs are air-gapped; intermediates do the actual signing.
How trust works Chain of trust: Root → Intermediate → Leaf. Your OS ships with ~150 trusted root CAs. A leaf cert is trusted if every signature in the chain traces back to one of them.
How certificates fail Expiry (forgotten renewal), compromise (stolen private key), misissuance (CA error), or CA distrust (DigiNotar 2011, Symantec 2018). Any of these → red padlock.
Modern solution Automated certificate management: ACME protocol + cert-manager on Kubernetes = zero-touch issuance, renewal every 60–90 days, no human in the loop.
Typical Interview Site Interview Studio (this site)
Memorization Understanding
Coding only Coding + Architecture + Data Modeling
Short answers Deep reasoning with trade-offs
LeetCode style Real-world engineering at scale
Junior focus Senior / Staff / L6–L7
🔬 Preparing for a system design interview? See the deep-dive → The Certificate Problem — whiteboard session (schema, issuance pipeline, SPIFFE/SPIRE YAML, 22 Q&A)
⚡ If you only remember 5 things
  1. Root CA stays offline. It signs intermediates once, then goes into a vault. Never exposed to network requests.
  2. Intermediate CA signs certs. Your leaf cert is signed by an intermediate, not the root. Chain: Root → Intermediate → Leaf.
  3. TLS validates chain + hostname + expiration. Browser checks all three in ~50 ms. Fail any one → red padlock.
  4. Revocation uses OCSP / CRL / CRLite. Certs can't be "unsigned" — revocation is a separate signal CAs broadcast to clients.
  5. Kubernetes automates everything. cert-manager + ACME = auto-issue, auto-renew, auto-rotate. Manual cert management doesn't scale past 50 services.

When your browser shows the padlock on https://paddyspeaks.com, a certificate was presented, a chain of trust was walked, and a symmetric session key was negotiated — all before a single byte of HTTP was exchanged. DigiCert is the infrastructure behind that padlock: issuing, renewing, revoking, and auditing hundreds of millions of certificates at sub-second latency, with zero tolerance for error.

· · ·
Part 1 — The Fundamentals

What Is a Certificate?

A digital certificate is a signed data structure that binds a public key to an identity — like a passport stamped by a recognized authority. Certificates follow the X.509 v3 standard (RFC 5280):

POSTAGE To: your browser password=hunter2 ← anyone can read this HTTP — PLAIN TEXT ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ only sender & receiver can read HTTPS — ENCRYPTED ⚠ No certificate ✓ TLS certificate + encryption X.509 v3 Certificate VALID ✓ Serial A3:0F:81:C2:… Issuer DigiCert TLS RSA SHA256 CA1 SAN paddyspeaks.com, *.paddyspeaks.com Not Before 2026-01-01 Not After 2026-04-01 (90 days) Public Key ECDSA P-256 [64 bytes] EKU TLS Web Server Auth OCSP http://ocsp.digicert.com CT SCTs 2 × Signed Timestamps Signature SHA256withRSA [256 b] ← CA private key signs all fields above Who is this cert for? SAN is canonical Validity window 90d max → 47d in 2027 Revocation check OCSP endpoint CA's digital seal tamper-proof CERTIFICATE LIFECYCLE 📜 ISSUED CSR signed by CA ACTIVE Days 1–60 (90-day cert) ⚠️ EXPIRING Days 61–90 → renew now 🔄 RENEWED cert-manager auto-rotates
Interview Q&A

What's the difference between a self-signed cert and a CA-signed cert? When would you use each?

A self-signed cert is signed by the same key it describes — browsers reject it because no root store vouches for the signer. Use them for local dev, internal testing, or private PKI where you control the trust anchor. A CA-signed cert has a trusted third party (DigiCert, Let's Encrypt) verify the domain (DV), organization (OV), or identity (EV) before signing — required for anything public-facing.

· · ·
Part 2 — Chain of Trust

Root CA → Intermediate CA → Leaf

No browser trusts a leaf certificate directly. Trust flows through a chain because Root CAs stay offline and never issue directly to websites — Intermediates carry the operational risk.

ROOT CA — OFFLINE / AIR-GAPPED / HSM DigiCert Global Root G2 RSA 2048 · Self-signed · Valid 30+ yrs · In every OS trust store signs Intermediate CA cert ↓ INTERMEDIATE CA — ONLINE / HSM-PROTECTED DigiCert TLS RSA SHA256 2020 CA1 Signed by Root G2 · Valid 10 years · Issues leaf certs daily signs Leaf cert (90-day) ↓ LEAF / END-ENTITY CERT — SERVED BY WEB SERVER paddyspeaks.com P-256 · Signed by Intermediate · Valid 90 days · SAN: paddyspeaks.com 🔒 Browser walks up to verify
Interview Q&A

Why do CAs use intermediate certificates instead of signing everything from the Root?

Root private keys are the crown jewels — compromise means browser vendors remove the root and millions of sites go dark. Keeping the root offline limits blast radius to an intermediate, which can be revoked and replaced without touching OS trust stores.

✅ Interviewer Expects at L5+

The interviewer wants blast radius management, not just "security." An intermediate compromise is survivable; a root compromise requires browser vendors to push emergency updates to every device on Earth.

⚠️ Common Candidate Mistake

Saying "the certificate encrypts the connection." It does not — it authenticates the server. TLS uses the cert's public key only during the handshake to establish a symmetric session key (AES-256); all actual data travels under that symmetric key.

· · ·
Part 3 — Trust Stores

Where Certificates Are Stored

Understanding which trust store an application reads is the difference between "cert works in Chrome but not in curl" — each runtime maintains its own anchor list.

Platform / StoreLocation / ToolWho reads itUser modifiable?
macOS KeychainSystem Keychain (/Library/Keychains/System.keychain) + System RootsSafari, curl (via SecureTransport), most appsAdmin can add to System; users to Login
macOS Login Keychain~/Library/Keychains/login.keychain-dbUser-space apps, Keychain Access.appYes
Windows Certificate StoreRegistry-backed: Cert:\LocalMachine\Root, Cert:\CurrentUser\RootIE, Edge, Chrome (on Windows), WinHTTP, .NET, PowerShellAdmin for LocalMachine; user for CurrentUser
Mozilla NSSFirefox-bundled cert9.db in profile folder; certutil (libnss)Firefox, Thunderbird, LibreOffice on LinuxYes, per-profile
Linux System (ca-certificates)/etc/ssl/certs/ca-certificates.crt (Debian/Ubuntu) or /etc/pki/tls/certs/ca-bundle.crt (RHEL)OpenSSL-linked apps, curl, wget, Python requestsRoot only; update-ca-certificates
JVM KeyStore (JKS/PKCS12)$JAVA_HOME/lib/security/cacerts; custom via -Djavax.net.ssl.trustStore=Java, Kafka, Elasticsearch, Hadoopkeytool (admin)
Node.jsBundles its own Mozilla root store (npm config get cafile); bypasses OSNode.js apps, npmNODE_EXTRA_CA_CERTS env var
Kubernetes Secretkubectl get secret tls-cert -n default; mounted at /etc/ssl/certs/Pods that mount the secretYes, via kubectl / cert-manager
Browser (Chrome)macOS/Windows: delegates to OS. Linux: bundles its own NSS ~/.pki/nssdbChrome, ChromiumSettings → Certificates; certutil on Linux
iOS / iPadOSSettings → General → About → Certificate Trust SettingsSafari, native apps, WKWebViewUsers can toggle trust; MDM can push roots
"The cert works in Chrome but not in curl" — 9 times out of 10 this is because Chrome delegates to the macOS System Keychain while curl uses OpenSSL's ca-certificates bundle, and the intermediate wasn't installed in the right store.
Interview Q&A

A Java microservice throws PKIX path building failed: unable to find valid certification path to requested target. The same endpoint works in curl. Walk me through your diagnosis.

Java uses its own cacerts keystore, not the OS bundle. The intermediate CA is almost certainly missing from the JVM store. Fix: (1) openssl s_client -connect host:443 -showcerts to dump the chain. (2) keytool -importcert -alias digicert-inter -file inter.crt -cacerts -storepass changeit to install it. In Kubernetes, update the trust bundle ConfigMap and restart pods — prefer injecting the CA bundle as a volume over modifying the base image.

· · ·
Part 4 — The Protocol

The TLS 1.3 Handshake

47 msMedian TLS 1.3 handshake (p50)
1 RTTRound trips in TLS 1.3 (down from 2 in TLS 1.2)
0 bytesHTTP sent until auth + key exchange complete
0-RTTSession resumption with early data (TLS 1.3)

TLS 1.3 reduced the handshake to 1 round-trip. Every message exchanged:

TLS 1.3 Handshake — 1 Round Trip 🌐 Browser 🖥 Server ClientHello: TLS 1.3, key_share (X25519), cipher suites 🔐 Encrypted from here on (session keys derived) ServerHello: chosen cipher, key_share → session keys derived EncryptedExtensions (ALPN, SNI) Certificate (leaf + intermediate chain) ← encrypted CertificateVerify (server signs handshake hash) Finished (HMAC handshake) Finished (confirms receipt) ═══ Application Data (HTTP/2, fully encrypted) ═══ CLIENT VERIFIES:① Chain → trusted root② SAN matches hostname③ Expiry valid④ Sig + OCSP + CT SCTs 🔒 Total added RTTs: 1
CLIENT SERVER ────── ────── ClientHello ──────────────► (TLS version, cipher suites, - supported versions: TLS 1.3 key_share: client's ephemeral - key_share (X25519 ECDH pub) Diffie-Hellman public key, - supported_groups random bytes) - signature_algorithms ◄────────────── ServerHello - selected cipher: TLS_AES_256_GCM_SHA384 - key_share: server's DH pub key ← BOTH SIDES NOW DERIVE session keys → (no more plaintext after this point) ◄ ─ ─ ─ ─ ─ ─ EncryptedExtensions ◄ ─ ─ ─ ─ ─ ─ Certificate (leaf + intermediates) ◄ ─ ─ ─ ─ ─ ─ CertificateVerify (server signs handshake hash) ◄ ─ ─ ─ ─ ─ ─ Finished (HMAC over entire handshake) CLIENT verifies: 1. Chain: leaf → intermediate → trusted root 2. SANs match the hostname 3. Cert not expired 4. Signature on CertificateVerify valid 5. OCSP staple or online check → not revoked 6. CT SCTs present (Chrome requires 2+) Finished ──────────────► (client confirms receipt) ═══════════════ Application Data (HTTP/2 frames, encrypted) Total new RTTs: 1 (+ 0-RTT resumption possible)
Interview Q&A

What is OCSP stapling and why does it matter at DigiCert scale?

Without stapling, the browser calls DigiCert's OCSP responder per handshake — a privacy leak and latency hit at 8B checks/day. Stapling inverts this: the server pre-fetches a CA-signed OCSP response and attaches it to the TLS handshake, giving the browser revocation status at zero extra RTT. Nginx: ssl_stapling on; ssl_stapling_verify on;.

· · ·
Part 5 — Schema Design

The Certificate Issuance Database

The core schema tracks every issued cert — for billing, revocation, audit, and CA/B Forum compliance:

SQL — Core PKI Schema (PostgreSQL)
-- Certificate Authorities (roots + intermediates)
CREATE TABLE certificate_authority (
    ca_id           BIGSERIAL PRIMARY KEY,
    common_name     TEXT NOT NULL,
    subject_dn      TEXT NOT NULL,                      -- full distinguished name
    ca_type         TEXT NOT NULL CHECK (ca_type IN ('root','intermediate','issuing')),
    parent_ca_id    BIGINT REFERENCES certificate_authority(ca_id),
    public_key_sha256 BYTEA NOT NULL UNIQUE,            -- key fingerprint
    cert_pem        TEXT NOT NULL,
    valid_from      TIMESTAMPTZ NOT NULL,
    valid_until     TIMESTAMPTZ NOT NULL,
    is_active       BOOLEAN NOT NULL DEFAULT TRUE,
    hsm_slot_id     TEXT,                               -- HSM partition reference
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Issued certificates (leaf + subordinate CAs)
CREATE TABLE certificate (
    cert_id         BIGSERIAL PRIMARY KEY,
    serial_number   BYTEA NOT NULL,                     -- 20 bytes, CA-unique
    issuing_ca_id   BIGINT NOT NULL REFERENCES certificate_authority(ca_id),
    subject_cn      TEXT NOT NULL,
    subject_dn      TEXT NOT NULL,
    cert_type       TEXT NOT NULL CHECK (cert_type IN ('dv','ov','ev','code_sign','s_mime','client','device')),
    san_list        TEXT[] NOT NULL DEFAULT '{}',       -- DNS names, IPs, emails
    public_key_algo TEXT NOT NULL,                      -- RSA, EC, Ed25519
    key_size_bits   INT,                                -- 2048, 4096; null for EC
    ec_curve        TEXT,                               -- P-256, P-384; null for RSA
    public_key_sha256 BYTEA NOT NULL,
    cert_pem        TEXT NOT NULL,
    valid_from      TIMESTAMPTZ NOT NULL,
    valid_until     TIMESTAMPTZ NOT NULL,
    issued_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    customer_id     BIGINT NOT NULL,
    order_id        BIGINT NOT NULL,
    domain_validated_at TIMESTAMPTZ,                    -- DCV timestamp
    status          TEXT NOT NULL DEFAULT 'active'
                    CHECK (status IN ('active','revoked','expired','hold')),
    ct_log_ids      TEXT[] NOT NULL DEFAULT '{}',       -- SCT log IDs
    UNIQUE (issuing_ca_id, serial_number)
);

-- Revocation events
CREATE TABLE revocation (
    revocation_id   BIGSERIAL PRIMARY KEY,
    cert_id         BIGINT NOT NULL REFERENCES certificate(cert_id),
    revoked_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    reason_code     INT NOT NULL,                       -- RFC 5280 CRLReason
    reason_text     TEXT,
    requested_by    TEXT NOT NULL,                      -- customer / admin / auto
    ocsp_next_update TIMESTAMPTZ,                       -- OCSP response cache TTL
    crl_published_at TIMESTAMPTZ
);

-- Domain Control Validation
CREATE TABLE dcv_event (
    dcv_id          BIGSERIAL PRIMARY KEY,
    cert_id         BIGINT REFERENCES certificate(cert_id),
    domain          TEXT NOT NULL,
    method          TEXT NOT NULL CHECK (method IN ('http-01','dns-01','tls-alpn-01','email')),
    challenge_token TEXT NOT NULL,
    challenge_value TEXT NOT NULL,
    validated_at    TIMESTAMPTZ,
    expires_at      TIMESTAMPTZ NOT NULL,               -- reuse window (825 days → 90 days)
    ip_logged       INET                                -- requester IP for audit
);

-- OCSP response cache (hot read path)
CREATE TABLE ocsp_response_cache (
    cert_id         BIGINT NOT NULL REFERENCES certificate(cert_id),
    ca_id           BIGINT NOT NULL REFERENCES certificate_authority(ca_id),
    this_update     TIMESTAMPTZ NOT NULL,
    next_update     TIMESTAMPTZ NOT NULL,
    ocsp_status     TEXT NOT NULL CHECK (ocsp_status IN ('good','revoked','unknown')),
    signed_response BYTEA NOT NULL,                     -- DER-encoded OCSPResponse
    PRIMARY KEY (cert_id)
);

-- Indexes
CREATE INDEX idx_cert_serial    ON certificate (issuing_ca_id, serial_number);
CREATE INDEX idx_cert_customer  ON certificate (customer_id, status, valid_until);
CREATE INDEX idx_cert_san       ON certificate USING GIN (san_list);
CREATE INDEX idx_cert_expiring  ON certificate (valid_until) WHERE status = 'active';
CREATE INDEX idx_rev_published  ON revocation (crl_published_at) WHERE crl_published_at IS NULL;
CREATE INDEX idx_ocsp_next_upd  ON ocsp_response_cache (next_update);
Interview Q&A

How would you design the serial number generation to guarantee uniqueness across a distributed CA cluster?

CA/B Forum Ballot SC63 requires at least 64 bits of CSPRNG entropy per serial. Generate 20 random bytes (os.urandom(20)), store as BYTEA, and use ON CONFLICT DO NOTHING to detect the statistically-impossible collision. Never use sequential integers — they leak issuance volume to anyone reading CT logs. A ULID-like scheme (48-bit timestamp + 80 bits randomness) gives monotonic sort order without leaking counts.

· · ·
Part 6 — Issuance Pipeline

From CSR to Signed Certificate

Every certificate starts as a CSR (PKCS#10) generated by the applicant. The CA validates it, then an HSM-held private key signs and returns the certificate.

Applicant (e.g., customer ACME Corp or cert-manager operator) │ │ 1. Generate key pair locally: openssl genrsa -out key.pem 2048 │ 2. Create CSR: openssl req -new -key key.pem -out csr.pem │ 3. POST /certificate {csr_pem, order_id, dcv_method} │ ▼ DigiCert RA (Registration Authority) — validates identity │ ├─ DV: Did they control the domain? │ └─ http-01: GET /.well-known/acme-challenge/{token} → must return {token}.{account_thumbprint} │ └─ dns-01: TXT _acme-challenge.example.com = base64url(sha256(keyAuthorization)) │ └─ tls-alpn: TLS handshake on port 443 with acmeValidation-v1 OID in SAN │ ├─ OV/EV: Additional org checks (WHOIS, phone, Dun & Bradstreet, LEI, GLEIF) │ ├─ Policy checks: key size ≥ 2048 RSA / P-256+ EC; SHA-256+ only; SAN present │ ├─ CT pre-certificate submission → get SCTs from 2+ independent logs (Google Argon, Cloudflare Nimbus) │ └─ HSM sign: CA private key (never leaves HSM; PKCS#11 or AWS CloudHSM API) │ ▼ Return signed cert_pem (leaf + intermediate bundle) Write to DB: certificate table, CT log IDs, dcv_event.validated_at
Python — Automated ACME issuance (cert-manager style logic)
import hashlib, base64, json, time
from cryptography import x509
from cryptography.hazmat.primitives.asymmetric import ec
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.x509.oid import NameOID, ExtendedKeyUsageOID
import datetime, ipaddress

def generate_csr(domain: str, sans: list[str]) -> tuple[bytes, bytes]:
    """Generate EC P-256 private key + CSR. Returns (private_key_pem, csr_pem)."""
    key = ec.generate_private_key(ec.SECP256R1())

    csr = (
        x509.CertificateSigningRequestBuilder()
        .subject_name(x509.Name([x509.NameAttribute(NameOID.COMMON_NAME, domain)]))
        .add_extension(
            x509.SubjectAlternativeName([x509.DNSName(s) for s in sans]),
            critical=False,
        )
        .sign(key, hashes.SHA256())
    )
    return (
        key.private_bytes(serialization.Encoding.PEM,
                          serialization.PrivateFormat.TraditionalOpenSSL,
                          serialization.NoEncryption()),
        csr.public_bytes(serialization.Encoding.PEM),
    )


def dns01_key_authorization(token: str, account_key_thumbprint: str) -> str:
    """Build the DNS TXT record value for dns-01 challenge."""
    key_auth = f"{token}.{account_key_thumbprint}"
    digest = hashlib.sha256(key_auth.encode()).digest()
    return base64.urlsafe_b64encode(digest).rstrip(b"=").decode()


def poll_order(acme_client, order_url: str, timeout: int = 120) -> dict:
    """Poll ACME order until valid or timeout."""
    deadline = time.monotonic() + timeout
    while time.monotonic() < deadline:
        order = acme_client.get(order_url)
        if order["status"] == "valid":
            return order
        if order["status"] == "invalid":
            raise RuntimeError(f"ACME order invalid: {order.get('error')}")
        time.sleep(3)
    raise TimeoutError("ACME order did not complete in time")


def build_cert_bundle(leaf_pem: bytes, intermediates: list[bytes]) -> bytes:
    """Concatenate leaf + chain for nginx ssl_certificate."""
    return leaf_pem + b"\n".join(intermediates)


# ── Usage ──────────────────────────────────────────────────────────
if __name__ == "__main__":
    domain = "paddyspeaks.com"
    sans   = ["paddyspeaks.com", "www.paddyspeaks.com"]

    key_pem, csr_pem = generate_csr(domain, sans)
    print(csr_pem.decode())

    # In a real flow:
    # 1. Submit csr_pem to ACME newOrder → get challenge URLs
    # 2. Publish DNS TXT = dns01_key_authorization(token, thumbprint)
    # 3. Tell ACME server challenge is ready → poll_order(client, order_url)
    # 4. Download cert → build_cert_bundle(leaf, [intermediate_pem])
    # 5. Store key_pem in a secret (Vault, K8s Secret) — never log it
Interview Q&A

What is Certificate Transparency and why can't DigiCert issue a cert without it?

CT (RFC 9162) is a public, append-only Merkle-tree log of every certificate issued. Before Chrome accepts a cert, the CA must submit a pre-certificate to 2+ independent logs and embed the resulting Signed Certificate Timestamps (SCTs) in the final cert. If DigiCert or a rogue CA issues a cert for google.com without consent, Google's CT monitor catches it within minutes — before CT, incidents like DigiNotar 2011 took months to discover.

· · ·
Part 7 — Revocation Infrastructure

CRL, OCSP, and the Revocation Crisis

When a private key is compromised, DigiCert must make the revocation available to every browser within 24 hours (CA/B Forum Baseline Requirements). Two mechanisms exist:

Certificate Revocation List (CRL)

A CA-signed list of revoked serial numbers, published as a file at the CRL Distribution Point URL. Browsers download periodically. Can grow very large for big CAs — DigiCert's intermediate CRLs are partitioned to stay under 10 MB. Update cycle: typically 24–48 hours, up to 7 days for Root CRLs.

OCSP — Online Certificate Status Protocol

Real-time HTTP query per cert: "Is serial X from CA Y good or revoked?" DigiCert's OCSP responders are globally distributed via Anycast and must respond in under 75 ms (BR requirement). Signed responses cached for 24–48 h. The OCSP URL is embedded in every leaf cert's AIA extension.

The Revocation Problem

Browsers mostly soft-fail on OCSP errors (if DigiCert's OCSP is down, the cert is accepted). This breaks the revocation guarantee. Chrome removed soft-fail CRL/OCSP in 2023, instead relying on CRLite — a probabilistic Bloom filter pushed to every Chrome browser daily, capturing all revocations for all CAs. Firefox uses CRLite + OneCRL for intermediates.

CRLite / CRLSets

Google CRLSets / Mozilla CRLite compile every revoked serial number from all CT-logged certs into a compressed Bloom filter (~2 MB). Shipped daily with browser updates. Zero OCSP latency, zero privacy leak, and works offline. The future of revocation — CAs still need to update CT logs within 24 h, but clients don't need OCSP anymore.

Python — OCSP Request / Response (checking cert status)
from cryptography import x509
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.x509 import ocsp
import urllib.request, ssl

def check_ocsp_status(cert_pem: bytes, issuer_pem: bytes) -> str:
    """Check OCSP status of a leaf cert against its issuer."""
    cert   = x509.load_pem_x509_certificate(cert_pem)
    issuer = x509.load_pem_x509_certificate(issuer_pem)

    # Build OCSP request
    builder = ocsp.OCSPRequestBuilder()
    builder = builder.add_certificate(cert, issuer, hashes.SHA256())
    req = builder.build()
    req_der = req.public_bytes(serialization.Encoding.DER)

    # Find OCSP URL from cert's AIA extension
    aia = cert.extensions.get_extension_for_class(x509.AuthorityInformationAccess)
    ocsp_url = next(
        ad.access_location.value
        for ad in aia.value
        if ad.access_method == x509.AuthorityInformationAccessOID.OCSP
    )

    # HTTP POST OCSP request
    http_req = urllib.request.Request(
        ocsp_url,
        data=req_der,
        headers={"Content-Type": "application/ocsp-request"},
        method="POST",
    )
    with urllib.request.urlopen(http_req, timeout=5) as resp:
        resp_der = resp.read()

    # Parse OCSP response
    ocsp_resp = ocsp.load_der_ocsp_response(resp_der)

    status_map = {
        ocsp.OCSPCertStatus.GOOD: "GOOD",
        ocsp.OCSPCertStatus.REVOKED: f"REVOKED (reason={ocsp_resp.revocation_reason})",
        ocsp.OCSPCertStatus.UNKNOWN: "UNKNOWN",
    }
    return status_map.get(ocsp_resp.certificate_status, "PARSE_ERROR")


# ── Quick cert inspection utility ─────────────────────────────────
def inspect_cert(cert_pem: bytes) -> dict:
    """Return key fields from a PEM certificate."""
    cert = x509.load_pem_x509_certificate(cert_pem)
    try:
        san = cert.extensions.get_extension_for_class(x509.SubjectAlternativeName)
        sans = san.value.get_values_for_type(x509.DNSName)
    except x509.ExtensionNotFound:
        sans = []
    return {
        "subject":      cert.subject.rfc4514_string(),
        "issuer":       cert.issuer.rfc4514_string(),
        "serial":       hex(cert.serial_number),
        "not_before":   cert.not_valid_before_utc.isoformat(),
        "not_after":    cert.not_valid_after_utc.isoformat(),
        "sans":         sans,
        "key_algo":     cert.public_key().__class__.__name__,
        "fingerprint":  cert.fingerprint(hashes.SHA256()).hex(),
    }
⚠️ Common Candidate Mistake

Saying "we can just revoke the certificate" as a complete answer. Revocation is broken in practice: most browsers don't check CRL/OCSP at all (it's too slow), and OCSP stapling helps but requires server support. The real answer is short certificate lifetimes. CA/B Forum is reducing max cert lifetime to 47 days by 2027 precisely because revocation is unreliable — an expired cert is much better than a revoked one that browsers ignore.

· · ·
Part 8 — Kubernetes Architecture

PKI at Kubernetes Scale

Modern Kubernetes PKI runs through cert-manager (cert lifecycle), SPIFFE/SPIRE (workload identity), and service meshes like Istio / Linkerd that provide transparent mTLS between every pod — no manual cert management.

Layer 1 — Cert Lifecycle
cert-manager

The Kubernetes operator for certificate issuance. Watches Certificate CRDs, submits ACME challenges, renews 30 days before expiry, and writes results into Kubernetes Secrets.

  • Issuers: ACME (Let's Encrypt, DigiCert ACME), Vault PKI, DigiCert CertCentral API, self-signed, CA bundle
  • Supports DNS-01 via Route53, Cloud DNS, Cloudflare solvers
  • Rotation: automatic; zero-downtime via grace period + pod annotation
Layer 2 — Workload Identity
SPIFFE / SPIRE

Standard for cryptographic workload identity. Each pod gets a SPIFFE ID (spiffe://cluster/ns/default/sa/payments-svc) and a short-lived X.509 SVID (SPIFFE Verifiable Identity Document). No secrets in images or env vars.

  • SPIRE Server: validates node attestation (TPM, k8s SA token)
  • SPIRE Agent: DaemonSet; delivers SVIDs via Unix socket
  • SVIDs auto-rotate every 1 hour; apps re-read via Workload API
Layer 3 — Service Mesh mTLS
Istio / Linkerd

Transparent mutual TLS between every pod-to-pod connection. Sidecar proxies (Envoy in Istio, micro-proxy in Linkerd) intercept all traffic and perform TLS termination/origination, no app code changes required.

  • Istio Citadel (now istiod) acts as internal CA; issues workload certs via CSR to API server
  • Linkerd identity component issues P-256 certs valid 24 h, issued from a trust anchor
  • AuthorizationPolicy: deny unless cert SAN matches expected workload identity
Layer 4 — Secrets Management
HashiCorp Vault PKI

Vault's PKI secrets engine acts as a subordinate CA. Services call Vault's API to get short-lived certs (e.g., 24 h). No long-lived certs in etcd. Integrates with cert-manager via VaultIssuer.

  • Audit log of every cert issued; correlate with workload identity
  • Dynamic credentials: cert tied to k8s service account token
  • Supports HSM backend for Vault's own CA key
YAML — cert-manager Certificate CRD (DigiCert ACME issuer)
---
# ClusterIssuer — DigiCert ACME endpoint
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: digicert-acme
spec:
  acme:
    server: https://acme.digicert.com/v2/acme/directory
    email: ops@example.com
    privateKeySecretRef:
      name: digicert-acme-account-key
    solvers:
      - dns01:
          route53:
            region: us-east-1
            accessKeyIDSecretRef:
              name: route53-creds
              key: access-key-id
            secretAccessKeySecretRef:
              name: route53-creds
              key: secret-access-key

---
# Certificate — requested certificate resource
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: paddyspeaks-tls
  namespace: default
spec:
  secretName: paddyspeaks-tls-secret     # written as kubernetes.io/tls Secret
  issuerRef:
    name: digicert-acme
    kind: ClusterIssuer
  commonName: paddyspeaks.com
  dnsNames:
    - paddyspeaks.com
    - "*.paddyspeaks.com"
  duration: 2160h                        # 90 days (DigiCert max)
  renewBefore: 720h                      # renew 30 days before expiry
  privateKey:
    algorithm: ECDSA
    size: 256                            # P-256
  usages:
    - digital signature
    - key encipherment
    - server auth

---
# Ingress — mounts the cert Secret automatically
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: paddyspeaks-ingress
  annotations:
    cert-manager.io/cluster-issuer: "digicert-acme"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  tls:
    - hosts:
        - paddyspeaks.com
      secretName: paddyspeaks-tls-secret
  rules:
    - host: paddyspeaks.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: paddyspeaks-app
                port:
                  number: 80
YAML — SPIFFE/SPIRE workload identity for microservices
---
# SPIRE Server — runs as StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: spire-server
  namespace: spire
spec:
  replicas: 1
  selector:
    matchLabels: { app: spire-server }
  template:
    spec:
      containers:
        - name: spire-server
          image: ghcr.io/spiffe/spire-server:1.9
          args:
            - -config
            - /run/spire/config/server.conf
          volumeMounts:
            - name: spire-config
              mountPath: /run/spire/config/
            - name: spire-data
              mountPath: /run/spire/data/

---
# SPIRE Agent — DaemonSet (one per node)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: spire-agent
  namespace: spire
spec:
  selector:
    matchLabels: { app: spire-agent }
  template:
    spec:
      hostPID: true                         # needed for k8s node attestor
      containers:
        - name: spire-agent
          image: ghcr.io/spiffe/spire-agent:1.9
          volumeMounts:
            - name: spire-agent-socket
              mountPath: /run/spire/sockets  # Workload API socket
      volumes:
        - name: spire-agent-socket
          hostPath:
            path: /run/spire/sockets
            type: DirectoryOrCreate

---
# Registration Entry — bind k8s service account → SPIFFE ID
# kubectl exec spire-server -- spire-server entry create \
#   -spiffeID spiffe://cluster.local/ns/default/sa/payments-svc \
#   -parentID spiffe://cluster.local/k8s-workload-registrar/node \
#   -selector k8s:ns:default \
#   -selector k8s:sa:payments-svc
#   -ttl 3600    # 1-hour SVIDs
spiffe://cluster.local/ns/payments/sa/checkout-api
checkout-api service
spiffe://cluster.local/ns/payments/sa/fraud-scorer
fraud-scorer service
spiffe://cluster.local/ns/data/sa/warehouse-writer
warehouse-writer service
spiffe://cluster.local/ns/infra/sa/kafka-consumer
kafka-consumer service

Each pod gets a unique SPIFFE ID. Istio AuthorizationPolicy can allow only checkout-api's SPIFFE ID to call fraud-scorer, enforcing zero-trust service-to-service auth — no passwords, no API keys, just cert identity.

Interview Q&A

cert-manager is renewing a cert but pods keep serving the old cert. What happened?

cert-manager writes the new cert into the Kubernetes Secret but running pods don't automatically reload it. Three failure modes: (1) Volume mount — kubelet syncs the file eventually (60 s default) but the app must re-read; nginx supports hot-reload via SIGHUP. (2) Env var mount — env vars never update in running pods; restart required. (3) Cached TLS context — Go's tls.Config.GetCertificate callback reads from disk per handshake; Java's SSLContext does not. Fix: deploy Envoy/Istio as TLS terminator and reload via xDS, or use the ingress controller which runs nginx -s reload automatically on Secret change.

· · ·
Part 9 — Designing at DigiCert Scale

The Architecture Decisions

Issuance API
Horizontally Scaled REST + ACME

Stateless issuance service pods behind a load balancer. The expensive work (DCV validation, HSM signing) is done async via a queue. Issuance latency budget: <300 ms for DV-auto (no human review). Key design: idempotent order IDs — if the client retries, return the same cert if DCV already passed.

  • Rate limiting: per-customer, per-domain (abuse prevention)
  • ACME replay-nonce: Postgres-backed nonce store, TTL 1 h
  • HSM pool: PKCS#11 load-balanced across 8 HSM partitions
OCSP Infrastructure
Anycast OCSP Responders

OCSP must respond in <75 ms globally. DigiCert runs OCSP responders in every major cloud region, behind BGP Anycast IPs. Responses are pre-signed and cached in Redis. Cache miss → look up ocsp_response_cache table → sign → store.

  • Cache TTL: 24–48 h (matches next_update in OCSP response)
  • Revocation write-through: revoke → immediately update Redis + publish to CRL queue
  • SLA: 99.99% availability (4 nines); OCSP downtime = soft-fail = security hole
CT Log Submission
Pre-cert → SCT → Final Cert

Must submit to 2+ independent logs and receive SCTs before signing the final cert. Google Argon and Cloudflare Nimbus are the dominant logs. Each submit is an HTTP POST; SCT is a 104-byte signed timestamp. DigiCert pre-fetches SCTs in parallel before the HSM signing step.

  • Retry logic: CT log outages require fallback to alternate logs
  • Log-qualified: certs must remain qualified for their lifetime (5-year logs for 5-year EV)
  • Monitoring: CT monitor scrapes all logs, alerts on any cert for customer domains
Compliance + Audit
WebTrust / ETSI Audit Trail

Every issuance event is immutably logged (append-only Kafka topic → S3 + Iceberg table). CA/B Forum Baseline Requirements mandate 7-year retention. External auditors (e.g., KPMG for WebTrust) get read-only access to the audit trail. HSM audit logs are separately stored on write-once media.

  • CRL/OCSP SLA monitoring: alerting if CRL not refreshed within 90% of validity window
  • Misissuance detection: automated linting (pkilint, zlint) on every cert before issuance
  • Incident response: revoke all certs from compromised intermediate within 24 h (BR §4.9.1.1)
Python — Cert expiry monitoring (Prometheus metrics style)
"""
Scan a list of hostnames, check TLS cert expiry, emit Prometheus metrics.
Deploy as a CronJob in Kubernetes or as a Prometheus exporter DaemonSet.
"""
import ssl, socket, datetime, logging
from dataclasses import dataclass
from typing import Iterator

@dataclass
class CertInfo:
    hostname: str
    port: int
    subject_cn: str
    issuer: str
    not_after: datetime.datetime
    days_remaining: int
    serial: str

def check_cert(hostname: str, port: int = 443, timeout: int = 10) -> CertInfo:
    ctx = ssl.create_default_context()
    with socket.create_connection((hostname, port), timeout=timeout) as sock:
        with ctx.wrap_socket(sock, server_hostname=hostname) as ssock:
            info = ssock.getpeercert()

    not_after = datetime.datetime.strptime(
        info["notAfter"], "%b %d %H:%M:%S %Y %Z"
    ).replace(tzinfo=datetime.timezone.utc)
    now = datetime.datetime.now(datetime.timezone.utc)
    days = (not_after - now).days

    subject = dict(x[0] for x in info["subject"])
    issuer  = dict(x[0] for x in info["issuer"])
    serial  = str(info.get("serialNumber", "unknown"))

    return CertInfo(
        hostname=hostname,
        port=port,
        subject_cn=subject.get("commonName", hostname),
        issuer=issuer.get("organizationName", "unknown"),
        not_after=not_after,
        days_remaining=days,
        serial=serial,
    )


def scan_fleet(hostnames: list[str]) -> Iterator[CertInfo]:
    for h in hostnames:
        try:
            yield check_cert(h)
        except Exception as e:
            logging.warning("cert check failed for %s: %s", h, e)


def emit_prometheus_metrics(results: list[CertInfo]) -> str:
    """Return Prometheus text format metrics."""
    lines = [
        "# HELP tls_cert_days_remaining Days until TLS certificate expires",
        "# TYPE tls_cert_days_remaining gauge",
    ]
    for r in results:
        label = f'hostname="{r.hostname}",issuer="{r.issuer}",serial="{r.serial}"'
        lines.append(f"tls_cert_days_remaining{{{label}}} {r.days_remaining}")
    return "\n".join(lines)


# ── Usage ─────────────────────────────────────────────────────────
if __name__ == "__main__":
    fleet = [
        "paddyspeaks.com", "google.com", "github.com",
        "api.stripe.com", "s3.amazonaws.com",
    ]
    results = list(scan_fleet(fleet))
    for r in sorted(results, key=lambda x: x.days_remaining):
        status = "CRITICAL" if r.days_remaining < 14 else ("WARNING" if r.days_remaining < 30 else "OK")
        print(f"[{status:8s}] {r.hostname:30s}  {r.days_remaining:3d} days  issuer={r.issuer}")
    print()
    print(emit_prometheus_metrics(results))
Sample Output
[OK      ] paddyspeaks.com                 72 days  issuer=DigiCert Inc
[OK      ] github.com                      61 days  issuer=DigiCert Inc
[OK      ] google.com                      58 days  issuer=Google Trust Services
[WARNING ] api.stripe.com                  28 days  issuer=DigiCert Inc
[CRITICAL] internal-legacy-api.corp        11 days  issuer=Self-signed

# HELP tls_cert_days_remaining Days until TLS certificate expires
# TYPE tls_cert_days_remaining gauge
tls_cert_days_remaining{hostname="paddyspeaks.com",issuer="DigiCert Inc",...} 72
tls_cert_days_remaining{hostname="api.stripe.com",issuer="DigiCert Inc",...} 28
tls_cert_days_remaining{hostname="internal-legacy-api.corp",issuer="Self-signed",...} 11
Interview Q&A

The CA/B Forum is shortening max TLS cert lifetimes to 47 days by 2027. How does this change DigiCert's architecture?

Shorter lifetimes force automation first — 47-day manual renewal is impossible at scale. For DigiCert: (1) same cert count but ~6× renewal frequency means the issuance pipeline must scale with zero human review for DV; (2) ACME is now mandatory in practice — customers who don't automate face constant outages; (3) the DCV reuse window shrinks, pushing toward per-issuance domain validation. On the positive side, a stolen key is only exploitable for 47 days max — reducing reliance on OCSP/CRL revocation infrastructure.

· · ·
When PKI Goes Wrong

Four Times the Certificate System Broke the Internet

Every architectural decision in this article traces back to one of these four failures.

2011

DigiNotar — The CA That Destroyed Itself

Iranian hackers issued 500+ fraudulent certs including *.google.com, enabling MITM attacks on millions of Gmail users. Mozilla, Google, and Microsoft removed DigiNotar from all root programs within days. The company filed for bankruptcy two months later.

💡 Lesson encoded in modern PKI: Certificate Transparency (CT) logs make every issued cert publicly auditable within minutes. A fraudulent cert for google.com would be caught by Google's CT monitoring in seconds today.
2017

Symantec Distrust — A Two-Year Slow-Motion Removal

Symantec (then world's largest CA, ~30% of web certs) had misissued thousands of certificates — test certs for domains they didn't own, certs without proper DV. Chrome 70 (Oct 2018) removed all Symantec roots after a phased distrust. DigiCert acquired Symantec's PKI business and had to re-issue all affected certs.

💡 Lesson encoded in modern PKI: CA/B Forum audits (WebTrust) now require annual third-party audits. The ballot process governs exactly what CAs can and cannot issue — and violations result in distrust, not warnings.
2021

Let's Encrypt Root Expiry — 1.5 Million Sites Briefly Broken

The IdenTrust DST Root CA X3 cross-signature expired Sep 30, 2021. Android <7.1.1 and IoT devices without the newer ISRG Root X1 stopped trusting Let's Encrypt certs. Estimated 1.5M+ sites showed cert errors.

💡 Lesson encoded in modern PKI: Short-lived certs (90 days) + automated renewal (ACME) mean the blast radius of any single cert failure is bounded. Devices with frozen OS trust stores are the long tail of PKI risk.
2022

TrustCor Removal — A National Security Distrust

All major root programs removed TrustCor after investigations revealed links between its parent companies and a US defense contractor involved in spyware. Unlike DigiNotar (hack) or Symantec (misissuance), this was a pure policy distrust — triggered by ownership, not technical failure.

💡 Lesson encoded in modern PKI: Root store membership is a privilege, not a right. The five root programs (Mozilla, Apple, Google/Chrome, Microsoft, Oracle Java) each independently gate who gets trusted. A CA can be technically perfect and still get removed.
✅ Interviewer Expects

At L6+, explain at least two CA distrust events, their causes, and the architectural responses. CT logs, ACME, and the CA/B Forum ballot process all exist because of specific, named failures.

· · ·
Putting It Together

The DigiCert Design Summary

PKI is the cryptographic backbone of all trust on the internet. Every lock icon, API token, and mTLS handshake traces back to a signed X.509 structure and a chain of organizations that agreed to trust each other.

X.509 v3
The Lingua Franca

Every cert — web TLS, code signing, email, IoT device, Kubernetes pod — is an X.509 structure with the same field layout. Learn to read a cert (serial, SAN, AKI, EKU, OCSP URL) and you can debug any TLS problem.

Chain
Trust is Delegated

Root → Intermediate → Leaf. Roots never go online. Intermediates carry operational risk. This is defense-in-depth: a compromised intermediate is contained and revocable without touching OS trust stores.

cert-manager
Automation Wins

Manual certificate management doesn't survive 90-day (soon 47-day) lifetimes. cert-manager + ACME is the answer for Kubernetes. SPIFFE/SPIRE + Istio is the answer for workload-to-workload identity.

HSM + CT
The Two Pillars of CA Trust

HSMs prevent private key exfiltration even under full server compromise. CT logs prevent silent misissaunce. Together they are why browsers can trust a CA — and why a rogue issuance is detectable within minutes.

The padlock is not a feature. It is a ceremony — a chain of cryptographic signatures, a walk through a trust hierarchy, a query to a revocation service, and a key negotiation — all in under 50 milliseconds. DigiCert's job is to make that ceremony reliable for a billion connections a day.
Share