Designing DigiCert
PKI, X.509 & Certificates
at Kubernetes Scale
From the byte layout of a certificate to a CA issuing a million certs a day on Kubernetes — the full stack, built out in schema, Python, cert-manager YAML, and real architecture decisions that keep the internet's TLS working.
Advanced / Staff-Level · L6–L7 Interview Prep
| Question | 60-second answer |
|---|---|
| Why certificates exist | Establish trust between strangers on the internet — prove a server is who it claims to be, and encrypt everything in transit. |
| Who issues them | Certificate Authorities (CAs) — trusted third parties like DigiCert, Let's Encrypt, or your internal private PKI. Root CAs are air-gapped; intermediates do the actual signing. |
| How trust works | Chain of trust: Root → Intermediate → Leaf. Your OS ships with ~150 trusted root CAs. A leaf cert is trusted if every signature in the chain traces back to one of them. |
| How certificates fail | Expiry (forgotten renewal), compromise (stolen private key), misissuance (CA error), or CA distrust (DigiNotar 2011, Symantec 2018). Any of these → red padlock. |
| Modern solution | Automated certificate management: ACME protocol + cert-manager on Kubernetes = zero-touch issuance, renewal every 60–90 days, no human in the loop. |
| Typical Interview Site | Interview Studio (this site) |
|---|---|
| Memorization | Understanding |
| Coding only | Coding + Architecture + Data Modeling |
| Short answers | Deep reasoning with trade-offs |
| LeetCode style | Real-world engineering at scale |
| Junior focus | Senior / Staff / L6–L7 |
- Root CA stays offline. It signs intermediates once, then goes into a vault. Never exposed to network requests.
- Intermediate CA signs certs. Your leaf cert is signed by an intermediate, not the root. Chain: Root → Intermediate → Leaf.
- TLS validates chain + hostname + expiration. Browser checks all three in ~50 ms. Fail any one → red padlock.
- Revocation uses OCSP / CRL / CRLite. Certs can't be "unsigned" — revocation is a separate signal CAs broadcast to clients.
- Kubernetes automates everything. cert-manager + ACME = auto-issue, auto-renew, auto-rotate. Manual cert management doesn't scale past 50 services.
When your browser shows the padlock on https://paddyspeaks.com, a certificate was presented, a chain of trust was walked, and a symmetric session key was negotiated — all before a single byte of HTTP was exchanged. DigiCert is the infrastructure behind that padlock: issuing, renewing, revoking, and auditing hundreds of millions of certificates at sub-second latency, with zero tolerance for error.
What Is a Certificate?
A digital certificate is a signed data structure that binds a public key to an identity — like a passport stamped by a recognized authority. Certificates follow the X.509 v3 standard (RFC 5280):
What's the difference between a self-signed cert and a CA-signed cert? When would you use each?
A self-signed cert is signed by the same key it describes — browsers reject it because no root store vouches for the signer. Use them for local dev, internal testing, or private PKI where you control the trust anchor. A CA-signed cert has a trusted third party (DigiCert, Let's Encrypt) verify the domain (DV), organization (OV), or identity (EV) before signing — required for anything public-facing.
Root CA → Intermediate CA → Leaf
No browser trusts a leaf certificate directly. Trust flows through a chain because Root CAs stay offline and never issue directly to websites — Intermediates carry the operational risk.
Why do CAs use intermediate certificates instead of signing everything from the Root?
Root private keys are the crown jewels — compromise means browser vendors remove the root and millions of sites go dark. Keeping the root offline limits blast radius to an intermediate, which can be revoked and replaced without touching OS trust stores.
The interviewer wants blast radius management, not just "security." An intermediate compromise is survivable; a root compromise requires browser vendors to push emergency updates to every device on Earth.
Saying "the certificate encrypts the connection." It does not — it authenticates the server. TLS uses the cert's public key only during the handshake to establish a symmetric session key (AES-256); all actual data travels under that symmetric key.
Where Certificates Are Stored
Understanding which trust store an application reads is the difference between "cert works in Chrome but not in curl" — each runtime maintains its own anchor list.
| Platform / Store | Location / Tool | Who reads it | User modifiable? |
|---|---|---|---|
| macOS Keychain | System Keychain (/Library/Keychains/System.keychain) + System Roots | Safari, curl (via SecureTransport), most apps | Admin can add to System; users to Login |
| macOS Login Keychain | ~/Library/Keychains/login.keychain-db | User-space apps, Keychain Access.app | Yes |
| Windows Certificate Store | Registry-backed: Cert:\LocalMachine\Root, Cert:\CurrentUser\Root | IE, Edge, Chrome (on Windows), WinHTTP, .NET, PowerShell | Admin for LocalMachine; user for CurrentUser |
| Mozilla NSS | Firefox-bundled cert9.db in profile folder; certutil (libnss) | Firefox, Thunderbird, LibreOffice on Linux | Yes, per-profile |
| Linux System (ca-certificates) | /etc/ssl/certs/ca-certificates.crt (Debian/Ubuntu) or /etc/pki/tls/certs/ca-bundle.crt (RHEL) | OpenSSL-linked apps, curl, wget, Python requests | Root only; update-ca-certificates |
| JVM KeyStore (JKS/PKCS12) | $JAVA_HOME/lib/security/cacerts; custom via -Djavax.net.ssl.trustStore= | Java, Kafka, Elasticsearch, Hadoop | keytool (admin) |
| Node.js | Bundles its own Mozilla root store (npm config get cafile); bypasses OS | Node.js apps, npm | NODE_EXTRA_CA_CERTS env var |
| Kubernetes Secret | kubectl get secret tls-cert -n default; mounted at /etc/ssl/certs/ | Pods that mount the secret | Yes, via kubectl / cert-manager |
| Browser (Chrome) | macOS/Windows: delegates to OS. Linux: bundles its own NSS ~/.pki/nssdb | Chrome, Chromium | Settings → Certificates; certutil on Linux |
| iOS / iPadOS | Settings → General → About → Certificate Trust Settings | Safari, native apps, WKWebView | Users can toggle trust; MDM can push roots |
A Java microservice throws PKIX path building failed: unable to find valid certification path to requested target. The same endpoint works in curl. Walk me through your diagnosis.
Java uses its own cacerts keystore, not the OS bundle. The intermediate CA is almost certainly missing from the JVM store. Fix: (1) openssl s_client -connect host:443 -showcerts to dump the chain. (2) keytool -importcert -alias digicert-inter -file inter.crt -cacerts -storepass changeit to install it. In Kubernetes, update the trust bundle ConfigMap and restart pods — prefer injecting the CA bundle as a volume over modifying the base image.
The TLS 1.3 Handshake
TLS 1.3 reduced the handshake to 1 round-trip. Every message exchanged:
What is OCSP stapling and why does it matter at DigiCert scale?
Without stapling, the browser calls DigiCert's OCSP responder per handshake — a privacy leak and latency hit at 8B checks/day. Stapling inverts this: the server pre-fetches a CA-signed OCSP response and attaches it to the TLS handshake, giving the browser revocation status at zero extra RTT. Nginx: ssl_stapling on; ssl_stapling_verify on;.
The Certificate Issuance Database
The core schema tracks every issued cert — for billing, revocation, audit, and CA/B Forum compliance:
-- Certificate Authorities (roots + intermediates)
CREATE TABLE certificate_authority (
ca_id BIGSERIAL PRIMARY KEY,
common_name TEXT NOT NULL,
subject_dn TEXT NOT NULL, -- full distinguished name
ca_type TEXT NOT NULL CHECK (ca_type IN ('root','intermediate','issuing')),
parent_ca_id BIGINT REFERENCES certificate_authority(ca_id),
public_key_sha256 BYTEA NOT NULL UNIQUE, -- key fingerprint
cert_pem TEXT NOT NULL,
valid_from TIMESTAMPTZ NOT NULL,
valid_until TIMESTAMPTZ NOT NULL,
is_active BOOLEAN NOT NULL DEFAULT TRUE,
hsm_slot_id TEXT, -- HSM partition reference
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Issued certificates (leaf + subordinate CAs)
CREATE TABLE certificate (
cert_id BIGSERIAL PRIMARY KEY,
serial_number BYTEA NOT NULL, -- 20 bytes, CA-unique
issuing_ca_id BIGINT NOT NULL REFERENCES certificate_authority(ca_id),
subject_cn TEXT NOT NULL,
subject_dn TEXT NOT NULL,
cert_type TEXT NOT NULL CHECK (cert_type IN ('dv','ov','ev','code_sign','s_mime','client','device')),
san_list TEXT[] NOT NULL DEFAULT '{}', -- DNS names, IPs, emails
public_key_algo TEXT NOT NULL, -- RSA, EC, Ed25519
key_size_bits INT, -- 2048, 4096; null for EC
ec_curve TEXT, -- P-256, P-384; null for RSA
public_key_sha256 BYTEA NOT NULL,
cert_pem TEXT NOT NULL,
valid_from TIMESTAMPTZ NOT NULL,
valid_until TIMESTAMPTZ NOT NULL,
issued_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
customer_id BIGINT NOT NULL,
order_id BIGINT NOT NULL,
domain_validated_at TIMESTAMPTZ, -- DCV timestamp
status TEXT NOT NULL DEFAULT 'active'
CHECK (status IN ('active','revoked','expired','hold')),
ct_log_ids TEXT[] NOT NULL DEFAULT '{}', -- SCT log IDs
UNIQUE (issuing_ca_id, serial_number)
);
-- Revocation events
CREATE TABLE revocation (
revocation_id BIGSERIAL PRIMARY KEY,
cert_id BIGINT NOT NULL REFERENCES certificate(cert_id),
revoked_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
reason_code INT NOT NULL, -- RFC 5280 CRLReason
reason_text TEXT,
requested_by TEXT NOT NULL, -- customer / admin / auto
ocsp_next_update TIMESTAMPTZ, -- OCSP response cache TTL
crl_published_at TIMESTAMPTZ
);
-- Domain Control Validation
CREATE TABLE dcv_event (
dcv_id BIGSERIAL PRIMARY KEY,
cert_id BIGINT REFERENCES certificate(cert_id),
domain TEXT NOT NULL,
method TEXT NOT NULL CHECK (method IN ('http-01','dns-01','tls-alpn-01','email')),
challenge_token TEXT NOT NULL,
challenge_value TEXT NOT NULL,
validated_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ NOT NULL, -- reuse window (825 days → 90 days)
ip_logged INET -- requester IP for audit
);
-- OCSP response cache (hot read path)
CREATE TABLE ocsp_response_cache (
cert_id BIGINT NOT NULL REFERENCES certificate(cert_id),
ca_id BIGINT NOT NULL REFERENCES certificate_authority(ca_id),
this_update TIMESTAMPTZ NOT NULL,
next_update TIMESTAMPTZ NOT NULL,
ocsp_status TEXT NOT NULL CHECK (ocsp_status IN ('good','revoked','unknown')),
signed_response BYTEA NOT NULL, -- DER-encoded OCSPResponse
PRIMARY KEY (cert_id)
);
-- Indexes
CREATE INDEX idx_cert_serial ON certificate (issuing_ca_id, serial_number);
CREATE INDEX idx_cert_customer ON certificate (customer_id, status, valid_until);
CREATE INDEX idx_cert_san ON certificate USING GIN (san_list);
CREATE INDEX idx_cert_expiring ON certificate (valid_until) WHERE status = 'active';
CREATE INDEX idx_rev_published ON revocation (crl_published_at) WHERE crl_published_at IS NULL;
CREATE INDEX idx_ocsp_next_upd ON ocsp_response_cache (next_update);
How would you design the serial number generation to guarantee uniqueness across a distributed CA cluster?
CA/B Forum Ballot SC63 requires at least 64 bits of CSPRNG entropy per serial. Generate 20 random bytes (os.urandom(20)), store as BYTEA, and use ON CONFLICT DO NOTHING to detect the statistically-impossible collision. Never use sequential integers — they leak issuance volume to anyone reading CT logs. A ULID-like scheme (48-bit timestamp + 80 bits randomness) gives monotonic sort order without leaking counts.
From CSR to Signed Certificate
Every certificate starts as a CSR (PKCS#10) generated by the applicant. The CA validates it, then an HSM-held private key signs and returns the certificate.
import hashlib, base64, json, time
from cryptography import x509
from cryptography.hazmat.primitives.asymmetric import ec
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.x509.oid import NameOID, ExtendedKeyUsageOID
import datetime, ipaddress
def generate_csr(domain: str, sans: list[str]) -> tuple[bytes, bytes]:
"""Generate EC P-256 private key + CSR. Returns (private_key_pem, csr_pem)."""
key = ec.generate_private_key(ec.SECP256R1())
csr = (
x509.CertificateSigningRequestBuilder()
.subject_name(x509.Name([x509.NameAttribute(NameOID.COMMON_NAME, domain)]))
.add_extension(
x509.SubjectAlternativeName([x509.DNSName(s) for s in sans]),
critical=False,
)
.sign(key, hashes.SHA256())
)
return (
key.private_bytes(serialization.Encoding.PEM,
serialization.PrivateFormat.TraditionalOpenSSL,
serialization.NoEncryption()),
csr.public_bytes(serialization.Encoding.PEM),
)
def dns01_key_authorization(token: str, account_key_thumbprint: str) -> str:
"""Build the DNS TXT record value for dns-01 challenge."""
key_auth = f"{token}.{account_key_thumbprint}"
digest = hashlib.sha256(key_auth.encode()).digest()
return base64.urlsafe_b64encode(digest).rstrip(b"=").decode()
def poll_order(acme_client, order_url: str, timeout: int = 120) -> dict:
"""Poll ACME order until valid or timeout."""
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
order = acme_client.get(order_url)
if order["status"] == "valid":
return order
if order["status"] == "invalid":
raise RuntimeError(f"ACME order invalid: {order.get('error')}")
time.sleep(3)
raise TimeoutError("ACME order did not complete in time")
def build_cert_bundle(leaf_pem: bytes, intermediates: list[bytes]) -> bytes:
"""Concatenate leaf + chain for nginx ssl_certificate."""
return leaf_pem + b"\n".join(intermediates)
# ── Usage ──────────────────────────────────────────────────────────
if __name__ == "__main__":
domain = "paddyspeaks.com"
sans = ["paddyspeaks.com", "www.paddyspeaks.com"]
key_pem, csr_pem = generate_csr(domain, sans)
print(csr_pem.decode())
# In a real flow:
# 1. Submit csr_pem to ACME newOrder → get challenge URLs
# 2. Publish DNS TXT = dns01_key_authorization(token, thumbprint)
# 3. Tell ACME server challenge is ready → poll_order(client, order_url)
# 4. Download cert → build_cert_bundle(leaf, [intermediate_pem])
# 5. Store key_pem in a secret (Vault, K8s Secret) — never log it
What is Certificate Transparency and why can't DigiCert issue a cert without it?
CT (RFC 9162) is a public, append-only Merkle-tree log of every certificate issued. Before Chrome accepts a cert, the CA must submit a pre-certificate to 2+ independent logs and embed the resulting Signed Certificate Timestamps (SCTs) in the final cert. If DigiCert or a rogue CA issues a cert for google.com without consent, Google's CT monitor catches it within minutes — before CT, incidents like DigiNotar 2011 took months to discover.
CRL, OCSP, and the Revocation Crisis
When a private key is compromised, DigiCert must make the revocation available to every browser within 24 hours (CA/B Forum Baseline Requirements). Two mechanisms exist:
Certificate Revocation List (CRL)
A CA-signed list of revoked serial numbers, published as a file at the CRL Distribution Point URL. Browsers download periodically. Can grow very large for big CAs — DigiCert's intermediate CRLs are partitioned to stay under 10 MB. Update cycle: typically 24–48 hours, up to 7 days for Root CRLs.
OCSP — Online Certificate Status Protocol
Real-time HTTP query per cert: "Is serial X from CA Y good or revoked?" DigiCert's OCSP responders are globally distributed via Anycast and must respond in under 75 ms (BR requirement). Signed responses cached for 24–48 h. The OCSP URL is embedded in every leaf cert's AIA extension.
The Revocation Problem
Browsers mostly soft-fail on OCSP errors (if DigiCert's OCSP is down, the cert is accepted). This breaks the revocation guarantee. Chrome removed soft-fail CRL/OCSP in 2023, instead relying on CRLite — a probabilistic Bloom filter pushed to every Chrome browser daily, capturing all revocations for all CAs. Firefox uses CRLite + OneCRL for intermediates.
CRLite / CRLSets
Google CRLSets / Mozilla CRLite compile every revoked serial number from all CT-logged certs into a compressed Bloom filter (~2 MB). Shipped daily with browser updates. Zero OCSP latency, zero privacy leak, and works offline. The future of revocation — CAs still need to update CT logs within 24 h, but clients don't need OCSP anymore.
from cryptography import x509
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.x509 import ocsp
import urllib.request, ssl
def check_ocsp_status(cert_pem: bytes, issuer_pem: bytes) -> str:
"""Check OCSP status of a leaf cert against its issuer."""
cert = x509.load_pem_x509_certificate(cert_pem)
issuer = x509.load_pem_x509_certificate(issuer_pem)
# Build OCSP request
builder = ocsp.OCSPRequestBuilder()
builder = builder.add_certificate(cert, issuer, hashes.SHA256())
req = builder.build()
req_der = req.public_bytes(serialization.Encoding.DER)
# Find OCSP URL from cert's AIA extension
aia = cert.extensions.get_extension_for_class(x509.AuthorityInformationAccess)
ocsp_url = next(
ad.access_location.value
for ad in aia.value
if ad.access_method == x509.AuthorityInformationAccessOID.OCSP
)
# HTTP POST OCSP request
http_req = urllib.request.Request(
ocsp_url,
data=req_der,
headers={"Content-Type": "application/ocsp-request"},
method="POST",
)
with urllib.request.urlopen(http_req, timeout=5) as resp:
resp_der = resp.read()
# Parse OCSP response
ocsp_resp = ocsp.load_der_ocsp_response(resp_der)
status_map = {
ocsp.OCSPCertStatus.GOOD: "GOOD",
ocsp.OCSPCertStatus.REVOKED: f"REVOKED (reason={ocsp_resp.revocation_reason})",
ocsp.OCSPCertStatus.UNKNOWN: "UNKNOWN",
}
return status_map.get(ocsp_resp.certificate_status, "PARSE_ERROR")
# ── Quick cert inspection utility ─────────────────────────────────
def inspect_cert(cert_pem: bytes) -> dict:
"""Return key fields from a PEM certificate."""
cert = x509.load_pem_x509_certificate(cert_pem)
try:
san = cert.extensions.get_extension_for_class(x509.SubjectAlternativeName)
sans = san.value.get_values_for_type(x509.DNSName)
except x509.ExtensionNotFound:
sans = []
return {
"subject": cert.subject.rfc4514_string(),
"issuer": cert.issuer.rfc4514_string(),
"serial": hex(cert.serial_number),
"not_before": cert.not_valid_before_utc.isoformat(),
"not_after": cert.not_valid_after_utc.isoformat(),
"sans": sans,
"key_algo": cert.public_key().__class__.__name__,
"fingerprint": cert.fingerprint(hashes.SHA256()).hex(),
}
Saying "we can just revoke the certificate" as a complete answer. Revocation is broken in practice: most browsers don't check CRL/OCSP at all (it's too slow), and OCSP stapling helps but requires server support. The real answer is short certificate lifetimes. CA/B Forum is reducing max cert lifetime to 47 days by 2027 precisely because revocation is unreliable — an expired cert is much better than a revoked one that browsers ignore.
PKI at Kubernetes Scale
Modern Kubernetes PKI runs through cert-manager (cert lifecycle), SPIFFE/SPIRE (workload identity), and service meshes like Istio / Linkerd that provide transparent mTLS between every pod — no manual cert management.
cert-manager
The Kubernetes operator for certificate issuance. Watches Certificate CRDs, submits ACME challenges, renews 30 days before expiry, and writes results into Kubernetes Secrets.
- Issuers: ACME (Let's Encrypt, DigiCert ACME), Vault PKI, DigiCert CertCentral API, self-signed, CA bundle
- Supports DNS-01 via Route53, Cloud DNS, Cloudflare solvers
- Rotation: automatic; zero-downtime via grace period + pod annotation
SPIFFE / SPIRE
Standard for cryptographic workload identity. Each pod gets a SPIFFE ID (spiffe://cluster/ns/default/sa/payments-svc) and a short-lived X.509 SVID (SPIFFE Verifiable Identity Document). No secrets in images or env vars.
- SPIRE Server: validates node attestation (TPM, k8s SA token)
- SPIRE Agent: DaemonSet; delivers SVIDs via Unix socket
- SVIDs auto-rotate every 1 hour; apps re-read via Workload API
Istio / Linkerd
Transparent mutual TLS between every pod-to-pod connection. Sidecar proxies (Envoy in Istio, micro-proxy in Linkerd) intercept all traffic and perform TLS termination/origination, no app code changes required.
- Istio Citadel (now istiod) acts as internal CA; issues workload certs via CSR to API server
- Linkerd identity component issues P-256 certs valid 24 h, issued from a trust anchor
- AuthorizationPolicy: deny unless cert SAN matches expected workload identity
HashiCorp Vault PKI
Vault's PKI secrets engine acts as a subordinate CA. Services call Vault's API to get short-lived certs (e.g., 24 h). No long-lived certs in etcd. Integrates with cert-manager via VaultIssuer.
- Audit log of every cert issued; correlate with workload identity
- Dynamic credentials: cert tied to k8s service account token
- Supports HSM backend for Vault's own CA key
---
# ClusterIssuer — DigiCert ACME endpoint
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: digicert-acme
spec:
acme:
server: https://acme.digicert.com/v2/acme/directory
email: ops@example.com
privateKeySecretRef:
name: digicert-acme-account-key
solvers:
- dns01:
route53:
region: us-east-1
accessKeyIDSecretRef:
name: route53-creds
key: access-key-id
secretAccessKeySecretRef:
name: route53-creds
key: secret-access-key
---
# Certificate — requested certificate resource
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: paddyspeaks-tls
namespace: default
spec:
secretName: paddyspeaks-tls-secret # written as kubernetes.io/tls Secret
issuerRef:
name: digicert-acme
kind: ClusterIssuer
commonName: paddyspeaks.com
dnsNames:
- paddyspeaks.com
- "*.paddyspeaks.com"
duration: 2160h # 90 days (DigiCert max)
renewBefore: 720h # renew 30 days before expiry
privateKey:
algorithm: ECDSA
size: 256 # P-256
usages:
- digital signature
- key encipherment
- server auth
---
# Ingress — mounts the cert Secret automatically
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: paddyspeaks-ingress
annotations:
cert-manager.io/cluster-issuer: "digicert-acme"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- paddyspeaks.com
secretName: paddyspeaks-tls-secret
rules:
- host: paddyspeaks.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: paddyspeaks-app
port:
number: 80
---
# SPIRE Server — runs as StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: spire-server
namespace: spire
spec:
replicas: 1
selector:
matchLabels: { app: spire-server }
template:
spec:
containers:
- name: spire-server
image: ghcr.io/spiffe/spire-server:1.9
args:
- -config
- /run/spire/config/server.conf
volumeMounts:
- name: spire-config
mountPath: /run/spire/config/
- name: spire-data
mountPath: /run/spire/data/
---
# SPIRE Agent — DaemonSet (one per node)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: spire-agent
namespace: spire
spec:
selector:
matchLabels: { app: spire-agent }
template:
spec:
hostPID: true # needed for k8s node attestor
containers:
- name: spire-agent
image: ghcr.io/spiffe/spire-agent:1.9
volumeMounts:
- name: spire-agent-socket
mountPath: /run/spire/sockets # Workload API socket
volumes:
- name: spire-agent-socket
hostPath:
path: /run/spire/sockets
type: DirectoryOrCreate
---
# Registration Entry — bind k8s service account → SPIFFE ID
# kubectl exec spire-server -- spire-server entry create \
# -spiffeID spiffe://cluster.local/ns/default/sa/payments-svc \
# -parentID spiffe://cluster.local/k8s-workload-registrar/node \
# -selector k8s:ns:default \
# -selector k8s:sa:payments-svc
# -ttl 3600 # 1-hour SVIDs
Each pod gets a unique SPIFFE ID. Istio AuthorizationPolicy can allow only checkout-api's SPIFFE ID to call fraud-scorer, enforcing zero-trust service-to-service auth — no passwords, no API keys, just cert identity.
cert-manager is renewing a cert but pods keep serving the old cert. What happened?
cert-manager writes the new cert into the Kubernetes Secret but running pods don't automatically reload it. Three failure modes: (1) Volume mount — kubelet syncs the file eventually (60 s default) but the app must re-read; nginx supports hot-reload via SIGHUP. (2) Env var mount — env vars never update in running pods; restart required. (3) Cached TLS context — Go's tls.Config.GetCertificate callback reads from disk per handshake; Java's SSLContext does not. Fix: deploy Envoy/Istio as TLS terminator and reload via xDS, or use the ingress controller which runs nginx -s reload automatically on Secret change.
The Architecture Decisions
Horizontally Scaled REST + ACME
Stateless issuance service pods behind a load balancer. The expensive work (DCV validation, HSM signing) is done async via a queue. Issuance latency budget: <300 ms for DV-auto (no human review). Key design: idempotent order IDs — if the client retries, return the same cert if DCV already passed.
- Rate limiting: per-customer, per-domain (abuse prevention)
- ACME replay-nonce: Postgres-backed nonce store, TTL 1 h
- HSM pool: PKCS#11 load-balanced across 8 HSM partitions
Anycast OCSP Responders
OCSP must respond in <75 ms globally. DigiCert runs OCSP responders in every major cloud region, behind BGP Anycast IPs. Responses are pre-signed and cached in Redis. Cache miss → look up ocsp_response_cache table → sign → store.
- Cache TTL: 24–48 h (matches
next_updatein OCSP response) - Revocation write-through: revoke → immediately update Redis + publish to CRL queue
- SLA: 99.99% availability (4 nines); OCSP downtime = soft-fail = security hole
Pre-cert → SCT → Final Cert
Must submit to 2+ independent logs and receive SCTs before signing the final cert. Google Argon and Cloudflare Nimbus are the dominant logs. Each submit is an HTTP POST; SCT is a 104-byte signed timestamp. DigiCert pre-fetches SCTs in parallel before the HSM signing step.
- Retry logic: CT log outages require fallback to alternate logs
- Log-qualified: certs must remain qualified for their lifetime (5-year logs for 5-year EV)
- Monitoring: CT monitor scrapes all logs, alerts on any cert for customer domains
WebTrust / ETSI Audit Trail
Every issuance event is immutably logged (append-only Kafka topic → S3 + Iceberg table). CA/B Forum Baseline Requirements mandate 7-year retention. External auditors (e.g., KPMG for WebTrust) get read-only access to the audit trail. HSM audit logs are separately stored on write-once media.
- CRL/OCSP SLA monitoring: alerting if CRL not refreshed within 90% of validity window
- Misissuance detection: automated linting (pkilint, zlint) on every cert before issuance
- Incident response: revoke all certs from compromised intermediate within 24 h (BR §4.9.1.1)
"""
Scan a list of hostnames, check TLS cert expiry, emit Prometheus metrics.
Deploy as a CronJob in Kubernetes or as a Prometheus exporter DaemonSet.
"""
import ssl, socket, datetime, logging
from dataclasses import dataclass
from typing import Iterator
@dataclass
class CertInfo:
hostname: str
port: int
subject_cn: str
issuer: str
not_after: datetime.datetime
days_remaining: int
serial: str
def check_cert(hostname: str, port: int = 443, timeout: int = 10) -> CertInfo:
ctx = ssl.create_default_context()
with socket.create_connection((hostname, port), timeout=timeout) as sock:
with ctx.wrap_socket(sock, server_hostname=hostname) as ssock:
info = ssock.getpeercert()
not_after = datetime.datetime.strptime(
info["notAfter"], "%b %d %H:%M:%S %Y %Z"
).replace(tzinfo=datetime.timezone.utc)
now = datetime.datetime.now(datetime.timezone.utc)
days = (not_after - now).days
subject = dict(x[0] for x in info["subject"])
issuer = dict(x[0] for x in info["issuer"])
serial = str(info.get("serialNumber", "unknown"))
return CertInfo(
hostname=hostname,
port=port,
subject_cn=subject.get("commonName", hostname),
issuer=issuer.get("organizationName", "unknown"),
not_after=not_after,
days_remaining=days,
serial=serial,
)
def scan_fleet(hostnames: list[str]) -> Iterator[CertInfo]:
for h in hostnames:
try:
yield check_cert(h)
except Exception as e:
logging.warning("cert check failed for %s: %s", h, e)
def emit_prometheus_metrics(results: list[CertInfo]) -> str:
"""Return Prometheus text format metrics."""
lines = [
"# HELP tls_cert_days_remaining Days until TLS certificate expires",
"# TYPE tls_cert_days_remaining gauge",
]
for r in results:
label = f'hostname="{r.hostname}",issuer="{r.issuer}",serial="{r.serial}"'
lines.append(f"tls_cert_days_remaining{{{label}}} {r.days_remaining}")
return "\n".join(lines)
# ── Usage ─────────────────────────────────────────────────────────
if __name__ == "__main__":
fleet = [
"paddyspeaks.com", "google.com", "github.com",
"api.stripe.com", "s3.amazonaws.com",
]
results = list(scan_fleet(fleet))
for r in sorted(results, key=lambda x: x.days_remaining):
status = "CRITICAL" if r.days_remaining < 14 else ("WARNING" if r.days_remaining < 30 else "OK")
print(f"[{status:8s}] {r.hostname:30s} {r.days_remaining:3d} days issuer={r.issuer}")
print()
print(emit_prometheus_metrics(results))
[OK ] paddyspeaks.com 72 days issuer=DigiCert Inc
[OK ] github.com 61 days issuer=DigiCert Inc
[OK ] google.com 58 days issuer=Google Trust Services
[WARNING ] api.stripe.com 28 days issuer=DigiCert Inc
[CRITICAL] internal-legacy-api.corp 11 days issuer=Self-signed
# HELP tls_cert_days_remaining Days until TLS certificate expires
# TYPE tls_cert_days_remaining gauge
tls_cert_days_remaining{hostname="paddyspeaks.com",issuer="DigiCert Inc",...} 72
tls_cert_days_remaining{hostname="api.stripe.com",issuer="DigiCert Inc",...} 28
tls_cert_days_remaining{hostname="internal-legacy-api.corp",issuer="Self-signed",...} 11
The CA/B Forum is shortening max TLS cert lifetimes to 47 days by 2027. How does this change DigiCert's architecture?
Shorter lifetimes force automation first — 47-day manual renewal is impossible at scale. For DigiCert: (1) same cert count but ~6× renewal frequency means the issuance pipeline must scale with zero human review for DV; (2) ACME is now mandatory in practice — customers who don't automate face constant outages; (3) the DCV reuse window shrinks, pushing toward per-issuance domain validation. On the positive side, a stolen key is only exploitable for 47 days max — reducing reliance on OCSP/CRL revocation infrastructure.
Four Times the Certificate System Broke the Internet
Every architectural decision in this article traces back to one of these four failures.
DigiNotar — The CA That Destroyed Itself
Iranian hackers issued 500+ fraudulent certs including *.google.com, enabling MITM attacks on millions of Gmail users. Mozilla, Google, and Microsoft removed DigiNotar from all root programs within days. The company filed for bankruptcy two months later.
Symantec Distrust — A Two-Year Slow-Motion Removal
Symantec (then world's largest CA, ~30% of web certs) had misissued thousands of certificates — test certs for domains they didn't own, certs without proper DV. Chrome 70 (Oct 2018) removed all Symantec roots after a phased distrust. DigiCert acquired Symantec's PKI business and had to re-issue all affected certs.
Let's Encrypt Root Expiry — 1.5 Million Sites Briefly Broken
The IdenTrust DST Root CA X3 cross-signature expired Sep 30, 2021. Android <7.1.1 and IoT devices without the newer ISRG Root X1 stopped trusting Let's Encrypt certs. Estimated 1.5M+ sites showed cert errors.
TrustCor Removal — A National Security Distrust
All major root programs removed TrustCor after investigations revealed links between its parent companies and a US defense contractor involved in spyware. Unlike DigiNotar (hack) or Symantec (misissuance), this was a pure policy distrust — triggered by ownership, not technical failure.
At L6+, explain at least two CA distrust events, their causes, and the architectural responses. CT logs, ACME, and the CA/B Forum ballot process all exist because of specific, named failures.
The DigiCert Design Summary
PKI is the cryptographic backbone of all trust on the internet. Every lock icon, API token, and mTLS handshake traces back to a signed X.509 structure and a chain of organizations that agreed to trust each other.
The Lingua Franca
Every cert — web TLS, code signing, email, IoT device, Kubernetes pod — is an X.509 structure with the same field layout. Learn to read a cert (serial, SAN, AKI, EKU, OCSP URL) and you can debug any TLS problem.
Trust is Delegated
Root → Intermediate → Leaf. Roots never go online. Intermediates carry operational risk. This is defense-in-depth: a compromised intermediate is contained and revocable without touching OS trust stores.
Automation Wins
Manual certificate management doesn't survive 90-day (soon 47-day) lifetimes. cert-manager + ACME is the answer for Kubernetes. SPIFFE/SPIRE + Istio is the answer for workload-to-workload identity.
The Two Pillars of CA Trust
HSMs prevent private key exfiltration even under full server compromise. CT logs prevent silent misissaunce. Together they are why browsers can trust a CA — and why a rogue issuance is detectable within minutes.