The AWS Problem

Weak answer	Strong answer
"Use EC2 and RDS"	Names specific instance families, justifies with workload shape
Ignores AZ topology	3-AZ active-active, explains what fails if one AZ goes dark
"S3 for storage"	Maps workload to the right storage class with cost and retrieval tradeoff
Generic "use IAM"	Explains evaluation order: Deny → SCP → Boundary → Resource → Identity
No DR plan	RPO/RTO targets, pilot light vs warm standby vs multi-region active-active

Service	Type	Primary use case	Throughput	Cost tier
S3	Object	Static files, backups, data lake, ML datasets	Very high (parallel)	Lowest ($0.001–$0.023/GB)
EBS	Block	EC2 root volumes, databases, single-AZ apps	Up to 256K IOPS (io2 BE)	Medium ($0.08–$0.125/GB)
EFS	NFS (managed)	Shared filesystem across multiple EC2 / Lambda	10 GB/s burst	Higher ($0.30/GB)
FSx for Lustre	HPC filesystem	ML training I/O, HPC simulations	1 TB/s+	High ($0.14+/GB)

Entity	What it is	When to use
IAM User	Long-term credentials (access key + secret) for a human or app	Avoid for apps; use for CI/CD bootstrap only
IAM Role	Temporary credentials assumed by a principal (EC2, Lambda, human)	Always prefer roles over users for AWS services
IAM Group	Collection of users sharing the same policies	Organize human users; roles don't join groups
IAM Policy	JSON document specifying Allow/Deny on actions/resources	Attach to role, user, group, or resource

§ 09 — SCHEMAMulti-tenant SaaS on AWS

Five tables that capture the resource tagging and cost allocation design for an AWS-managed multi-tenant SaaS platform.

-- Multi-tenant SaaS schema — AWS resource tracking & cost allocation

-- 1. Account: tenant root
CREATE TABLE account (
    account_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aws_account_id      TEXT UNIQUE NOT NULL,   -- 12-digit AWS account
    name                TEXT NOT NULL,
    tier                TEXT NOT NULL CHECK (tier IN ('free','pro','enterprise')),
    primary_region      TEXT NOT NULL DEFAULT 'us-east-1',
    created_at          TIMESTAMPTZ NOT NULL DEFAULT now(),
    status              TEXT NOT NULL DEFAULT 'active'
);

-- 2. Region deployment: where a tenant is deployed
CREATE TABLE region_deployment (
    deployment_id       UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    account_id          UUID NOT NULL REFERENCES account(account_id),
    region              TEXT NOT NULL,   -- e.g. 'us-east-1', 'eu-west-1'
    vpc_id              TEXT NOT NULL,   -- AWS VPC ID
    is_primary          BOOLEAN NOT NULL DEFAULT false,
    deployed_at         TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (account_id, region)
);

-- 3. Resource: any AWS resource owned by a tenant
CREATE TABLE resource (
    resource_id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    account_id          UUID NOT NULL REFERENCES account(account_id),
    deployment_id       UUID NOT NULL REFERENCES region_deployment(deployment_id),
    resource_type       TEXT NOT NULL,   -- 'ec2_instance','rds_cluster','s3_bucket',...
    aws_resource_id     TEXT NOT NULL,   -- actual AWS ARN or ID
    name                TEXT,
    tags                JSONB NOT NULL DEFAULT '{}',  -- propagated to AWS tags
    created_at          TIMESTAMPTZ NOT NULL DEFAULT now(),
    deleted_at          TIMESTAMPTZ,
    UNIQUE (aws_resource_id)
);
CREATE INDEX resource_account_idx ON resource(account_id, resource_type);
CREATE INDEX resource_tags_idx    ON resource USING gin(tags);

-- 4. Cost allocation: daily cost per resource (from AWS Cost Explorer)
CREATE TABLE cost_allocation (
    cost_id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    resource_id         UUID NOT NULL REFERENCES resource(resource_id),
    account_id          UUID NOT NULL REFERENCES account(account_id),
    usage_date          DATE NOT NULL,
    service             TEXT NOT NULL,   -- 'AmazonEC2','AmazonS3','AWSLambda'
    usage_type          TEXT NOT NULL,   -- 'BoxUsage:c6i.xlarge','DataTransfer-Out'
    cost_usd            NUMERIC(14,6) NOT NULL,
    blended_rate        NUMERIC(14,8),
    usage_quantity      NUMERIC(14,4),
    PRIMARY KEY (resource_id, usage_date, usage_type)  -- override
);
CREATE INDEX cost_account_date_idx ON cost_allocation(account_id, usage_date);

-- 5. Quota: per-account service limits
CREATE TABLE quota (
    quota_id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    account_id          UUID NOT NULL REFERENCES account(account_id),
    service             TEXT NOT NULL,   -- 'ec2','rds','lambda'
    quota_name          TEXT NOT NULL,   -- 'max_instances','storage_gb'
    limit_value         NUMERIC NOT NULL,
    current_value       NUMERIC NOT NULL DEFAULT 0,
    unit                TEXT NOT NULL DEFAULT 'count',
    updated_at          TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (account_id, service, quota_name)
);

Cost allocation rows flow from AWS Cost Explorer via an overnight Glue job. The tags JSONB column on resource mirrors the AWS tag set — tag propagation is enforced at deploy time so every dollar is attributable to a tenant.

§ 10 — COMMON MISTAKESWhat trips up even experienced engineers

These are the most common misconceptions that surface in AWS design interviews and real production incidents. Each one has cost teams real money or uptime.

❌ "EC2 is always cheaper than managed services"

RDS costs more per hour than EC2 + MySQL self-hosted — but you're paying for automated backups, Multi-AZ failover, automated patching, minor version upgrades, and monitoring. Total cost of ownership almost always favors managed services for most teams once you factor in engineering time, on-call burden, and failure recovery. The comparison is not $/hour — it's $/uptime.

❌ "Multi-AZ = disaster recovery"

Multi-AZ protects against Availability Zone failures — hardware faults, power outages, network partitions within a region. It does NOT protect against regional failures, data corruption from application bugs, or accidental deletes (which replicate instantly to the standby). True disaster recovery requires a multi-region strategy: Aurora Global Database, DynamoDB Global Tables, S3 Cross-Region Replication, and a tested failover runbook with defined RPO and RTO targets.

❌ "Security groups are stateless"

Security groups are STATEFUL. If you allow inbound traffic on port 443, the return traffic on ephemeral ports is automatically allowed — you don't need an explicit outbound rule for it. NACLs are stateless: you must explicitly allow both the inbound request and the outbound response. Confusing these two causes misconfigured VPCs where either traffic is blocked unexpectedly or security is weaker than intended.

§ 11 — WHY NOT?AWS vs the alternatives

AWS is the default choice for most teams — but not for every team. Understanding the tradeoffs makes you a stronger systems designer and shows the interviewer you think beyond the happy path.

Choose AWS When
✓ Large ecosystem — 300+ services covering every use case
✓ Largest community, documentation, and third-party tooling
✓ Best enterprise support tiers and compliance certifications (FedRAMP, HIPAA, SOC2)
✓ AWS-specific services with no equivalent: DynamoDB, Bedrock, SageMaker, Aurora
Consider Alternatives When
✗ GCP for big data and ML workloads — BigQuery and Vertex AI are best-in-class
✗ Azure for Microsoft-stack enterprises — Active Directory integration and .NET tooling are native
✗ Cloudflare Workers for edge-first applications — global deployment with zero cold starts
✗ Regulatory constraints requiring data residency in countries where AWS has no region

§ 12 — ONE-MINUTE ANSWERThe elevator pitch for AWS

Practice this until you can deliver it cold. It covers the history, the insight, the mechanism, and the tradeoff — everything an interviewer needs to hear.

Interview Question
"Why did AWS change the industry?"
Strong Answer

      Before AWS, launching a product required weeks of procurement, upfront capital for hardware, and teams dedicated to managing infrastructure. AWS introduced pay-as-you-go cloud computing starting with S3 (March 2006) and EC2 (August 2006), allowing a startup to serve global traffic with zero upfront cost.
      
      The key insight was the shared responsibility model: AWS manages the hardware, power, cooling, and network; you manage the software and data. This shifted infrastructure from a capital expense to an operational one — a startup could now scale from zero to millions of users without owning a single server.
      
      The tradeoff: vendor lock-in and pricing complexity that requires dedicated FinOps expertise at scale. At $10M+/year of AWS spend, a company needs engineers who do nothing but optimize cloud costs.

§ 13 — INTERVIEWER'S MINDWhat they're actually testing

AWS interviews rarely test rote knowledge of service names. They test depth of judgment. Here's what's behind each category of question.

Cost Awareness
Can you estimate rough costs before committing to a design? Do you know the difference between Reserved, On-Demand, and Spot pricing models — and when each is appropriate? Have you used AWS Cost Explorer or set up billing alarms?
Service Selection Judgment
When do you use SQS vs SNS vs EventBridge? RDS vs DynamoDB vs Aurora? Lambda vs ECS vs EC2? Can you articulate tradeoffs rather than just naming services?
Security Model
Can you explain IAM roles vs users vs policies and when to use each? What is the principle of least privilege and how do you enforce it at scale? What's the difference between a resource policy and an identity policy?
Architecture Patterns
Multi-AZ vs multi-region — what's the difference in failure scope? When does Lambda beat EC2 (and vice versa)? What makes a good VPC design — subnet sizing, NAT placement, security group layering?

§ 15 — WHAT'S NEXT?The frontier beyond AWS

Each generation of cloud infrastructure solved one hard problem — and created the conditions for the next one. Understanding this arc is what separates senior engineers from cloud operators.

      AWS solved infrastructure provisioning
       → 
      Then: containers (Docker + Kubernetes) solved packaging and portability across any cloud
    
      Serverless solved per-function scaling
       → 
      Then: AI APIs (OpenAI, Bedrock, Vertex) solved model deployment — no GPU cluster required
    
      Current frontier: AI-native infrastructure
       — GPU clusters, vector databases (Pinecone, pgvector), inference optimization (quantization, speculative decoding), and observability for non-deterministic systems
    
      The next problem:
       "How do you build and operate AI workloads at the cost of traditional software?" GPU hours cost 10–100× more than CPU hours. Inference latency is non-deterministic. Output quality degrades in ways that are hard to monitor. This is the unsolved infrastructure problem of 2024–2026.

§ 16 — Q&ATwelve rapid-fire questions

1. S3 vs EBS vs EFS — when to use each?: S3 for object storage (files, backups, data lake) — cheapest, unlimited scale, HTTP access. EBS for block storage attached to a single EC2 (databases, OS volumes) — low-latency IOPS. EFS for a shared POSIX filesystem across multiple EC2 or Lambda — pay for what you use, auto-scales, but costs 5× more than EBS.
2. How does Aurora differ from RDS?: Aurora uses a custom distributed storage layer that replicates 6 copies across 3 AZs — storage and compute are decoupled. RDS uses traditional single-AZ EBS with an optional synchronous standby; Aurora is 5× faster on MySQL and 3× on PostgreSQL for write-heavy workloads and handles failover in under 30 seconds versus RDS's ~60–120s.
3. What is VPC peering vs Transit Gateway?: VPC peering is a 1:1 private connection between two VPCs (non-transitive — A↔B and B↔C does not give A↔C). Transit Gateway is a hub-and-spoke that connects thousands of VPCs and on-premises networks transitively — use it when you have more than a handful of VPCs to connect.
4. How does Lambda handle concurrency?: Each concurrent request needs its own Lambda execution environment — 1000 concurrent invocations = 1000 containers (default account limit). Reserved concurrency sets a ceiling for a function; Provisioned Concurrency pre-warms N containers to eliminate cold starts. Throttling returns HTTP 429; implement exponential backoff + SQS buffer in front.
5. What is the difference between SQS and SNS?: SQS is a pull-based queue: consumers poll for messages, messages persist until acknowledged (14-day max), each message is processed by one consumer. SNS is a push-based topic: publishes to all subscribers simultaneously (Lambda, SQS, HTTP, email) with no persistence — use SNS→SQS fan-out for durable, parallel processing.
6. How do you secure an S3 bucket?: Block public access at the account level (the big four checkboxes). Use bucket policies to restrict to specific IAM principals or VPC endpoints. Enable server-side encryption (SSE-S3 or SSE-KMS). Enable versioning + MFA delete for critical data. Use S3 Access Analyzer to detect unintended public access, and CloudTrail S3 data events for audit.
7. What is an IAM role vs IAM user?: An IAM user has long-term static credentials (access key + secret) tied to a person or app — a credential leak is permanent until rotated. An IAM role has temporary credentials (15min–12hr) issued by STS, assumed by a trusted principal (EC2, Lambda, another account, federated user). Always prefer roles; users are a last resort for bootstrapping.
8. How does Auto Scaling work with ALB?: An Auto Scaling Group adds/removes EC2 instances or ECS tasks based on CloudWatch metrics (CPU, custom, or ALB request count per target). The ALB's target group registers/deregisters instances as the ASG changes; connection draining (deregistration delay) ensures in-flight requests finish before a removed instance is terminated.
9. What is CloudFront and when do you use it?: CloudFront is AWS's CDN — it caches responses at 400+ edge locations worldwide, reducing latency and origin load. Use it for static assets (S3), API acceleration (API Gateway), streaming (S3 or MediaPackage), and DDoS mitigation (AWS Shield is automatically included). CloudFront also terminates TLS at the edge, so your origin only handles decrypted traffic.
10. How do you design for multi-region failover?: Define RPO and RTO first. Pilot light (near-zero cost, cold data only) suits RPO hours; warm standby (scaled-down active in DR region) suits RPO minutes; active-active (Route 53 latency/health routing, Aurora Global, DynamoDB Global Tables) suits RPO seconds. Always test failover — chaos engineering, not just diagrams.
11. What is the difference between NACLs and Security Groups?: Security Groups are stateful (allow inbound → return traffic automatically allowed), operate at the instance/ENI level, and support Allow rules only. NACLs are stateless (you must explicitly allow both inbound and return traffic), operate at the subnet level, and support both Allow and Deny — useful for blocking specific IP ranges at the perimeter.
12. How does DynamoDB handle hot partitions?: DynamoDB distributes data by partition key hash — a hot partition occurs when one key receives a disproportionate share of requests, exhausting its 1000 RCU/3000 WCU per partition. Mitigations: add a random suffix (write sharding, then scatter-gather on reads), use DAX for read-heavy hot keys, use SQS buffering to absorb write bursts, or redesign the access pattern to spread load.

§ 17 — SUMMARYWeak vs strong answers

Topic	Weak answer	Strong answer
Compute	"Use EC2"	Names instance family by workload shape; mentions Graviton for cost; proposes containers (ECS/EKS) for operational leverage
Storage	"S3 for everything"	Maps access pattern to storage class; quantifies 23× cost delta Standard→Deep Archive; proposes Lifecycle policies
Networking	"Put it in a VPC"	Draws public/private subnet split; explains IGW vs NAT; distinguishes SG (stateful) from NACL (stateless)
Database	"Use RDS"	Uses decision tree by data shape; explains Aurora 6-copy storage vs RDS EBS; names DynamoDB for >10M ops/s
Serverless	"Lambda is cheap"	Explains cold start anatomy; proposes Provisioned Concurrency for latency-sensitive; SQS buffer for bursty ingress
IAM	"Use IAM policies"	Walks the 5-step evaluation order; recommends roles over users; mentions SCPs for account guardrails
Multi-region DR	"We have backups"	Defines RPO/RTO → selects pilot light / warm standby / active-active; names Aurora Global + DynamoDB Global Tables; plans chaos testing
Cost	Ignores cost	Mentions Savings Plans, Reserved for baseline; Spot for stateless batch; right-sizes with Graviton; Cost Explorer + tag propagation for attribution

§ 00 — BEFORE AWSWhy the cloud changed everything

§ 01 — THE QUESTIONMulti-region SaaS on AWS

§ 02 — GLOBAL INFRASTRUCTURERegions → AZs → Edge

§ 03 — COMPUTEInstance families & decision tree

§ 04 — STORAGES3 classes · EBS · EFS · FSx

§ 05 — NETWORKINGVPC anatomy

§ 06 — DATABASEDecision tree by data shape

§ 07 — SERVERLESSLambda cold starts & fan-out

§ 08 — IAMPolicy evaluation flow