Design a multi-region, fault-tolerant SaaS platform on AWS. Compute families, S3 storage classes, VPC anatomy with subnets and security groups, the full database decision tree, Lambda cold-start mechanics, IAM policy evaluation flow, a multi-tenant schema, and twelve rapid-fire interview questions — every layer visual-first.
Before 2006, launching a software product meant buying servers months in advance. AWS didn't just reduce cost — it fundamentally changed who could build and ship software at scale.
"Design a multi-region, fault-tolerant SaaS platform on AWS. Walk me through your choices for compute, storage, networking, and database tiers — and how you'd handle DR, cost optimization, and security at scale."
| Weak answer | Strong answer |
|---|---|
| "Use EC2 and RDS" | Names specific instance families, justifies with workload shape |
| Ignores AZ topology | 3-AZ active-active, explains what fails if one AZ goes dark |
| "S3 for storage" | Maps workload to the right storage class with cost and retrieval tradeoff |
| Generic "use IAM" | Explains evaluation order: Deny → SCP → Boundary → Resource → Identity |
| No DR plan | RPO/RTO targets, pilot light vs warm standby vs multi-region active-active |
AWS runs 33+ regions, each with 3+ isolated Availability Zones connected by private fiber. Edge Locations (400+) serve CloudFront CDN and Route 53 DNS.
Choose the instance family by workload shape, not by reflex. The decision tree below saves you from paying for memory you don't use.
S3 is object storage — not a filesystem. Choose the right storage class by access frequency; the price delta between Standard and Deep Archive is 23×.
| Service | Type | Primary use case | Throughput | Cost tier |
|---|---|---|---|---|
| S3 | Object | Static files, backups, data lake, ML datasets | Very high (parallel) | Lowest ($0.001–$0.023/GB) |
| EBS | Block | EC2 root volumes, databases, single-AZ apps | Up to 256K IOPS (io2 BE) | Medium ($0.08–$0.125/GB) |
| EFS | NFS (managed) | Shared filesystem across multiple EC2 / Lambda | 10 GB/s burst | Higher ($0.30/GB) |
| FSx for Lustre | HPC filesystem | ML training I/O, HPC simulations | 1 TB/s+ | High ($0.14+/GB) |
A VPC is a logically isolated network. Public subnets face the internet via an Internet Gateway; private subnets reach the internet via NAT. Security Groups are stateful; NACLs are stateless.
AWS has 15+ managed database services. Start with the data shape, not the brand name.
Lambda is billed per 1ms of execution. The cold start penalty (100–500ms) only hits when a new container is bootstrapped — mitigate with Provisioned Concurrency or SnapStart (Java).
IAM evaluates policies in a strict order. An explicit Deny at any layer overrides any Allow. The full chain: Explicit Deny → SCPs → Permission Boundaries → Resource Policies → Identity Policies.
| Entity | What it is | When to use |
|---|---|---|
| IAM User | Long-term credentials (access key + secret) for a human or app | Avoid for apps; use for CI/CD bootstrap only |
| IAM Role | Temporary credentials assumed by a principal (EC2, Lambda, human) | Always prefer roles over users for AWS services |
| IAM Group | Collection of users sharing the same policies | Organize human users; roles don't join groups |
| IAM Policy | JSON document specifying Allow/Deny on actions/resources | Attach to role, user, group, or resource |
Five tables that capture the resource tagging and cost allocation design for an AWS-managed multi-tenant SaaS platform.
-- Multi-tenant SaaS schema — AWS resource tracking & cost allocation
-- 1. Account: tenant root
CREATE TABLE account (
account_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aws_account_id TEXT UNIQUE NOT NULL, -- 12-digit AWS account
name TEXT NOT NULL,
tier TEXT NOT NULL CHECK (tier IN ('free','pro','enterprise')),
primary_region TEXT NOT NULL DEFAULT 'us-east-1',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
status TEXT NOT NULL DEFAULT 'active'
);
-- 2. Region deployment: where a tenant is deployed
CREATE TABLE region_deployment (
deployment_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
account_id UUID NOT NULL REFERENCES account(account_id),
region TEXT NOT NULL, -- e.g. 'us-east-1', 'eu-west-1'
vpc_id TEXT NOT NULL, -- AWS VPC ID
is_primary BOOLEAN NOT NULL DEFAULT false,
deployed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (account_id, region)
);
-- 3. Resource: any AWS resource owned by a tenant
CREATE TABLE resource (
resource_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
account_id UUID NOT NULL REFERENCES account(account_id),
deployment_id UUID NOT NULL REFERENCES region_deployment(deployment_id),
resource_type TEXT NOT NULL, -- 'ec2_instance','rds_cluster','s3_bucket',...
aws_resource_id TEXT NOT NULL, -- actual AWS ARN or ID
name TEXT,
tags JSONB NOT NULL DEFAULT '{}', -- propagated to AWS tags
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
deleted_at TIMESTAMPTZ,
UNIQUE (aws_resource_id)
);
CREATE INDEX resource_account_idx ON resource(account_id, resource_type);
CREATE INDEX resource_tags_idx ON resource USING gin(tags);
-- 4. Cost allocation: daily cost per resource (from AWS Cost Explorer)
CREATE TABLE cost_allocation (
cost_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
resource_id UUID NOT NULL REFERENCES resource(resource_id),
account_id UUID NOT NULL REFERENCES account(account_id),
usage_date DATE NOT NULL,
service TEXT NOT NULL, -- 'AmazonEC2','AmazonS3','AWSLambda'
usage_type TEXT NOT NULL, -- 'BoxUsage:c6i.xlarge','DataTransfer-Out'
cost_usd NUMERIC(14,6) NOT NULL,
blended_rate NUMERIC(14,8),
usage_quantity NUMERIC(14,4),
PRIMARY KEY (resource_id, usage_date, usage_type) -- override
);
CREATE INDEX cost_account_date_idx ON cost_allocation(account_id, usage_date);
-- 5. Quota: per-account service limits
CREATE TABLE quota (
quota_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
account_id UUID NOT NULL REFERENCES account(account_id),
service TEXT NOT NULL, -- 'ec2','rds','lambda'
quota_name TEXT NOT NULL, -- 'max_instances','storage_gb'
limit_value NUMERIC NOT NULL,
current_value NUMERIC NOT NULL DEFAULT 0,
unit TEXT NOT NULL DEFAULT 'count',
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (account_id, service, quota_name)
);
Cost allocation rows flow from AWS Cost Explorer via an overnight Glue job. The tags JSONB column on resource mirrors the AWS tag set — tag propagation is enforced at deploy time so every dollar is attributable to a tenant.
These are the most common misconceptions that surface in AWS design interviews and real production incidents. Each one has cost teams real money or uptime.
AWS is the default choice for most teams — but not for every team. Understanding the tradeoffs makes you a stronger systems designer and shows the interviewer you think beyond the happy path.
Practice this until you can deliver it cold. It covers the history, the insight, the mechanism, and the tradeoff — everything an interviewer needs to hear.
AWS interviews rarely test rote knowledge of service names. They test depth of judgment. Here's what's behind each category of question.
AWS didn't emerge in a vacuum. Understanding the full arc from physical servers to AI-native infrastructure helps you situate every architectural decision in its historical context.
Each generation of cloud infrastructure solved one hard problem — and created the conditions for the next one. Understanding this arc is what separates senior engineers from cloud operators.
| Topic | Weak answer | Strong answer |
|---|---|---|
| Compute | "Use EC2" | Names instance family by workload shape; mentions Graviton for cost; proposes containers (ECS/EKS) for operational leverage |
| Storage | "S3 for everything" | Maps access pattern to storage class; quantifies 23× cost delta Standard→Deep Archive; proposes Lifecycle policies |
| Networking | "Put it in a VPC" | Draws public/private subnet split; explains IGW vs NAT; distinguishes SG (stateful) from NACL (stateless) |
| Database | "Use RDS" | Uses decision tree by data shape; explains Aurora 6-copy storage vs RDS EBS; names DynamoDB for >10M ops/s |
| Serverless | "Lambda is cheap" | Explains cold start anatomy; proposes Provisioned Concurrency for latency-sensitive; SQS buffer for bursty ingress |
| IAM | "Use IAM policies" | Walks the 5-step evaluation order; recommends roles over users; mentions SCPs for account guardrails |
| Multi-region DR | "We have backups" | Defines RPO/RTO → selects pilot light / warm standby / active-active; names Aurora Global + DynamoDB Global Tables; plans chaos testing |
| Cost | Ignores cost | Mentions Savings Plans, Reserved for baseline; Spot for stateless batch; right-sizes with Graviton; Cost Explorer + tag propagation for attribution |