Design the data model behind Airbnb — search and availability, the request→accepted→stayed→reviewed lifecycle, cancellation refunds, host payouts and bilateral reviews. It reads like a CRUD app and hides a time machine: a pricing team must be able to ask what a listing cost on July 4th last year, as the site showed it on June 1st. One decision separates the senior answer — the calendar is not a state you mutate, it is a fact you snapshot. A full working-through: data flow, the dimensional schema, the snapshot-key invariant, append-only ingestion, the refund close, three analytics queries, and the dashboard that proves it.
Most candidates hear "model Airbnb" and reach for three tables and a status column. The question is harder than it sounds, and the difficulty is hiding in a single innocent-looking sentence near the end of the prompt: what was the price of this listing, on that night, as we showed it back then?
"Design a data model for Airbnb supporting search/availability lookup, the booking lifecycle — request → accepted → stayed → reviewed — cancellation-policy refund logic, host payouts after stay completion, bilateral guest-and-host reviews, and dynamic pricing, with full point-in-time queryability: what was the price of listing X on July 4 last year? How would you scope it?"
The prompt names five subsystems and tests whether you notice they pull in opposite directions. Search wants a fat, denormalized read path — "Paris, July 4–7, 2 guests" must scan millions of listing-night cells and return in milliseconds. The booking lifecycle wants a strict, transactional spine where a confirmed reservation is sacred and a refund is arithmetic, not opinion. And dynamic pricing wants history — not the current price, but the price as it stood on every past day, because that is the only data a pricing experiment can be measured against. A naïve schema can serve any one of these. A senior schema notices that the last requirement quietly rewrites the first two: the calendar cannot be a mutable grid of "is this night free and what does it cost." It has to be an accumulating record of what was true, and what was shown, on each day it was observed.
The weak answer is a calendar table keyed by (listing_id, night_date) that you UPDATE as prices change and nights get booked. It can answer "what is the price tonight" and nothing else. The instant a host nudges a price or an algorithm re-prices a weekend, yesterday's number is gone, and the pricing team's most important question becomes un-answerable. The strong answer treats the calendar the way a warehouse treats any fast-moving dimension that must be auditable: snapshot it. So before any boxes, the working frame for the session — three grains, three tempos:
Scope is the first scored dimension, so name it before any schema. In scope: the calendar-as-fact with daily snapshots, the booking lifecycle and its append-only modification chain, refund computation locked at cancel time, host payouts after stay, bilateral reviews, and multi-currency with the FX rate frozen at booking. Out of scope, stated aloud: the search ranking model itself (we feed it; we do not design it), the maps and geocoding stack, the messaging and payments rails beyond the entries that represent them, and fraud scoring (a downstream consumer of this model, not part of it). One discipline I will state up front: nothing in the booking spine is ever updated in place except the three timestamps that mark a state transition. Price changes, cancellations, modifications — all of them move the system forward by adding a row, never by erasing one.
Then the envelope math, volunteered rather than extracted. Airbnb-shaped numbers, and one row that reshapes the whole design:
| Quantity | Estimate | Consequence |
|---|---|---|
| Active listings | ~7 M | The width of every calendar snapshot |
| Future nights tracked / listing | ~365 | How far the bookable horizon extends |
| Calendar rows / day | 7M × 365 ≈ 2.5 B | One daily snapshot — the number that shapes the whole storage strategy |
| Bookings / day | ~2 M | Tiny by comparison; a conventional row store |
| Search queries / s (peak) | ~10 K | Served from Elasticsearch, not the warehouse |
| Snapshot retention | 90 d raw, then weekly | Full grain is unaffordable forever; downsample old snapshots |
| Reviews / booking | 0–2 | Bilateral; both sides may stay silent |
Notice the asymmetry. Bookings are a rounding error — two million rows a day is gigabytes. The calendar is the whole problem: a daily snapshot is billions of rows, and naïvely keeping every snapshot forever is petabytes a year for a question most listings will never be asked. That single row of the table dictates the partitioning, the retention ladder, and the entire reason the calendar looks the way it does. The rest of this article follows the calendar.
One transactional spine, one snapshot tributary, two read shapes. Hosts and guests mutate a small operational store; a nightly job freezes the calendar into the warehouse; search reads a fast index, and the pricing team reads the frozen past.
Three properties of this picture do most of the interview's work. First, search never touches the warehouse — it reads a rebuildable Elasticsearch index fed from the change log, because a guest typing "Paris, 2 guests" needs a 50 ms answer over live availability, not a scan of a 2.5-billion-row fact table. Second, the snapshot job is the only bridge between the two worlds, and it runs exactly once per night: it reads the entire live calendar, stamps every row with today's snapshot_date, and appends — it never updates a prior snapshot, because a prior snapshot is, by definition, what was true then. Third, the live calendar is allowed to forget. It holds only the current state; the moment a price changes, the old number is overwritten in OLTP and preserved only in the warehouse. The operational store optimizes for the booking flow; the fact table carries the memory.
The operational store may lie about the past; the warehouse may never lie about it. If the search index is stale by a minute, a guest occasionally sees a night that was just booked — annoying, self-healing, recoverable at request time by the booking service's transactional check. But if a snapshot is wrong, or a refund is restated because a policy changed, the system has corrupted evidence — and disputes, pricing experiments, and host trust all rest on that evidence. So freshness is negotiable on the read path and non-negotiable in the fact: snapshots are written once, refunds are frozen at cancel time, and FX is locked at booking. The past is a contract.
The schema falls out of the three tempos. A snapshotted calendar fact for the hot path and the time machine; an append-only booking fact for the lifecycle; reviews and payouts for settlement; and slowly-changing dimensions so a booking made last year still joins to the listing as it was, not as it is.
Everything turns on the grain of one table. fct_calendar_nights carries one row per (listing_id, night_date, snapshot_date) — and that triple is the primary key. The first two columns say which night of which listing; the third says which day we are looking at it from. Put snapshot_date in the key and time-travel becomes a WHERE clause; leave it out and the question "what did this cost on July 4th, as shown on June 1st" becomes an archaeology project against SCD2 history and booking events — slow, brittle, often impossible after data drift. The three price columns earn their place: base_price_local is what the host set, smart_price_local is what the algorithm suggested, and effective_price_local is what a guest would actually have been charged. Keeping all three lets the pricing team measure the algorithm against the host against reality.
A booking is a small record at trivial volume, so a conventional row store is the right home — but it is governed by one rule that separates seniors from juniors: append-only. A modification — a date change, an extra guest, a renegotiated price — does not UPDATE the original row; it inserts a new booking that points back via parent_booking_id, and the original's terminal_state becomes 'modified'. The chain is the audit trail. The only in-place writes permitted are the three lifecycle timestamps and the cancellation fields, because those record when a transition happened, which is itself a fact, not a rewrite of history.
Reviews are bilateral and typed, so one table with a review_type discriminator carries both directions without a self-join nightmare. Payouts fire after a completed stay and reference the booking they settle. And the cancellation policy is a real dimension, not an enum scattered through code: dim_cancellation_policies stores a refund_curve_json mapping days-to-check-in to the refundable fraction, so the refund math is data, versioned and auditable, not a buried CASE statement.
A booking made in July must, a year later, still join to the listing's price, capacity, and policy as they were in July — not as they are now. That is the textbook job of a Type-2 slowly-changing dimension. dim_listings and dim_hosts carry validity windows, so an as-of join recovers the historical row. The calendar snapshot already freezes the price; SCD2 freezes everything else about the listing and host.
Every system has one place its correctness lives. Here it is not a transaction or a lock — it is a column in a primary key. Put snapshot_date in the key and the past becomes addressable; the time machine is not a feature you build, it is a consequence of the grain you chose.
Consider the two designs side by side. In the mutable calendar, the price of listing 42 on July 4th is a single cell, repeatedly overwritten — by the host on June 1st, by the algorithm on June 3rd, by a last-minute discount on July 2nd. Ask "what did it cost on June 1st" and the only honest answer is a shrug. In the snapshotted calendar, that same night exists as many rows — one per day it was observed — and the question is a coordinate lookup: fix the listing, fix the night, fix the day you are asking from. The price is wherever those three lines cross. The cost of this is real and must be named: roughly 2.5 billion rows per nightly snapshot. The payback is that an entire class of question — pricing experiments, dispute resolution, "show me what the guest saw" — moves from impossible to a single indexed read.
The same forward-only discipline governs the booking. A reservation walks a state machine, and a modification does not mutate it — it forks it. CANCELLED and MODIFIED are terminal; the chain of parent_booking_id links reconstructs the full history of a stay that changed three times.
The single most senior move in the whole problem is the refund. When a guest cancels, the refundable amount is computed once, from the policy's curve and the days remaining until check-in, and the result is written into refund_amount_usd and never recomputed. This is not laziness — it is correctness. Cancellation policies change; a host moves from 'moderate' to 'strict' next quarter. If refunds were computed on read, that change would silently restate every historical refund, and a guest's settled $496.80 would quietly become a different number. Locking the refund at cancel time means the past stays the past.
Three programs carry the load: the nightly snapshot that freezes the calendar, the booking writer that appends modifications instead of editing them, and the search-index updater that keeps live availability fresh. Each is small; the judgment is in what they refuse to do.
The snapshot runs once a night and has exactly one rule: it appends rows stamped with today's date and never touches a prior snapshot. It is idempotent on (listing_id, night_date, snapshot_date), so a retried run after a crash is a no-op rather than a duplicate. The reasoning is the whole design: a snapshot is a photograph, and you do not edit a photograph — you take a new one tomorrow.
One carve-out, always stated: the retention thinning runs only after the new snapshot is confirmed durable. Deleting old history before the new photograph is safely written would risk a window where the system has neither — the kind of ordering bug that looks fine in code review and loses a quarter of pricing data the one night the job is killed mid-run.
The booking writer enforces the append-only law in code so no caller can forget it. A modification reads the parent, computes the new totals with FX frozen at this moment, inserts a fresh booking pointing back, and retires the parent — all in one transaction. The parent is never edited except to set its terminal state, which is itself an immutable fact about when it was superseded.
The slow loop has one job and one promise: money moves to a host after a completed stay, exactly once, net of any refund, with the payout-time FX recorded distinctly from the booking-time FX. The craft is in the idempotency and in keeping two exchange rates honest.
A payout is not a read-time calculation; it is a daily close that materializes a row into fct_host_payouts. The trigger is a state, not a timer: a booking becomes eligible only when its terminal_state reaches 'stayed' and the guest's check-out has passed. The amount is the host's earnings — the subtotal less the platform's service fee — and, critically, it is converted at the FX rate true on the payout day, which is generally not the rate locked at booking. The system therefore carries two FX facts per stay: one frozen at booking (what the guest was charged) and one frozen at payout (what the host received). The gap between them is realized currency drift, and a senior model surfaces it rather than hiding it inside a single blended number.
The same daily rhythm closes the reviews window. Airbnb's reviews are double-blind: neither side sees the other's until both have submitted or a fourteen-day window expires. The aggregation that drives the superhost signal therefore reads only revealed reviews, recomputes a host's trailing rating and response rate, and writes a new dim_hosts version when the status flips — SCD2 again, so "was this host a superhost when the booking was made" stays answerable forever.
The frozen tables are where the system explains itself. Three queries an interviewer loves, because each one carries a classic SQL pattern on its back: the as-of join, the funnel by conditional aggregation, and the window-function gap detector.
Search is a join against one chosen snapshot. "Paris, July 4–7, 2 guests" filters the freshest snapshot_date to available nights, joins the listing dimension as-of the same instant, and demands the listing have all three nights free — the HAVING COUNT(*) = 3 that turns "some nights available" into "the whole stay is bookable."
"Where do bookings die?" The lifecycle timestamps make the funnel a single pass of COUNT(*) FILTER: requested, accepted, stayed, plus the host's median time-to-accept. The same conditional-aggregation pattern that powers every funnel in analytics, applied to one append-only fact.
"How much did a listing's price move in the run-up to the night, and did the smart-pricing engine chase demand?" This is the snapshot table's signature query: for one night, walk its successive snapshots in date order and LAG to measure day-over-day change. It is the time machine doing what only it can — and it doubles as a price-dispute resolver.
A senior design ends with observability, because every frozen moment above is invisible without it. The dashboard watches the three loops separately — supply and search, the booking funnel, and settlement & trust — each with its own definition of healthy.
Read the warning tiles together and the dashboard narrates a marketplace under demand pressure. The effective-vs-base spread widening is the smart-pricing engine doing its job — and the snapshot table is the only reason that line can be drawn at all, because it remembers every price the algorithm ever showed. Meanwhile the red host-cancel tile is the trust-and-safety alarm from §07 firing in real time: that is the query that costs a host their superhost status, and it is visible here the moment it crosses the line.
Strip the listings and the reviews away and the question was testing five judgments, each of which generalizes far beyond travel: