PaddySpeaks · Systems at the Whiteboard · Nº 20

The Recurrence Problem

Design Google Calendar — recurring events with per-instance exceptions, IANA time zones that survive daylight saving, invitations that fan out to other people's calendars, and "find a time" across a room of attendees. One row can mean infinite meetings, and a single clock change can silently move a million stand-ups. The decision that separates a senior answer: store the rule, derive the instant. A full working-through — data flow, the schema, the local-time invariant, DST-correct expansion in Python, the free/busy bitmap, the change-stream pipeline, and the dashboard that proves it.

§ 01 — THE QUESTIONOne row, infinite meetings

A calendar looks like the easiest app you will ever build: a list of events with a start and an end, sorted by time. It is, in fact, one of the quietest correctness traps in consumer software — and the question reveals fast whether someone has shipped against time, or only read about it.

Interview Prompt

"Design the data model behind Google Calendar — recurring events with per-instance exceptions, IANA time zones with daylight-saving correctness, invitations that accept/decline/tentative, and 'find a time' free/busy across many attendees. And it has to answer point-in-time: what did my Tuesday look like as Calendar showed it last Monday? How would you scope it?"

LEVEL · SENIOR / STAFFDURATION · 45 MINFORMAT · WHITEBOARD

Two facts make this hard, and both are facts about data, not about the web tier. First, a recurring event is one row that denotes an unbounded number of occurrences — "every weekday, forever" is a single record and an infinite series. Second, the correct time of an event is a moving target: a 9:00 AM stand-up has to stay 9:00 AM after the clocks change, which means you cannot freeze it to a UTC instant and walk away. Layer invitations and "find a time across ten people" on top, and you have a read-heavy, latency-bound system where the bugs are measured in missed meetings. The weak answer draws a load balancer and three app servers. The strong answer names the forces that will deform the design first, because every later decision exists to survive one of them.

There are exactly four forces, and they are not the usual suspects. So before any boxes and arrows, the working frame for the whole session:

THE EXPLOSION
Recurrence-as-rule. "Daily stand-up, every weekday, no end date" is one row that denotes an infinite set. Materialize occurrences as rows and you sign up for a forever-running backfill and a thousand-row rewrite per series edit. Expand the rule on read and writes stay O(1) — but a malicious FREQ=MINUTELY rule tries to emit a million instances for a one-month view.
THE MOVING CLOCK
Time is not a number. Storing an event as a single UTC instant is correct for a one-off and wrong for anything that recurs. The rule has to live in local wall-clock time beside an IANA zone, and UTC is derived per occurrence — never frozen, or the day the clocks change the whole company's stand-up jumps an hour.
THE MANY CALENDARS
One event, ten partitions. A meeting I organize belongs to my calendar and to all nine attendees' at once. Read-time fan-in turns the hottest path into a scatter-gather; write-time fan-out is the answer — until a 50,000-person all-hands turns one edit into 50,000 writes.
THE SECOND QUERY SYSTEM
Free/busy is not a calendar read. "When is everyone free?" is a privacy-preserving projection — opaque busy intervals, never titles — computed across many people and hammered by schedulers, room booking, and now AI agents polling availability. Re-expanding everyone's recurrences on every call recomputes a deterministic answer forever.
Read those four again: every one is a data problem, not a web-tier problem. That is the tell of the question — the interesting engineering is in the model and the pipeline, and the request path is almost boring.

Scoping out loud

Scope is the first scored dimension, and most candidates skip it. In scope: event CRUD, recurrence with per-instance exceptions, invitations and RSVP, and free/busy. Out of scope, said explicitly: email-delivery internals, video conferencing, working-hours and appointment-slot features, and sharing ACLs — modeled-for, not designed in detail. One tension I state up front: this is read-heavy with a hard correctness requirement around time. A stale calendar view for a few seconds is fine, so availability beats strict consistency for reads — but RSVPs need read-your-writes, and free/busy must not double-book in obvious ways. Time gets stored three ways at once, which I will defend with the DDL.

Then the envelope math, volunteered rather than extracted. Google-class numbers:

QuantityEstimateConsequence
Monthly / daily active users~500M / ~180MSets the read fan-out; everyone opens "this week" at once
Read : write ratio~10 : 1Make the common read nearly free; pay costs on the rare write
Occurrences per series row1 row → ∞The materialize-vs-expand call that shapes the whole architecture
Event writes / day50–80M~1k writes/s average, spiking to 5–10k/s, business-hours-spiky
Peak reads / s100k+Sync clients + calendar opens; a latency problem, not a volume one
Free/busy slot15 min · 96 bits/day12 bytes/calendar-day; ten people for two weeks < 2 KB
Storage over yearssingle-digit PBNot storage-bound. Correctness and read latency are the game

Notice the row unlike the others: a single physical row denotes an unbounded number of occurrences. That is not a detail — it dictates the storage layout, the write cost, and the entire free/busy design. The rest of this piece is a consequence of resolving it deliberately: store the rule, expand for the visible window only, and derive UTC per occurrence. Everything else follows the data.


§ 02 — DATA FLOWFollowing an edit through the building

One source of truth, four projections off one ordered log. The event store owns truth; attendee copies, the free/busy bitmap, the rendered-window cache, and client sync are all derived consumers of a single per-calendar change stream — fed by a transactional outbox so the stream can never miss a committed change or replay one that rolled back.

WRITE PATH · ONE TX, RULE + OUTBOX PROJECTIONS · DERIVED, VERSIONED, REPLAYABLE CLIENT create · move · RSVP + expected version EVENT WRITE SVC CAS on version store rule, not occurrences local time is the truth EVENT STORE events + outbox, one TX sharded by calendar_id the source of truth CHANGE STREAM (CDC) Debezium tails WAL → Kafka one ordered partition / calendar seq IS the sync token FAN-OUT CONSUMER copy into attendee shards > threshold → reference-only FREE/BUSY MATERIALIZER expand series to horizon stamp 15-min bitmaps only touched days CACHE-BUMP rotate calendar version keys roll over, no purge CLIENT SYNC "changed since token?" usually an empty delta READ PATH · render window 2 SELECTs + expand, cache by version L1 sync · L2 window cache · L3 replicas — most reads fetch nothing SOLID — the write path · DASHED — derived projections & cached reads · One log is written once; everything else is recomputed.
FIG. 1 — One source of truth, four jobs off one ordered log. The event row and its outbox row commit together; the change stream fans out to attendee copies, the free/busy bitmap, cache versioning, and client sync — each replayable from a stored offset.

Three properties of this picture carry the interview. First, the write stores the rule, not the occurrences — editing a series is one row touched, not thousands rewritten, so writes stay O(1) no matter how long the series runs. Second, the event row and its changelog row commit in one transaction, so a CDC connector turns committed rows into exactly one ordered stream message — no dual-write window where the bitmaps and attendee copies silently disagree with truth. Third, everything below the dashed line is derived: copies, bitmaps, the rendered-window cache, the client sync feed. Drop the whole projection layer and you have lost only cache — the event store survives, and every projection rebuilds by replaying the log.

The Failure Philosophy, In One Rule

The store is truth; everything else degrades to stale, never to wrong. Because copies, bitmaps, and caches are versioned projections off one ordered log, every failure has the same shape: a consumer falls behind and a guest sees an invitation a few seconds late, or free/busy lags a write — both acceptable, because availability is advisory. A write degrades in exactly one direction: it either commits the rule plus its outbox row atomically, or it commits nothing. There is no half-fanned-out event and no orphaned bitmap, because the projections never hold authority — they hold a replayable copy of it.


§ 03 — DATA MODELOne overloaded table, and the columns that earn their keep

The entity skeleton is unremarkable: users, calendars, events, and an attendee join table. Every interesting decision is inside events — where singles, recurring masters, and per-instance overrides all live in one table, disambiguated by which columns are populated, and time is stored three ways on purpose.

The small tables — identity and the unit of sharing

A user owns several calendars; the calendar is the unit of ownership and of sharing. The kind column lets rooms and resources reuse the exact same machinery as people, and the version counter — bumped on any write — becomes the cache key that makes invalidation a non-event.

DDL · IDENTITY & CALENDARS (POSTGRESQL)
-- A login identity. home_tz is the user's default zone, used when an -- event is created without an explicit one. Always an IANA id, never an offset. CREATE TABLE users ( user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, email CITEXT UNIQUE NOT NULL, display_name TEXT NOT NULL, home_tz TEXT NOT NULL DEFAULT 'UTC', -- IANA id created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); -- A calendar is the unit of ownership AND of sharing/ACLs. 'kind' lets -- rooms and resources reuse the exact same machinery as people's calendars. CREATE TABLE calendars ( calendar_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, owner_id BIGINT NOT NULL REFERENCES users(user_id), summary TEXT NOT NULL, default_tz TEXT NOT NULL DEFAULT 'UTC', kind TEXT NOT NULL DEFAULT 'primary' CHECK (kind IN ('primary','secondary','resource','room')), version BIGINT NOT NULL DEFAULT 1, -- bumped on any write; used as a cache key created_at TIMESTAMPTZ NOT NULL DEFAULT now() );

The heart — single, master, and override in one table

The single most consequential choice: a row with an rrule and no parent is a master; a row with a parent_event_id and a recurrence_id is an override of one occurrence; a plain row is a one-off. One table keeps the read path to two queries and lets an override carry its own attendees for free — which is what lets someone decline a single moved instance while staying accepted on the series. And time is stored three ways: *_utc for range scans and one-off correctness, *_local as bare wall-clock, and tz_id as the IANA zone from which UTC for a recurrence is derived.

DDL · EVENTS — THE OVERLOADED CORE
CREATE TABLE events ( event_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, calendar_id BIGINT NOT NULL REFERENCES calendars(calendar_id), organizer_id BIGINT NOT NULL REFERENCES users(user_id), uid TEXT NOT NULL, -- RFC 5545 UID, stable across edits + ingest -- content. On an OVERRIDE row, NULL means "inherit from the master" -- (sparse override — see §04). On a master/single it is the value. title TEXT, description TEXT, location TEXT, -- time, stored THREE ways on purpose: -- *_utc -> the instant, for range scans and one-off correctness -- *_local -> wall-clock with NO zone, e.g. 2026-06-09 09:00 -- tz_id -> IANA zone; UTC for a recurrence is DERIVED from these two start_utc TIMESTAMPTZ, end_utc TIMESTAMPTZ, start_local TIMESTAMP, end_local TIMESTAMP, tz_id TEXT, all_day BOOLEAN NOT NULL DEFAULT false, -- recurrence (MASTER rows only) rrule TEXT, -- 'FREQ=WEEKLY;BYDAY=TU' rdate TIMESTAMP[], -- explicit extra occurrences exdate TIMESTAMP[], -- cheap inline cancellations rrule_until_utc TIMESTAMPTZ, -- DENORMALIZED series end; NULL = open-ended -- exception linkage (OVERRIDE rows only) parent_event_id BIGINT REFERENCES events(event_id), recurrence_id TIMESTAMP, -- original wall-clock start this row replaces overridden_cols TEXT[], -- field mask: cols the user explicitly set status TEXT NOT NULL DEFAULT 'confirmed' CHECK (status IN ('confirmed','tentative','cancelled')), transparency TEXT NOT NULL DEFAULT 'opaque' -- opaque=busy, transparent=free CHECK (transparency IN ('opaque','transparent')), version BIGINT NOT NULL DEFAULT 1, -- optimistic concurrency (compare-and-swap) sequence INT NOT NULL DEFAULT 0, -- RFC 5545 SEQUENCE, bumped on material change created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), -- an override must name the occurrence it replaces CONSTRAINT override_names_occurrence CHECK (parent_event_id IS NULL OR recurrence_id IS NOT NULL), -- a row is a recurring master XOR an override, never both CONSTRAINT master_xor_override CHECK (NOT (rrule IS NOT NULL AND parent_event_id IS NOT NULL)) ); -- Range scan: "singles + masters that could touch [lo, hi)" on one calendar. -- Partial index keeps cancelled rows and overrides out of the hot path. CREATE INDEX idx_events_cal_window ON events (calendar_id, start_utc) WHERE status <> 'cancelled' AND parent_event_id IS NULL; -- Open-ended series stay filterable thanks to the denormalized horizon. CREATE INDEX idx_events_cal_series ON events (calendar_id, rrule_until_utc) WHERE rrule IS NOT NULL; -- Overlay path: fetch every override/cancel for a set of masters; and an -- occurrence can only be overridden once per series. CREATE UNIQUE INDEX uq_override ON events (parent_event_id, recurrence_id) WHERE parent_event_id IS NOT NULL;
Why three time columns, never a raw offset

An offset like -08:00 is a fact about an instant, not a place. Store it and you have thrown away the rule that produced it, so you cannot re-derive the right wall-clock after a DST transition or a government zone change. The IANA zone id (America/Los_Angeles) is a pointer into the tz database, which does know the rules. So start_local + tz_id is the source of truth for anything that recurs, start_utc is a derived convenience for range scans and one-offs, and the two are kept consistent by the app, not by the user.

Attendees get their own table because RSVP state is per-person and, crucially, because an override row can have its own attendee set. External guests live by email and have a NULL attendee_id; the inbox index answers "what am I invited to, and have I replied?" in one seek.

DDL · EVENT_ATTENDEES
CREATE TABLE event_attendees ( event_id BIGINT NOT NULL REFERENCES events(event_id) ON DELETE CASCADE, attendee_id BIGINT REFERENCES users(user_id), -- NULL for external guests email CITEXT NOT NULL, -- external attendees live by email response TEXT NOT NULL DEFAULT 'needs_action' CHECK (response IN ('needs_action','accepted','declined','tentative')), optional BOOLEAN NOT NULL DEFAULT false, is_organizer BOOLEAN NOT NULL DEFAULT false, PRIMARY KEY (event_id, email) ); -- "what am I invited to, and have I replied?" — the invitation inbox. CREATE INDEX idx_attendee_inbox ON event_attendees (attendee_id, response);

That is the whole model the system rides on: two small tables, one carefully overloaded big one, and a join table that quietly carries the per-instance RSVP magic. The analytics warehouse mirrors this as a star — dim_events (SCD2, the recurrence master whose title and organizer drift over time), dim_calendars, dim_users, a dim_timezone that is the DST source of truth, and the occurrence, exception, invitation, and free/busy facts we interrogate in §07. The operational row shape and the dimensional grain are the same idea told twice.


§ 04 — THE CORE INVARIANTThe rule lives in local time; UTC is derived

Every correctness guarantee in this system collapses to one ordering: a recurrence is iterated in local wall-clock time, and the UTC instant is computed per occurrence afterward. Get that ordering backward — freeze each occurrence to UTC up front — and the next daylight-saving transition silently slides the whole series by an hour.

Consider the canonical case. A 9:00 AM Pacific stand-up is 17:00 UTC in winter and 16:00 UTC in summer, because America/Los_Angeles is UTC−8 under standard time and UTC−7 under daylight time. If the system had stored each occurrence as a UTC instant, the spring-forward boundary would leave half the series at 17:00Z and half at 16:00Z — and on the morning the clocks changed, everyone would show up an hour wrong. Storing local time and deriving UTC makes the right answer automatic: the wall-clock never moves, and the offset is recomputed from the zone database, which is the only thing that knows the transition dates.

The three edits, one primitive

A recurring event is one logical record but unbounded physical occurrences, and the model has to survive three kinds of edit. All three reduce to the same primitive — an exception row, a child pointing at the master via parent_event_id, tagged with the recurrence_id of the occurrence it is about.

MASTER · RRULE EXPAND (local) DERIVE UTC OVERLAY EXCEPTIONS OCCURRENCE LIST

A cancel one instance is an override row with status = cancelled for that recurrence_id — iCalendar's EXDATE, but as a row so it can carry metadata. A move or edit one instance is an override row with its own start and, because it is a full row, its own attendees and RSVPs. This and all following is a series split: set the old master's UNTIL to just before the split and create a new master from the split point. That third one is what people get wrong — they try to version a rule in place; the right move is two writes, history preserved, old exceptions still attached to the old master.

The exception lives in an append-only ledger keyed by recurrence_id: a cancel tombstones one occurrence, a move carries its own new start, an edit records which columns the user pinned. Recurrence health then becomes a GROUP BY series — and nothing ever UPDATEs the series in place.

SQL · THE OPTIMISTIC WRITE (COMPARE-AND-SWAP ON VERSION)
-- The organizer fixing the title and an attendee moving the instance can -- race. With sparse overrides they mostly write different rows, but both are -- guarded by a CAS on version: 0 rows affected => someone beat us; 409 to the -- client. Last-write-wins on the same row is never silently allowed. UPDATE events SET title = :title, location = :location, version = version + 1, sequence = sequence + 1, updated_at = now() WHERE event_id = :id AND version = :expected_version -- the guard IS the lock RETURNING version;
Expand in local time, convert per occurrence. That single ordering is the entire daylight-saving fix — and the reason a clock change can't move a million stand-ups.RECURRENCE RULE Nº 1 — STORE THE RULE, DERIVE THE INSTANT

There is a second invariant hiding inside the first, and it is what makes the system bounded: the per-query window cap and the max_instances guardrail. A recurrence yields occurrences in order, so expansion stops the instant it crosses the window's far edge — and a crafted FREQ=SECONDLY rule that tries to emit millions of instances dies at the cap instead of melting a worker. Local time keeps the books honest; the cap keeps a single read from being weaponized. Together they are the entire correctness story of the read path.


§ 05 — INGESTION & STREAMSPython that refuses to freeze time

Three programs carry the read and write loops: the expansion that turns one master into concrete occurrences without ever freezing UTC, the two-query read path that overlays exceptions in memory, and the iCalendar ingestion that maps the outside world onto the same columns. Each is small; the judgment is in what it refuses to do.

1 · The expansion — DST-correct by construction

This is the procedure that makes or breaks the design. The rule is walked in local time; each occurrence is converted to UTC only after it is generated. The guardrail is not decoration — it is the line that stands between a worker and a FREQ=MINUTELY abuse rule. And the merge of an override is sparse: it supplies its own time plus only the columns it explicitly set, so a later typo-fix on the series still reaches the moved Wednesday instance.

PYTHON · EXPAND ONE MASTER INTO OCCURRENCES (DST-CORRECT)
from datetime import timedelta from zoneinfo import ZoneInfo from dateutil.rrule import rrulestr UTC = ZoneInfo("UTC") INHERITABLE = ("title", "description", "location", "transparency") def expand_series(master, win_lo, win_hi, overrides, max_instances=730): """Expand a recurring MASTER into occurrences within [win_lo, win_hi) UTC, applying cancellations and per-instance overrides. The rule is walked in LOCAL wall-clock time and each occurrence is then converted to UTC. That order is the daylight-saving fix: 09:00 local maps to 17:00Z in winter and 16:00Z in summer, automatically.""" tz = ZoneInfo(master["tz_id"]) dtstart = master["start_local"].replace(tzinfo=tz) duration = master["end_local"] - master["start_local"] rule = rrulestr(master["rrule"], dtstart=dtstart) # index overrides by the occurrence they replace (naive local start) ov_by_rid = {o["recurrence_id"]: o for o in overrides} exdates = set(master.get("exdate") or ()) out, count = [], 0 for local_start in rule: if count >= max_instances: break # guardrail vs FREQ=MINUTELY abuse start_utc = local_start.astimezone(UTC) if start_utc >= win_hi: break # rrule yields in order; safe to stop end_utc = (local_start + duration).astimezone(UTC) if end_utc <= win_lo: continue # before the window; keep walking rid = local_start.replace(tzinfo=None) # match key == recurrence_id if rid in exdates: continue # cheap inline cancel ov = ov_by_rid.get(rid) if ov is None: out.append(_plain(master, start_utc, end_utc, rid)) elif ov["status"] != "cancelled": out.append(_merge_override(master, ov)) # moved / edited instance # else: cancelled override -> drop silently count += 1 return out def _merge_override(master, ov): """Sparse merge: the override supplies its own time plus only the columns it explicitly set (overridden_cols); everything else falls through to the master's CURRENT value, so a later typo-fix on the series still reaches it.""" merged = {"event_id": ov["event_id"], "start_utc": ov["start_utc"], "end_utc": ov["end_utc"], "recurrence_id": ov["recurrence_id"]} overridden = set(ov.get("overridden_cols") or ()) for col in INHERITABLE: merged[col] = ov[col] if col in overridden else master[col] return merged

Notice the sparse-merge discipline. The naive instinct is a full snapshot — the override copies every field — which is simple and wrong, because the typo-fix never reaches the moved instance. The opposite extreme is a field-mask fan-out, which writes to every override on every master edit and can race or partially fail at occurrence 4,000 of 9,000. The sparse override makes propagation correct by construction: NULL means inherit, the merge is in-memory because we already loaded the master, and a series rename reaches the override for free.

2 · The read path — two queries, not one

"Give me this calendar for June" is two SQL statements. First, pull the singles and masters that could touch the window — the denormalized rrule_until_utc is exactly what keeps open-ended series index-filterable. Then pull every override for those masters and overlay them in the application after expansion.

SQL · STEP 1 — SINGLES + MASTERS OVERLAPPING THE WINDOW
SELECT event_id, title, location, transparency, all_day, tz_id, start_utc, end_utc, start_local, end_local, rrule, rdate, exdate FROM events WHERE calendar_id = :cal AND status <> 'cancelled' AND parent_event_id IS NULL -- masters + singles; overrides overlaid in step 2 AND ( -- a one-off that overlaps [lo, hi) (rrule IS NULL AND start_utc < :hi AND end_utc > :lo) OR -- a series whose lifespan overlaps the window; the denormalized -- rrule_until_utc is what keeps open-ended series index-filterable (rrule IS NOT NULL AND start_utc < :hi AND (rrule_until_utc IS NULL OR rrule_until_utc >= :lo)) ); -- step 2: overrides & cancellations for those masters (hits uq_override) SELECT parent_event_id, recurrence_id, status, overridden_cols, title, location, description, transparency, start_utc, end_utc FROM events WHERE parent_event_id = ANY(:master_ids); -- app then calls expand_series(master, lo, hi, overrides[master_id]); the two -- SELECTs merge into one ordered, exception-corrected occurrence list.

3 · Ingesting the outside world

Half of "invitations" is interop — an Outlook export, a subscribed holiday feed, a room-booking system, all of which speak iCalendar. The payoff of modeling recurrence on RFC 5545 in the first place is that ingestion is almost a straight column mapping: RRULE, EXDATE, and RECURRENCE-ID land on the columns we already have, and the uid makes re-ingesting the same feed a no-op.

PYTHON · iCALENDAR (.ics) INGESTION — IDEMPOTENT UPSERT
from icalendar import Calendar def ingest_ics(feed_bytes, calendar_id, db): """Ingest an external iCalendar feed into events. RRULE / EXDATE / RECURRENCE-ID map straight onto our columns; (calendar_id, uid, recurrence_id) is the idempotency key so re-pulling a feed is a no-op.""" cal, rows = Calendar.from_ical(feed_bytes), [] for c in cal.walk("VEVENT"): start = c.decoded("dtstart") end = c.decoded("dtend", start) # default to start if absent rid = c.get("recurrence-id") # present => this VEVENT is an override rows.append({ "calendar_id": calendar_id, "uid": str(c["uid"]), "title": str(c.get("summary", "")), "location": str(c.get("location", "")), "start_local": to_wall(start), "end_local": to_wall(end), "tz_id": tzid_of(start), # 'floating' -> calendar default "rrule": rrule_text(c.get("rrule")), "exdate": exdates(c.get("exdate")), "recurrence_id": to_wall(rid.dt) if rid else None, "status": str(c.get("status", "CONFIRMED")).lower(), }) # ON CONFLICT (calendar_id, uid, recurrence_id) DO UPDATE ... db.bulk_upsert_events(rows, conflict=("calendar_id", "uid", "recurrence_id")) return len(rows)

The carve-out an interviewer listens for: the heavyweight case. When the master's rule or time changes, the occurrence grid moves out from under every recurrence_id, and there is no clean automatic answer to "where does the cancelled-Tuesday exception go in a Thursday series." So that one edit is treated as a rebuild — warn the user that instance-level changes will be discarded, then split or recreate the series. Everything else is a sparse merge; only rule/time changes earn the heavy hammer.


§ 06 — AGGREGATIONFree/busy as a bitmap, and the change stream as the spine

Two derived layers sit atop the event store: the free/busy index, which materializes the very thing we refused to materialize for events, and the change-stream pipeline, which keeps every projection in lockstep. Both exist for speed; both can be thrown away and rebuilt from the log.

For ten people over two weeks, computing free/busy live is genuinely fine — fan out to the ten calendars, expand each, project to busy intervals, merge per attendee, and intersect across attendees with a sweep line. So why not stop there? Three reasons: tail latency (one slow shard out of ten makes an interactive query slow, and people drag the duration slider, re-firing it), volume (schedulers, room booking, and AI agents hammer this far beyond the UI), and waste (recurrence expansion is deterministic, so re-expanding everyone on every call recomputes the same answer). The fix is a materialized free/busy index, separate from the source-of-truth event store.

Per calendar, per UTC day, store a bitmap at 15-minute granularity — 96 bits, 12 bytes. Two weeks for one user is 168 bytes; ten users is under 2 KB. The query collapses to: fetch the tiny per-day bitmaps, OR them together across attendees, and scan for a run of zero bits at least the meeting length long. What just came back through the side door is materialization of recurrences — the very thing rejected for the event store — and here it is right, because free/busy needs only coverage, not event identity, and the bitmap is fixed-size no matter how many events overlap a slot.

PYTHON · FREE/BUSY MATERIALIZER (EVENT CHANGE → BUSY BITMAPS)
from datetime import datetime, timedelta, time from collections import defaultdict from math import ceil SLOT_MIN, SLOTS = 15, 96 # 24h / 15min def stamp(mask: bytearray, start_utc, end_utc, day): """Set the 15-min bits [start, end) covers within one UTC day.""" day0 = datetime.combine(day, time.min, tzinfo=UTC) lo, hi = max(start_utc, day0), min(end_utc, day0 + timedelta(days=1)) if hi <= lo: return first = int((lo - day0).total_seconds()) // (SLOT_MIN * 60) last = ceil(int((hi - day0).total_seconds()) / (SLOT_MIN * 60)) for s in range(first, min(last, SLOTS)): mask[s >> 3] |= 1 << (s & 7) def on_change_freebusy(change, db, horizon_days=365): # recompute ONLY the UTC days this change touched, out to a rolling horizon cal = change["calendar_id"] lo = change["win_lo"] hi = min(change["win_hi"], lo + timedelta(days=horizon_days)) # authoritative busy intervals (singles + expanded series) for the span busy = load_busy_intervals(db, cal, lo, hi) # calls expand_series by_day = defaultdict(lambda: bytearray(SLOTS // 8)) for s, e in busy: d = s.date() while d <= e.date(): stamp(by_day[d], s, e, d) d += timedelta(days=1) for day, mask in by_day.items(): db.upsert_freebusy(cal, day, bytes(mask), version=change["new_version"])

Only one thing makes those four projections consistent: the spine. Three different systems must react to one event write — attendee copies must fan out, free/busy bitmaps must rebuild, cache keys must rotate. Doing them inline makes "create event" latency hostage to three downstream systems, and a partial failure leaves the world inconsistent. The fix is a single ordered log per calendar fed by a transactional outbox: every mutation writes the event row and a changelog row in the same database transaction, so the stream can never miss a committed change or replay one that rolled back.

SQL · THE TRANSACTIONAL OUTBOX — EVENT + CHANGELOG COMMIT ATOMICALLY
CREATE TABLE calendar_changelog ( seq BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- IS the sync token calendar_id BIGINT NOT NULL, event_id BIGINT NOT NULL, op TEXT NOT NULL CHECK (op IN ('insert','update','delete')), new_version BIGINT NOT NULL, win_lo TIMESTAMPTZ NOT NULL, -- affected span, so the f/b consumer knows win_hi TIMESTAMPTZ NOT NULL, -- which days to recompute (whole horizon for a series) payload JSONB NOT NULL, -- the changed row, for consumers that don't re-read created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); -- The write: the rule edit and its intent-to-publish are ONE commit. BEGIN; UPDATE events SET title = :title, location = :location, version = version + 1, sequence = sequence + 1, updated_at = now() WHERE event_id = :id AND version = :expected -- optimistic CAS from §04 RETURNING version INTO :new_version; INSERT INTO calendar_changelog (calendar_id, event_id, op, new_version, win_lo, win_hi, payload) VALUES (:cal, :id, 'update', :new_version, :win_lo, :win_hi, :payload); COMMIT; -- a CDC connector (Debezium tailing the WAL) ships committed changelog rows -- to a Kafka topic keyed by calendar_id, preserving per-calendar order.
Dual-writing is the difference between a pipeline that is eventually consistent and one that is eventually wrong. The outbox makes the change and its intent-to-publish one atomic commit; CDC turns committed rows into stream messages exactly once.RECURRENCE RULE Nº 2 — ONE TRANSACTION, OR NONE

From there the log forks into independent consumers that share nothing but the stream. Fan-out replicates an invitation into each attendee's calendar partition, carrying the version so a stale copy is detectable and self-repairing — and above a threshold (a 50,000-person all-hands) it flips to reference-only, one source row resolved at read time the way feeds handle celebrity accounts. Free/busy runs the materializer above. Cache-bump rotates the calendar's version so rendered-window keys roll over and stale entries simply age out — no explicit invalidation. And client sync is the cheapest leverage of all: the changelog seq is the sync token, so "what changed since X?" turns most phone wake-ups into a tiny, usually empty delta. One spine, four jobs — and the deep cut: when IANA ships a new tz release, you diff which zones changed over which future ranges and enqueue a synthetic "touch" through the very same stream, so the materializer and cache-bump do the rest with code that already exists.


§ 07 — ANALYTICS SQLInterrogating the schedule

The occurrence, free/busy, invitation, and exception facts are where the system explains itself. Three queries an interviewer loves, each carrying a classic pattern on its back: the as-of read against materialized occurrences, gaps-and-islands for "find a time," and conditional aggregation over the append-only exception ledger.

The calendar view — occurrence-as-fact, exceptions already applied

The user's question, and the hot path of the whole product: my stand-up over the next month, cancellations dropped and moves honored. Because occurrences are materialized to a rolling horizon, this is a range scan on fct_event_occurrences — no RRULE re-expansion at read time, because the override row already carries its moved time and the cancelled occurrence already carries its tombstone flag.

SQL · MY OCCURRENCES, MOVES HONORED, CANCELS DROPPED
SELECT o.recurrence_id AS original_slot, o.occurrence_start_local AS shows_at_local, o.occurrence_start_utc AS instant_utc, CASE WHEN o.is_override THEN 'moved' ELSE 'series' END AS kind FROM fct_event_occurrences o WHERE o.event_sk = 501 AND o.is_cancelled = FALSE AND o.occurrence_start_local >= '2026-06-09' AND o.occurrence_start_local < '2026-07-09' ORDER BY o.occurrence_start_local; -- 4 rows: 6/9, 6/16, (6/23 dropped), 7/1 14:00 'moved', 7/7. No re-expansion. -- 09:00 PDT shows as 16:00Z all summer; the same series lands at 17:00Z in winter.

Find a time — gaps and islands over the slot bitmap

The scheduler's question: the first 60-minute slot where Alice, Bren, and Chen are all free. The pattern is the canonical gaps-and-islands move — OR the per-attendee masks (here a SUM across calendars), keep only the all-free slots, and let a running count of busy slots label each free run; the first island long enough wins. A 60-minute meeting needs four consecutive zero slots.

SQL · FIRST COMMON 60-MIN OPENING (GAPS-AND-ISLANDS)
WITH combined AS ( SELECT slot_start_utc, sum(is_busy) AS busy_heads FROM fct_freebusy_slots WHERE calendar_sk IN (77,78,79) AND slot_start_utc >= '2026-07-01 15:00+00' -- 08:00 PT AND slot_start_utc < '2026-07-01 21:00+00' -- 14:00 PT GROUP BY slot_start_utc ), runs AS ( -- the islands trick: count busy slots seen so far to label free runs SELECT slot_start_utc, count(*) FILTER (WHERE busy_heads > 0) OVER (ORDER BY slot_start_utc) AS busy_group FROM combined WHERE busy_heads = 0 ) SELECT min(slot_start_utc) AS open_from, min(slot_start_utc) + INTERVAL '60 min' AS open_to FROM runs GROUP BY busy_group HAVING count(*) >= 4 -- 4 x 15-min = 60 min, all heads free ORDER BY open_from LIMIT 1; -- -> 18:00Z = 11:00 PT

RSVP funnel and recurrence health — conditional aggregation

The organizer's question and the workspace admin's, in one shape. Because invitations are per-occurrence and exceptions are an append-only ledger, the response mix is a GROUP BY response on one occurrence, and recurrence health is conditional aggregation over fct_event_exceptions — a series with cancels every week is a "zombie" the workspace nudges to delete.

SQL · RESPONSE MIX + WHICH SERIES CHURN THEIR OCCURRENCES
-- RSVP funnel for one occurrence (the moved Wednesday instance) SELECT response, count(*) AS heads, round(avg(latency_min)) AS avg_latency_min FROM fct_invitations WHERE occurrence_sk = 504 GROUP BY response; -- accepted 2, declined 1 (Chen can't do Wed) -- recurrence health: which series churn their occurrences? SELECT series_sk, count(*) AS exceptions, count(*) FILTER (WHERE exception_type = 'cancel') AS cancels, count(*) FILTER (WHERE exception_type = 'move') AS moves FROM fct_event_exceptions GROUP BY series_sk HAVING count(*) >= 2 ORDER BY exceptions DESC; -- many cancels => a 'zombie' to nudge

§ 08 — THE DASHBOARDProving the schedule is alive

A senior design ends with observability, because every clever degradation above is invisible without it. The calendar dashboard watches three things that fail differently: the read path that must feel instant, the pipeline that keeps projections honest, and the time-correctness machinery that no one notices until it is wrong.

READ PATH
render p95 for a window, sync-delta hit rate (how often "what changed?" returns nothing — the cheapest query is the one that fetches nothing), window-cache hit %, and expansion guardrail trips — a spike means someone shipped a FREQ=MINUTELY rule into a hot calendar.
THE PIPELINE
fan-out consumer lag (seconds of staleness on attendee copies), free/busy lag, dead-letter / reconcile count, and stale-copy repairs — proof that versioned copies are detecting and healing themselves instead of silently disagreeing with truth.
TIME & RECURRENCE
DST-affected occurrences pending recompute after a tz-database bump, exceptions per series and the zombie tail, find-a-time p99, and RSVP response latency — the late-arriving truth the model is built to absorb.
Calendar Ops — Global MON 09:12 UTC · ALL ZONES · 30s REFRESH
Render p95
38ms
Sync-Delta Empty %
91%
Window-Cache Hit
87%
Fan-out Lag
4.1s
Find-a-Time — 3 attendees, 08:00–14:00 PT · OR the masks, first zero-run wins
08101214 A B C FREE 11:00 PT = 18:00Z · 4 slots
Free/Busy Lag
2.3s
Guardrail Trips
3/h
DST Recompute Backlog
61k
Stale-Copy Repairs
820/s
Zombie Series
1.2k
RSVP Latency p50
2.4h
FIG. 2 — The story a healthy tz-bump tells: a DST recompute backlog draining through the same change stream, fan-out lag drifting amber under the burst, find-a-time still landing 11:00 PT in microseconds — and render p95 flat, because reads come off bitmaps and version-keyed caches, untouched by the churn behind them.

Read the amber tiles together and the dashboard narrates the deep cut from §06. IANA shipped a zone change overnight; the bump job diffed the affected zones and ranges and enqueued synthetic touches, so 61k future occurrences are being re-derived and their bitmaps restamped — and because every projection is versioned and replayable, fan-out merely lagged rather than corrupted, stale copies are repairing themselves at 820/s, and not one user's render got slower. That is what a designed degradation looks like from the operator's chair: the time machinery churns, and the front of the house stays boring.


§ 09 — THE RUBRICWhat was actually being tested

Strip the calendar details away and the question was probing five judgments, each of which generalizes far beyond scheduling:

RULE OVER ROWS
Seeing that one row can denote an unbounded set, and storing the generator rather than the output — so writes stay O(1) and materialization is an optimization for hot windows, never the model.
DERIVE THE INSTANT
Storing local wall-clock plus an IANA zone and computing UTC per occurrence, never a frozen offset — because the only thing that knows the transition dates is the zone database, and a clock change must not move the meeting.
CORRECTIONS AS A LEDGER
Modeling cancels and moves as sparse, append-only override rows keyed by the occurrence they replace — so a series edit propagates by inheritance, per-instance RSVPs fall out for free, and recurrence health is a GROUP BY.
MATERIALIZE WHERE IDENTITY DOESN'T MATTER
Refusing materialization for events but embracing it for free/busy, because availability needs only coverage — a fixed-size bitmap you OR across attendees turns "find a time" into a popcount cheap enough for agents to poll.
ONE LOG, MANY JOBS
Making the change stream the spine: a transactional outbox feeds fan-out, bitmaps, cache versioning, and client sync, so every projection is derived, versioned, and replayable — and every failure degrades to stale, never to wrong.
The calendar is a data-modeling problem wearing a UI. Get the row shape and the change stream right and the request path becomes almost boring — which, for something a billion people open every morning, is exactly what you want.— CLOSING ARGUMENT