The Outbox Pattern: Atomic Side-Effect Intent
The Reconciler has been running fine. The expenses table fills up cleanly. Idempotency keeps retries safe. Claims keep two Workers off the same row. Tuesday night, everything works.
Wednesday night, the email service is down for forty minutes.
The Reconciler writes Alice's expense row at 23:43. The row commits. The Reconciler then calls the email API. The email API times out. The Worker logs the error and moves on. Alice has an expense in the database and no email in her inbox. Friday she calls the bank to dispute a charge she "never got notified about".
You think: send the email first, then write the row. The next week the email service is up but a transient connection drop kills the database write. Alice gets the email saying her expense was recorded. The expense is not recorded. Now the books say zero and the inbox says one. Worse problem.
There is no order of these two calls that is safe. Two networks. Two failure points. One process trying to span both.
This lesson teaches the smallest fix that closes the gap: the outbox.
- Dual-write problem: A single logical action that requires writes to two systems (database plus email service, database plus message queue) cannot be made atomic by ordering alone.
- Outbox: A table inside the same database as the business data. The Worker writes a row to it as part of the same transaction that writes the business row.
- Side-effect intent: A row in the outbox that describes the external action to take (recipient, subject, body). It is not the action itself, only the intent.
- Relay: A separate process that reads pending outbox rows and performs the external action (sends the email, posts the message, makes the API call).
- Dispatch: The act of the relay actually performing the side effect.
- Dedup key: A short string carried in the outbox row that the receiver (or the relay) uses to recognise duplicates.
The Dual-Write Problem in One Picture
There are exactly two orderings the Reconciler can try without the outbox. Both leak.
| Order | Step 1 | Step 2 | Failure case | Durable state after failure |
|---|---|---|---|---|
| A | INSERT expense row | Call email API | Email API down | Expense exists, no email sent. User disputes. |
| B | Call email API | INSERT expense row | DB connection dies | Email sent, no expense row. Books are wrong. |
The problem is the same in both directions. Two systems, two networks, two clocks. The Worker can succeed in the first system and fail in the second. There is no COMMIT that spans both.
The outbox dodges the problem instead of solving it. Both writes go to the same database. The database's own transaction is atomic. One COMMIT, two rows.
What the Outbox Does
The recipe is small.
- The Worker opens one transaction.
- Inside that transaction it writes the business row (the
expensesrow). - Inside the same transaction it writes a row to
outbox_messagesdescribing the side-effect intent (the email payload, not the email itself). - It COMMITs.
Two outcomes are possible. Either both rows exist or neither does. The COMMIT is the seatbelt.
Then a separate process called the relay runs on its own schedule. It claims the next pending row from outbox_messages using SELECT ... FOR UPDATE SKIP LOCKED (the claim pattern you read in Lesson 4). It dispatches the side effect. On success, it marks the row dispatched. On failure, it bumps an attempt counter and leaves the row pending for the next pass.
The receiver (the email service in this case) or the relay itself deduplicates by the dedup_key carried in the row. That dedup key is built the same way as the idempotency key in Lesson 3: deterministic, derived from inputs, no clocks, no random ids.
Two disciplines, both already in your toolkit, recombined. Same shape.
What the Outbox Table Looks Like
The agent writes the migration. You read it.
CREATE TABLE outbox_messages (
id UUID PRIMARY KEY,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
topic TEXT NOT NULL,
payload JSONB NOT NULL,
dedup_key TEXT NOT NULL UNIQUE,
status TEXT NOT NULL DEFAULT 'pending',
attempts INT NOT NULL DEFAULT 0,
last_error TEXT,
dispatched_at TIMESTAMPTZ
);
Six things to read.
topicis what kind of side effect this is (confirmation_email,slack_post,webhook_call). One outbox table can carry many topics.payload JSONBis the body of the intent. For an email: recipient, subject, body, template variables.dedup_key TEXT NOT NULL UNIQUEis the most important column. The UNIQUE constraint stops a retried Worker from queuing the same intent twice. Same contract you read in Lesson 3.statusispending,dispatched, orfailed. The relay flips it.attemptsandlast_errorare the relay's bookkeeping. They live in the database, not in the relay's memory.dispatched_atis filled in only on success. A NULLdispatched_atplus apendingstatus is what the relay scans for.
If the agent's migration is missing the UNIQUE constraint on dedup_key, the outbox is a hint, not a contract. Reject and ask for the constraint.
What the Worker Writes
One transaction, two inserts. This is the whole point of the lesson.
BEGIN;
INSERT INTO expenses (worker_id, user_id, category, amount, spent_at, idempotency_key)
VALUES ($worker_id, $user_id, $category, $amount, $spent_at, $idempotency_key)
ON CONFLICT (worker_id, idempotency_key) DO NOTHING;
INSERT INTO outbox_messages (id, topic, payload, dedup_key)
VALUES ($outbox_id, 'confirmation_email', $payload::jsonb, $dedup_key)
ON CONFLICT (dedup_key) DO NOTHING;
COMMIT;
Four things to read.
- Both inserts live between the same
BEGINandCOMMIT. If the COMMIT fails, neither row exists. If it succeeds, both rows exist. - The first insert uses the idempotency key contract from Lesson 3. The retry case is a no-op.
- The second insert uses the same
ON CONFLICT DO NOTHINGshape ondedup_key. A retried Worker does not queue a second email intent. - The Worker does not call the email API inside this block. The relay does that, later, with no time pressure.
After the COMMIT, the row in outbox_messages is the contract: "an email must be sent for this expense, eventually." The Worker is done.
What the Relay Does
The agent writes this. You read it.
BEGIN;
SELECT id, topic, payload, dedup_key
FROM outbox_messages
WHERE status = 'pending'
ORDER BY id
LIMIT 1
FOR UPDATE SKIP LOCKED;
-- The application sends the side effect (calls the email API).
-- On success:
UPDATE outbox_messages
SET status = 'dispatched',
dispatched_at = now()
WHERE id = $id;
-- On failure (in a separate transaction, after rollback):
UPDATE outbox_messages
SET attempts = attempts + 1,
last_error = $error_message
WHERE id = $id;
COMMIT;
Three things to read.
- The SELECT carries
FOR UPDATE SKIP LOCKED. The relay uses the exact claim pattern you read in Lesson 4. Two relay instances racing each other do not collide. - The actual side effect (the email API call) happens between the SELECT and the UPDATE. If the API call fails, the relay rolls back the success UPDATE and writes the failure UPDATE in a fresh transaction. The row stays
pendingfor the next pass. - The receiver (the email service) reads
dedup_keyfrom the payload. If it sees the samededup_keytwice, it short-circuits the second delivery. This is why the email service does not send two emails when the relay retries.
The relay never invents work. It only dispatches intents that the Worker wrote. If the Worker did not write a row, the relay does nothing. If the Worker wrote a row, the relay will eventually dispatch it.
PRIMM-AI+ Practice: Force a Relay Failure
Predict [AI-FREE]
The Reconciler is about to write one expense row for Bob and queue one confirmation email through the outbox. The relay will dispatch the email. You will then force the relay to fail on its first attempt and succeed on its second.
Before running anything, write down:
- After the Worker COMMITs, how many rows exist in
expensesfor this transaction? How many inoutbox_messageswithstatus = 'pending'? - After the relay's first (failed) attempt, what is the
statusof the outbox row, and what doesattemptsread? - After the relay's second (successful) attempt, what is the
status, and what doesdispatched_atread? - How many real emails did the receiver actually deliver?
- Your confidence score from 1 to 5.
You should be able to answer all four before running anything.
Run
Ask Claude Code to write a small relay simulator: it runs the Worker transaction (two inserts, one COMMIT), then runs the relay block twice. On the first relay run, raise a forced exception before the success UPDATE. On the second relay run, let the dispatch succeed. After both runs, read SELECT id, status, attempts, dispatched_at, last_error FROM outbox_messages WHERE id = $id and show the result.
What you should see:
| Stage | expenses rows | outbox status | attempts | dispatched_at | Emails delivered |
|---|---|---|---|---|---|
| After Worker COMMIT | 1 | pending | 0 | NULL | 0 |
| After relay fail #1 | 1 | pending | 1 | NULL | 0 |
| After relay success #2 | 1 | dispatched | 1 | filled | 1 |
The expense row never duplicated. The intent never disappeared. The email arrived once. That is the contract.
Investigate
Write your own one-paragraph explanation:
- The Worker's transaction committed both rows or neither. There is no world where the expense exists without an intent next to it.
- The relay's first attempt failed cleanly. The row stayed
pending. The intent was not lost because it lives in the database, not in the relay's memory. - The relay's second attempt succeeded. The
dispatched_atcolumn is the visible proof that the side effect happened. - The receiver deduplicated by
dedup_key. Even if the relay had retried five times after partial successes, only one email would have arrived.
Then ask the agent:
- "If I dropped the outbox row entirely and called the email API directly from the Worker transaction, what would the failure mode be? Walk through both orderings."
- "Why is the dedup_key built from Worker inputs and not from the outbox row's UUID? What goes wrong if I use the UUID?"
- "What happens if the relay crashes between the email API success and the UPDATE that sets
status = 'dispatched'? Trace the next relay pass."
The third question is the subtle one. Without the receiver-side dedup_key, a relay crash mid-flight would resend the email on the next pass. With the dedup_key, the receiver short-circuits the second delivery. The outbox row eventually becomes dispatched on a later successful run.
Modify
Change one rule. Drop the outbox row entirely and call the email service directly from the Worker transaction.
Predict what happens when the email service times out. Then run the demo.
What you should see: the Worker's transaction either commits before the timeout (expense exists, no email, no record that an email was even intended) or rolls back the expense after the timeout (no expense, no email, no record that anything was tried). Either way the durable state is incomplete. There is nothing for a relay to pick up. The intent is gone.
The outbox is the seam that lets the database commit be the contract.
Make [Mastery Gate]
The brief in business English:
"Whenever the Reconciler categorises a transaction with a confidence score below 0.6, it should publish a
low_confidence_flagevent to anotificationstopic so the on-call analyst can review. The notification must arrive at most once per(run_id, pending_transaction_id)even if the Worker retries."
Hand this brief to the agent. Ask for:
- The
topicstring and the shape of the JSON payload (named fields, not free-form). - The exact derivation of the
dedup_key(which Worker fields it hashes). - The placement of the new
INSERT INTO outbox_messagesinside the existing Worker transaction. - Confirmation that
ON CONFLICT (dedup_key) DO NOTHINGis present.
You read each piece. The gate passes when you can point at the dedup_key derivation and explain, in business terms, which retry it refuses. If the agent uses a UUID or a timestamp in the dedup_key, you reject: same key contract as Lesson 3, no clocks, no random ids.
This is what the outbox prevents: a side effect that the database commit said would happen, but did not.
Try With AI
Prompt 1: Dual-Write Recognition
I will describe three Worker designs. For each one, tell me whether
the design has the dual-write problem and what the failure mode is.
A) Worker writes an expense row, then calls the email API in the same
process, no transaction wrapping either.
B) Worker calls the email API first, then writes the expense row in
a transaction that commits on success.
C) Worker writes an expense row and an outbox row in one transaction,
then a separate relay process dispatches the email later.
For each one, name the durable failure if the second step fails.
What you're learning: Recognition. A and B both leak. C does not. Building this reflex is what lets you reject a Worker design before it ships, instead of finding out at 2am.
Prompt 2: Dedup Key Derivation
I am about to outbox a confirmation_email for the Expense Reconciler.
The Worker has `run_id`, `turn_seq`, `expense_id`, `user_id`, `amount`,
`occurred_at` available. List the fields you would hash to build the
`dedup_key`, in order, and explain in one sentence each why each field
belongs. Then list two fields that look tempting but do NOT belong.
What you're learning: Same discipline as the idempotency key in Lesson 3, applied to the outbox row. Same hashing rules. Same prohibitions: no timestamps, no UUIDs. The key must be the same on every retry of the same Worker action.
Prompt 3: Relay-Failure Drill
Walk me through the durable state of `outbox_messages` and the real
world after each of these events:
1. Worker COMMITs the two-insert transaction.
2. Relay claims the row with FOR UPDATE SKIP LOCKED.
3. Relay calls the email API. The API returns a 500 error.
4. Relay's success UPDATE never runs. The transaction rolls back.
5. Relay starts a new transaction. Bumps `attempts`. Sets `last_error`.
6. Relay claims the same row again on the next pass. Calls the API.
The API returns 200 this time.
7. Relay's success UPDATE runs. The transaction commits.
For each step, name the value of `status`, `attempts`, `dispatched_at`,
and whether a real email has been delivered. End with the receiver's
view of how many emails it saw with this dedup_key.
What you're learning: Tracing a relay timeline is the proof that the side-effect intent survives a partial failure. The visible repeatability across the seven steps is the contract.
Checkpoint
- I can name the dual-write problem and explain why neither ordering of "write row, send email" is atomic.
- I can read a Worker transaction that writes both an
expensesrow and anoutbox_messagesrow and explain why both succeed or both fail. - I can read a relay block and identify the
FOR UPDATE SKIP LOCKEDclaim and the success-vs-failure UPDATEs. - I can explain why the
dedup_keyis built from Worker inputs and not from the outbox row's UUID. - I have traced a relay failure-then-success scenario and confirmed the email arrived exactly once.
- I rejected at least one agent-proposed outbox design during practice for missing the UNIQUE constraint on
dedup_key.