Skip to main content

The Outbox Pattern: Atomic Side-Effect Intent

The Reconciler has been running fine. The expenses table fills up cleanly. Idempotency keeps retries safe. Claims keep two Workers off the same row. Tuesday night, everything works.

Wednesday night, the email service is down for forty minutes.

The Reconciler writes Alice's expense row at 23:43. The row commits. The Reconciler then calls the email API. The email API times out. The Worker logs the error and moves on. Alice has an expense in the database and no email in her inbox. Friday she calls the bank to dispute a charge she "never got notified about".

You think: send the email first, then write the row. The next week the email service is up but a transient connection drop kills the database write. Alice gets the email saying her expense was recorded. The expense is not recorded. Now the books say zero and the inbox says one. Worse problem.

There is no order of these two calls that is safe. Two networks. Two failure points. One process trying to span both.

This lesson teaches the smallest fix that closes the gap: the outbox.

Key Terms for This Lesson
  • Dual-write problem: A single logical action that requires writes to two systems (database plus email service, database plus message queue) cannot be made atomic by ordering alone.
  • Outbox: A table inside the same database as the business data. The Worker writes a row to it as part of the same transaction that writes the business row.
  • Side-effect intent: A row in the outbox that describes the external action to take (recipient, subject, body). It is not the action itself, only the intent.
  • Relay: A separate process that reads pending outbox rows and performs the external action (sends the email, posts the message, makes the API call).
  • Dispatch: The act of the relay actually performing the side effect.
  • Dedup key: A short string carried in the outbox row that the receiver (or the relay) uses to recognise duplicates.

The Dual-Write Problem in One Picture

There are exactly two orderings the Reconciler can try without the outbox. Both leak.

OrderStep 1Step 2Failure caseDurable state after failure
AINSERT expense rowCall email APIEmail API downExpense exists, no email sent. User disputes.
BCall email APIINSERT expense rowDB connection diesEmail sent, no expense row. Books are wrong.

The problem is the same in both directions. Two systems, two networks, two clocks. The Worker can succeed in the first system and fail in the second. There is no COMMIT that spans both.

The outbox dodges the problem instead of solving it. Both writes go to the same database. The database's own transaction is atomic. One COMMIT, two rows.

What the Outbox Does

The recipe is small.

  1. The Worker opens one transaction.
  2. Inside that transaction it writes the business row (the expenses row).
  3. Inside the same transaction it writes a row to outbox_messages describing the side-effect intent (the email payload, not the email itself).
  4. It COMMITs.

Two outcomes are possible. Either both rows exist or neither does. The COMMIT is the seatbelt.

Then a separate process called the relay runs on its own schedule. It claims the next pending row from outbox_messages using SELECT ... FOR UPDATE SKIP LOCKED (the claim pattern you read in Lesson 4). It dispatches the side effect. On success, it marks the row dispatched. On failure, it bumps an attempt counter and leaves the row pending for the next pass.

The receiver (the email service in this case) or the relay itself deduplicates by the dedup_key carried in the row. That dedup key is built the same way as the idempotency key in Lesson 3: deterministic, derived from inputs, no clocks, no random ids.

Two disciplines, both already in your toolkit, recombined. Same shape.

What the Outbox Table Looks Like

The agent writes the migration. You read it.

CREATE TABLE outbox_messages (
id UUID PRIMARY KEY,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
topic TEXT NOT NULL,
payload JSONB NOT NULL,
dedup_key TEXT NOT NULL UNIQUE,
status TEXT NOT NULL DEFAULT 'pending',
attempts INT NOT NULL DEFAULT 0,
last_error TEXT,
dispatched_at TIMESTAMPTZ
);

Six things to read.

  1. topic is what kind of side effect this is (confirmation_email, slack_post, webhook_call). One outbox table can carry many topics.
  2. payload JSONB is the body of the intent. For an email: recipient, subject, body, template variables.
  3. dedup_key TEXT NOT NULL UNIQUE is the most important column. The UNIQUE constraint stops a retried Worker from queuing the same intent twice. Same contract you read in Lesson 3.
  4. status is pending, dispatched, or failed. The relay flips it.
  5. attempts and last_error are the relay's bookkeeping. They live in the database, not in the relay's memory.
  6. dispatched_at is filled in only on success. A NULL dispatched_at plus a pending status is what the relay scans for.

If the agent's migration is missing the UNIQUE constraint on dedup_key, the outbox is a hint, not a contract. Reject and ask for the constraint.

What the Worker Writes

One transaction, two inserts. This is the whole point of the lesson.

BEGIN;

INSERT INTO expenses (worker_id, user_id, category, amount, spent_at, idempotency_key)
VALUES ($worker_id, $user_id, $category, $amount, $spent_at, $idempotency_key)
ON CONFLICT (worker_id, idempotency_key) DO NOTHING;

INSERT INTO outbox_messages (id, topic, payload, dedup_key)
VALUES ($outbox_id, 'confirmation_email', $payload::jsonb, $dedup_key)
ON CONFLICT (dedup_key) DO NOTHING;

COMMIT;

Four things to read.

  1. Both inserts live between the same BEGIN and COMMIT. If the COMMIT fails, neither row exists. If it succeeds, both rows exist.
  2. The first insert uses the idempotency key contract from Lesson 3. The retry case is a no-op.
  3. The second insert uses the same ON CONFLICT DO NOTHING shape on dedup_key. A retried Worker does not queue a second email intent.
  4. The Worker does not call the email API inside this block. The relay does that, later, with no time pressure.

After the COMMIT, the row in outbox_messages is the contract: "an email must be sent for this expense, eventually." The Worker is done.

What the Relay Does

The agent writes this. You read it.

BEGIN;

SELECT id, topic, payload, dedup_key
FROM outbox_messages
WHERE status = 'pending'
ORDER BY id
LIMIT 1
FOR UPDATE SKIP LOCKED;
-- The application sends the side effect (calls the email API).

-- On success:
UPDATE outbox_messages
SET status = 'dispatched',
dispatched_at = now()
WHERE id = $id;

-- On failure (in a separate transaction, after rollback):
UPDATE outbox_messages
SET attempts = attempts + 1,
last_error = $error_message
WHERE id = $id;

COMMIT;

Three things to read.

  1. The SELECT carries FOR UPDATE SKIP LOCKED. The relay uses the exact claim pattern you read in Lesson 4. Two relay instances racing each other do not collide.
  2. The actual side effect (the email API call) happens between the SELECT and the UPDATE. If the API call fails, the relay rolls back the success UPDATE and writes the failure UPDATE in a fresh transaction. The row stays pending for the next pass.
  3. The receiver (the email service) reads dedup_key from the payload. If it sees the same dedup_key twice, it short-circuits the second delivery. This is why the email service does not send two emails when the relay retries.

The relay never invents work. It only dispatches intents that the Worker wrote. If the Worker did not write a row, the relay does nothing. If the Worker wrote a row, the relay will eventually dispatch it.

PRIMM-AI+ Practice: Force a Relay Failure

Predict [AI-FREE]

The Reconciler is about to write one expense row for Bob and queue one confirmation email through the outbox. The relay will dispatch the email. You will then force the relay to fail on its first attempt and succeed on its second.

Before running anything, write down:

  • After the Worker COMMITs, how many rows exist in expenses for this transaction? How many in outbox_messages with status = 'pending'?
  • After the relay's first (failed) attempt, what is the status of the outbox row, and what does attempts read?
  • After the relay's second (successful) attempt, what is the status, and what does dispatched_at read?
  • How many real emails did the receiver actually deliver?
  • Your confidence score from 1 to 5.

You should be able to answer all four before running anything.

Run

Ask Claude Code to write a small relay simulator: it runs the Worker transaction (two inserts, one COMMIT), then runs the relay block twice. On the first relay run, raise a forced exception before the success UPDATE. On the second relay run, let the dispatch succeed. After both runs, read SELECT id, status, attempts, dispatched_at, last_error FROM outbox_messages WHERE id = $id and show the result.

What you should see:

Stageexpenses rowsoutbox statusattemptsdispatched_atEmails delivered
After Worker COMMIT1pending0NULL0
After relay fail #11pending1NULL0
After relay success #21dispatched1filled1

The expense row never duplicated. The intent never disappeared. The email arrived once. That is the contract.

Investigate

Write your own one-paragraph explanation:

  1. The Worker's transaction committed both rows or neither. There is no world where the expense exists without an intent next to it.
  2. The relay's first attempt failed cleanly. The row stayed pending. The intent was not lost because it lives in the database, not in the relay's memory.
  3. The relay's second attempt succeeded. The dispatched_at column is the visible proof that the side effect happened.
  4. The receiver deduplicated by dedup_key. Even if the relay had retried five times after partial successes, only one email would have arrived.

Then ask the agent:

  1. "If I dropped the outbox row entirely and called the email API directly from the Worker transaction, what would the failure mode be? Walk through both orderings."
  2. "Why is the dedup_key built from Worker inputs and not from the outbox row's UUID? What goes wrong if I use the UUID?"
  3. "What happens if the relay crashes between the email API success and the UPDATE that sets status = 'dispatched'? Trace the next relay pass."

The third question is the subtle one. Without the receiver-side dedup_key, a relay crash mid-flight would resend the email on the next pass. With the dedup_key, the receiver short-circuits the second delivery. The outbox row eventually becomes dispatched on a later successful run.

Modify

Change one rule. Drop the outbox row entirely and call the email service directly from the Worker transaction.

Predict what happens when the email service times out. Then run the demo.

What you should see: the Worker's transaction either commits before the timeout (expense exists, no email, no record that an email was even intended) or rolls back the expense after the timeout (no expense, no email, no record that anything was tried). Either way the durable state is incomplete. There is nothing for a relay to pick up. The intent is gone.

The outbox is the seam that lets the database commit be the contract.

Make [Mastery Gate]

The brief in business English:

"Whenever the Reconciler categorises a transaction with a confidence score below 0.6, it should publish a low_confidence_flag event to a notifications topic so the on-call analyst can review. The notification must arrive at most once per (run_id, pending_transaction_id) even if the Worker retries."

Hand this brief to the agent. Ask for:

  1. The topic string and the shape of the JSON payload (named fields, not free-form).
  2. The exact derivation of the dedup_key (which Worker fields it hashes).
  3. The placement of the new INSERT INTO outbox_messages inside the existing Worker transaction.
  4. Confirmation that ON CONFLICT (dedup_key) DO NOTHING is present.

You read each piece. The gate passes when you can point at the dedup_key derivation and explain, in business terms, which retry it refuses. If the agent uses a UUID or a timestamp in the dedup_key, you reject: same key contract as Lesson 3, no clocks, no random ids.

This is what the outbox prevents: a side effect that the database commit said would happen, but did not.

Try With AI

Prompt 1: Dual-Write Recognition

I will describe three Worker designs. For each one, tell me whether
the design has the dual-write problem and what the failure mode is.

A) Worker writes an expense row, then calls the email API in the same
process, no transaction wrapping either.
B) Worker calls the email API first, then writes the expense row in
a transaction that commits on success.
C) Worker writes an expense row and an outbox row in one transaction,
then a separate relay process dispatches the email later.

For each one, name the durable failure if the second step fails.

What you're learning: Recognition. A and B both leak. C does not. Building this reflex is what lets you reject a Worker design before it ships, instead of finding out at 2am.

Prompt 2: Dedup Key Derivation

I am about to outbox a confirmation_email for the Expense Reconciler.
The Worker has `run_id`, `turn_seq`, `expense_id`, `user_id`, `amount`,
`occurred_at` available. List the fields you would hash to build the
`dedup_key`, in order, and explain in one sentence each why each field
belongs. Then list two fields that look tempting but do NOT belong.

What you're learning: Same discipline as the idempotency key in Lesson 3, applied to the outbox row. Same hashing rules. Same prohibitions: no timestamps, no UUIDs. The key must be the same on every retry of the same Worker action.

Prompt 3: Relay-Failure Drill

Walk me through the durable state of `outbox_messages` and the real
world after each of these events:

1. Worker COMMITs the two-insert transaction.
2. Relay claims the row with FOR UPDATE SKIP LOCKED.
3. Relay calls the email API. The API returns a 500 error.
4. Relay's success UPDATE never runs. The transaction rolls back.
5. Relay starts a new transaction. Bumps `attempts`. Sets `last_error`.
6. Relay claims the same row again on the next pass. Calls the API.
The API returns 200 this time.
7. Relay's success UPDATE runs. The transaction commits.

For each step, name the value of `status`, `attempts`, `dispatched_at`,
and whether a real email has been delivered. End with the receiver's
view of how many emails it saw with this dedup_key.

What you're learning: Tracing a relay timeline is the proof that the side-effect intent survives a partial failure. The visible repeatability across the seven steps is the contract.

Checkpoint

  • I can name the dual-write problem and explain why neither ordering of "write row, send email" is atomic.
  • I can read a Worker transaction that writes both an expenses row and an outbox_messages row and explain why both succeed or both fail.
  • I can read a relay block and identify the FOR UPDATE SKIP LOCKED claim and the success-vs-failure UPDATEs.
  • I can explain why the dedup_key is built from Worker inputs and not from the outbox row's UUID.
  • I have traced a relay failure-then-success scenario and confirmed the email arrived exactly once.
  • I rejected at least one agent-proposed outbox design during practice for missing the UNIQUE constraint on dedup_key.

Flashcards Study Aid