The Worker Execution Journal

Last night the Reconciler crashed at 11:42pm. You open psql first thing in the morning and try to figure out what happened.

The expenses table looks normal. The pending_transactions table has fewer rows than yesterday, but more than zero. Some pending rows were processed. Some were not. Which ones? In what order? Why did the Worker stop?

The database has the inputs and the outputs. It does not have the work-in-progress.

If you restart the Worker now, you risk processing some rows twice. If you skip ahead, you risk missing rows. If you do nothing, the boss will ask why bank statements from Tuesday have not landed yet.

The gap is the execution journal: the table (or set of tables) that records the Worker's intent and progress, not just the result.

This lesson adds the three tables that close the gap.

Key Terms for This Lesson

Run: one invocation of the Worker. Has a start time, a finish time (eventually), a status, and an input reference.
Turn: one step inside a run. The Reconciler typically does one turn per pending transaction it is processing.
Tool call: one external action a turn made: a SELECT, an INSERT, an LLM call, an MCP tool invocation. Has its own start, finish, status, and payload.
Status enum: a fixed list of allowed status values: pending, running, completed, failed, cancelled. The database refuses any other value.
Journal (or ledger): the three-table record of runs/turns/tool_calls together. Reading it reconstructs what the Worker did.

Three Tables That Make a Worker Inspectable

The schema is small. The agent will write the migration. You read it.

Table 1: `runs`

CREATE TABLE runs (
    id           uuid        PRIMARY KEY DEFAULT gen_random_uuid(),
    worker_id    text        NOT NULL,
    started_at   timestamptz NOT NULL DEFAULT now(),
    finished_at  timestamptz,
    status       text        NOT NULL DEFAULT 'pending'
                 CHECK (status IN ('pending','running','completed','failed','cancelled')),
    input_ref    text,
    error        text
);

Read the columns:

id: the stable, addressable handle for this run. You will refer to it everywhere else.
worker_id: which Worker this is (e.g. expense-reconciler).
started_at: set when the run begins. NOT NULL.
finished_at: nullable, set only when the run reaches a terminal state.
status: bound by a CHECK constraint to five values. The database refuses anything else.
input_ref: a pointer to what the run was working on (e.g. "pending_transactions for 2026-05-10").
error: a short text field, set only on failure.

Table 2: `turns`

CREATE TABLE turns (
    id           uuid        PRIMARY KEY DEFAULT gen_random_uuid(),
    run_id       uuid        NOT NULL REFERENCES runs(id) ON DELETE CASCADE,
    seq          integer     NOT NULL,
    started_at   timestamptz NOT NULL DEFAULT now(),
    finished_at  timestamptz,
    status       text        NOT NULL DEFAULT 'pending'
                 CHECK (status IN ('pending','running','completed','failed','cancelled')),
    UNIQUE (run_id, seq)
);

Read the columns:

run_id: the parent run.
seq: turn 1, turn 2, turn 3 within that run. The UNIQUE (run_id, seq) constraint prevents two turn-3s on the same run.
started_at, finished_at, status: same shape as runs.

Table 3: `tool_calls`

CREATE TABLE tool_calls (
    id           uuid        PRIMARY KEY DEFAULT gen_random_uuid(),
    turn_id      uuid        NOT NULL REFERENCES turns(id) ON DELETE CASCADE,
    tool_name    text        NOT NULL,
    input_json   jsonb,
    output_json  jsonb,
    status       text        NOT NULL DEFAULT 'pending'
                 CHECK (status IN ('pending','running','completed','failed','cancelled')),
    started_at   timestamptz NOT NULL DEFAULT now(),
    finished_at  timestamptz,
    error        text
);

Read the columns:

turn_id: the parent turn.
tool_name: e.g. categorise_transaction, insert_expense, queue_email.
input_json and output_json: what the tool was called with, and what it returned. Stored as JSONB so you can query into them later.
status, started_at, finished_at, error: same shape as the other two tables.

The three tables together form one tree per run: one run, many turns, many tool_calls per turn.

The Status Enum

The same five values appear in all three tables. They mean the same thing everywhere:

Status	When It Is Set
`pending`	The row exists but the work has not started yet. Useful for queues.
`running`	The work has started. `started_at` is set. `finished_at` is still NULL.
`completed`	The work finished successfully. `finished_at` is set.
`failed`	The work finished with an error. `finished_at` is set. `error` is populated.
`cancelled`	The work was stopped before finishing. `finished_at` is set. No `error` required.

A row with status = 'running' and finished_at = NULL is the signature of a crash. The Worker started the work and never wrote the closing record. The journal tells you exactly where the crash happened.

Resume Story

Here is what last night's crashed Reconciler looks like when you read the journal in the morning.

SELECT id, worker_id, started_at, finished_at, status
FROM runs
WHERE worker_id = 'expense-reconciler'
ORDER BY started_at DESC
LIMIT 1;

Output:

id        | worker_id           | started_at          | finished_at | status
----------+---------------------+---------------------+-------------+---------
8a3f...   | expense-reconciler  | 2026-05-10 23:00    | NULL        | running

Translation: the run started, never finished, and the database still thinks it is running. That is your crash signal.

Now read the turns:

SELECT seq, started_at, finished_at, status
FROM turns
WHERE run_id = '8a3f...'
ORDER BY seq;

seq | started_at        | finished_at       | status
----+-------------------+-------------------+-----------
  1 | 23:00:01          | 23:00:04          | completed
  2 | 23:00:04          | 23:00:08          | completed
  3 | 23:00:08          | NULL              | running

Turns 1 and 2 finished cleanly. Turn 3 was in flight when the crash happened. Now the tool_calls for turn 3:

SELECT tool_name, status, started_at, finished_at, error
FROM tool_calls
WHERE turn_id = '<turn-3 id>'
ORDER BY started_at;

tool_name              | status     | started_at | finished_at | error
-----------------------+------------+------------+-------------+--------
categorise_transaction | completed  | 23:00:08   | 23:00:09    | NULL
insert_expense         | running    | 23:00:09   | NULL        | NULL

The Worker categorised the third transaction. It then started inserting the expense row and crashed before recording the result. The expenses table itself either has the new row or does not, depending on whether the database commit landed before the process died.

You now have a real choice, made from rows, not guesses:

Resume strategy: re-run turn 3 with the same input, but use idempotency (the next lesson) so the second insert_expense does not duplicate the row.
Restart strategy: mark this run failed, start a fresh run, let it pick up the same pending transactions, and rely on idempotency to not double-process turns 1 and 2.

Either decision is defensible. The point is that the journal made the decision possible. Without it, you were guessing.

PRIMM-AI+ Practice: Record One Reconciler Run

Predict [AI-FREE]

The Reconciler is going to process exactly three pending transactions for Alice in one run. Each transaction needs three tool calls: one categorise_transaction, one insert_expense, one queue_email.

Before running, write down:

How many rows will appear in runs after the run completes?
How many rows in turns?
How many rows in tool_calls?
What status values will exist on the run, on each turn, and on each tool_call?
Which column on runs is set last? Why?
Confidence (1-5) on each answer.

Expected answers, to check yourself: 1 run row, 3 turn rows, 9 tool_call rows, all completed. finished_at on runs is set last because it cannot be set until every turn has finished.

Run

Use the agent to simulate one Reconciler run end to end. The migrations from Lesson 1 plus the three tables above are assumed in place.

On the journal-demo Neon branch, simulate one Reconciler run for Alice
processing three pending transactions. Insert one row in runs, three in
turns, nine in tool_calls. Walk every status from pending -> running ->
completed and timestamp it correctly. After the simulation, run a SELECT
that joins the three tables and shows me the run as a flat table.

What you should see:

1 row in runs, status completed, finished_at set.
3 rows in turns, all completed, all with seq values 1, 2, 3.
9 rows in tool_calls, three per turn, all completed.
The flat join shows you the full ordered trace of the run.

Investigate

Write your own explanation first.

Why does finished_at on runs get set last? (Hint: a run cannot be complete before its last turn is.)
Why are there exactly 9 tool_calls, not 3? (Hint: each turn has three tool calls.)
What does ON DELETE CASCADE on the foreign keys mean for what happens if you delete the run? (Hint: turns and tool_calls go with it.)

Then ask the agent these targeted questions:

"Show me the SQL that ordered the rows correctly when the run finished. Why was runs.finished_at set after turns.finished_at for turn 3?"
"If I dropped tool_calls first by accident, would the foreign key prevent it? What error would I see?"
"Walk me through the smallest change to add a cost_cents column to tool_calls for tracking LLM spend."

You should be able to answer your version of (3) before reading the agent's: it is a column-add migration, nullable, no backfill needed.

Modify

Simulate a crash. Manually update the last turn's status to running and clear its finished_at. Update the last tool_call's status to running and clear its finished_at. Now query the journal as if you had just woken up.

Predict before you run:

Which run has the crash signature (status = 'running' AND finished_at IS NULL)?
Which turn is in-flight?
Which tool_call is in-flight?
If the resume Worker re-ran the in-flight tool_call, what risk would it carry without idempotency? (Hint: if the original tool call already wrote to expenses, the resume call would write again.)

Run the queries. Compare with your prediction. Decide whether to mark the run failed and start fresh, or to resume from the in-flight turn.

Make [Mastery Gate]

Brief in business English:

"The Reconciler should record which model and version it used per run, so that if a model upgrade changes categorisation behaviour, we can answer 'which runs used the old model and which used the new one?' from the journal alone."

Hand the brief to the agent. Ask for the migration file.

Read the file. Confirm:

The migration adds two new columns on runs: model_name text and model_version text.
Both are nullable, because old runs do not have this information and we do not want to backfill arbitrary values.
The downgrade() drops both columns.
The agent did not "improve" the migration by also dropping the existing error column or renaming anything. Read for scope creep.

Apply on the branch. Insert one new run with model fields populated and confirm a SELECT can filter by model_name. Roll back. Approve in writing or reject with reasons.

The gate passes when you can read three tables, narrate one run from end to end, and either approve a follow-up migration or refuse it because the agent overreached.

What Failure This Prevents

This lesson kills two failure modes: silent loss of work-in-progress when the Worker crashes (because the journal records intent and progress as separate facts), and inability to answer "what did the Worker do?" hours or days after the fact (because the rows are still there to be read). The Reconciler can now crash safely.

Try With AI

Prompt 1: Read a Journal

Show me three sample rows: one runs row, one of its turns rows, and one
of that turn's tool_calls rows. Pretty-print them. Do not explain. After
that, I will narrate, in plain English, what the Worker was doing on
that run. You then tell me what I missed.

What you're learning: Reading the journal as a tree (run -> turns -> tool_calls) before you ever try to write code that produces it. The Worker writes; you read; the agent confirms. Same pattern as 21A.

Prompt 2: Crash Diagnosis

I have one runs row with status='running' and finished_at IS NULL. The
last turns row for that run has status='running' and finished_at IS
NULL. The last tool_calls row for that turn has tool_name='insert_expense',
status='running', finished_at IS NULL.

Tell me the two possible states of the underlying expenses table after
this crash, and what one query I should run to find out which one I am
in.

What you're learning: Forensic reasoning. The journal narrows the problem to a single tool_call in flight. Whether the database commit landed is a separate question, answerable by querying the expenses table for a row that matches the tool_call's input_json.

Prompt 3: Cost Tracking Add-On

We want to add LLM cost tracking. Propose the smallest change to the
journal: which table gets a new column, what type, nullable or not,
default or not. Do not write the migration yet. Just argue for or
against your choice in three sentences.

What you're learning: Forcing the agent to argue before writing code is the cheapest way to surface bad design choices. If the argument is weak, the migration will be weak. By the time you say "okay, write it," you should already have agreed on the shape.

Checkpoint

I can read the runs/turns/tool_calls schema and explain the foreign-key tree in plain English.
I can name the five status values and explain when each is set.
I can read a partially-completed journal and recognise the crash signature.
I simulated one full run end to end and read the rows back.
I can argue for either resume or restart given a specific journal state, and name the risk of each.
I am ready to make these writes idempotent in the next lesson.

Three Tables That Make a Worker Inspectable​

Table 1: runs​

Table 2: turns​

Table 3: tool_calls​

The Status Enum​

Resume Story​

PRIMM-AI+ Practice: Record One Reconciler Run​

Predict [AI-FREE]​

Run​

Investigate​

Modify​

Make [Mastery Gate]​

What Failure This Prevents​

Try With AI​

Prompt 1: Read a Journal​

Prompt 2: Crash Diagnosis​

Prompt 3: Cost Tracking Add-On​

Checkpoint​

Flashcards Study Aid​

Three Tables That Make a Worker Inspectable

Table 1: `runs`

Table 2: `turns`

Table 3: `tool_calls`

The Status Enum

Resume Story

PRIMM-AI+ Practice: Record One Reconciler Run

Predict [AI-FREE]

Run

Investigate

Modify

Make [Mastery Gate]

What Failure This Prevents

Try With AI

Prompt 1: Read a Journal

Prompt 2: Crash Diagnosis

Prompt 3: Cost Tracking Add-On

Checkpoint

Flashcards Study Aid