The Worker Execution Journal
Last night the Reconciler crashed at 11:42pm. You open psql first thing in the morning and try to figure out what happened.
The expenses table looks normal. The pending_transactions table has fewer rows than yesterday, but more than zero. Some pending rows were processed. Some were not. Which ones? In what order? Why did the Worker stop?
The database has the inputs and the outputs. It does not have the work-in-progress.
If you restart the Worker now, you risk processing some rows twice. If you skip ahead, you risk missing rows. If you do nothing, the boss will ask why bank statements from Tuesday have not landed yet.
The gap is the execution journal: the table (or set of tables) that records the Worker's intent and progress, not just the result.
This lesson adds the three tables that close the gap.
- Run: one invocation of the Worker. Has a start time, a finish time (eventually), a status, and an input reference.
- Turn: one step inside a run. The Reconciler typically does one turn per pending transaction it is processing.
- Tool call: one external action a turn made: a SELECT, an INSERT, an LLM call, an MCP tool invocation. Has its own start, finish, status, and payload.
- Status enum: a fixed list of allowed status values:
pending,running,completed,failed,cancelled. The database refuses any other value. - Journal (or ledger): the three-table record of runs/turns/tool_calls together. Reading it reconstructs what the Worker did.
Three Tables That Make a Worker Inspectable
The schema is small. The agent will write the migration. You read it.
Table 1: runs
CREATE TABLE runs (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
worker_id text NOT NULL,
started_at timestamptz NOT NULL DEFAULT now(),
finished_at timestamptz,
status text NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','running','completed','failed','cancelled')),
input_ref text,
error text
);
Read the columns:
id: the stable, addressable handle for this run. You will refer to it everywhere else.worker_id: which Worker this is (e.g.expense-reconciler).started_at: set when the run begins. NOT NULL.finished_at: nullable, set only when the run reaches a terminal state.status: bound by a CHECK constraint to five values. The database refuses anything else.input_ref: a pointer to what the run was working on (e.g. "pending_transactions for 2026-05-10").error: a short text field, set only on failure.
Table 2: turns
CREATE TABLE turns (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
run_id uuid NOT NULL REFERENCES runs(id) ON DELETE CASCADE,
seq integer NOT NULL,
started_at timestamptz NOT NULL DEFAULT now(),
finished_at timestamptz,
status text NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','running','completed','failed','cancelled')),
UNIQUE (run_id, seq)
);
Read the columns:
run_id: the parent run.seq: turn 1, turn 2, turn 3 within that run. TheUNIQUE (run_id, seq)constraint prevents two turn-3s on the same run.started_at,finished_at,status: same shape asruns.
Table 3: tool_calls
CREATE TABLE tool_calls (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
turn_id uuid NOT NULL REFERENCES turns(id) ON DELETE CASCADE,
tool_name text NOT NULL,
input_json jsonb,
output_json jsonb,
status text NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','running','completed','failed','cancelled')),
started_at timestamptz NOT NULL DEFAULT now(),
finished_at timestamptz,
error text
);
Read the columns:
turn_id: the parent turn.tool_name: e.g.categorise_transaction,insert_expense,queue_email.input_jsonandoutput_json: what the tool was called with, and what it returned. Stored as JSONB so you can query into them later.status,started_at,finished_at,error: same shape as the other two tables.
The three tables together form one tree per run: one run, many turns, many tool_calls per turn.
The Status Enum
The same five values appear in all three tables. They mean the same thing everywhere:
| Status | When It Is Set |
|---|---|
pending | The row exists but the work has not started yet. Useful for queues. |
running | The work has started. started_at is set. finished_at is still NULL. |
completed | The work finished successfully. finished_at is set. |
failed | The work finished with an error. finished_at is set. error is populated. |
cancelled | The work was stopped before finishing. finished_at is set. No error required. |
A row with status = 'running' and finished_at = NULL is the signature of a crash. The Worker started the work and never wrote the closing record. The journal tells you exactly where the crash happened.
Resume Story
Here is what last night's crashed Reconciler looks like when you read the journal in the morning.
SELECT id, worker_id, started_at, finished_at, status
FROM runs
WHERE worker_id = 'expense-reconciler'
ORDER BY started_at DESC
LIMIT 1;
Output:
id | worker_id | started_at | finished_at | status
----------+---------------------+---------------------+-------------+---------
8a3f... | expense-reconciler | 2026-05-10 23:00 | NULL | running
Translation: the run started, never finished, and the database still thinks it is running. That is your crash signal.
Now read the turns:
SELECT seq, started_at, finished_at, status
FROM turns
WHERE run_id = '8a3f...'
ORDER BY seq;
seq | started_at | finished_at | status
----+-------------------+-------------------+-----------
1 | 23:00:01 | 23:00:04 | completed
2 | 23:00:04 | 23:00:08 | completed
3 | 23:00:08 | NULL | running
Turns 1 and 2 finished cleanly. Turn 3 was in flight when the crash happened. Now the tool_calls for turn 3:
SELECT tool_name, status, started_at, finished_at, error
FROM tool_calls
WHERE turn_id = '<turn-3 id>'
ORDER BY started_at;
tool_name | status | started_at | finished_at | error
-----------------------+------------+------------+-------------+--------
categorise_transaction | completed | 23:00:08 | 23:00:09 | NULL
insert_expense | running | 23:00:09 | NULL | NULL
The Worker categorised the third transaction. It then started inserting the expense row and crashed before recording the result. The expenses table itself either has the new row or does not, depending on whether the database commit landed before the process died.
You now have a real choice, made from rows, not guesses:
- Resume strategy: re-run turn 3 with the same input, but use idempotency (the next lesson) so the second
insert_expensedoes not duplicate the row. - Restart strategy: mark this run
failed, start a fresh run, let it pick up the same pending transactions, and rely on idempotency to not double-process turns 1 and 2.
Either decision is defensible. The point is that the journal made the decision possible. Without it, you were guessing.
PRIMM-AI+ Practice: Record One Reconciler Run
Predict [AI-FREE]
The Reconciler is going to process exactly three pending transactions for Alice in one run. Each transaction needs three tool calls: one categorise_transaction, one insert_expense, one queue_email.
Before running, write down:
- How many rows will appear in
runsafter the run completes? - How many rows in
turns? - How many rows in
tool_calls? - What
statusvalues will exist on the run, on each turn, and on each tool_call? - Which column on
runsis set last? Why? - Confidence (1-5) on each answer.
Expected answers, to check yourself: 1 run row, 3 turn rows, 9 tool_call rows, all completed. finished_at on runs is set last because it cannot be set until every turn has finished.
Run
Use the agent to simulate one Reconciler run end to end. The migrations from Lesson 1 plus the three tables above are assumed in place.
On the journal-demo Neon branch, simulate one Reconciler run for Alice
processing three pending transactions. Insert one row in runs, three in
turns, nine in tool_calls. Walk every status from pending -> running ->
completed and timestamp it correctly. After the simulation, run a SELECT
that joins the three tables and shows me the run as a flat table.
What you should see:
- 1 row in
runs, statuscompleted,finished_atset. - 3 rows in
turns, allcompleted, all withseqvalues 1, 2, 3. - 9 rows in
tool_calls, three per turn, allcompleted. - The flat join shows you the full ordered trace of the run.
Investigate
Write your own explanation first.
- Why does
finished_atonrunsget set last? (Hint: a run cannot be complete before its last turn is.) - Why are there exactly 9 tool_calls, not 3? (Hint: each turn has three tool calls.)
- What does
ON DELETE CASCADEon the foreign keys mean for what happens if you delete the run? (Hint: turns and tool_calls go with it.)
Then ask the agent these targeted questions:
- "Show me the SQL that ordered the rows correctly when the run finished. Why was
runs.finished_atset afterturns.finished_atfor turn 3?" - "If I dropped
tool_callsfirst by accident, would the foreign key prevent it? What error would I see?" - "Walk me through the smallest change to add a
cost_centscolumn totool_callsfor tracking LLM spend."
You should be able to answer your version of (3) before reading the agent's: it is a column-add migration, nullable, no backfill needed.
Modify
Simulate a crash. Manually update the last turn's status to running and clear its finished_at. Update the last tool_call's status to running and clear its finished_at. Now query the journal as if you had just woken up.
Predict before you run:
- Which run has the crash signature (
status = 'running' AND finished_at IS NULL)? - Which turn is in-flight?
- Which tool_call is in-flight?
- If the resume Worker re-ran the in-flight tool_call, what risk would it carry without idempotency? (Hint: if the original tool call already wrote to
expenses, the resume call would write again.)
Run the queries. Compare with your prediction. Decide whether to mark the run failed and start fresh, or to resume from the in-flight turn.
Make [Mastery Gate]
Brief in business English:
"The Reconciler should record which model and version it used per run, so that if a model upgrade changes categorisation behaviour, we can answer 'which runs used the old model and which used the new one?' from the journal alone."
Hand the brief to the agent. Ask for the migration file.
Read the file. Confirm:
- The migration adds two new columns on
runs:model_name textandmodel_version text. - Both are nullable, because old runs do not have this information and we do not want to backfill arbitrary values.
- The
downgrade()drops both columns. - The agent did not "improve" the migration by also dropping the existing
errorcolumn or renaming anything. Read for scope creep.
Apply on the branch. Insert one new run with model fields populated and confirm a SELECT can filter by model_name. Roll back. Approve in writing or reject with reasons.
The gate passes when you can read three tables, narrate one run from end to end, and either approve a follow-up migration or refuse it because the agent overreached.
What Failure This Prevents
This lesson kills two failure modes: silent loss of work-in-progress when the Worker crashes (because the journal records intent and progress as separate facts), and inability to answer "what did the Worker do?" hours or days after the fact (because the rows are still there to be read). The Reconciler can now crash safely.
Try With AI
Prompt 1: Read a Journal
Show me three sample rows: one runs row, one of its turns rows, and one
of that turn's tool_calls rows. Pretty-print them. Do not explain. After
that, I will narrate, in plain English, what the Worker was doing on
that run. You then tell me what I missed.
What you're learning: Reading the journal as a tree (run -> turns -> tool_calls) before you ever try to write code that produces it. The Worker writes; you read; the agent confirms. Same pattern as 21A.
Prompt 2: Crash Diagnosis
I have one runs row with status='running' and finished_at IS NULL. The
last turns row for that run has status='running' and finished_at IS
NULL. The last tool_calls row for that turn has tool_name='insert_expense',
status='running', finished_at IS NULL.
Tell me the two possible states of the underlying expenses table after
this crash, and what one query I should run to find out which one I am
in.
What you're learning: Forensic reasoning. The journal narrows the problem to a single tool_call in flight. Whether the database commit landed is a separate question, answerable by querying the expenses table for a row that matches the tool_call's input_json.
Prompt 3: Cost Tracking Add-On
We want to add LLM cost tracking. Propose the smallest change to the
journal: which table gets a new column, what type, nullable or not,
default or not. Do not write the migration yet. Just argue for or
against your choice in three sentences.
What you're learning: Forcing the agent to argue before writing code is the cheapest way to surface bad design choices. If the argument is weak, the migration will be weak. By the time you say "okay, write it," you should already have agreed on the shape.
Checkpoint
- I can read the runs/turns/tool_calls schema and explain the foreign-key tree in plain English.
- I can name the five status values and explain when each is set.
- I can read a partially-completed journal and recognise the crash signature.
- I simulated one full run end to end and read the rows back.
- I can argue for either resume or restart given a specific journal state, and name the risk of each.
- I am ready to make these writes idempotent in the next lesson.