Observability and Forensics: Answer "What Happened?"
It is 8:14 on a Wednesday morning. The Reconciler had a bad Tuesday night. The boss wants a report by 9.
You did not write the Worker code. You did not design the retry policy. You did not pick the model. You did build the SoR shape across this chapter: migrations, journal, idempotency, claims, outbox, events, RLS, MCP boundary. That means the answers exist.
They are in five tables. Your job is to read them in the right order.
- Forensic report: a one-page narrative of one Worker run, assembled from rows. No re-run. No guessing.
- Run inspection: reading the
runs,turns, andtool_callsrows for onerun_idin order. - Outbox dashboard: a grouped read on
outbox_messagesthat shows what is pending, dispatched, or stuck. - EXPLAIN plan: Postgres's description of how it will execute a query. Recognition only in this lesson.
- Seq Scan vs Index Scan: the two plan shapes you must distinguish at sight. Seq Scan reads every row; Index Scan jumps to the matching rows via an index.
The Four Operational Queries
Every forensic question routes to one of these tables. Memorise the routing.
| Question | Where to look |
|---|---|
| What runs ran last night, and which failed? | runs filtered by status and time range |
| For a failed run, which tool calls completed and which did not? | tool_calls joined to turns joined to runs |
| What side effects are stuck or retried? | outbox_messages grouped by status |
| What did the Worker actually decide step by step? | agent_events ordered by occurred_at |
One short sample for each, recognition only.
Q1: Which runs failed last night?
SELECT id, worker_id, started_at, finished_at, status, error
FROM runs
WHERE worker_id = 'expense-reconciler'
AND started_at >= '2026-05-13 20:00'
AND started_at < '2026-05-14 08:00'
ORDER BY started_at;
Read the columns. The status column tells you which runs finished, which failed, and which are still marked running (the crash signature from Lesson 2).
Q2: For one failed run, which tool calls completed?
SELECT t.seq, tc.tool_name, tc.status, tc.started_at, tc.finished_at, tc.error
FROM tool_calls tc
JOIN turns t ON t.id = tc.turn_id
JOIN runs r ON r.id = t.run_id
WHERE r.id = '8a3f...'
ORDER BY t.seq, tc.started_at;
The join walks the run-to-turn-to-tool_call tree from Lesson 2. The output is a chronological list of what the Worker tried, in order.
Q3: What side effects are stuck?
SELECT status, COUNT(*) AS message_count
FROM outbox_messages
WHERE created_at >= now() - interval '24 hours'
GROUP BY status;
A healthy outbox shows most rows in dispatched. A spike in pending means the relay is behind. A spike in failed means the relay tried and could not deliver. The grouped count is your dashboard in one query.
Q4: What did the Worker actually decide?
SELECT occurred_at, event_type, subject_type, subject_id, payload
FROM agent_events
WHERE run_id = '8a3f...'
ORDER BY id;
The event stream from Lesson 6, in order, is the diary of one run. The payload carries the decisions: which category, what confidence, which idempotency key.
A Forensic Report Template
A one-page forensic report for one Worker run contains exactly six lines.
- run_id: the stable handle for this run (from
runs.id). - timing:
started_at,finished_at(or NULL), and finalstatus. - last completed turn: the highest
seqinturnswithstatus = 'completed'for thisrun_id. - last tool_call status: the most recent
tool_callsrow for this run, withtool_name,status, anderrorif present. - outbox state: rows in
outbox_messagestied to this run, grouped as pending, dispatched, failed. - event summary: the count of
agent_eventsrows for this run and the notableevent_typevalues (for examplelow_confidence_flag,expense_written,email_dispatched).
Six lines. No prose padding. The boss reads it in thirty seconds.
A Glance at EXPLAIN
Postgres can tell you how it plans to run a query before it runs it. Prefix any SELECT with EXPLAIN and you get a plan. In this lesson you only need to recognise two shapes.
EXPLAIN SELECT * FROM agent_events WHERE run_id = '8a3f...';
Two possible outputs, in plain English:
Seq Scan on agent_events (cost=0.00..234.00 rows=5 width=...)
Filter: (run_id = '8a3f...'::uuid)
vs
Index Scan using agent_events_run_id_idx on agent_events (cost=0.42..8.50 rows=5 width=...)
Index Cond: (run_id = '8a3f...'::uuid)
The first reads every row in agent_events and keeps the ones that match. Fine on a small table. Painful on a large one. The second jumps directly to the matching rows through an index on run_id. Fast at any size.
Recognition only. You do not need to read costs. You do not need to tune everything in this chapter. You only need to spot which shape the agent's query is hitting, decide whether the plan matches the release risk, and flag the need for an index when the evidence is obvious.
What This Chapter Is NOT Teaching
This is the floor of observability, not the ceiling.
- Deep query tuning: rewriting slow queries, picking a full indexing strategy, and reading cost numbers in detail.
- Partitioning of agent_events: long-running event tables can be partitioned by month or by tenant.
pg_stat_statements: the catalog view that tracks query frequency and cost across the whole database.- Point-in-time recovery (PITR): replaying WAL logs to restore the database to a specific second.
- Cross-region replication: read replicas in other regions, failover, and lag monitoring.
The five tables and the four operational queries are enough to answer "what happened last night?" That is the bar this lesson sets.
PRIMM-AI+ Practice: Build the Tuesday-Night Forensic Report
Predict [AI-FREE]
You are handed one run_id: the Reconciler run that died at 23:51 on Tuesday. Before opening any tool, write down what you expect to find.
- In
runs: whichstatuswill the row carry? - In
tool_calls: how many rows will carrystatus = 'running'withfinished_at IS NULL? - In
outbox_messagestied to this run: any rows inpending? Infailed? - In
agent_eventsfor this run: which event types should be present, and which absent? - Confidence on each, 1 to 5.
The expected pattern from Lessons 2-6: one runs row in running, at most one tool_calls row still running, possibly one or two outbox_messages rows in pending, and an event stream that stops before run_completed.
Run
Ask your agent to execute the four operational queries scoped to the one run_id and return the rows. Use tabs only if the runtime command differs.
On the Neon dev branch, given run_id '8a3f...', run the four operational
queries: (1) the runs row, (2) the tool_calls joined to turns and runs,
(3) the outbox_messages grouped by status, (4) the agent_events ordered
by id. Return each result as a small table. Do not interpret the results.
I will narrate the report from the rows.
What you should see:
- One
runsrow withstatus = 'running'andfinished_at IS NULL. The crash signature. - A
tool_callstrace that stops in the middle of one turn, with the last tool call stillrunning. - An
outbox_messagesgrouping with at least one row inpending(the email never got dispatched because the run died). - An
agent_eventsstream that containsclaimed_transactionandcategorised_with_confidence, but notrun_completed.
Investigate
Write your own one-page report first, using the six-line template above. Fill in run_id, timing, last completed turn, last tool_call status, outbox state, and event summary. Do not consult the agent yet.
Then ask the agent three targeted questions to test the gaps in your queries:
- "If the
tool_callstable for this run has two rows withstatus = 'running'and the second one'sstarted_atis later, which one is the actual crash point? Why?" - "The outbox query showed 1 pending row. Which event in
agent_eventsshould have followed anemail_dispatchedinsert? What does its absence tell me?" - "Show me an EXPLAIN of query Q4 against
agent_eventsfiltered by thisrun_id. Is the plan Seq Scan or Index Scan? Do not tune it, just label it."
Read the agent's answers against your own. The skill is finishing the report from the rows, then using the agent to sanity-check, not the other way round.
Modify
Now change one filter. Instead of scoping by run_id = '8a3f...', scope all four queries by user_id = 'alice' for the same Tuesday-night window.
Predict before you run:
- Which queries change shape? (Hint: any query that already filtered by
run_idneeds a different join to reachuser_id.) - Which queries return more rows than the run-scoped version? (Hint: a user has had many runs.)
- Does the outbox grouping still make sense, or does it need a join to
runsfirst? (Hint: outbox rows reference the run that produced them, not the user directly.)
Run the modified queries. Compare your prediction with the rows that come back. Decide whether a user-scoped report is more or less useful than a run-scoped one for a Tuesday-night triage. There is no single right answer. The point is to read both.
Make [Mastery Gate]
Brief in business English:
"Produce a one-page forensic report for run_id
8a3f...containing what happened, what is pending, what was retried, and what evidence exists. Six lines, no prose padding. I will read it before I decide whether to retry the run or mark it failed."
Hand the brief to the agent. Ask it to assemble the report by running the four operational queries and filling the six-line template. You do not approve the report by trusting the agent. You approve it by:
- Reading the agent's queries and confirming each one matches the table-to-question map above.
- Re-running one of the four queries yourself and comparing the rows the agent returned.
- Checking that the six template fields are all populated (no "unknown" entries) and the
run_idis the right one.
The gate passes when you can hand the boss a six-line report at 8:55am, name the failure mode in one sentence, and point to the row that proves it.
What Failure This Prevents
This lesson kills the worst possible answer to "what happened last night?": "we cannot tell." With the journal, outbox, and event stream in place, the answer always exists. You just have to read the rows in the right order.
Try With AI
Prompt 1: Forensic Report Drill
I will give you a run_id. Run the four operational queries against the
Neon dev branch: the runs row, tool_calls joined to turns and runs,
outbox_messages grouped by status, and agent_events ordered by id.
Return each result as a small table. Do not interpret. After I read the
rows, I will write the forensic report myself.
What you're learning: Treating the agent as a query runner, not as an interpreter. The rows are the truth; your job is to read them. Same pattern as 21A, scaled to five tables.
Prompt 2: EXPLAIN Recognition
Show me the EXPLAIN output for `SELECT * FROM agent_events WHERE run_id
= 'abc...'` on the current Neon branch. Do not tune anything. Just tell
me whether the plan is using a Seq Scan or an Index Scan, in one
sentence. No cost numbers. No suggestions.
What you're learning: Recognition over tuning. You do not need to be a query tuner to do forensics. You need to spot the plan shape and know when to call for help.
Prompt 3: Cross-Tenant Audit Thought Experiment
Today the forensic report is scoped to one run_id. Tomorrow the auditor
asks: "Across all Workers, show me every run that touched tenant Bob in
the last 30 days, and any outbox row that was retried more than twice."
Do not write the queries yet. Tell me in three sentences what changes
in the report shape and which existing tables already carry that
information.
What you're learning: Forensics is a reading skill, not a tooling skill. The tables already carry the answer. The question is whether you can name the joins that get you there and identify when the evidence is strong enough to block release or ask for a focused performance fix.
Checkpoint
- I can name the four operational questions and the table that answers each one.
- I can run the four queries against one
run_idand read the rows back without interpretation. - I can fill in the six-line forensic report template from the rows alone.
- I can distinguish Seq Scan from Index Scan in a short EXPLAIN output.
- I can name two observability topics this lesson did NOT teach (deep tuning, partitioning, PITR, replication,
pg_stat_statements) and explain why they are outside this chapter's minimum SoR scope.