Observability and Forensics: Answer "What Happened?"

It is 8:14 on a Wednesday morning. The Reconciler had a bad Tuesday night. The boss wants a report by 9.

You did not write the Worker code. You did not design the retry policy. You did not pick the model. You did build the SoR shape across this chapter: migrations, journal, idempotency, claims, outbox, events, RLS, MCP boundary. That means the answers exist.

They are in five tables. Your job is to read them in the right order.

Key Terms for This Lesson

Forensic report: a one-page narrative of one Worker run, assembled from rows. No re-run. No guessing.
Run inspection: reading the runs, turns, and tool_calls rows for one run_id in order.
Outbox dashboard: a grouped read on outbox_messages that shows what is pending, dispatched, or stuck.
EXPLAIN plan: Postgres's description of how it will execute a query. Recognition only in this lesson.
Seq Scan vs Index Scan: the two plan shapes you must distinguish at sight. Seq Scan reads every row; Index Scan jumps to the matching rows via an index.

The Four Operational Queries

Every forensic question routes to one of these tables. Memorise the routing.

Question	Where to look
What runs ran last night, and which failed?	`runs` filtered by status and time range
For a failed run, which tool calls completed and which did not?	`tool_calls` joined to `turns` joined to `runs`
What side effects are stuck or retried?	`outbox_messages` grouped by status
What did the Worker actually decide step by step?	`agent_events` ordered by `occurred_at`

One short sample for each, recognition only.

Q1: Which runs failed last night?

SELECT id, worker_id, started_at, finished_at, status, error
FROM runs
WHERE worker_id = 'expense-reconciler'
  AND started_at >= '2026-05-13 20:00'
  AND started_at <  '2026-05-14 08:00'
ORDER BY started_at;

Read the columns. The status column tells you which runs finished, which failed, and which are still marked running (the crash signature from Lesson 2).

Q2: For one failed run, which tool calls completed?

SELECT t.seq, tc.tool_name, tc.status, tc.started_at, tc.finished_at, tc.error
FROM tool_calls tc
JOIN turns t  ON t.id = tc.turn_id
JOIN runs  r  ON r.id = t.run_id
WHERE r.id = '8a3f...'
ORDER BY t.seq, tc.started_at;

The join walks the run-to-turn-to-tool_call tree from Lesson 2. The output is a chronological list of what the Worker tried, in order.

Q3: What side effects are stuck?

SELECT status, COUNT(*) AS message_count
FROM outbox_messages
WHERE created_at >= now() - interval '24 hours'
GROUP BY status;

A healthy outbox shows most rows in dispatched. A spike in pending means the relay is behind. A spike in failed means the relay tried and could not deliver. The grouped count is your dashboard in one query.

Q4: What did the Worker actually decide?

SELECT occurred_at, event_type, subject_type, subject_id, payload
FROM agent_events
WHERE run_id = '8a3f...'
ORDER BY id;

The event stream from Lesson 6, in order, is the diary of one run. The payload carries the decisions: which category, what confidence, which idempotency key.

A Forensic Report Template

A one-page forensic report for one Worker run contains exactly six lines.

run_id: the stable handle for this run (from runs.id).
timing: started_at, finished_at (or NULL), and final status.
last completed turn: the highest seq in turns with status = 'completed' for this run_id.
last tool_call status: the most recent tool_calls row for this run, with tool_name, status, and error if present.
outbox state: rows in outbox_messages tied to this run, grouped as pending, dispatched, failed.
event summary: the count of agent_events rows for this run and the notable event_type values (for example low_confidence_flag, expense_written, email_dispatched).

Six lines. No prose padding. The boss reads it in thirty seconds.

A Glance at EXPLAIN

Postgres can tell you how it plans to run a query before it runs it. Prefix any SELECT with EXPLAIN and you get a plan. In this lesson you only need to recognise two shapes.

EXPLAIN SELECT * FROM agent_events WHERE run_id = '8a3f...';

Two possible outputs, in plain English:

Seq Scan on agent_events  (cost=0.00..234.00 rows=5 width=...)
  Filter: (run_id = '8a3f...'::uuid)

Index Scan using agent_events_run_id_idx on agent_events  (cost=0.42..8.50 rows=5 width=...)
  Index Cond: (run_id = '8a3f...'::uuid)

The first reads every row in agent_events and keeps the ones that match. Fine on a small table. Painful on a large one. The second jumps directly to the matching rows through an index on run_id. Fast at any size.

Recognition only. You do not need to read costs. You do not need to tune everything in this chapter. You only need to spot which shape the agent's query is hitting, decide whether the plan matches the release risk, and flag the need for an index when the evidence is obvious.

What This Chapter Is NOT Teaching

This is the floor of observability, not the ceiling.

Deep query tuning: rewriting slow queries, picking a full indexing strategy, and reading cost numbers in detail.
Partitioning of agent_events: long-running event tables can be partitioned by month or by tenant.
pg_stat_statements: the catalog view that tracks query frequency and cost across the whole database.
Point-in-time recovery (PITR): replaying WAL logs to restore the database to a specific second.
Cross-region replication: read replicas in other regions, failover, and lag monitoring.

The five tables and the four operational queries are enough to answer "what happened last night?" That is the bar this lesson sets.

PRIMM-AI+ Practice: Build the Tuesday-Night Forensic Report

Predict [AI-FREE]

You are handed one run_id: the Reconciler run that died at 23:51 on Tuesday. Before opening any tool, write down what you expect to find.

In runs: which status will the row carry?
In tool_calls: how many rows will carry status = 'running' with finished_at IS NULL?
In outbox_messages tied to this run: any rows in pending? In failed?
In agent_events for this run: which event types should be present, and which absent?
Confidence on each, 1 to 5.

The expected pattern from Lessons 2-6: one runs row in running, at most one tool_calls row still running, possibly one or two outbox_messages rows in pending, and an event stream that stops before run_completed.

Run

Ask your agent to execute the four operational queries scoped to the one run_id and return the rows. Use tabs only if the runtime command differs.

On the Neon dev branch, given run_id '8a3f...', run the four operational
queries: (1) the runs row, (2) the tool_calls joined to turns and runs,
(3) the outbox_messages grouped by status, (4) the agent_events ordered
by id. Return each result as a small table. Do not interpret the results.
I will narrate the report from the rows.

What you should see:

One runs row with status = 'running' and finished_at IS NULL. The crash signature.
A tool_calls trace that stops in the middle of one turn, with the last tool call still running.
An outbox_messages grouping with at least one row in pending (the email never got dispatched because the run died).
An agent_events stream that contains claimed_transaction and categorised_with_confidence, but not run_completed.

Investigate

Write your own one-page report first, using the six-line template above. Fill in run_id, timing, last completed turn, last tool_call status, outbox state, and event summary. Do not consult the agent yet.

Then ask the agent three targeted questions to test the gaps in your queries:

"If the tool_calls table for this run has two rows with status = 'running' and the second one's started_at is later, which one is the actual crash point? Why?"
"The outbox query showed 1 pending row. Which event in agent_events should have followed an email_dispatched insert? What does its absence tell me?"
"Show me an EXPLAIN of query Q4 against agent_events filtered by this run_id. Is the plan Seq Scan or Index Scan? Do not tune it, just label it."

Read the agent's answers against your own. The skill is finishing the report from the rows, then using the agent to sanity-check, not the other way round.

Modify

Now change one filter. Instead of scoping by run_id = '8a3f...', scope all four queries by user_id = 'alice' for the same Tuesday-night window.

Predict before you run:

Which queries change shape? (Hint: any query that already filtered by run_id needs a different join to reach user_id.)
Which queries return more rows than the run-scoped version? (Hint: a user has had many runs.)
Does the outbox grouping still make sense, or does it need a join to runs first? (Hint: outbox rows reference the run that produced them, not the user directly.)

Run the modified queries. Compare your prediction with the rows that come back. Decide whether a user-scoped report is more or less useful than a run-scoped one for a Tuesday-night triage. There is no single right answer. The point is to read both.

Make [Mastery Gate]

Brief in business English:

"Produce a one-page forensic report for run_id 8a3f... containing what happened, what is pending, what was retried, and what evidence exists. Six lines, no prose padding. I will read it before I decide whether to retry the run or mark it failed."

Hand the brief to the agent. Ask it to assemble the report by running the four operational queries and filling the six-line template. You do not approve the report by trusting the agent. You approve it by:

Reading the agent's queries and confirming each one matches the table-to-question map above.
Re-running one of the four queries yourself and comparing the rows the agent returned.
Checking that the six template fields are all populated (no "unknown" entries) and the run_id is the right one.

The gate passes when you can hand the boss a six-line report at 8:55am, name the failure mode in one sentence, and point to the row that proves it.

What Failure This Prevents

This lesson kills the worst possible answer to "what happened last night?": "we cannot tell." With the journal, outbox, and event stream in place, the answer always exists. You just have to read the rows in the right order.

Try With AI

Prompt 1: Forensic Report Drill

I will give you a run_id. Run the four operational queries against the
Neon dev branch: the runs row, tool_calls joined to turns and runs,
outbox_messages grouped by status, and agent_events ordered by id.
Return each result as a small table. Do not interpret. After I read the
rows, I will write the forensic report myself.

What you're learning: Treating the agent as a query runner, not as an interpreter. The rows are the truth; your job is to read them. Same pattern as 21A, scaled to five tables.

Prompt 2: EXPLAIN Recognition

Show me the EXPLAIN output for `SELECT * FROM agent_events WHERE run_id
= 'abc...'` on the current Neon branch. Do not tune anything. Just tell
me whether the plan is using a Seq Scan or an Index Scan, in one
sentence. No cost numbers. No suggestions.

What you're learning: Recognition over tuning. You do not need to be a query tuner to do forensics. You need to spot the plan shape and know when to call for help.

Prompt 3: Cross-Tenant Audit Thought Experiment

Today the forensic report is scoped to one run_id. Tomorrow the auditor
asks: "Across all Workers, show me every run that touched tenant Bob in
the last 30 days, and any outbox row that was retried more than twice."
Do not write the queries yet. Tell me in three sentences what changes
in the report shape and which existing tables already carry that
information.

What you're learning: Forensics is a reading skill, not a tooling skill. The tables already carry the answer. The question is whether you can name the joins that get you there and identify when the evidence is strong enough to block release or ask for a focused performance fix.

Checkpoint

I can name the four operational questions and the table that answers each one.
I can run the four queries against one run_id and read the rows back without interpretation.
I can fill in the six-line forensic report template from the rows alone.
I can distinguish Seq Scan from Index Scan in a short EXPLAIN output.
I can name two observability topics this lesson did NOT teach (deep tuning, partitioning, PITR, replication, pg_stat_statements) and explain why they are outside this chapter's minimum SoR scope.

The Four Operational Queries​

Q1: Which runs failed last night?​

Q2: For one failed run, which tool calls completed?​

Q3: What side effects are stuck?​

Q4: What did the Worker actually decide?​

A Forensic Report Template​

A Glance at EXPLAIN​

What This Chapter Is NOT Teaching​

PRIMM-AI+ Practice: Build the Tuesday-Night Forensic Report​

Predict [AI-FREE]​

Run​

Investigate​

Modify​

Make [Mastery Gate]​

What Failure This Prevents​

Try With AI​

Prompt 1: Forensic Report Drill​

Prompt 2: EXPLAIN Recognition​

Prompt 3: Cross-Tenant Audit Thought Experiment​

Checkpoint​

Flashcards Study Aid​