From Agent to Digital FTE: A 4-Hour Hands-On Workshop
15 Concepts, 80% of Real Use - Skills, System of Record, and MCP
A continuation crash course. This is course #4 in the agentic-coding track. The previous course, Build AI Agents with the OpenAI Agents SDK and Cloudflare Sandbox, taught you to build a streaming chat agent with sessions, guardrails, and a sandboxed deployment. By the end of that course you had an agent that worked. By the end of this course you'll have a Worker: same OpenAI Agents SDK foundation, same Cloudflare Sandbox runtime, but with three additions that change what the agent is: portable Skills it loads on demand, a Neon Postgres system of record it reads from and writes to, and Model Context Protocol connections that wire the two together.
The single insight that makes everything else click: turning an agent into a Worker is two architectural decisions, where its capabilities live, and where its truth lives. Skills answer the first. MCP-connected systems of record answer the second. Everything in this crash course is one of those two answers or the wiring between them.
In plain English: a chat agent that lives in a single Python process has its capabilities baked into code and its memory baked into RAM. Restart the process and both disappear. A Worker (the architectural term our thesis uses for an agent that operates inside a company) has its capabilities packaged as Skills (portable folders the agent discovers on demand) and its memory packaged in a system of record (a durable database the agent reads from and writes to through the Model Context Protocol). The previous course built the engine. This course builds what the engine runs against.

Start here: the architectural placement and the 15-concept cheat sheet
Where this course sits in the architecture. The Agent Factory thesis describes Seven Invariants that any production agent system must satisfy. Course #3 covered the engine layer (Invariant 4): the OpenAI Agents SDK as the harness, Cloudflare Sandbox as the compute. This course covers two more invariants and one of the production engine's three pillars:
- Invariant 5: Every Worker runs against a system of record. "Engine is what each Worker runs on; system of record is what each Worker runs against. Every AI Worker reads from and writes to an authoritative store of state." Neon Postgres + pgvector is one concrete realization of this invariant. The thesis is clear: any durable, addressable, governed store the workforce can read and write satisfies the requirement. We use Neon because it's free to start, scales to zero, and ships an official MCP server. The architecture is invariant; the product is replaceable.
- The Skills capability layer. Your thesis names "capability packaged as portable skills" as one of the architecture's stable invariants. Skills are the open standard (originally Anthropic, now ecosystem-wide at agentskills.io) that lets a Worker's capabilities live outside its code, in folders an agent discovers, loads, and executes on demand.
- MCP as the connector. Your thesis frames MCP as "how the workforce reaches" its systems of record. Course #3 used MCP lightly; this course makes MCP the load-bearing pattern between the agent and Neon.
The thesis sentence, expanded. Two architectural decisions turn an agent into a Worker:
- Where do its capabilities live? In code, as
@function_tooldecorators → fine for a demo. In Skill folders the agent discovers at startup and loads on demand → fine for a Worker that ships, evolves, gets added to without redeploying, and is sharable across Workers in the same company. - Where does its truth live? In RAM and a local SQLite file → fine for a chat demo. In a durable Postgres database the agent reads from and writes to under audit, with vector search over what it has learned → fine for a Worker that has to answer "what did you tell the customer six weeks ago?" and "have we seen this question before?"
The connector that joins the two is the Model Context Protocol: an open standard for how an agent reaches external state and external capability. By the end of this course, you will have replaced this course's predecessor's stub tools and SQLite session with: three Skills the agent loads on demand, a Neon Postgres + pgvector database holding the agent's authoritative state, and an MCP layer that wires them to the agent without coupling the agent to either.
The 15-concept cheat sheet. A failure in production almost always traces to one of three root causes: a missing or wrong skill, a system of record that isn't actually the source of truth, or an MCP wiring that loses data. This table is the diagnostic.
| # | Concept | Layer | What question it answers |
|---|---|---|---|
| 1 | What an Agent Skill is | Skills | Where does reusable capability live? In a folder, with SKILL.md plus optional scripts/references. |
| 2 | Progressive disclosure | Skills | Why are skills cheap to keep on hand? Discovery → activation → execution loads only what's needed when it's needed. |
| 3 | Writing a SKILL.md | Skills | What does a skill file actually contain? Metadata, trigger description, operational instructions. |
| 4 | Skill packaging conventions | Skills | How do skills travel between tools? Same folder works in Claude Code, OpenCode, and any compliant client. |
| 5 | Composing skills | Skills | When to chain small skills via filesystem handoff vs. write one big skill. |
| 6 | Why managed Postgres | System of record | What store earns "system of record"? One with persistence, branching, governance, and the vector primitives an agent needs. |
| 7 | The Worker's schema | System of record | What tables does an agent actually need? Conversations, messages, documents, embeddings, audit log, capability invocations. |
| 8 | pgvector basics | System of record | How does semantic search work in Postgres? Embedding column, distance operators, index types. |
| 9 | The embedding pipeline | System of record | How does text become a queryable vector? Chunking, the embedding model, when to re-embed. |
| 10 | Audit trail as discipline | System of record | What does "reads and writes" mean for a Worker? Every action a Worker takes leaves a trace the company can replay. |
| 11 | What MCP is and isn't | MCP | A protocol for tools, resources, and prompts: not a framework, not a service. |
| 12 | The Neon MCP server | MCP | The agent's interface to its database: what it exposes, how it authenticates. |
| 13 | Connecting MCP to the Agents SDK | MCP | The SDK's MCP integration: how to register a server, what the model sees, where the trust boundary lives. |
| 14 | Custom MCP servers | MCP | When to write your own server vs. just use @function_tool. The decision tree. |
| 15 | MCP under load | MCP | Transport choices, connection pooling, when to queue. |
Once you have this mapping, the rest of the document is mostly mechanics. A failure in production traces to one of: a Skill that doesn't get discovered (description too vague), a system of record that two Workers disagree about (schema race), or an MCP wiring that drops events (transport not chosen for the workload). The diagnostic tells you which.
Audience and prerequisites in 30 seconds. Intermediate course. Assumes you completed Build AI Agents Crash Course (Course #3), have the Agentic Coding Crash Course discipline (Plan mode, rules files), have done at least one PRIMM-AI+ cycle, and have a working Postgres mental model. Full audience, prereqs list, and glossary are in Appendix A.1 and A.5 at the end of the page; flip there now if you're missing one of those before starting the Quick Win.
Current as of May 14, 2026. Verified against openai-agents==0.17.2, mcp==1.27.1, Neon's MCP server documentation, and pgvector 0.8+ (run pip index versions openai-agents for the current pin when you build). The state-and-trust architecture this course teaches does not change when the products do; the products are this year's best fit. The Cloudflare Sandbox tutorial and Neon docs are the canonical sources for vendor-specific details. See Part 5's Swap guide for the per-product alternatives at each layer (Supabase, Pinecone, Cohere embeddings, LangGraph, and others).
The OpenAI Agents SDK and Cloudflare Sandbox stack from the previous course is the foundation of this course, not a stepping stone we move past. Your agent in Part 4 still uses Agent, Runner, function_tool, sessions, streaming events, guardrails, and SandboxAgent with Shell() and Filesystem() capabilities. What changes: those primitives now sit on top of a Skills library and a Neon system of record connected via MCP. The previous course taught you the engine. This course teaches you what the engine runs against.
Cloudflare Sandbox remains the trust boundary for anything the agent executes, including Skill scripts that run shell commands. Neon is the system of record. MCP is the wiring. The architecture combines an open SDK (OpenAI Agents SDK, MIT-licensed), an open connector protocol (MCP), and replaceable managed infrastructure (Cloudflare Sandbox for execution, Neon for storage), each piece swappable without changing the others. Everything below assumes the same Python-first stack as Course #3 with uv for package management.
This is a Python-first course, like its predecessor. The Skills format is language-agnostic (a SKILL.md is a Markdown file, whether your agent is Python or TypeScript), and MCP is transport-agnostic, but the agent we extend in Part 4 is the same Python chat-agent from Course #3.
The dual-tool pattern from Course #3 continues here. Sections that diverge between Claude Code and OpenCode have a switcher; pick one and the page syncs across visits.
There's a complete worked example in Part 4: the chat agent from Course #3, evolved into a customer-support Worker with three skills, a Neon system of record, and an MCP wiring layer. Eight build decisions, same shape as the previous course's worked example. If you learn better from doing than reading definitions, skim Parts 1–3 and jump to Part 4.
Architecture in one line. Engine = OpenAI Agents SDK + Cloudflare Sandbox (from Course #3). Capability = Skills (Part 1). Truth = Neon Postgres + pgvector (Part 2). Connector = MCP (Part 3). The eight build decisions in Part 4 wire all four together. If only one sentence of this whole document sticks, that's the one.
The fifteen-minute quick win: succeed once, then study why it worked
Before you read the 15 concepts that explain why this architecture works, build the smallest possible version of it that actually works. By the end of this section you'll have:
- one Postgres table holding the agent's first durable write
- one audit row recording what the agent did, in the same transaction
- one MCP boundary the agent crossed to do it
- one Skill folder, ready to activate later when you run the same agent from inside Claude Code
- a working answer to the question "did the system of record + MCP boundary actually do anything for me?"
This is not the Part 4 worked example; that's the full Worker, eight Decisions, hundreds of lines. This is one screen. If you only have one sitting, do this, then come back for the concepts when you want to know why each piece was shaped the way it was.
Before you start: this continues your Course #3 project. The Quick Win extends the
chat-agent/project from Course #3: a Skill folder, an MCP server, and ~30 lines added to itscli.py. With that project in hand it is twouvcommands (uv add, thenuv run). Without it, do Course #3 first, or scaffold a freshuvproject (uv initplus a minimal agent-loopcli.py) and run three commands instead of two.
One precondition. Skill discovery (scanning
.claude/skills/, reading frontmatter, presenting descriptions to the model) is a client capability in Claude Code, OpenCode, and Codex. A bareAgent(...)from the OpenAI Agents SDK does not read.claude/skills/SKILL.mdon its own. The Quick Win builds a standalone Python chat agent (the samecli.pyshape as Course #3) and routes "Remember this: …" via the MCP tool's own docstring. The Skill folder from Step 1 activates later when you run the same agent from inside Claude Code; it does not fire in the standalone Python path of the Quick Win, and that is the point: the MCP boundary alone is enough to prove the architecture works.
Step 1. In your Course #3 chat-agent/ project, create a Skill folder with one file:
mkdir -p .claude/skills/log-a-note
cat > .claude/skills/log-a-note/SKILL.md << 'EOF'
---
name: log-a-note
description: Saves a short text note to the durable notes table in Postgres, along with a timestamp. Use when the user says "remember this", "save a note", "log that", or otherwise asks for something to be persisted between sessions. Returns the row_id of the saved note.
---
# Log a note
When this skill activates:
1. Extract the note text from the user's message — everything after "remember:" or "save this:" or similar trigger phrases, or the whole message if no trigger phrase is used.
2. Call the `save_note` tool with the extracted text.
3. Reply with one short sentence confirming the save and citing the row_id.
EOF
Step 2. Create one Postgres table and a tiny audit table on a fresh database or Neon branch, not one that already has tables. (Use Neon's web console, or CREATE DATABASE on any Postgres you control; you don't need pgvector yet.) The names notes and audit_log are generic enough to collide with an existing schema, and the CREATE TABLE below assumes an empty target.
CREATE TABLE notes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
note_text TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
action TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Step 3. Write a 30-line MCP server exposing one tool. The full pattern shows up again in Concept 14 and Decision 6; this is the smallest possible version:
# notes_mcp.py
import json
import os
import asyncpg
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("notes")
_pool = None
async def pool():
global _pool
if _pool is None:
_pool = await asyncpg.create_pool(os.environ["DATABASE_URL"])
return _pool
@mcp.tool()
async def save_note(text: str) -> dict[str, str]:
"""Save a note to the durable notes table. Returns the row_id."""
p = await pool()
async with p.acquire() as c:
async with c.transaction():
row_id = await c.fetchval(
"INSERT INTO notes (note_text) VALUES ($1) RETURNING id::text", text)
await c.execute(
"INSERT INTO audit_log (action, payload) VALUES ($1, $2::jsonb)",
"note_saved",
json.dumps({"row_id": row_id, "len": len(text)}),
)
return {"row_id": row_id, "status": "saved"}
if __name__ == "__main__":
mcp.run(transport="stdio")
The audit payload goes through json.dumps(), not an f-string. The Quick Win values here are injection-free by accident (UUID and int), but the worked example will accept user-influenced strings; the safe pattern is the same in both places. The row_id key is load-bearing: the verification query in Step 5 joins on payload->>'row_id', so that key has to be in the payload you write.
Step 4. Build the agent. First, install the SDK, DB driver, and MCP package (skip what you already have from Course #3):
uv add openai-agents asyncpg mcp
Then in your Course #3 chat-agent/cli.py, the imports and main() wiring look like this (the rest of cli.py is unchanged from Course #3):
# cli.py — Quick Win shape
import asyncio
import os
from agents import Agent, OpenAIChatCompletionsModel, Runner, set_tracing_disabled
from agents.mcp import MCPServerStdio
from openai import AsyncOpenAI
# Only if your backend is not OpenAI (DeepSeek, OpenRouter, local proxy):
set_tracing_disabled(True)
async def main():
# The OpenAI Agents SDK accepts any OpenAI-API-compatible backend.
# Set OPENAI_BASE_URL only when pointing at a non-OpenAI host.
client = AsyncOpenAI(
api_key=os.environ["OPENAI_API_KEY"],
base_url=os.environ.get("OPENAI_BASE_URL"),
)
model = OpenAIChatCompletionsModel(
model=os.environ["MODEL_NAME"],
openai_client=client,
)
async with MCPServerStdio(
name="notes",
params={
"command": "python",
"args": ["notes_mcp.py"],
"env": {**os.environ}, # inherit PATH + DATABASE_URL into the subprocess
},
) as notes:
agent = Agent(
name="ChatAgent",
model=model,
mcp_servers=[notes],
instructions="You are a helpful assistant. Use the tools available to you.",
)
result = await Runner.run(
agent,
"Remember this: the production deploy needs a new env var before Friday.",
)
print(result.final_output)
if __name__ == "__main__":
asyncio.run(main())
Two details worth naming. env={**os.environ} is the practical fix for the most common Quick Win failure: pass only DATABASE_URL and the spawned subprocess loses PATH, so python notes_mcp.py is not even findable on some shells. Inherit the parent env and the subprocess starts cleanly. set_tracing_disabled(True) silences the SDK's trace-upload attempt against platform.openai.com when your backend is something else. Drop the call if you're using OpenAI for real.
Step 5. Run it.
export DATABASE_URL='postgresql://…/your_db?sslmode=require'
export OPENAI_API_KEY='sk-…' # OpenAI, DeepSeek, OpenRouter, whatever you have
# Pick any model your provider serves. Common choices:
# 'gpt-4o' or 'gpt-5.5' (OpenAI), 'deepseek-chat' (DeepSeek),
# 'meta-llama-3.1-…' (OpenRouter), etc.
export MODEL_NAME='gpt-4o'
# UNCOMMENT THIS UNLESS your OPENAI_API_KEY is from OpenAI itself.
# DeepSeek, OpenRouter, and other OpenAI-API-compatible providers need
# the base URL set, otherwise the SDK will hit api.openai.com with your
# non-OpenAI key and you'll see a 401.
# export OPENAI_BASE_URL='https://api.deepseek.com/v1'
# Smoke-test the key before the full run. This should print the start of a
# JSON model list; a 401 here means the key or OPENAI_BASE_URL is wrong, and
# you learn that in one second instead of four files deep.
curl -sS "${OPENAI_BASE_URL:-https://api.openai.com/v1}/models" \
-H "Authorization: Bearer $OPENAI_API_KEY" | head -c 200; echo
uv run python cli.py
The MCP tool's docstring is the only routing signal: the model matches "Remember this: …" to save_note, calls it, and the server writes both rows in one transaction. One MCP boundary, one trigger word, one transaction (the figure below traces the full path).

The MCP tool's docstring is doing all the routing work. One transaction writes both rows. One JOIN verifies the commit. The .claude/skills/log-a-note/ folder sits on disk waiting; it joins the picture in Part 4 when the same agent runs inside Claude Code.
The verification, in one SQL query:
SELECT n.id, n.note_text, n.created_at, al.action, al.payload
FROM notes n
JOIN audit_log al ON al.payload->>'row_id' = n.id::text
ORDER BY n.created_at DESC LIMIT 1;
If you see your note text in one row and "action": "note_saved" with a matching row_id in the joined audit row, the architecture is working. A system of record that holds the truth. An MCP boundary the agent crossed to write it. An audit trail you can replay. (The Skill folder is on disk waiting; it joins the picture in Part 4 when the same agent runs from inside Claude Code.)
This is one Skill, one tool, two tables, and ~50 lines of new code. The Part 4 worked example scales the same pattern to three Skills, ten tables, three MCP tools, and an embedding pipeline, but the shape is identical. If this quick win worked, the rest of the course is explaining why each piece is shaped the way it is and what changes when you scale it up.
If something didn't work, scan the "When something feels wrong" table near the end of the document; it maps each common failure to the concept that explains it. Then come back here.
Part 1: Skills, capability as portable folders
The previous course's agent had its capabilities baked into Python. Every tool was a @function_tool-decorated function in tools.py; every behavior was an instruction string in agents.py. That worked for one demo. It does not work for a workforce. Skills extract capability out of code and into folders the agent discovers, loads, and executes on demand: version-controlled, sharable across agents, addable without redeploying.
Concept 1: What an Agent Skill is
An Agent Skill is a folder containing a SKILL.md file plus optional scripts, references, and assets. The folder is the skill; the SKILL.md is its entry point. The format was originally released by Anthropic, is now an open standard published at agentskills.io with a growing ecosystem of clients (Claude Code, OpenCode, Codex, Cursor, Gemini CLI, Junie, and more).
The minimum viable skill:
hello-skill/
└── SKILL.md
With contents:
---
name: hello-skill
description: A skill that responds with a friendly greeting tailored to the user's name and the time of day. Use when the user asks to be greeted, says hello, or starts a conversation.
---
# Hello skill
When the user greets you or asks to be greeted:
1. Determine the current local time of day.
2. Compose a friendly, time-appropriate greeting that uses the user's name if available.
3. Keep the greeting under 25 words.
Examples:
- Morning + name known: "Good morning, Sara — hope your week is starting well."
- Evening + no name: "Good evening — happy to be of help."
That's a complete, valid skill. The agent will see it at startup (the name and description), load the body when a task matches the description, and act on the instructions. No code. No deployment. No SDK call. A folder with a markdown file.
This is the load-bearing observation. Skills are not Python objects you import. They are not API endpoints you call. They are not framework primitives you instantiate. They are files on disk the agent discovers and executes. That property (the file-on-disk property) is what makes them portable across tools, shareable across teams, and version-controllable like any other text artifact.
PRIMM, Predict. Given the
hello-skillfolder above, what does the agent load into context at startup, before any user message arrives? Three options: (a) the entireSKILL.mdfile, including the instructions; (b) just the frontmatternameanddescription, nothing else; (c) nothing, since skills load only when invoked. Confidence 1–5.
The answer is (b). At startup, the agent reads only the metadata for every skill in its skill library. The full body (the instructions, the examples, the references) loads on demand. This is progressive disclosure, and it's the next concept.
Try with AI
I want to write my first Agent Skill. Give me three skill ideas
suitable for a customer-support agent. For each, write the frontmatter
(name + description) only. The description must be specific enough
that the agent knows EXACTLY when to invoke it — vague descriptions
like "Helps with tickets" don't count. After I review your three,
ask me which I'd refine first.
Concept 2: Progressive disclosure, the three-stage skill loading model
A chat agent with one skill has trivial context cost. A Worker with fifty skills doesn't, unless the loading model is clever. Progressive disclosure is the clever part. Skills load in three stages, each more expensive than the last, only loading the next stage when the previous stage indicates the skill is relevant.
Stage 1, Discovery. At startup, the agent loads the name and description of every available skill. The spec recommends keeping this footprint tiny: roughly 100 tokens per skill. A Worker with fifty skills pays roughly 5,000 tokens for discovery, every turn, every conversation. That's the cost of knowing what's in the library.
Stage 2, Activation. When the model decides a task matches a skill's description, it loads the full SKILL.md body into context. The spec recommends keeping this under 5,000 tokens per skill. Most well-written skills sit at 500–2,000 tokens. The activation cost is paid only on turns that actually use the skill.
Stage 3, Execution. If the skill's instructions reference other files (a script under scripts/, a reference under references/, a template under assets/), those load only when the agent explicitly reaches for them. Some skills never load any references; deep technical skills might load three or four.

A short walkthrough of what this looks like in practice. Imagine your customer-support Worker has 30 Skills in .claude/skills/, each with a SKILL.md. The agent starts; the runtime reads each SKILL.md and extracts the YAML frontmatter only (name and description) for all 30 files. Roughly 3,000 tokens land in the model's context as a table of "available skills." The body of each SKILL.md is not loaded; the references under references/ are not loaded; the scripts under scripts/ are not even read. The model now knows what tools it has, but not how to use any of them.
The user sends a message: "Bring me up to speed on ticket TKT-1042." The model scans its 30-description list, matches the summarize-ticket description (which mentions "ticket ID," "bring me up to speed"), and decides this skill applies. Now the runtime loads the full summarize-ticket/SKILL.md body, about 1,500 tokens, into context. The model reads the operational instructions: "extract the ticket ID, call lookup_customer, call find_similar_resolved_tickets, compose a five-section summary." It follows the instructions.
Step 3 of those instructions references references/style.md, the formatting guide for summaries. The model reaches for that file; the runtime loads it on demand. The other 29 skills' bodies, the other reference files, the scripts under scripts/: none of it ever entered the model's context. Only the metadata for all 30, plus the body and one reference for the one skill that fired, was paid for. The Predict below puts numbers on that gap.
PRIMM, Predict. A Worker has 30 skills in its library. The average SKILL.md frontmatter is 80 tokens; the average body is 1,500 tokens; each skill has on average two reference files totalling 4,000 tokens of references. On a typical turn that activates one skill and reads one reference, what is the rough context cost of the skill system? Three options: (a) ~2,400 tokens (startup metadata + one activation + one reference); (b) ~6,900 tokens; (c) ~135,000 tokens (everything loaded all the time). Confidence 1–5.
The answer is (b), roughly 6,900 tokens: 30 × 80 tokens for discovery (2,400), plus 1,500 tokens for the one activated skill, plus ~2,000 tokens for one referenced file. The cost scales linearly with library size for discovery, and constant per-turn for activation and execution. This is what makes a skill library affordable. Without progressive disclosure, 30 skills × 1,500 tokens of body × 4,000 tokens of references = 165,000 tokens per turn just to know what the agent can do. Nobody runs that.
Two consequences for skill design:
-
Descriptions are the load-bearing field. They're what fires in Stage 1 and determines whether Stage 2 happens at all. A vague description ("helps with PDFs") fires too often and burns activation tokens for nothing; a sharp description ("extracts text and tables from PDFs, fills PDF forms, merges PDFs; use when working with PDF files") fires when it should and stays quiet when it shouldn't. Most skill failures are description failures.
-
Long SKILL.md bodies are expensive. Keep
SKILL.mdunder ~500 lines, ideally 100–300. Move depth into reference files. The 500-line limit isn't arbitrary; it's the threshold where activation cost starts crowding out actual reasoning context.
Try with AI
I have a SKILL.md that's grown to 1,200 lines because I kept adding
edge cases. Walk me through how to split it. The skill is a
"customer-refund-issuance" skill: it has the core refund process,
five edge cases (international, partial, post-30-days, with-restocking-fee,
fraud-flagged), three reference policies (US, EU, UK), and a Python script
that calls the payment gateway. Help me decide what stays in SKILL.md,
what moves to references/, what moves to assets/, and what stays as scripts/.
Concept 3: Writing a SKILL.md, the contract a skill makes with the model
The SKILL.md file has two parts: YAML frontmatter (the contract) and the Markdown body (the operational instructions). The frontmatter is the agent-facing API; the body is the doing.
The frontmatter, by field.
| Field | Required | Constraint | Purpose |
|---|---|---|---|
name | Yes | 1–64 chars, lowercase alphanumeric + hyphens, no leading/trailing/consecutive hyphens, must match folder name | The skill's identifier. |
description | Yes | 1–1024 chars, non-empty | The trigger surface. What the agent reads at discovery to decide whether to invoke this skill. |
license | No | License name or path to bundled license file | What terms the skill ships under. |
compatibility | No | ≤500 chars | Environment requirements (intended product, system packages, network access). Most skills don't need this. |
metadata | No | Arbitrary key-value mapping | Client-specific extensions (author, version, etc.). |
allowed-tools | No | Space-separated tool list | Pre-approved tools the skill may use. Experimental; support varies. |
The minimum viable frontmatter is two fields:
---
name: my-skill
description: One sentence explaining what this skill does and when to use it.
---
A more production-shaped frontmatter:
---
name: extract-meeting-notes
description: Extracts structured action items, decisions, and follow-ups from raw meeting transcripts. Use when the user provides a transcript, meeting recording text, or asks to summarize a meeting. Produces a markdown summary with explicit sections for Decisions, Action Items (with owner and deadline), and Follow-ups.
license: MIT
compatibility: Requires Python 3.12+ and access to the OpenAI API.
metadata:
author: panaversity
version: "1.2"
---
The description is the load-bearing field. Reread the spec's good vs. poor examples and internalize the difference:
Good: "Extracts text and tables from PDF files, fills PDF forms, and merges multiple PDFs. Use when working with PDF documents or when the user mentions PDFs, forms, or document extraction."
Poor: "Helps with PDFs."
The good description says what ("extracts text and tables, fills forms, merges"), when ("when working with PDF documents or when the user mentions PDFs, forms, or document extraction"), and surfaces specific keywords the model can match against ("PDFs," "forms," "extraction"). The poor one says none of those. Description quality is the single biggest determinant of whether your skill fires when it should and stays quiet when it shouldn't.
The body, by convention. No format requirements; the spec says "Write whatever helps agents perform the task effectively." But three sections show up in nearly every well-written skill:
-
Step-by-step instructions. What the agent does, in numbered steps, in operational language. Not narrative ("This skill is designed to help with refunds") but imperative ("Look up the order. Check the policy. Issue the refund."). Imperative skills run better than narrative ones.
-
Examples of inputs and outputs. One or two short examples that show the expected shape of input and the expected shape of output. Examples are worth roughly five times more than descriptions for steering model behavior. Don't skip them.
-
Common edge cases. Two or three cases that aren't obvious from the happy path, typically the cases that have actually broken in production. Edge cases earn their place by being from real failures, not imagined ones.
PRIMM, Predict. Two skills have the same
namefield (summarize-document) but live in different folders: one in~/.claude/skills/(user-level) and one in.claude/skills/(project-level). The current task matches both descriptions. What happens? Three options: (a) the agent picks one at random; (b) the project-level skill wins because of folder precedence; (c) the agent loads both and lets the model choose. Confidence 1–5.
The answer depends on the client, but the prevailing convention across Claude Code and OpenCode is (b), project-level wins over user-level. Same precedence pattern as CLAUDE.md resolution, the agentic-coding crash course's rules-file behavior, and most tool configuration. The principle: more specific context overrides more general context. Two skills with the same name from the same folder is a different problem; that one is usually an error, since folder names must match name values.
Try with AI
You have two skills in the same project:
- name: lookup-customer
description: Looks up a customer record.
- name: get-customer-profile
description: Retrieves customer profile information.
A user asks: "What's John Smith's email?"
Walk through three steps with the AI:
1. For each skill, predict whether it WILL fire on this prompt.
Confidence 1-5.
2. If both fire, the model has to pick one — predict which it picks
and why.
3. Rewrite BOTH descriptions so exactly one fires on this prompt,
and the other fires only on a clearly different prompt. State the
prompt that SHOULD fire the other skill.
Concept 4: Skill packaging, where skills live and how they travel
Skills are filesystem artifacts. That's what makes them portable; it's also what makes their packaging conventions matter. Get the folder structure right and your skill works in every compliant client; get it wrong and it works in none.
Where each tool looks for skills.
| Tool | Project-level | User-level (global) |
|---|---|---|
| Claude Code | .claude/skills/<skill-name>/SKILL.md | ~/.claude/skills/<skill-name>/SKILL.md |
| OpenCode | .opencode/skills/<skill-name>/SKILL.md | ~/.config/opencode/skills/<skill-name>/SKILL.md |
| OpenCode (fallback) | .claude/skills/<skill-name>/SKILL.md | ~/.claude/skills/<skill-name>/SKILL.md |
The OpenCode fallback is the load-bearing piece for portability. OpenCode reads from its own .opencode/skills/ first, but falls back to .claude/skills/ if a skill isn't found there. The practical consequence: write your skill once in .claude/skills/, and it works in both tools without modification. This is the most concrete payoff of the open Agent Skills format. The same SKILL.md you ship to a Claude Code user works for an OpenCode user, byte-for-byte.
The full folder structure, by directory.
my-skill/
├── SKILL.md # Required: frontmatter + body. The entry point.
├── scripts/ # Optional: executable code the skill can run.
│ ├── extract.py
│ └── normalize.sh
├── references/ # Optional: deep documentation, loaded on demand.
│ ├── REFERENCE.md
│ └── policies/
│ └── us-refund-policy.md
└── assets/ # Optional: templates, schemas, lookup tables.
├── report-template.md
└── status-codes.json
Each directory has a specific job:
-
scripts/holds executable code the agent can run. Python, Bash, JavaScript: language support depends on the client and the sandbox. The agent invokes scripts by relative path (e.g.,scripts/extract.py). Scripts should be self-contained, document their dependencies in a comment header, and handle edge cases gracefully because the agent will not. -
references/holds deep documentation the agent loads on demand. The convention: keep each reference file focused on one topic (finance.md,legal.md,error-codes.md) so the agent loads only what the current task needs. A 5,000-token reference file the agent reaches once per session is cheap; a 50,000-token reference file is not. Keep references one level deep from SKILL.md. Avoidreferences/category/subcategory/file.md-style nesting; the spec is explicit on this. -
assets/holds static resources: document templates, configuration templates, lookup tables, schemas, images. These are consumed by the agent (or by scripts the agent runs) but not read as instructions.
The file reference syntax inside SKILL.md. Use relative paths from the skill root:
For policy details, see [the US refund policy](references/policies/us-refund-policy.md).
To extract the structured data, run `scripts/extract.py`.
The output should follow the template in `assets/report-template.md`.
The agent reads the SKILL.md body, sees the path, and loads the referenced file on demand. No special syntax, no link resolution gymnastics, just relative paths.
Quick check. You have a skill at
.claude/skills/refund-issuance/SKILL.mdthat referencesreferences/policies/us.md. The skill is being invoked while the agent's working directory is/home/user/projects/customer-support. Where does the agent look for the policy file? Answer:/home/user/projects/customer-support/.claude/skills/refund-issuance/references/policies/us.md; paths are relative to the skill's directory, not the agent's working directory. Get this wrong and your skill breaks in subtle ways under deployment.
Try with AI
I'm porting a skill from Claude Code to OpenCode and want to verify
my folder layout is correct. The skill is at .claude/skills/extract-meeting-notes/
and contains SKILL.md, scripts/parse.py, and references/glossary.md.
For each of these claims, tell me TRUE or FALSE and explain why:
1. Without any changes, this skill will be discovered by OpenCode.
2. If I move the folder to .opencode/skills/extract-meeting-notes/,
Claude Code will stop discovering it.
3. The script reference inside SKILL.md should use the absolute path
~/.claude/skills/extract-meeting-notes/scripts/parse.py.
4. If two skills have the same name and one is in .claude/skills/
and the other in ~/.claude/skills/, the user-level one wins.
Concept 5: Composing skills, when to chain small skills vs. write one big one
Skills compose. A "weekly customer-health report" capability could be one giant skill that does research, drafting, formatting, and review in one shot. Or it could be four skills (research-customer-health, draft-customer-health-report, format-customer-health-report, review-customer-health-report) that hand off to each other through the filesystem.
Both work. They have very different properties.
One big skill. Easier to discover (one description, one match). Lower discovery cost (one entry in the discovery stage). Tighter coupling: when the activation triggers, all the steps run, in order, in one context. Hard to test in isolation. Hard to reuse a piece elsewhere. When something fails halfway through, the model has to recover with all of the now-irrelevant earlier work still in context.
Many small skills. Higher discovery cost (four entries instead of one). Higher activation orchestration cost (something has to chain them). Looser coupling: each step can be tested, replaced, or reused independently. When one fails, the failure is localized; the prior steps' artifacts are already on disk. Each step gets a fresh activation, meaning no accumulated context pollution.

The decision rule that works in practice:
Write one skill if the steps are tightly coupled and rarely reused in isolation. Write many skills if any step might be invoked on its own from a different workflow, or if context-isolation between steps is more valuable than orchestration simplicity.
For a Worker that does customer support, summarize-ticket is probably its own skill (used by humans, used by the escalation flow, used by the weekly digest, used by the audit replay). escalate-to-tier-2 is probably its own skill (the escalation criteria and tone requirements are reused). The decomposition is value-of-isolation versus cost-of-orchestration; isolation usually wins past two or three composed steps.
The filesystem handoff pattern. When skills compose, the cleanest handoff is the filesystem, not the conversation. Skill A writes its output to tmp/research-customer-{id}.md. Skill B is invoked, reads from tmp/research-customer-{id}.md, writes to tmp/draft-customer-{id}.md. Skill C reads draft, writes report. The conversation only sees the final report; intermediate artifacts live on disk where the agent can re-read them, the human can inspect them, and the audit trail can find them later. This is the same pattern Course #3 used for subagent output isolation: same insight, applied at skill granularity.
customer-health-pipeline/
├── tmp/ # ephemeral handoff dir
│ ├── research-customer-{id}.md # skill A output
│ ├── draft-customer-{id}.md # skill B output
│ └── reviewed-draft-customer-{id}.md # skill C output
└── output/
└── customer-health-{id}.pdf # skill D final
This is also where the next two parts of the course start to matter. Some skill handoffs don't belong on the filesystem: they belong in the system of record. The "customer health" snapshot from Skill A doesn't just feed Skill B; it's also a record the company needs to keep, to query later, to find similar cases against, to audit if the customer complains six weeks later. A skill that writes to a temp file is a draft. A skill that writes to the system of record is an action. That distinction is what Part 2 builds.
Try with AI
Compare two design approaches for a customer-refund-issuance workflow:
Design A: One big skill called "issue-refund" that handles eligibility check,
policy lookup, amount calculation, gateway call, ticket update, and customer
notification.
Design B: Five small skills (check-refund-eligibility, lookup-refund-policy,
calculate-refund-amount, call-payment-gateway, update-support-ticket,
notify-customer) chained via filesystem handoffs in tmp/.
For each design, name (1) one situation where it's the right choice and
(2) one specific failure mode it's vulnerable to. Then tell me which
design you'd ship and why.
Concepts 1-5 describe how skills should work when the model is a strong instruction-follower (Claude Sonnet/Opus, GPT-5-class, Gemini 2.5 Pro). With a smaller or cheaper model (deepseek-chat, Haiku-class, Llama-70B, Mistral, most local models), three things drift:
- Multi-skill sequencing. SKILL.md imperatives like "ALWAYS run this BEFORE drafting" or "call X, then Y, then Z" land reliably on the strong models and unreliably on the weaker ones. The fix is to add a short GENERAL-FLOW preamble in the agent's system prompt spelling out the order. SKILL bodies stay declarative; the system prompt provides the orchestration scaffold.
- Format drift. A weaker model will silently add emojis, markdown tables, "Action Taken" headers, or paraphrase your inputs even when the SKILL body says "output as five paragraphs, no tables, preserve the user's text verbatim." Be more explicit on a weaker model: list what NOT to do, not just what to do.
- Trigger blindness. Descriptions that fire on "summarize ticket TKT-1042" may miss "what's the story on #1042" on a smaller model. Write descriptions with more concrete trigger phrasings (Concept 3's discipline matters more, not less, when the model is weaker).
Rule of thumb: budget the strong model's effort into the SKILL.md, budget the weak model's effort into the system prompt. The fact that the architecture survives this drift with extra scaffolding is itself a vote for the architecture: the SKILL/MCP/SoR boundaries are stable; only the prompting around them changes.
Part 2: Neon Postgres + pgvector as system of record
Part 1 gave the agent capabilities it can load on demand. Those capabilities are useless without state to work against: the customer history, the prior conversations, the policy library, the audit log. The previous course's agent kept state in a local SQLite file (the session DB) and stub data in Python lists. That works for a demo. It doesn't work for a Worker that has to answer "what did we tell this customer six weeks ago?" or "have we seen this question before, and what did the resolution look like?"
This Part replaces both with a system of record: the architectural term from the thesis for "the authoritative store of state the workforce reads from and writes to." We use Neon Postgres with the pgvector extension. The architecture is the invariant; Neon is the product. Any durable, addressable, governed Postgres satisfies the requirement.
First-pass compression note. Part 2 is the densest stretch of this course: six tables, SQL, embedding pipelines, vector index math. If your Postgres is rusty, read Concepts 6 and 7 for the shape, then skim 8–10 and come back when you build. The minimum-viable Worker doesn't actually need all six tables on day one:
messages+embeddingsis enough to feel the architecture work. Adddocumentswhen you have a real library to embed. Addaudit_logandcapability_invocationsthe first time you have to answer "what did the agent do at 3am last Tuesday?" Addconversationswhen one user starts having more than one of them. The six-table schema below is where you end up, not where you start.
Concept 6: Why managed Postgres, and why Neon specifically
The thesis is product-agnostic about systems of record: "the AI-Native Company's existing databases, workflows, and operational platforms (CRMs, ERPs, ticketing systems, data warehouses, ledgers) serve as the system of record." For an agent you're building from scratch, though, you need to pick something. The question is not "Postgres vs. MongoDB vs. a vector DB"; it's "which Postgres."
Why Postgres, not a dedicated vector database. Three reasons that hold even in 2026.
-
One database, one transaction, one auth boundary. A vector DB plus a relational DB means two stores to keep consistent, two auth systems to scope, two backup pipelines to maintain. The pgvector extension puts vectors next to the user records, ticket records, and audit logs they're related to: a JOIN is a JOIN, not a network hop between two eventually-consistent services. Every major managed Postgres provider (AWS RDS, Google Cloud SQL, Azure, Supabase, Neon) ships pgvector by default; Neon alone reports 30,000+ databases running it. For most workloads, it's the empirical answer to "is this enough." It almost always is.
-
Postgres already does the hard parts. Transactions, indexes, foreign keys, row-level security, point-in-time recovery, query planning. A dedicated vector DB has to invent these from scratch and usually does some of them worse. The default boring choice has compounding advantages.
-
MCP servers exist for Postgres at every layer. Neon ships one (for management). General Postgres MCP servers exist (for SQL execution). You can write your own (for scoped runtime access). The MCP ecosystem around Postgres is the most mature.
Why Neon specifically: three differentiators.
-
Serverless with scale-to-zero. Free tier costs nothing when idle. A Worker that handles 50 conversations per day spends most of its time costing $0, not $50/month for a provisioned instance. Critical for the economics of a multi-Worker AI-Native Company where many Workers are bursty.
-
Branching. A Neon database branches in seconds: a full copy-on-write clone of your production data, ready to query. The intended use is dev/test environments and migration safety; the agent-relevant use is letting the agent experiment on a branch without touching production. A migration that goes wrong on a branch is reverted by deleting the branch. The same operation on a non-branchable database is a restore from backup.
-
First-class MCP integration. Neon ships an official MCP server (remote at
https://mcp.neon.tech/mcp, OAuth-authed) that exposes project management, branch lifecycle, SQL execution, and migration tooling to any MCP client. We'll use this for development. Critically (and we'll return to this) Neon explicitly does not recommend the Neon MCP server for production runtime use. It's a powerful natural-language interface; production agents talk to Postgres through narrower, scoped MCP servers you write yourself, or through direct connections. The distinction matters; Part 3 makes it operational.
Quick check. Three claims, mark each True or False before reading the next paragraph: (a) The Neon MCP server is intended to be the runtime database connection for production AI Workers. (b) A Neon database branched from production starts with all the production data already in it. (c) A Neon database that hasn't received a query in an hour is still costing you money under the free tier.
Answers: (a) False: Neon's own documentation says the MCP server is for development/testing only. (b) True: branches are copy-on-write, so the branch starts as a logical clone of production with no data movement. (c) False: scale-to-zero means idle databases cost nothing on the free tier. These three answers are the operational shape of why Neon fits as the SoR.
Try with AI
Your customer-support Worker handles EU customers. Your legal team
says: "All customer data must stay in Frankfurt." Neon's free tier
offers US-East regions only; their paid tier offers EU regions.
For each of these three approaches, decide GO or NO-GO and give ONE
sentence of justification:
A) Use Neon's paid tier in an EU region (pay-as-you-go, typically $25–75/month for a single Worker per [Neon's pricing](https://neon.com/pricing)) and ship.
B) Use Postgres on a Frankfurt VPS you manage yourself (no pgvector
MCP server, no branching, no scale-to-zero) and ship.
C) Use Neon's free US tier; embed everything client-side before
sending; tell legal it's "encrypted at rest."
Then say which one you'd actually pick and what you'd ask legal
before committing.
Concept 7: The Worker's schema, what tables an agent actually needs
A Worker's system of record is not just "store conversations somewhere." It's a structured schema that supports the four things the thesis says every Worker does: read truth, write outcomes, leave traces, find similar prior work. Six tables cover the 90% case. You'll add more for domain specifics; these are the load-bearing six.
-- 1. CONVERSATIONS: the top-level unit of work
CREATE TABLE conversations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id TEXT NOT NULL,
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
ended_at TIMESTAMPTZ,
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
-- searchable summary; populated by the agent at conversation end
summary TEXT
);
CREATE INDEX idx_conversations_user ON conversations(user_id, started_at DESC);
-- 2. MESSAGES: the turns of a conversation
CREATE TABLE messages (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
conversation_id UUID NOT NULL REFERENCES conversations(id) ON DELETE CASCADE,
role TEXT NOT NULL CHECK (role IN ('user', 'assistant', 'tool', 'system')),
content TEXT NOT NULL,
-- token counts let you reason about cost without re-counting
input_tokens INT,
output_tokens INT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_messages_conv ON messages(conversation_id, created_at);
-- 3. DOCUMENTS: the agent's reference library
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source TEXT NOT NULL, -- 'policy_library', 'kb_article', 'past_case', etc.
title TEXT NOT NULL,
body TEXT NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_documents_source ON documents(source);
-- 4. EMBEDDINGS: vector representations of documents AND past conversations
CREATE TABLE embeddings (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- one of these is populated; the other is NULL
document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
conversation_id UUID REFERENCES conversations(id) ON DELETE CASCADE,
chunk_text TEXT NOT NULL,
chunk_index INT NOT NULL,
embedding VECTOR(1536) NOT NULL,
model TEXT NOT NULL, -- 'text-embedding-3-small', etc.
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CHECK (
(document_id IS NOT NULL)::int + (conversation_id IS NOT NULL)::int = 1
)
);
-- the load-bearing index for semantic search; see Concept 8
CREATE INDEX idx_embeddings_hnsw
ON embeddings USING hnsw (embedding vector_cosine_ops);
-- 5. AUDIT_LOG: every action the Worker takes, replayable forever
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
conversation_id UUID REFERENCES conversations(id) ON DELETE SET NULL,
actor TEXT NOT NULL, -- 'worker:customer-support', 'system', etc.
action TEXT NOT NULL, -- see "canonical action vocabulary" below
target TEXT, -- table name, skill name, etc.
payload JSONB NOT NULL, -- the data of the action
result JSONB, -- what happened
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_audit_conv ON audit_log(conversation_id, created_at);
CREATE INDEX idx_audit_action ON audit_log(action, created_at);
-- 6. CAPABILITY_INVOCATIONS: every skill or tool call, for replay and metrics
CREATE TABLE capability_invocations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
conversation_id UUID NOT NULL REFERENCES conversations(id) ON DELETE CASCADE,
capability TEXT NOT NULL, -- 'skill:summarize-ticket', 'tool:search_docs', etc.
arguments JSONB NOT NULL,
result JSONB,
status TEXT NOT NULL CHECK (status IN ('ok', 'error', 'timeout')),
latency_ms INT,
cost_cents INT, -- approximate cost in 1/100 cents
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_cap_conv ON capability_invocations(conversation_id, created_at);

A few notes on the design choices, since each is load-bearing:
-
One embeddings table for both documents and conversations. A
CHECKconstraint ensures exactly one ofdocument_idorconversation_idis set. This lets you do semantic search across "policies AND past conversations" in one query; the question "have we answered this before?" gets one index, not two. -
audit_logusesBIGSERIAL, notUUID. Audit rows are written constantly; the simpler integer key keeps inserts fast and ordering trivial. The other tables useUUIDbecause rows leave the database (in API responses, in URLs) and UUIDs avoid leaking row counts. -
capability_invocationsseparates skills from tools. A skill invocation and a@function_toolinvocation are conceptually similar but operationally different (different code paths, different cost profiles, different failure modes). Storing them in one table with acapabilitydiscriminator lets you query both together when asking "what did the agent do?" while still splitting them when asking "what did skills cost vs. tools?" -
metadata JSONBcolumns are escape hatches. The schema can't predict every domain-specific field you'll need; JSONB lets you add fields without migrations. Use sparingly: fields that appear in many queries should be promoted to columns.
This is the minimum operational schema for a Worker: three tables for the work (conversations, messages, documents), one for the embeddings, two for the trace (audit_log, capability_invocations). You'll add more for domain specifics: a customers table, a tickets table, a policies table with versioning. Those are the same shape: relational data the agent reads and writes via MCP.
PRIMM, Predict. A Worker handles 200 conversations/day, each averaging 10 messages, with 30% triggering one skill invocation and 50% writing two audit rows beyond the skill row. After one month (30 days), roughly how many rows are in each of the six tables? Three options: (a) similar volumes across all six (~6,000–60,000 each); (b) audit_log dwarfs the others by 10×–50×; (c) embeddings dwarfs everything because each message gets embedded. Confidence 1–5.
The answer is (b): audit_log will be the largest by a wide margin because every interaction produces multiple audit rows: message sent, skill invoked, db write, db write again, conversation closed, etc. A rough estimate: conversations ~6,000; messages ~60,000; capability_invocations ~1,800; audit_log probably 150,000–300,000. Plan your retention and indexing accordingly; audit_log is the table you'll partition first as you grow.
Try with AI
I want to extend this schema for a customer-support Worker that
handles software bug reports specifically. What three additional
tables would you add, and what columns would they have? For each,
say what the agent will use it for (read access? write access?
both?) and what foreign keys connect it to the six core tables.
Concept 8: pgvector basics, types, distance operators, indexes
The embeddings table above is what makes the agent's memory semantically searchable. The technology that powers it is pgvector: a Postgres extension that adds vector data types and similarity-search operators. It ships by default on every major managed Postgres provider and is the default-choice answer for "should I use a separate vector DB?" for most workloads: no.
The vector type. VECTOR(n) is a fixed-length floating-point column. n is the dimensionality of your embeddings: 1536 for OpenAI's text-embedding-3-small, 3072 for text-embedding-3-large, varies for other models. The dimension must match the model that produced the embedding. Mixing dimensions is the most common pgvector bug: insert 1536-dim vectors from model A, query with 1536-dim vectors from model B, and the results are nonsense even though every query "works."
For dimensions above 2000, the HALFVEC type uses half-precision floats and roughly halves the storage at minor recall cost. For our 1536-dim case, plain VECTOR(1536) is fine.
The three distance operators. pgvector exposes three ways to measure how similar two vectors are. Pick one for your use case and stick with it; mixing is a recipe for confusion.
| Operator | Name | What it measures | When to use |
|---|---|---|---|
<-> | L2 distance (Euclidean) | Straight-line distance in n-dimensional space | Image embeddings, geometric similarity |
<#> | Negative inner product | Dot product (negated) | When your embedding model produces un-normalized vectors and you care about magnitude |
<=> | Cosine distance | Angle between vectors, regardless of magnitude | Text embeddings: the default for our case |
OpenAI's text-embedding-3-small and text-embedding-3-large produce normalized vectors, which means cosine distance and Euclidean distance give equivalent rankings. Use cosine distance (<=>) by convention for text embeddings; it's the most common, the most documented, and the index ops are named vector_cosine_ops for it.
A semantic-search query, in full:
-- Find the 5 documents most similar to the user's question
SELECT d.id, d.title, d.body,
e.embedding <=> $1 AS distance
FROM embeddings e
JOIN documents d ON d.id = e.document_id
WHERE e.document_id IS NOT NULL -- excludes conversation embeddings
ORDER BY e.embedding <=> $1 -- smaller distance = more similar
LIMIT 5;
$1 here is the embedding of the user's question, generated at query time. The <=> operator is what makes this O(n) into O(log n) given the right index.
Two index types: HNSW and IVFFlat. pgvector has two practical index choices. As of 2026, the recommendation has converged:
- Start with HNSW unless you have a reason not to. Graph-based, builds slowly but queries fast, has the broadest operator support, predictable performance. The default for new projects.
- Use IVFFlat if build time matters more than query time. Partition-based, builds 5–6× faster than HNSW, queries slower. Useful when you re-build the index frequently or have an insert-heavy workload.
- DiskANN exists (via the pgvectorscale extension from Timescale) for very large indexes that don't fit in RAM. You almost certainly don't need it.
The HNSW index from the schema above:
CREATE INDEX idx_embeddings_hnsw
ON embeddings USING hnsw (embedding vector_cosine_ops);
Two tuning knobs you'll set if you tune at all: m (the number of connections per node, default 16) and ef_construction (the build-time effort, default 64). The defaults are sensible for most workloads; only touch them when you've benchmarked.
Quick check. Three claims to mark True or False. (a) You can have multiple HNSW indexes on the same
embeddingcolumn, one per distance operator. (b) Inserting a vector into a table with an HNSW index incurs more cost than inserting into a table with no vector index. (c) An HNSW index can be created before any data is loaded into the table. Answers: all three are True. Pgvector lets you index for multiple distance operators (rare but possible), the index maintenance cost is real (which is why some workloads prefer batch-insert-then-index), and HNSW (unlike IVFFlat) doesn't need a training step.
Try with AI
Two scenarios. For each, pick HNSW or IVFFlat and justify with one
specific property of the index:
Scenario A: A research index of 10M scientific papers. Built once,
queried millions of times. Build time is "whatever it takes —
overnight is fine." Query latency directly affects user experience.
Scenario B: A live index of customer support tickets that's
re-indexed every 4 hours because thousands of new tickets stream in.
Query patterns are simple (top-5 nearest neighbors). The current
HNSW build takes 20 minutes — a third of the re-index cycle.
After you answer: name ONE thing that would change your answer for
each scenario. Be specific about what you'd need to see in
production metrics before switching.
Concept 9: The embedding pipeline, text in, queryable vector out
Embeddings don't appear by magic; you generate them by calling an embedding model. The pipeline is straightforward, but each step has a decision point that bites if you get it wrong.
The four-step pipeline.
- Chunk the document into pieces small enough to embed coherently.
- Embed each chunk by calling the embedding model.
- Store the chunk text, the embedding, and metadata in the
embeddingstable. - Query by embedding the user's question and finding nearest neighbors.

Chunking. Embedding models have token limits: OpenAI's text-embedding-3-small accepts up to 8,191 tokens per chunk, far more than you typically want. The question isn't "what's the maximum" but "what's the chunk size that preserves the unit of meaning the user will search for." Three rules that work in practice:
- Chunk at semantic boundaries when you can. Section headings, paragraph breaks, structural markers in your source. A chunk that ends mid-sentence retrieves badly.
- Target 200–500 tokens per chunk for most text. Long enough to carry meaning, short enough to be specific. Documents with longer chunks tend to "match everything weakly" rather than "match the right thing strongly."
- Include 10–20% overlap between chunks for chunks that fall across natural boundaries. The overlap costs storage but improves recall when the user's question lies near a chunk boundary.
- Don't chunk what's already short. Documents already under ~300 tokens (single resolved tickets, short policy paragraphs, FAQ entries) don't need chunking; embed them as one piece. The chunker below is shaped for longer source documents (policy PDFs, knowledge-base articles, conversation transcripts). For the worked example's seed corpus of short resolved tickets, you can skip the chunker and embed each ticket as a single chunk (Decision 5 stores each ticket as a
documentsrow whosebodyissummary || '\n' || resolution, and embeds that).
A typed chunking function for the worked example:
# src/chat_agent/embedding/chunker.py
from dataclasses import dataclass
@dataclass(frozen=True)
class Chunk:
text: str
index: int
source_offset: int # character offset in the original document
def chunk_text(
text: str,
target_tokens: int = 400,
overlap_tokens: int = 60,
) -> list[Chunk]:
"""Split text into overlapping chunks at paragraph boundaries.
Approximates tokens as words * 0.75 — fine for chunking; not for
actual token counting. Use tiktoken for precise counts.
"""
paragraphs: list[str] = [p.strip() for p in text.split("\n\n") if p.strip()]
target_words: int = int(target_tokens / 0.75)
overlap_words: int = int(overlap_tokens / 0.75)
chunks: list[Chunk] = []
current: list[str] = []
current_word_count: int = 0
source_offset: int = 0
for para in paragraphs:
para_words: int = len(para.split())
if current_word_count + para_words > target_words and current:
chunks.append(Chunk(
text="\n\n".join(current),
index=len(chunks),
source_offset=source_offset,
))
# carry overlap forward
overlap_chunk: list[str] = []
overlap_count: int = 0
for prev in reversed(current):
if overlap_count + len(prev.split()) > overlap_words:
break
overlap_chunk.insert(0, prev)
overlap_count += len(prev.split())
current = overlap_chunk
current_word_count = overlap_count
source_offset += sum(len(p) for p in current) + 2 * len(current)
current.append(para)
current_word_count += para_words
if current:
chunks.append(Chunk(
text="\n\n".join(current),
index=len(chunks),
source_offset=source_offset,
))
return chunks
Embedding. Call the model with batched chunks (the OpenAI API accepts arrays of inputs in one request, much more efficient than one call per chunk):
# src/chat_agent/embedding/embedder.py
from openai import AsyncOpenAI
EMBEDDING_MODEL: str = "text-embedding-3-small"
EMBEDDING_DIM: int = 1536 # the table column must match
async def embed_chunks(
client: AsyncOpenAI, chunks: list[str],
) -> list[list[float]]:
"""Embed a batch of chunks. Returns one vector per chunk, in order."""
response = await client.embeddings.create(
model=EMBEDDING_MODEL, input=chunks,
)
return [item.embedding for item in response.data]
Where do embeddings come from when your inference provider isn't OpenAI? This is a real architectural fork the chapter has so far papered over. OpenAI is the only major inference provider that also ships a first-class embeddings API; if you're routing inference through DeepSeek, Anthropic, Gemini, or a local model, you have four options. Pick one before you touch the VECTOR(n) column, because the column dimension must match the model.
| Model | Dimensions | Cost (input, per 1M tokens) | Where it lives | When to use |
|---|---|---|---|---|
text-embedding-3-small | 1536 | $0.02 | OpenAI (and OpenAI-compatible aggregators like OpenRouter) | The default if you have an OpenAI key. Cheap, fast, good for most retrieval. |
text-embedding-3-large | 3072 | $0.13 (model card) / $0.065 (pricing page) | OpenAI | When you've measured -small underperforming. |
embed-english-v3 / embed-multilingual-v3 | 1024 | $0.10 | Cohere | When you're already on Cohere for inference, or when multilingual recall matters. |
voyage-3 / voyage-3-lite | 1024 | $0.06 / $0.02 | Voyage | Embeddings-as-a-service; integrates cleanly with Anthropic-shaped stacks. |
all-MiniLM-L6-v2 / bge-small-en-v1.5 (local) | 384 / 384 | "free" (your compute) | sentence-transformers package; runs CPU-only with no API call | When your inference provider has no embeddings API (DeepSeek, most local LLMs), or when data residency forbids sending text to a third party. |
The headline cost number: embedding 50,000 chunks at ~300 tokens each = 15M tokens × $0.02/M = $0.30 on OpenAI. The same with -large is $1.95. With a local 384-dim model, $0 plus 30 seconds of CPU. Embeddings are the cheapest line on the bill; the choice rarely moves the dollar dial, but it does move the architecture.
The dimension is the contract. VECTOR(1536) only accepts 1536-dim vectors. If you switch from text-embedding-3-small (1536) to all-MiniLM-L6-v2 (384), you re-create the column as VECTOR(384) and re-embed every row. There is no "fits if close enough" with pgvector; the dimensions are absolute.
Re-embedding. When do you re-embed? Three triggers:
- The source document changed. Delete and re-insert all embeddings whose
document_idmatches. - The embedding model changed. Migration of a lifetime; if you switch from
-smallto-large, every existing embedding is incompatible with new ones. Re-embed everything, or run two embedding columns during a transition. - The chunking strategy changed. If you decide 400 tokens was wrong and 250 is right, re-chunk and re-embed. Versioning your chunks (storing
chunk_strategy_versionin themetadataJSONB) lets you do this safely.
PRIMM, Predict. You've embedded 100,000 chunks with
text-embedding-3-small. You then decide to also embed all messages (not just documents) so the agent can do "have we discussed this before?" lookups. You write the message embeddings into the sameembeddingstable with the same column. A semantic search query (find the 5 nearest neighbors to a user question, no filter) comes back with mixed document and message results. Is this what you wanted? What's the right query shape? Confidence 1–5.
The answer: almost certainly not what you wanted. Mixing documents and messages in retrieval results without distinguishing them produces incoherent answers: the model sees a top result that's a customer's prior message and treats it as authoritative documentation. The right pattern is to filter by source type in the query: either by joining and filtering on WHERE document_id IS NOT NULL for document search, or by running two separate searches and ranking them differently. The schema's CHECK constraint that exactly one of document_id/conversation_id is set makes this filter cheap and the intent explicit.
Debugging poor retrieval. When top-k results don't match expectations, four checks in order, each ruling out one common cause:
| Symptom | Likely cause | Fix |
|---|---|---|
| All top-5 distances are > 0.7 (cosine distance, so anything > 0.5 is "far") | Query embedding model differs from corpus embedding model | Confirm both went through the same EMBEDDING_MODEL. Mixing text-embedding-3-small with -large, or local with OpenAI, produces nonsense ranks across both pools. |
| Top-5 results are all from one source type when you expected variety | Source-type filter missing | Add WHERE document_id IS NOT NULL (or the right side of the CHECK constraint) to the query, or split into two ranked queries and merge them with explicit weights. |
| Top-k changes wildly between near-identical queries | Chunk size too small (each chunk lacks enough context) | Re-chunk with larger chunks (200, then 400, then 600 tokens) and re-embed. Re-running the same query should now produce stable top-k. |
| Query returns nothing | HNSW index missing, or vector dimension mismatch | \d+ embeddings in psql to confirm the column is VECTOR(1536) and the HNSW index exists with vector_cosine_ops. If the column is the wrong dimensionality, re-create it. |
Three diagnostic queries worth keeping in a scratch.sql:
-- 1. Distance band on the top-5 for a known-good query
SELECT chunk_text, embedding <=> :query_vec AS distance
FROM embeddings ORDER BY distance LIMIT 5;
-- 2. Source-type breakdown of top-20 (catch the "wrong source dominates" bug)
SELECT
CASE WHEN document_id IS NOT NULL THEN 'document' ELSE 'message' END AS src,
COUNT(*)
FROM (SELECT * FROM embeddings ORDER BY embedding <=> :query_vec LIMIT 20) t
GROUP BY src;
-- 3. Confirm the index is actually being used
EXPLAIN (FORMAT TEXT) SELECT id FROM embeddings ORDER BY embedding <=> :query_vec LIMIT 10;
Retrieval quality is the silent killer of Worker accuracy. The final answer can sound perfectly reasonable while citing the wrong evidence; the only way to catch this is to evaluate retrieval before the final answer.
Try with AI
I'm chunking a corpus of legal contracts (each averaging 8,000 words)
for semantic search. The user will query things like "what's the
termination clause in this contract" — phrases that map cleanly to
specific sections. Walk me through three chunking strategies:
A) Fixed 400-token chunks with 60-token overlap (the default)
B) Chunk at section headings only, with no overlap
C) A two-level approach: store both 400-token chunks AND
whole-section chunks, search both, combine results
For each, name (1) when it wins and (2) when it loses.
Concept 10: Audit trail as discipline, what "reads and writes" means for a Worker
The thesis is unusually direct about this: "every action a Worker takes leaves a trace in a store that survives the agent's session and can be inspected, replayed, and trusted. The system of record is what separates execution from plausible-sounding fiction. It is also what makes the workforce legible after the fact."
Translated to operational terms: every meaningful action the agent takes writes a row to audit_log, plus a more structured row to capability_invocations if it's a skill or tool call. The data being acted on lives in its appropriate table (a message goes in messages, a document update goes in documents); the fact that the action happened lives in audit_log. The two are joined by foreign key.
What to log.
- Every model call: input tokens, output tokens, model name, cost estimate
- Every skill invocation: skill name, arguments, result summary, latency, success/error/timeout
- Every database write: which table, what changed (the JSONB payload), under which conversation context
- Every external tool call: tool name, input, output summary, latency
- Every guardrail event: which guardrail tripped, what the input was, what the action was (blocked/allowed/modified)
What not to log.
- The full conversation history on every turn; that's already in
messages, you'd be storing it twice - Sensitive payload data verbatim if the row is queryable to humans; store a hash or summary, keep the full data in a restricted table
- Internal model reasoning that the user shouldn't see; that's a context-management decision, not an audit decision
The replay test. A good audit trail passes one specific test: given a conversation_id and a timestamp, you can reconstruct what the agent did and why, without re-running the model. If your audit log doesn't pass this test, it's logging, not auditing.
A concrete write-on-action helper, used everywhere a capability runs:
# src/chat_agent/audit.py
import time
import json
from typing import Any
from uuid import UUID
import asyncpg
async def log_capability(
pool: asyncpg.Pool,
conversation_id: UUID,
capability: str,
arguments: dict[str, Any],
result: Any,
status: str,
started_at: float,
cost_cents: int | None = None,
) -> None:
"""Write a capability_invocations row plus a denormalized audit_log row.
Both writes happen in one transaction so they're either both present
or both absent — the audit trail never gets partial.
"""
latency_ms: int = int((time.monotonic() - started_at) * 1000)
async with pool.acquire() as conn:
async with conn.transaction():
await conn.execute(
"""INSERT INTO capability_invocations
(conversation_id, capability, arguments, result, status,
latency_ms, cost_cents)
VALUES ($1, $2, $3::jsonb, $4::jsonb, $5, $6, $7)""",
conversation_id, capability,
json.dumps(arguments), json.dumps(result),
status, latency_ms, cost_cents,
)
await conn.execute(
"""INSERT INTO audit_log
(conversation_id, actor, action, target, payload, result)
VALUES ($1, $2, $3, $4, $5::jsonb, $6::jsonb)""",
conversation_id, "worker:customer-support",
"capability_invoked", capability,
json.dumps({"arguments": arguments, "latency_ms": latency_ms}),
json.dumps({"status": status, "result": result}),
)
The two writes in one transaction is the load-bearing detail. Half-written audit trails are worse than missing audit trails: they suggest completeness without delivering it. Either both rows go in or neither does.
Canonical action vocabulary. Every site that writes to audit_log picks an action value from a small, agreed list. Drift here (the same event gets called three different names across the codebase) is what makes replay queries fragile six months later. The six values the worked example uses, and which code path writes each:
action | target | Where written |
|---|---|---|
message_received | (null) | cli.py, when the user sends a message and a row lands in messages |
message_sent | (null) | cli.py, when the agent's reply lands in messages |
skill_activated | skill name | cli.py, on the RunItemStreamEvent for a skill load |
capability_invoked | capability id | log_capability helper above, on every skill or MCP-tool call |
mcp_called | tool name | cli.py, after each MCP tool returns (lower-level than capability_invoked; use one or the other consistently) |
refund_issued / refund_blocked | order id | The customer-data MCP server's issue_refund tool, on success and policy denial |
Pick one of capability_invoked and mcp_called and stick with it; writing both per call double-counts. The worked example uses capability_invoked for the agent-side helper (Decision 7) and named domain actions like refund_issued for state-changing writes inside MCP tools (Decision 6). Domain-state writes always get their own action name so the audit row is the receipt for the business event, not just for the tool call that triggered it.
Why this isn't just logging. Three properties separate audit data from log data:
- Replayable. The schema lets you reconstruct the agent's reasoning trace from
audit_logjoined withmessagesjoined withcapability_invocations. A log line in a JSONL file doesn't. - Queryable. "What did the agent tell customer X last month, and which policy did it cite?" is a SQL query. "Which skill triggered the most timeouts in the last 7 days?" is a SQL query. Logs require grep and luck.
- Trustworthy. Audit rows are in the same database as the business data. They're backed up together, point-in-time-recoverable together, access-controlled together. They survive the agent process, the deployment region, and the model version.
This is what the thesis means when it says "Workers only become governable as a workforce when a ledger makes them legible, as units of capability, cost, latency, and outcome." Your audit_log and capability_invocations tables are that ledger. Treat them like one.
Try with AI
Here's a customer support scenario: a customer claims the Worker told
them they would receive a $50 refund, but the actual refund issued was
$30. The Worker handled the conversation 19 days ago.
Walk me through the audit-trail query path to resolve this:
1. Find the conversation. (Which columns of which tables?)
2. Find the message where the refund amount was promised. (How do you
distinguish "discussed" from "promised"?)
3. Find the capability invocation that issued the refund.
4. Find the database write that recorded the $30 amount.
For each step, name the table you'd query and the WHERE clauses.
Then say what's MISSING from the six-table schema that would make
this query easier.
Part 3: MCP, wiring the agent to the system of record
Part 1 gave the agent a library of Skills. Part 2 gave it a Postgres system of record. Part 3 wires the two together with the Model Context Protocol: the open standard for how agents reach external state and external capability. The thesis is direct about MCP's place: "MCP is how the workforce reaches [its systems of record]: every authoritative store becomes addressable to any Worker through an MCP server, under policy." This Part makes that operational.
Concept 11: What MCP is and isn't
The Model Context Protocol (modelcontextprotocol.io) is an open client/server protocol (originally from Anthropic, now governed as an open standard) for how an AI agent connects to external tools, data, and prompts. The framing that gets repeated is "USB-C for AI tools": one protocol, many implementations, swap any side without breaking the other. The framing is accurate; like all metaphors, it has limits worth naming.
What MCP is. A protocol. A specification. Three primitives the server can expose to the client.
- Tools: functions the model can invoke. The client lists them, the model picks one, the server executes it. Conceptually similar to a
@function_tooldecorator from Course #3, but the implementation lives in the MCP server process, not the agent's process. This is the most-used primitive by far. - Resources: read-only data the agent can fetch. Files, database query results, API responses. Think of them as the GET-only side of MCP. Less common than tools in practice, but useful for "let the agent read this document on demand."
- Prompts: reusable prompt templates the server provides. A team can publish standardised prompts ("summarize-incident-report") that any agent connecting to the server can invoke. Rarely used compared to tools and resources.
Three transports, with current recommendations as of 2026:
| Transport | When to use | Status |
|---|---|---|
stdio | Local subprocess; agent and server on the same machine | Mature. Default for local tools. |
streamable HTTP | Remote server; production deployments | Recommended for new remote work. Bidirectional, resumable, single endpoint. |
SSE | Remote server; older deployments | Legacy. Many servers still expose it; new ones increasingly default to streamable HTTP. |
What MCP is not.
- Not a framework. It's a protocol. Your agent doesn't "use MCP" the way it uses the Agents SDK; your agent's MCP client speaks MCP to an MCP server. The Agents SDK includes an MCP client; that's the integration point.
- Not a service. There is no "MCP cloud." MCP servers are programs you run (or that vendors run for you). The Neon MCP server is hosted at
mcp.neon.tech; the filesystem MCP server runs as a local subprocess; a custom MCP server you write runs wherever you deploy it. - Not a security boundary. MCP defines transport and protocol; what tools an MCP server exposes and what they can do is the server's responsibility. A malicious MCP server can do anything its server-side code does. The trust boundary is still the agent loop deciding which tools to call, and the sandbox the tools execute in.
- Not a replacement for
@function_tool. Both still have a place. The decision tree is Concept 14.
Quick check. True or false: (a) An MCP client talks to exactly one MCP server at a time. (b) The same
@function_tool-style function, if you wanted, could be exposed as an MCP tool or kept as a function tool, and the model wouldn't know the difference. (c) MCP servers and OpenAI Agents SDK are tightly coupled, so to use MCP you must use the SDK. Answers: (a) False: an agent can connect to multiple MCP servers and see the union of their tools. (b) True: to the model, both look like callable tools with schemas. The difference is where the implementation lives. (c) False: MCP is model-agnostic. Claude, Gemini, and others have their own MCP clients. The OpenAI Agents SDK is one client among many.
Try with AI
You're building three capabilities for an internal "weekly report"
Worker. For each, decide: @function_tool, custom MCP server, or
vendor MCP server (use the existing one). Justify with ONE specific
property from Concept 11:
A) generate_summary(text: str) -> str — wraps OpenAI's completion
API to summarize the input. Used only by this Worker.
B) read_jira_issues(project: str, since: date) -> list[Issue] — your
company already runs Atlassian's official Jira MCP server in
production.
C) post_to_slack(channel: str, message: str) -> None — used by this
Worker today. The Engineering team's incident-response Worker
needs the same capability next quarter.
Then answer: which of the three is the EASIEST design decision, and
why? Which is the HARDEST, and why?
Concept 12: The Neon MCP server, development plane, not runtime
The specifics in this concept will age. The pattern won't. Neon's MCP server tooling, auth flow, and exact tool surface change every few months. What stays true: a managed-database vendor exposes its management API through MCP for natural-language operations, while runtime production traffic uses direct connections or scoped custom servers. Verify against Neon's docs before pinning specifics.
The Neon MCP server is the most polished example of a vendor-shipped MCP server. It exposes Neon's management API (projects, branches, databases, migrations, ad-hoc SQL) as MCP tools your agent can call in natural language. This is profoundly useful during development. It is explicitly not for production runtime. Neon's own warning is unambiguous: "We recommend MCP for development and testing only, not production environments." And the stronger version, also from Neon's docs: "Never connect MCP agents to production databases."
Why the distinction matters. The Neon MCP server's run_sql tool can execute any SQL the model produces. A development workflow where you say "show me users who signed up last week and haven't logged in" → model generates SELECT * FROM users WHERE ... → server runs it → results come back is delightful. The same path in production would be a security nightmare: an attacker who could prompt-inject your customer-support Worker could exfiltrate your entire database by asking the right question.
What the Neon MCP server is for. Three roles where it's the right tool:
- Schema design and migrations during development. "Add a
prioritycolumn to the tickets table." Branch-based migrations let the model test the migration on a branch before committing; Neon's own MCP design pattern. This is the right way to use it. - Exploratory data work in your agent-building loop. "How many embeddings are in the database, broken down by source?" Working through the model is often faster than typing SQL by hand, especially for one-off questions.
- Connection-string retrieval and project introspection. The MCP server can hand you a connection string for the right branch without your having to log into the Neon console.
Quick setup for development use. The Neon MCP server is available three ways. Use the OAuth-based remote server for the simplest path:
# In your project, run:
npx -y mcp-remote@latest https://mcp.neon.tech/mcp
Or add to .claude/settings.json:
{
"mcpServers": {
"neon": {
"command": "npx",
"args": ["-y", "mcp-remote@latest", "https://mcp.neon.tech/mcp"]
}
}
}
Claude Code prompts for OAuth on first use.
A development session with the Neon MCP server, no SQL typed. (Approximate transcript; tool names are accurate as of Neon's current MCP documentation; exact phrasing and output format will vary. The shape of the interaction is what matters.)
You: Set up the chat-agent SoR. Create a project called "chat-agent",
enable pgvector, run the schema from migrations/001_initial.sql,
and tell me the connection string for the main branch.
Agent: [calls create_project] Created project chat-agent (ID: abc123).
[calls run_sql] Ran CREATE EXTENSION vector; — succeeded.
[calls prepare_database_migration with migrations/001_initial.sql]
Created temporary branch migration-test/abc123.
Migration ran on the branch; 6 tables created, 7 indexes built.
Want me to commit to main?
You: Yes, commit.
Agent: [calls complete_database_migration] Committed. Main branch
updated. Connection string:
postgresql://...@ep-cool-name.neon.tech/chat-agent?sslmode=require
That entire interaction is a Claude Code or OpenCode session, no SQL typed by hand, the migration tested on a branch before going to main. This is what MCP is for during development. It is not what the agent will use to write a single audit row at runtime.
PRIMM, Predict. You're writing a customer-support Worker. The Worker needs to: (a) look up the customer's order history; (b) check the refund policy for the customer's tier; (c) issue a refund; (d) write an audit row recording what was issued and why. Should it talk to Neon through the official Neon MCP server, or some other way? Confidence 1–5.
The answer: some other way. For all four operations. The Neon MCP server's run_sql is too broad an interface for a production agent: it grants the agent permission to run arbitrary SQL, which is far more authority than the four narrow operations above require. The production patterns are direct Postgres connections (a connection_pool from asyncpg) wrapping scoped business operations, or a custom MCP server that exposes only those four operations. Concept 14 covers the latter; Part 4's worked example uses both: a custom customer-data MCP server for business operations (lookups, vector search, refunds), and direct asyncpg only for the audit subsystem (Decision 7 explains the reason: audit is the meta-layer and must not share the MCP boundary it's auditing).
The distinction the thesis is making with Invariant 5 (that the workforce reads from and writes to governed stores) depends on this distinction. A broad run_sql MCP tool is not governance; it is the absence of governance with a friendly interface.
The Neon MCP tools your Worker will touch. The Neon MCP server exposes roughly twenty tools across project management, branching, schema, SQL execution, and query tuning. Decision 3 of the worked example uses a small subset; the rest are useful for follow-on work (extending the schema later, sandboxing a what-if migration on a branch). For reference:
| Tool | What it does | Where used in Part 4 |
|---|---|---|
list_projects | List Neon projects on your account | Decision 3, to confirm the project does not already exist |
create_project | Provision a new Neon project (Postgres database + default main branch) | Decision 3, to create the chat-agent project |
describe_project | Get project metadata (default branch, region, compute settings) | Decision 3, sanity-check after provisioning |
get_connection_string | Return the Postgres connection string for a named branch | Decision 3, to populate NEON_DATABASE_URL for asyncpg |
prepare_database_migration | Start a migration: opens a temporary branch and lets the agent run DDL there | Decision 3, to test the six-table schema before merging to main |
run_sql | Execute SQL on a named branch (within an MCP-mediated session) | Decision 3, to run the schema creation on the temporary branch |
complete_database_migration | Commit a migration: merges the temporary branch back to main | Decision 3, after the schema verifies clean on the temporary branch |
describe_table_schema | Dump full DDL of a named table (useful for the agent to ground future plan-mode work) | Optional, when you ask Plan Mode to extend the schema later |
create_branch / delete_branch | Manage isolated copy-on-write branches | Optional, when you want a sandbox branch for a what-if migration |
Production runtime Workers do not touch this server. They reach Postgres through the custom MCP server you write in Decision 6, or through direct asyncpg for the audit subsystem (Decision 7).
Try with AI
Read Neon's MCP server documentation page and answer three questions:
1. List THREE management operations the Neon MCP server exposes that
would be useful during development of a customer-support Worker.
2. List THREE operations a runtime Worker would NEED that the Neon MCP
server should NOT be used for, and why.
3. For each of the three in (2), propose what the Worker should use
instead (direct Postgres connection? custom MCP server? function_tool?).
Concept 13: Connecting MCP to the OpenAI Agents SDK
The Agents SDK ships with a first-class MCP client. Three classes do the work, one per transport:
from agents.mcp import (
MCPServerStdio, # local subprocess
MCPServerStreamableHttp, # remote, modern transport
MCPServerSse, # remote, legacy transport (avoid for new work)
)
All three are async context managers (async with server as ...:). Once connected, you pass them to the Agent via the mcp_servers argument. The agent automatically discovers each server's tools at startup, presents them to the model alongside any @function_tools you've defined, and routes calls back to the right server based on which tool the model picks.
MCPServerStdio's default client_session_timeout_seconds=5 is enough for a pure-Python server that only imports mcp and asyncpg. It is not enough when the server imports torch, sentence-transformers, or anything that loads a model into memory at import time, which can take 10–60 seconds on first launch. The symptom is a confusing init failure ("MCP server did not respond in time"); the fix is one parameter:
MCPServerStdio(
name="customer-data",
params={...},
client_session_timeout_seconds=60,
)
Set it once for any server whose process does real work at startup, and forget it.
A minimal example, using the Neon MCP server during development:
# tools/scratch_with_neon.py — development utility, not production
import asyncio
from agents import Agent, Runner
from agents.mcp import MCPServerStreamableHttp
async def main() -> None:
async with MCPServerStreamableHttp(
name="Neon (development)",
params={"url": "https://mcp.neon.tech/mcp"},
cache_tools_list=True,
) as neon:
agent: Agent = Agent(
name="DBAssistant",
instructions=(
"You help with Neon database operations during development. "
"Always use prepare_database_migration / complete_database_migration "
"for schema changes — never run DDL directly on main."
),
mcp_servers=[neon],
model="gpt-5.5",
)
result = await Runner.run(
agent,
"List all projects, then show me the schema of the largest one.",
)
print(result.final_output)
if __name__ == "__main__":
asyncio.run(main())
Three details worth naming:
-
async withis not optional. The MCP connection holds an open transport (a subprocess for stdio, an HTTPS session for streamable HTTP). Without the context manager, the connection leaks and the server-side state goes stale. Always useasync with. -
cache_tools_list=Trueis a substantial speedup for production. By default the SDK callslist_tools()on everyRunner.run, which is a network round-trip. Caching makes it once-per-process. Cache invalidation is manual: callserver.invalidate_tools_cache()when you've added or removed tools. For development with a server whose tools change, leave itFalse. -
Multiple MCP servers stack naturally. Pass
mcp_servers=[neon, custom_server, hosted_server]; the model sees the union of all tools. UseMCPServerManagerif you need lifecycle management across many; it's a thin helper around the same pattern.
The require_approval parameter is the production switch worth knowing about. By default, MCP tool calls run without confirmation. For sensitive servers, you can require human approval per-tool:
async with MCPServerStreamableHttp(
name="Neon (development, guarded)",
params={"url": "https://mcp.neon.tech/mcp"},
require_approval={
"always": {"tool_names": ["delete_project", "delete_branch", "run_sql"]},
"never": {"tool_names": ["list_projects", "describe_project"]},
},
) as neon:
...
The destructive operations get a human-in-the-loop check; the read-only ones run silently. This is the practical knob for the development-vs-production gap from Concept 12: even when you're using the Neon MCP server for hands-on work, gating its destructive tools through approval is a real safety improvement.
A look at how the model sees MCP tools. When the Agent runs, the model gets a tool list that looks structurally similar to a @function_tool list:
[
{"type": "function", "function": {"name": "search_docs", "description": "...", "parameters": {...}}},
{"type": "function", "function": {"name": "get_billing_invoice", "description": "...", "parameters": {...}}},
// These three came from the Neon MCP server:
{"type": "function", "function": {"name": "list_projects", "description": "Lists the first 10 Neon projects ...", "parameters": {...}}},
{"type": "function", "function": {"name": "run_sql", "description": "Executes a single SQL query ...", "parameters": {...}}},
{"type": "function", "function": {"name": "prepare_database_migration", "description": "Initiates a database migration by creating a temporary branch ...", "parameters": {...}}}
]
The model can't tell the difference between an MCP tool and a @function_tool, and shouldn't need to. The SDK's routing layer dispatches each call to the right backend.
Try with AI
I want to connect my OpenAI Agents SDK agent to TWO MCP servers
simultaneously: (a) the Neon MCP server for database operations, and
(b) a local filesystem MCP server (npx @modelcontextprotocol/server-filesystem)
for reading project files.
Write the async setup code that connects to both servers and uses
them in a single Agent. Include:
1. The right imports.
2. Both connections as async context managers (one streamable HTTP,
one stdio).
3. An Agent that uses both, with require_approval set to require
approval for any tool that writes to the filesystem.
4. A sample Runner.run that exercises tools from both servers in
one turn.
Concept 14: Custom MCP servers, when to write your own vs. when not to
The Neon MCP server is generic: it can do anything Neon's API can do. That's its strength for development and its weakness for runtime. A custom MCP server inverts the trade-off: narrow surface, no general-purpose run_sql, only the specific operations your Worker actually needs.
The decision tree, in order of priority.

The same logic in a quick-scan table:
| You want to expose... | Use this | Why |
|---|---|---|
| One function with one input, used by one agent | @function_tool | No need for protocol overhead. Local function call is fine. |
| Several functions tightly coupled to your agent's code | @function_tool | If they share state with the agent and live in the same repo, they're part of the agent. |
| A capability that multiple agents (or multiple deployments) will use | Custom MCP server | The protocol is what makes it reusable. |
| A capability that needs to outlive the agent's process | Custom MCP server | Long-running connections, background jobs, queue consumers. |
| Vendor-provided functionality (Neon, GitHub, Linear) | Vendor's MCP server | Don't rebuild what they ship. |
| Sensitive operations that need narrow scope | Custom MCP server | Define exactly the tools you need; nothing else. |
The minimum-viable custom MCP server in Python, using the official MCP SDK:
# servers/customer_data_mcp/server.py
# Run with: python -m customer_data_mcp.server
import os
from typing import Annotated
import asyncpg
from mcp.server.fastmcp import FastMCP
from pydantic import Field
mcp: FastMCP = FastMCP("customer-data")
# Connection pool: created once, reused across tool calls.
_pool: asyncpg.Pool | None = None
async def get_pool() -> asyncpg.Pool:
global _pool
if _pool is None:
_pool = await asyncpg.create_pool(
os.environ["DATABASE_URL"],
min_size=1, max_size=10,
)
return _pool
@mcp.tool()
async def lookup_customer(
customer_id: Annotated[
str, Field(description="The customer's UUID, as provided by the user."),
],
) -> dict[str, str | int]:
"""Look up a customer by ID. Returns id, email, tier, and active_tickets count.
Use when the user provides a customer ID and you need their basic profile.
Does NOT return billing details — use lookup_billing for that.
"""
pool = await get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
"""SELECT id::text, email, tier,
(SELECT COUNT(*) FROM tickets t
WHERE t.customer_id = c.id AND t.status='open') AS active_tickets
FROM customers c WHERE id = $1::uuid""",
customer_id,
)
if row is None:
return {"error": f"customer {customer_id} not found"}
return dict(row)
@mcp.tool()
async def find_similar_resolved_tickets(
description: Annotated[
str, Field(description="The user's question or description of the issue."),
],
limit: Annotated[int, Field(description="Max results.", ge=1, le=10)] = 5,
) -> list[dict[str, str | float]]:
"""Find resolved tickets similar to a description, via pgvector.
Use when the user describes an issue and you want to check if a similar
issue has been resolved before. Returns ticket_id, summary, resolution,
and similarity score (lower is more similar).
"""
pool = await get_pool()
# In production, embed the description here; for brevity, assume it's
# already an embedding column being matched. See Part 4 for the full pipeline.
#
# Note the JOIN path: embeddings.document_id references documents(id), not
# tickets(id). See Concept 7's schema. A resolved ticket is embedded by
# storing it as a `documents` row (source='past_case') whose metadata holds
# the ticket id, so the join goes embeddings -> documents -> tickets.
async with pool.acquire() as conn:
rows = await conn.fetch(
"""SELECT t.id::text AS ticket_id, t.summary, t.resolution,
(e.embedding <=> $1::vector) AS distance
FROM embeddings e
JOIN documents d ON d.id = e.document_id
JOIN tickets t ON t.id = (d.metadata->>'ticket_id')::uuid
WHERE d.source = 'past_case' AND t.status = 'resolved'
ORDER BY e.embedding <=> $1::vector
LIMIT $2""",
description, limit, # description here is illustrative; real version embeds first
)
return [dict(r) for r in rows]
if __name__ == "__main__":
mcp.run(transport="stdio")
Then in your agent:
async with MCPServerStdio(
name="customer-data",
params={
"command": "python",
"args": ["-m", "customer_data_mcp.server"],
"env": {"DATABASE_URL": os.environ["DATABASE_URL"]},
},
cache_tools_list=True,
) as customer_data:
agent: Agent = Agent(
name="CustomerSupport",
mcp_servers=[customer_data],
...,
)
Three things this server gives you that @function_tool doesn't.
-
Process isolation. The MCP server runs in its own process (subprocess for stdio, separate service for streamable HTTP). A crash in the server doesn't crash the agent; a memory leak in the server doesn't leak in the agent.
-
Scope. The server has exactly two tools. No
run_sql. No "execute arbitrary code." The model can't escape this scope because the protocol doesn't expose anything else. This is a real defense in depth: even if the model decided to do something stupid, the surface area to do it through is two functions. -
Reusability across agents. A second agent (a Sales Worker, a Reporting Worker) can talk to the same
customer-dataMCP server. Same scope, same protocol, same trust boundary. The capability becomes a shared piece of infrastructure rather than a copy-paste between agents.
The trade-off is real. Custom MCP servers add operational complexity: another process to deploy, another set of logs, another network hop (if remote), another version to manage. Don't write one for a single function used by a single agent. Write one when the capability is going to be reused, when scoping matters, or when isolation buys you safety.
PRIMM, Predict. You're designing the customer-support Worker. You need: (1) semantic search over past resolved tickets; (2) writing a refund audit row; (3) reading the current weather (used in one greeting skill that says "good morning from sunny Karachi"); (4) calling the payment gateway to issue a refund. For each, predict:
@function_tool, custom MCP server, or vendor MCP server (e.g., Stripe's, if such exists)?
The answers tease out the framework:
- Custom MCP server (
customer-data). Reused across agents; sensitive data; scoped tools beat a broadrun_sql. - Custom MCP server (
customer-data) or@function_tool. Either works; if the Worker is the only writer, function tool is fine. If multiple Workers will write audit rows, MCP server. @function_tool. One agent, one tiny function, no security surface to defend. Don't build a server for it.- Vendor MCP server (Stripe MCP) if it exists, else
@function_toolcalling Stripe's API. Don't wrap third-party APIs in your own MCP server unless you need to add policy on top.
The framework is clear once you trace it: the value of MCP rises with the value of the boundary it creates. A boundary you don't need is overhead.
Try with AI
You're designing the architecture for a research-Worker that does these
five things:
A. Reads PDFs from a Google Drive folder (uses Google Drive's API)
B. Queries a private Snowflake warehouse the company owns
C. Calls a third-party data-enrichment API (Clearbit-like)
D. Logs every action to the company's central audit log database
E. Generates a Markdown report by string-formatting the results
For each, decide: @function_tool, custom MCP server, or vendor MCP
server (if a credible vendor MCP server exists). For each choice,
justify with ONE of the three properties of MCP from Concept 14
(isolation, scope, reusability) OR justify why the boundary isn't
worth building.
Concept 15: MCP under load: transports, pooling, and what happens at scale
A demo with one MCP server and one agent on a laptop works fine. The questions get harder when the Worker is in production handling 10 conversations a minute. This Concept is the operational checklist.
Transport choice, at scale.
-
stdiois for local development and single-process deployments. The MCP server runs as a subprocess of the agent. Restart the agent → server restarts → fresh state. Cheap. Works. Doesn't scale across machines. -
streamable HTTPis for production remote servers. Multiple agents can connect to one MCP server. The server can run on different hardware than the agent. Auto-reconnect, resumable streams, single endpoint: designed for production from the ground up. -
SSEexists, mostly for legacy reasons. New work should default to streamable HTTP. Existing SSE-only servers are still common (the SDK supports them); don't rewrite a server just to switch transport.
Connection management. MCP connections are not free. Each one is a session: an authentication handshake, a list_tools round-trip, a persistent connection. Three patterns that matter:
-
Cache the tools list.
cache_tools_list=Truesaves a round-trip perRunner.run. This is the single biggest latency win for the typical agent loop. Invalidate the cache when you deploy a new version of the server. -
Reuse server instances across runs. Don't create a new
MCPServerStreamableHttp(...)per request. Open once at agent startup, hold theasync withopen for the lifetime of the agent process, close at shutdown. The connection cost is amortized across thousands of agent runs. -
Connection pooling lives inside the MCP server, not the client. If your MCP server talks to Postgres (as the
customer-dataexample does), the server should have anasyncpg.create_pool(min_size=1, max_size=10)and share the pool across tool calls. The client side talks to one MCP server connection; the server side fans out to a pool of database connections.
Concurrency limits. Three places to set them, in order from cheapest to most expensive to get wrong:
max_turnson the agent. Already familiar from Course #3. Caps loop length regardless of tools.max_retry_attemptson the MCP server connection. Caps retry storms when an MCP server is flaky. Default 0 (no retries); set to 2–3 for production resilience.- Server-side rate limits. Inside your custom MCP server, count concurrent calls per agent and reject above some threshold. The protocol allows tool calls to return errors; clients see them as tool failures and the model can decide to retry, give up, or pick a different tool.
Tracing across the MCP boundary. Course #3's tracing setup carries through, but with one important detail: MCP tool calls appear in the agent's trace as ordinary tool calls: same span shape, same span name (the MCP tool's name), same timing. What's not visible from the agent side is what happened inside the MCP server's process. If your custom MCP server makes its own database calls or downstream API calls, those need their own tracing inside the server.
A practical pattern: propagate trace context across the MCP boundary using _meta. The SDK supports a tool_meta_resolver that adds metadata to each tool call (tenant IDs, trace IDs, request IDs); your server can extract and use them. This is how a single agent run's tracing chain stays unbroken from agent → MCP client → MCP server → Postgres.
Quick check. True or false: (a) Switching from
MCPServerSsetoMCPServerStreamableHttprequires changes to the MCP server itself. (b)cache_tools_list=Trueis safe in production as long as you callinvalidate_tools_cache()after deploying a new server version. (c) An MCP server with five tools always uses more agent context budget than a@function_toolwith five functions. Answers: (a) Depends: the server must support the streamable HTTP transport; most modern servers do, older ones may only speak SSE. (b) True: that's the intended pattern. (c) False: at the schema level they're equivalent. Five tool definitions cost about the same in either form.
Try with AI
My customer-support Worker is in production. It runs 80 conversations/minute
at peak. Each conversation makes 2-4 MCP tool calls on average. I'm seeing
intermittent latency spikes — most calls return in 200ms, but a small
percentage take 5-15 seconds.
Walk me through five places I'd investigate, in order of priority:
1. The agent-side MCP client connection management.
2. The transport choice between agent and MCP server.
3. The MCP server's internal connection pool to Postgres.
4. Postgres-side query performance (slow queries blocking the pool).
5. Network or DNS issues between agent and MCP server.
For each, name the specific signal I'd look for and the rough fix.
Part 4: The worked example, customer-support Worker
One realistic evolution, every concept above, both tools. We take the chat-agent project from Course #3 and turn it into a customer-support Worker by adding three Skills, a Neon system of record, and an MCP wiring layer. Eight build decisions, same shape as Course #3's Part 5.

Before you start: setup you need that isn't in the prereqs. Three things this Part assumes are already done. (1) You have completed Course #3's worked example and have a working
chat-agent/project withcli.py,agents.py,tools.py,models.py,guardrails.py. We modify these files; we do not replace them. (2) You have a free Neon account and have runnpx neonctl@latest initonce to authenticate. (3) You have either Claude Code or OpenCode installed and authenticated. If any of these is missing, fix it before Decision 1.
The brief
Evolve the chat-agent from Course #3 into a customer-support Worker that:
- Loads three Skills on demand:
summarize-ticket,find-similar-cases, andescalate-with-context. - Reads from and writes to a Neon Postgres system of record with the six tables from Concept 7.
- Uses pgvector for semantic search over a small library of past resolved cases.
- Talks to Postgres at runtime through a scoped, custom MCP server (
customer-data), not through the Neon MCP server, and not through directasyncpgcalls in agent code. - Writes an audit row for every meaningful action: every skill invoked, every database write, every refund considered.
The "verification at the end" test: a customer messages "I haven't received my refund from order #4429, it's been two weeks." The Worker looks up the order via MCP, finds three similar past cases via vector search, drafts a response that cites the resolution from the most similar case, writes an audit row recording what it did, and (in a real deploy) escalates if the customer is a Pro tier user.
A note on the prompts that follow. Each Decision shows a structured ask as a block-quoted prompt. The pattern that works best in practice is to precede each ask with one orient move ("Read
CLAUDE.mdand the relevant files, tell me what you see, and ask 1-2 questions before we start") and then send the structured ask once the agent has loaded context and clarified ambiguities. The structured asks below are the destination, not the first move. Pasting them cold works; pasting them after orientation works better, especially as the project grows.
Decision 1: Update the rules file with the new architecture
What you do (Claude Code). Open Claude Code in your existing chat-agent/ project. Brief the agent on the architectural changes Course #4 is adding, and ask it to update CLAUDE.md accordingly:
We're extending this project with three additions on top of the Course
#3 stack: Skills loaded from .claude/skills/, a Neon Postgres system of
record (six-table schema in migrations/001_initial.sql), and MCP wiring
(Neon MCP server for development; a custom customer-data MCP server at
servers/customer_data_mcp/ for runtime).
Update CLAUDE.md to add:
1. New stack lines (Neon Postgres + pgvector, MCP).
2. A new "Architecture (NEW)" section describing where each piece lives.
3. Critical rules preventing three failure modes: business data access
must go through the customer-data MCP server (agent decision logic
never bypasses MCP for business reads or writes); the only permitted
direct asyncpg usage is in the audit subsystem via a separate audit
pool; never use the Neon MCP server's run_sql in production paths
(it's a dev tool); every skill invocation, refund, or database write
produces an audit_log row; skills live in .claude/skills/ (OpenCode
reads as fallback) — do not duplicate in .opencode/skills/.
4. New uv-run commands for the MCP server and migrations.
Keep the file under 2,500 tokens. Show me the diff before writing.
Claude Code drafts the update. Read the diff carefully. The "Critical rules" section is the load-bearing part; anything weak there fails to prevent the failure modes it's supposed to prevent. The result should look roughly like this:
# chat-agent
## Stack
Python 3.12+, uv, openai-agents >=0.14.0 (latest at time of writing: 0.17.2),
Cloudflare Sandbox, Neon Postgres + pgvector, MCP (custom server for runtime).
## Architecture (NEW)
- Capabilities: three Skills in .claude/skills/
(summarize-ticket, find-similar-cases, escalate-with-context).
See [agentskills.io](https://agentskills.io/specification) for the format.
- System of record: Neon Postgres, schema in migrations/001_initial.sql.
Six tables: conversations, messages, documents, embeddings, audit_log,
capability_invocations.
- MCP: Neon MCP server for development ONLY (schema, migrations).
Runtime traffic uses a custom MCP server at servers/customer_data_mcp/,
exposing scoped tools (lookup_customer, find_similar_resolved_tickets,
issue_refund). No run_sql in production paths.
- Embeddings: text-embedding-3-small (1536 dims), cosine distance,
HNSW index.
## Critical rules (NEW)
- Business data access must go through the `customer-data` MCP server.
Agent decision logic must never bypass MCP to query or mutate business
data. The only permitted direct asyncpg usage is inside the audit
subsystem, through a separate audit connection pool (Decision 7
explains why audit can't share the MCP boundary it's auditing).
- Never use the Neon MCP server's run_sql tool in production paths.
It's a development tool.
- Every skill invocation, every refund consideration, every database
write produces an audit_log row. No silent actions.
- Skills are in .claude/skills/ (OpenCode reads this as a fallback);
do not duplicate in .opencode/skills/.
## Commands
- `uv run python -m chat_agent.cli`: local dev (uses customer-data MCP via stdio)
- `uv run python -m chat_agent.sandboxed`: sandbox deployment
- `uv run python -m customer_data_mcp.server`: run MCP server standalone
- `uv run alembic upgrade head`: apply pending migrations
The file grows from ~30 lines to ~50 lines, still well under the 2,500-token budget from Course #3. If the diff Claude Code shows you misses any of the critical rules or hallucinates a constraint that isn't in the brief, push back and re-prompt. The rules file is one place where small inaccuracies have outsized downstream effects.
Why. Course #3's rules file pinned the engine and the sandbox. This update pins the architecture layered on top: Skills, system of record, MCP boundaries. The "business data through MCP, audit through scoped direct asyncpg" rule is the load-bearing one: it prevents the agent from short-circuiting the MCP boundary later when somebody is in a hurry, while keeping the audit subsystem independent of the boundary it audits. Decision 7 explains why that independence matters.
What changes in OpenCode. Same flow: brief the agent, review the diff. Rename the file to AGENTS.md (if you haven't already from Course #3) or leave as CLAUDE.md (OpenCode reads it as a fallback). Same content.
Decision 2: Plan the schema and the Skill set
What you do (Claude Code). Press Shift+Tab twice to enter Plan Mode. The model can read your existing project but cannot edit anything. Brief it:
Plan the customer-support Worker evolution of this project. The
foundation (OpenAI Agents SDK, Cloudflare Sandbox, sessions, streaming,
guardrails) stays. We're adding:
1. Three Skills: summarize-ticket, find-similar-cases, escalate-with-context.
For each, propose: the description, the operational shape (script-driven
or instruction-driven), and what reference files it needs.
2. The six-table schema from Part 2 Concept 7, plus any tables specific
to a customer-support domain (probably: customers, orders, tickets, refunds).
3. The custom MCP server (customer-data), with exactly the runtime tools
our agent will need. Propose the tool list and signatures. No run_sql.
4. The audit-logging plan — what writes an audit row, what doesn't.
Output the plan as a markdown file at plans/customer-support-worker-plan.md.
Do not write code yet.
The model produces a plan. Read it carefully. Two places where the first draft is usually wrong:
- The Skills' descriptions will be vague ("Summarizes tickets"). Push back: descriptions must be specific enough to fire correctly (Concept 3 of this course was about exactly this).
- The MCP tools' input schemas will be broader than needed ("query: string"). Push back: each tool should have the minimum input it needs. A
lookup_customertool needs acustomer_id, not aqueryyou build SQL out of.
Why. Two failure modes Plan Mode catches that cost hours later: a Skill with a vague description never fires, and an MCP tool with a broad input schema is just run_sql with extra steps. Both are easier to fix in a markdown plan than after they're built.
What changes in OpenCode. Press Tab to switch to the Plan agent. Same prompt, same plan output. Save the plan file with the same name.
Decision 3: Provision Neon and run the schema migration
Neon's free tier covers a single Worker at the volume Part 5 assumes (~200 conversations/day). Plan on $0/month here. The free plan limits are 0.5 GB storage and 100 compute-hours per project (Neon pricing); above that, the Launch tier is pay-as-you-go (roughly $0.11/CU-hour + $0.35/GB-month), and a worked-example Worker typically stays under $25/month. See Part 5's cost shape table for the full breakdown.
What you do (Claude Code). Press Shift+Tab to exit Plan Mode. Make sure the Neon MCP server is connected (from Concept 12). Then:
Using the Neon MCP server, provision a project called "chat-agent",
enable the vector extension, and run the schema from
migrations/001_initial.sql (which you'll generate from the plan).
Use prepare_database_migration to test on a branch before committing
to main. Then add NEON_DATABASE_URL to my .env (use the main-branch
connection string).
The agent uses the Neon MCP server's tools: create_project, run_sql (for CREATE EXTENSION vector), prepare_database_migration (to test on a branch), and complete_database_migration (to apply to main). The Worker-side schema lands. At the end of this Decision, you have a working Neon database with the six tables, pgvector enabled, indexes built, and the connection string in .env.
This is exactly the development-use case from Concept 12. Schema management via natural language on a branch, with explicit human approval (you said "go ahead") before main is touched. The Neon MCP server has earned its place here.
What changes in OpenCode. Same flow. OpenCode's MCP integration uses the Neon server identically; the visible UI differs (the approval prompt format) but the operations are the same.
What migrations/001_initial.sql looks like
The full migration is the six-table schema from Concept 7, plus four domain-specific tables for the customer-support example. Two pieces worth showing here:
-- migrations/001_initial.sql — abbreviated; full file in the repo
CREATE EXTENSION IF NOT EXISTS vector;
-- [Six tables from Concept 7 omitted: conversations, messages, documents,
-- embeddings, audit_log, capability_invocations. See Concept 7.]
-- Domain-specific tables for customer-support
CREATE TABLE customers (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT NOT NULL UNIQUE,
tier TEXT NOT NULL CHECK (tier IN ('free', 'pro', 'enterprise')),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES customers(id) ON DELETE CASCADE,
public_id TEXT NOT NULL UNIQUE, -- the #4429-style display ID
placed_at TIMESTAMPTZ NOT NULL,
amount_cents INT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('placed','shipped','delivered','refunded','cancelled'))
);
CREATE TABLE tickets (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES customers(id) ON DELETE CASCADE,
order_id UUID REFERENCES orders(id) ON DELETE SET NULL,
summary TEXT NOT NULL,
resolution TEXT,
status TEXT NOT NULL CHECK (status IN ('open','in_progress','resolved','escalated')),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
resolved_at TIMESTAMPTZ
);
CREATE TABLE refunds (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
order_id UUID NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
ticket_id UUID REFERENCES tickets(id) ON DELETE SET NULL,
amount_cents INT NOT NULL,
reason TEXT NOT NULL,
issued_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Verify Decision 3 worked. Open psql with your NEON_DATABASE_URL and run:
SELECT extname FROM pg_extension WHERE extname = 'vector'; -- one row: 'vector'
SELECT COUNT(*) AS table_count FROM information_schema.tables
WHERE table_schema='public'; -- expect 10
SELECT indexname FROM pg_indexes
WHERE indexname = 'idx_embeddings_hnsw'; -- one row
Three rows back, no errors: Decision 3 is done. If vector is missing, the CREATE EXTENSION step on the temporary branch didn't merge to main; re-run complete_database_migration. If the table count is off, the migration didn't apply cleanly.
Decision 4: Write the first Skill, summarize-ticket
What you do (Claude Code), and the one deliberate exception.
The most important sentence on this page. Decision 4 is the only place in the whole worked example where you write something by hand instead of briefing the agent. Everywhere else (CLAUDE.md, the migration, the MCP server, the audit wiring) the human briefs and the agent drafts. Decision 4 is different in exactly one piece: the
descriptionfield ofSKILL.md. This split (human owns the description, agent owns everything else) is the load-bearing pedagogy of the entire chapter.
The reason is Concept 3. The description is what determines whether the skill ever fires; getting it wrong means the skill either silently never activates or fires on every prompt (worse). Description quality is a judgment call about which user phrasings should trigger this skill, and that judgment compounds over the life of the skill. Ask the agent to write it and you'll get a description that satisfies the spec but is generic; write it yourself and you'll get one that captures the specific triggers you care about. The body, the examples, the edge cases: all agent-written under human review, same as every other Decision. The frontmatter description alone is the part the human owns.
First, create the skill folder structure:
mkdir -p .claude/skills/summarize-ticket
The frontmatter, hand-written:
---
name: summarize-ticket
description: Summarizes a customer support ticket into a structured format with sections for Customer Context, Issue Description, Resolution Steps Taken, Current Status, and Recommended Next Action. Use when the user provides a ticket ID, asks for a ticket summary, asks "what's the status of ticket X", or asks to be brought up to speed on a ticket. Produces a concise summary suitable for handoff to another agent.
---
That description took five drafts to write. The discipline from Concept 3: name what ("summarizes a customer support ticket into a structured format"), name when ("when the user provides a ticket ID, asks for a ticket summary, asks 'what's the status of ticket X', or asks to be brought up to speed on a ticket"), surface specific keywords ("ticket ID," "ticket summary," "status of ticket," "brought up to speed").
Then ask Claude Code:
Open .claude/skills/summarize-ticket/SKILL.md. The frontmatter is
done. Write the body — step-by-step operational instructions, two
examples (one short ticket, one complex), and three edge cases
(escalated ticket, ticket with no resolution yet, ticket where the
customer is irate). Keep the body under 200 lines.
The model produces the body. Read it. Push back on anything narrative. Skills run better with imperative instructions ("Look up the ticket. Extract X. Format as Y.") than narrative descriptions ("This skill is designed to help with..."). The Concept 3 discipline applies in full.
Why. This is the first time the agent gets capability that wasn't in tools.py. The next conversation that mentions a ticket ID won't go through the search_docs stub; it'll activate this skill and follow its instructions. The model decides the skill applies based on the description; the body controls what happens next.
What changes in OpenCode. The skill in .claude/skills/summarize-ticket/SKILL.md is auto-discovered by OpenCode via its fallback path (Concept 4). No duplication needed.
What .claude/skills/summarize-ticket/SKILL.md looks like
---
name: summarize-ticket
description: Summarizes a customer support ticket into a structured format with sections for Customer Context, Issue Description, Resolution Steps Taken, Current Status, and Recommended Next Action. Use when the user provides a ticket ID, asks for a ticket summary, asks "what's the status of ticket X", or asks to be brought up to speed on a ticket. Produces a concise summary suitable for handoff to another agent.
---
# Summarize ticket
When this skill activates:
1. Extract the ticket ID from the user's message. Format: a UUID, or a
public ID like "TKT-1234". If both forms are present, prefer the UUID.
2. Call the `lookup_customer` tool with the ticket's customer_id (which
you'll get in step 4).
3. Call `find_similar_resolved_tickets` with the ticket's summary, to
surface 2-3 prior similar resolutions.
4. Fetch the ticket itself by querying for ticket_id = <id>. If the ticket
has order_id set, fetch the order too.
5. Compose a five-section summary:
**Customer Context** — name, tier, account age, any recent prior tickets
**Issue Description** — the ticket summary, the affected order if any
**Resolution Steps Taken** — actions logged in audit_log scoped to this ticket
**Current Status** — status field, who is assigned, time since last update
**Recommended Next Action** — based on similar resolved tickets, propose ONE
specific action: respond with template, escalate, request more info, close
6. Keep the summary under 200 words total.
## Example: short ticket
Ticket TKT-1042 from a free-tier customer who reported a missing order. Status:
open, 18 minutes old, no actions yet. Three similar resolved tickets show
this is usually a delivery-tracking issue resolved in 2 hours.
Output:
**Customer Context**: Sara K. (free tier, 7-month account, 0 prior tickets)
**Issue Description**: Order #4912 not received, 5 days post-shipping date
**Resolution Steps Taken**: None — ticket 18 minutes old
**Current Status**: Open, unassigned
**Recommended Next Action**: Reply with delivery-tracking template; auto-close
in 24h if no follow-up.
## Example: complex ticket
[A second worked example with an escalated ticket — omitted for brevity.]
## Edge cases
**Escalated ticket** — include the escalation reason and the assignee. Mark the
"Recommended Next Action" as "Continue with assignee; do not duplicate."
**No resolution yet** — explicit "no actions logged" in Resolution Steps Taken.
Don't infer.
**Irate customer (from message sentiment)** — flag in Customer Context.
Recommend tier-2 review even if the issue is small.
The other two Skills, in brief. The same shape applies. Two more frontmatter blocks, each forty minutes of careful description-writing, and the Worker has the three Skills it needs.
# .claude/skills/find-similar-cases/SKILL.md (frontmatter only)
---
name: find-similar-cases
description: Searches the resolved-tickets library for tickets semantically similar to a customer's described issue, returning the top 3-5 with their resolutions and a similarity score. Use when the user describes a problem, complaint, or symptom and you need to check whether the team has handled something similar before. Calls the find_similar_resolved_tickets MCP tool. Always run this BEFORE drafting a response, so the response can reference proven prior resolutions rather than inventing a new approach.
---
The body walks through three steps: extract the issue description from context, call find_similar_resolved_tickets with limit=5, present the top three with similarity scores in a markdown table, and explicitly flag low-confidence matches (similarity > 0.3) as "no strong prior precedent found." The instruction "always run this BEFORE drafting" is doing real work; without it, the model sometimes drafts a reply from priors and never looks at the library.
# .claude/skills/escalate-with-context/SKILL.md (frontmatter only)
---
name: escalate-with-context
description: Packages a customer conversation for handoff to a tier-2 support agent. Produces a structured escalation note with customer profile, issue summary, what was already tried, why escalation is recommended, and the suggested specialist team. Use when (a) the customer is on the Pro or Enterprise tier AND the issue is unresolved after one round of investigation, (b) the customer's sentiment is clearly negative, (c) the issue involves billing >$500 or a refund decision, or (d) the user explicitly asks for a human.
---
The body invokes summarize-ticket first to get the structured context, then writes a six-section escalation note (customer context, issue, attempted resolutions, sentiment signals, recommended team, suggested SLA). The four explicit trigger conditions in the description are what stop this skill from over-firing; a Worker with vague escalation logic escalates everything, which defeats the purpose.
By Decision 7 all three Skills will be live in .claude/skills/. The structure transfers; only the descriptions and the operational specifics change.
Decision 5: Build the embedding pipeline and seed the document library
A seed corpus of a few dozen resolved tickets at ~300 tokens each embeds for a fraction of a cent at text-embedding-3-small's $0.02 per 1M input tokens. Ongoing embedding of new tickets and messages typically stays under $3/month at the worked-example volume. Embeddings are the cheapest line on the bill; the cost lever is the inference budget, not the embedding budget.
What you do. Two pieces here: the embedding pipeline code (from Concept 9) and a small corpus to embed. The Worker's "library" for this example is a curated set of past resolved tickets: small enough to run fast, big enough that semantic search shows its value. A dozen rows is enough to feel the architecture work; a few dozen makes the retrieval results more interesting. The course doesn't ship the corpus, because writing it is part of the exercise: give your agent the column shape and a few example rows, and have it generate the rest.
mkdir -p data/seed
# data/seed/resolved-tickets.csv (columns: id, customer_email, summary, resolution)
# A few starter rows; ask your agent to expand this to a dozen or more,
# varied enough that semantic search has something to discriminate between.
id,customer_email,summary,resolution
1,sara@example.com,Refund not showing up two weeks after return approved,Refund was stuck in pending due to a payment-gateway batch delay; manually re-triggered and funds posted within 24h.
2,raj@example.com,Cannot log in after email change,Account email was updated but the session cache held the old address; cleared the session and the customer logged in normally.
3,mei@example.com,Duplicate charge on a single order,Gateway retry created a second authorization; voided the duplicate and confirmed only one capture settled.
The customer_email column lets the seeder find-or-create a customers row before inserting the ticket (the tickets.customer_id foreign key is NOT NULL, so the seed cannot skip this step). Then ask the agent:
Implement src/chat_agent/embedding/{chunker,embedder,seeder}.py based on
the patterns from Concepts 9 and 10 of the course. Add a CLI command
`python -m chat_agent.embedding.seeder data/seed/resolved-tickets.csv`
that:
1. Reads the CSV.
2. For each row:
a. Find or create a customers row by email (default tier='free').
b. Insert a tickets row with status='resolved' and customer_id set.
c. Insert a documents row with source='past_case', title=the ticket
summary, body=summary+resolution combined, and metadata containing
the ticket id ({"ticket_id": "<uuid>"}). This is the row the
embedding will link to: embeddings.document_id references
documents(id), not tickets(id). See Concept 7's schema and the
Concept 14 query.
d. Chunk the documents row's body.
e. Embed each chunk with text-embedding-3-small.
f. Insert each chunk into embeddings with document_id set to the
documents row from step (c) (not conversation_id).
3. Write a single audit_log row recording "seed_run" with the row count.
Use asyncpg via a connection pool. Do NOT use the Neon MCP server for
this; direct connection only. We're populating data, not managing
the database.
The model produces the seeder. Run it. Watch the embedding count grow. Confirm the index works:
-- A sanity-check query you should run after seeding
SELECT COUNT(*) FROM embeddings WHERE document_id IS NOT NULL;
SELECT model, COUNT(*) FROM embeddings GROUP BY model;
EXPLAIN ANALYZE
SELECT chunk_text, embedding <=> (SELECT embedding FROM embeddings LIMIT 1) AS d
FROM embeddings ORDER BY d LIMIT 5;
The EXPLAIN ANALYZE should show "Index Scan using idx_embeddings_hnsw"; that's the HNSW index doing its job. If it shows "Seq Scan," the index isn't being used (usually a result of the table being too small; pgvector's planner falls back to sequential scans below a certain row count, which is fine for a small seed corpus but worth knowing).
Why this is direct asyncpg, not MCP. Seed scripts are infrastructure. They run once, by hand, by an authorized operator. They are not a Worker's runtime path. The MCP boundary is for things the agent does autonomously; the seed script is for things you do. Don't put unnecessary boundaries between yourself and your own database.
What changes in OpenCode. Same approach. Use OpenCode's /build slash command equivalent (or just direct prompting) to generate the same files.
Verify Decision 5 worked. The sanity-check block above is your verification surface. Specifically:
SELECT COUNT(*) FROM documents WHERE source = 'past_case'should equal the row count in your CSV. Each resolved ticket becomes onedocumentsrow; this is the row the embedding links to.SELECT COUNT(*) FROM embeddings WHERE document_id IS NOT NULLshould also equal the CSV row count (one embedding per short ticket, since the seed corpus is below the chunking threshold).SELECT model, COUNT(*) FROM embeddings GROUP BY modelshould show one row, naming the embedding model you actually used. Two rows means you mixed models mid-seed; re-run from a clean state.- The
EXPLAIN ANALYZEline should showIndex Scan using idx_embeddings_hnswfor corpora over a few hundred rows. With a small seed corpus pgvector's planner falls back toSeq Scanand that's fine.
If the counts are zero, your seeder swallowed an exception silently; check the audit_log row your seeder wrote for the seed run.
Decision 6: Write the custom MCP server for runtime access
The custom MCP server runs in your sandbox process, so there is no separate hosting cost. Where the bill shows up is in inference: every lookup_customer or find_similar_resolved_tickets call adds a round-trip's worth of tokens to the next model turn. Concept 15 covers the latency and pool-size side of MCP-under-load.
What you do. Create the custom MCP server that the agent will use at runtime. Three tools to start: lookup_customer, find_similar_resolved_tickets, issue_refund. No run_sql. Each tool is narrowly scoped.
mkdir -p src/customer_data_mcp
touch src/customer_data_mcp/__init__.py
touch src/customer_data_mcp/server.py
uv add mcp asyncpg
The server is built on the mcp.server.fastmcp.FastMCP framework from the Python MCP SDK. Most of the structure is shown in Concept 14's example; the worked example here extends it.
Implement src/customer_data_mcp/server.py with three @mcp.tool() functions:
1. lookup_customer(customer_id) → returns id, email, tier, active_tickets.
2. find_similar_resolved_tickets(description, limit=5) → embeds description
via text-embedding-3-small, runs the vector query joining embeddings to
documents (embeddings.document_id → documents.id) and on to tickets via
documents.metadata->>'ticket_id', filtered to source='past_case' and
status='resolved'. Returns ticket_id, summary, resolution, similarity_score.
3. issue_refund(order_id, amount_cents, reason) → inserts a row into
refunds, marks the order status='refunded', writes an audit_log row
marking the action.
Use asyncpg with a connection pool. Type every function. Set up the server
to run via stdio when invoked as __main__.
The model produces the server. Read every tool's docstring. Each docstring is the model-facing description of when to use that tool: same load-bearing role as a SKILL.md description. Vague docstrings cause tools to fire at the wrong times.
Why this is a custom MCP server, not just asyncpg calls in agent code. Concept 14's three reasons, in the order they matter for this Worker: scope (the agent can do exactly three things to the database, not anything SQL allows), isolation (the MCP server runs in its own process, with its own connection pool that the agent can't accidentally exhaust), and reusability (when we build the second Worker that also needs lookup_customer, it talks to the same server).
What changes in OpenCode. The MCP server itself is tool-agnostic. The agent's MCP server registration moves to opencode.json:
{
"mcp": {
"customer-data": {
"type": "local",
"command": ["python", "-m", "customer_data_mcp.server"],
"env": { "DATABASE_URL": "${env:DATABASE_URL}" }
}
}
}
Skeleton of customer_data_mcp/server.py
# src/customer_data_mcp/server.py
import json
import os
from typing import Annotated
import asyncpg
from mcp.server.fastmcp import FastMCP
from openai import AsyncOpenAI
from pydantic import Field
mcp: FastMCP = FastMCP("customer-data")
_pool: asyncpg.Pool | None = None
_oai: AsyncOpenAI | None = None
async def get_pool() -> asyncpg.Pool:
global _pool
if _pool is None:
_pool = await asyncpg.create_pool(
os.environ["DATABASE_URL"], min_size=1, max_size=10,
)
return _pool
# The SQL in this server uses unqualified table names (embeddings, tickets, ...),
# which assumes the tables live in the `public` schema. That holds if you
# followed the course's migration as-is. If you put the schema somewhere else
# (a Neon branch with a non-default schema), schema-qualify the table names:
# Neon's pooled endpoint silently drops `server_settings` startup params and
# resets `SET search_path` on connection release, so search_path isolation does
# not survive a pooled connection.
def get_oai() -> AsyncOpenAI:
global _oai
if _oai is None:
_oai = AsyncOpenAI()
return _oai
@mcp.tool()
async def lookup_customer(
customer_id: Annotated[str, Field(description="Customer UUID.")],
) -> dict[str, str | int]:
"""Look up customer by UUID. Returns id, email, tier, active_tickets count.
Use when the user provides a customer ID and you need their profile.
"""
pool = await get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
"""SELECT id::text, email, tier,
(SELECT COUNT(*) FROM tickets t
WHERE t.customer_id = c.id AND t.status='open') AS active_tickets
FROM customers c WHERE id = $1::uuid""",
customer_id,
)
return dict(row) if row else {"error": "customer not found"}
@mcp.tool()
async def find_similar_resolved_tickets(
description: Annotated[str, Field(description="Customer's issue description.")],
limit: Annotated[int, Field(ge=1, le=10)] = 5,
) -> list[dict[str, str | float]]:
"""Find resolved tickets similar to a description, via pgvector.
Use when the user describes an issue and you want to check prior resolutions.
"""
oai = get_oai()
emb = await oai.embeddings.create(
model="text-embedding-3-small", input=[description],
)
query_vec = emb.data[0].embedding
pool = await get_pool()
async with pool.acquire() as conn:
# embeddings.document_id references documents(id), not tickets(id).
# Decision 5's seeder stores each resolved ticket as a documents row
# (source='past_case') with the ticket id in metadata, so the join
# path is embeddings -> documents -> tickets.
rows = await conn.fetch(
"""SELECT t.id::text AS ticket_id, t.summary, t.resolution,
(e.embedding <=> $1::vector) AS distance
FROM embeddings e
JOIN documents d ON d.id = e.document_id
JOIN tickets t ON t.id = (d.metadata->>'ticket_id')::uuid
WHERE d.source = 'past_case' AND t.status = 'resolved'
ORDER BY e.embedding <=> $1::vector
LIMIT $2""",
query_vec, limit,
)
return [dict(r) for r in rows]
@mcp.tool()
async def issue_refund(
order_id: Annotated[str, Field(description="Order UUID.")],
amount_cents: Annotated[int, Field(ge=1, description="Refund amount in cents.")],
reason: Annotated[str, Field(description="Reason recorded with the refund.")],
) -> dict[str, str]:
"""Issue a refund for an order. Writes refunds row, updates order.status,
and writes an audit_log row. Use only when refund is authorized.
"""
pool = await get_pool()
async with pool.acquire() as conn:
async with conn.transaction():
refund_id: str = await conn.fetchval(
"""INSERT INTO refunds (order_id, amount_cents, reason)
VALUES ($1::uuid, $2, $3) RETURNING id::text""",
order_id, amount_cents, reason,
)
await conn.execute(
"UPDATE orders SET status='refunded' WHERE id = $1::uuid",
order_id,
)
await conn.execute(
"""INSERT INTO audit_log
(actor, action, target, payload, result)
VALUES ($1, $2, $3, $4::jsonb, $5::jsonb)""",
"worker:customer-support", "refund_issued", order_id,
json.dumps({"amount_cents": amount_cents, "reason": reason}),
json.dumps({"refund_id": refund_id}),
)
return {"refund_id": refund_id, "order_id": order_id, "status": "ok"}
if __name__ == "__main__":
mcp.run(transport="stdio")
Verify Decision 6 worked. Start the server standalone and confirm it lists its three custom tools:
# In one terminal: start the MCP server as if it were being launched by the agent
uv run python -m customer_data_mcp.server &
# In another terminal, or after the server prints "Server listening on stdio":
# Try the agent
uv run python -m chat_agent.cli "test: list your available tools"
The agent's first response should name the three custom tools: lookup_customer, find_similar_resolved_tickets, issue_refund. (Some models also surface their own meta-tools, like OpenAI's multi_tool_use.parallel, a parallel-call wrapper; that is normal and not from your MCP server.) If you see a generic database tool (run_sql or similar), the agent is also connected to the Neon MCP server at runtime; remove that from your runtime mcp_servers= list (it's development-only per Concept 12). If the three custom tools are missing or you get an init-timeout error, the MCP server crashed at startup; check the subprocess stderr for the actual exception. The Concept 13 timeout admonition covers the ML-import case.
Event loop is closed traceback on shutdown is harmlessThis server holds an AsyncOpenAI client (for embedding inside find_similar_resolved_tickets). When the MCP subprocess shuts down, the HTTP library's connection cleanup can race the stdio transport's event-loop teardown and print a multi-line Fatal error on SSL transport / RuntimeError: Event loop is closed traceback to stderr. It is teardown noise: the run already completed. Judge success by the exit code and the audit trail, not by a clean-looking shutdown. (If the noise bothers you, the deeper fix is to move the embedding call into the agent process and pass the vector to the tool as an argument; that also narrows the server's scope, which is the whole point of Concept 14.)
Decision 7: Wire audit logging everywhere
What you do. Two pieces of wiring: the agent's own audit writes (skill invocations, model calls), and the audit writes inside the MCP server (already shown in issue_refund above). Both happen alongside actions, never as an afterthought.
The agent-side helper from Concept 10 (log_capability) gets used at three sites in cli.py and sandboxed.py: at the start and end of each Skill invocation, after each MCP tool call, and around guardrail trips. The key discipline: the audit write and the action it logs happen in the same code path, ideally in the same transaction.
Ask the agent:
Modify src/chat_agent/cli.py to log_capability around each
Runner.run_streamed call. Capture: which skills activated (visible
from RunItemStreamEvent.tool_called for skill events), which MCP
tools were called, total tokens, latency, status. Write to audit_log
via a direct asyncpg connection (NOT through the customer-data MCP
server — audit is the meta-layer; it shouldn't share the same MCP
boundary as the data it audits).
The audit connection pool is separate from the customer-data MCP server's pool. Two reasons: audit must succeed even if the data pool is saturated, and audit writes shouldn't compete with business writes for the same connections. An audit subsystem that can be starved by the system it's auditing is not an audit subsystem.
Why the audit infrastructure looks complicated and isn't. The schema from Concept 7 already has audit_log and capability_invocations. The helper from Concept 10 already writes both in one transaction. What this Decision adds is calling that helper at every meaningful boundary. The mechanics are simple; the discipline is in remembering to do it consistently.
What changes in OpenCode. Identical. The audit code is plain Python; no tool-specific differences.
Verify Decision 7 worked. Run one throwaway conversation against the agent, then ask the database whether the audit fired:
-- Pick the most recent conversation
SELECT id FROM conversations ORDER BY started_at DESC LIMIT 1;
-- ... then for that conversation_id, every meaningful event should have a row
SELECT action, target, created_at FROM audit_log
WHERE conversation_id = '<id-from-above>'
ORDER BY created_at;
Running from the standalone Python cli.py, you should see at minimum: message_received, at least one capability_invoked, and message_sent. (No skill_activated row from the standalone path; Decision 8 explains why and shows how to see one inside Claude Code.) If you see only the MCP-side rows (capability_invoked + domain actions like refund_issued), the agent-side log_capability helper is wired but not firing; check that you invoked it from the Runner.run_streamed event loop, not just at run start. If you see zero rows: the audit connection pool isn't connecting; check audit.py's asyncpg.create_pool(...) against your NEON_DATABASE_URL.
Decision 8: Verify end-to-end with the test scenario
What you do. Run the customer scenario from the brief:
You: I haven't received my refund from order #4429 — it's been two weeks.
Trace what happens. Running from the standalone Python cli.py, four things you should see in the audit log within a few seconds:
action=message_received: the user's message arrives, conversation row created.action=capability_invoked, target=mcp:lookup_customer: the agent finds the customer behind order #4429.action=capability_invoked, target=mcp:find_similar_resolved_tickets: vector search finds similar past cases.action=message_sent: the agent's draft reply, recorded.
If a refund were authorized: a action=capability_invoked, target=mcp:issue_refund row, plus an action=refund_issued row written by the MCP tool itself (Decision 6) carrying the amount and reason in payload. If a refund is not authorized (the policy check fails), an action=refund_blocked row with the reason: the no-action case is just as important as the action case for replay. All of these action names match the canonical vocabulary table in Concept 10.
What you will not see from the standalone Python path is a skill_activated row. Skill discovery is a Claude Code / OpenCode client capability (the "One precondition" callout near the Quick Win); a bare Agent(...) does not scan .claude/skills/. The three Skills from Decision 4 are on disk, valid, and ready, but the standalone agent reaches lookup_customer and find_similar_resolved_tickets through the MCP boundary directly, not through the summarize-ticket skill. To see the Skills layer fire, run the same agent inside Claude Code (next callout).
Rows 1 and 4 (message_received, message_sent) come from Decision 7's agent-side audit wiring. Rows 2, 3, and the refund rows come from the MCP server itself (Decision 6). If you implemented Decisions 1-6 + 8 but skipped Decision 7 (a defensible slice for a learning pass; the MCP-side audit is already enough to prove the architecture), you will see only the MCP-side rows. That is also a success state: the business writes have a receipt; the agent-side trace is what you add when you need to replay reasoning, not just outcomes.
Everything above runs through the standalone Python cli.py, which cannot load .claude/skills/. To exercise the Skills layer the brief promised, open this same project in Claude Code (or OpenCode) and give it the scenario prompt directly. Now skill discovery is active: the client scans .claude/skills/, the model matches the summarize-ticket description, and your agent-side log_capability wiring records a skill_activated row alongside the capability_invoked rows. Same audit schema, same canonical vocabulary, one more action value: the row the standalone path structurally cannot produce. This is the moment the three layers, Skills + system of record + MCP, are all visible in one trace.
Then run the replay query, the one that proves your audit log is replayable:
SELECT al.created_at, al.action, al.target, al.payload, al.result
FROM audit_log al
WHERE conversation_id = $1
ORDER BY al.created_at;
Read the result. You should be able to reconstruct, line by line, what the agent did and why, without re-running the model. If you can't (if a step happened that isn't in the log, or if a row is in the log but the action it claims happened isn't reflected in the business tables) there's a wiring bug. Fix it now.
Why this scenario. It exercises the architectural pieces this course adds: MCP-backed tools (lookup_customer, find_similar_resolved_tickets) run against the Neon system of record, and an audit trail records the path, all replayable in SQL. None of that existed in Course #3's chat agent. The Skills layer is the third piece: built and ready in .claude/skills/, and it joins the trace the moment you run the same agent inside Claude Code (the callout above), where skill discovery is active and a skill_activated row appears alongside the rest.
What changes in OpenCode. The verification is client-agnostic: run inside Claude Code or OpenCode and you get the same trace, skill_activated row included, as long as the agent code, skills, and MCP server are the same. The standalone Python cli.py produces the same trace minus that one row, for the reason Decision 8 explains.
What just happened
Eight Decisions, and the chat agent from Course #3 is now a Worker. Look back at what changed:
- Capability moved out of code. Three Skills sit in
.claude/skills/, version-controlled, sharable across agents. - State moved into a system of record. Six tables in Neon, with semantic search via pgvector, and a real domain schema for customers/orders/tickets/refunds.
- Runtime database access is mediated. The agent doesn't have
asyncpgin its imports; it talks to Postgres through a scoped MCP server that exposes exactly three tools. - Every action leaves a trace. The audit log can replay the full reasoning trace of any conversation, weeks or months after the fact, in SQL.
The OpenAI Agents SDK is still there. The Cloudflare Sandbox is still there. The streaming, guardrails, tracing from Course #3 are all still there. What's different is the architecture layered on top: Skills as capability, Neon as truth, MCP as the wiring between them.
That is what the thesis calls the difference between an agent and a Worker.
Part 5: Where this course leaves off
Cost shape of a Worker: what does it actually cost to operate?
Course #3 had the "every turn re-bills the world" thread; this course has so far been quiet about dollars. Closing the gap, for the Part 4 worked example specifically: a customer-support Worker doing 200 conversations per day, ~10 messages per conversation, average context 8K tokens per turn, three Skills, three MCP tools.
Four cost lines, in order of size.
| Line | Driver | Rough monthly cost | When it bites |
|---|---|---|---|
| Model inference | input tokens × turns × $/M | $60–$200 | Volume × prompt size. Cache hits on stable prefixes (CLAUDE.md, Skills metadata, system prompt) typically recover 60–80% of input cost. |
| Neon Postgres | storage + active compute | $0–$25 | Free tier covers a single Worker doing this volume. Scale-to-zero means idle hours cost nothing. Paid tier kicks in once a project crosses the 0.5 GB / 100 CU-hours free-plan limits (see Neon pricing for current numbers). |
| Embeddings | chunks × $0.02/M tokens | $0.30–$3 | One-time cost for seed data plus incremental embedding of new tickets/messages. Negligible unless you're embedding entire conversation histories continuously. |
| Cloudflare Sandbox | container minutes | $0–$15 | Depends on session length and concurrency. Idle reaping helps; long-running sessions don't. |
Total monthly for the Part 4 Worker: roughly $60–$240. The model is the largest line, by an order of magnitude over everything else. The system of record is essentially free at this volume. Skills discipline (progressive disclosure, Concept 2) is also cost discipline: a Worker with bloated SKILL.md bodies pays for every turn that activates them; a Worker with descriptions-only-at-discovery and tight bodies-on-activation pays for what it uses.
Three knobs that move the dial most. (1) Cache hit rate: keep your CLAUDE.md, system prompt, and Skills metadata stable; cache misses cost 5–10× more than hits. (2) Model tier: the same Worker on a cheaper model (DeepSeek V4 Flash, Claude Haiku) often does 80% of the job at 10% of the cost; route hard decisions to the frontier model only when needed. (3) Audit table growth: audit_log is the largest table by row count (Concept 7's PRIMM Predict); partition or archive past 90 days if you don't actually query historical audit data.
The honest scaling number. A workforce of 50 Workers at this shape (what your next course covers) is roughly $3,000–$12,000/month for inference, $50–$200 for Neon, single-digit dollars for embeddings, $100–$500 for sandbox compute. The infrastructure layer stays cheap; the model bill is what scales. This is why every Course #3 cost-discipline habit (cache hits, model tiering, context hygiene) compounds when you go from one Worker to many.
Swap guide: the architecture is invariant, the products are not
This course names specific vendors at every layer (OpenAI Agents SDK, Cloudflare Sandbox, Neon, OpenAI embeddings, MCP Python SDK). That's because a teaching example needs concrete answers, not "use any LLM runtime you like." But the architecture works with any compliant alternative. Five swaps the course's design explicitly anticipates:
- Postgres host: Neon → Supabase, AWS RDS, self-hosted. Anything with
pgvectorworks. You lose branching and scale-to-zero (those are Neon-specific), but the six-table schema, the embedding pipeline, the audit-trail discipline, and the custom MCP server pattern are all transferable byte-for-byte. The only change is the connection string and possibly the SSL config. - Vector storage: pgvector → Pinecone, Weaviate, Qdrant. If you reject the "one database for both relational and vector data" argument from Concept 6, swap the
embeddingstable for a vector-DB client. The cost: two stores to keep consistent (Concept 6 argues this is rarely worth it). The benefit: better recall at very large scales (10M+ vectors), and managed-service operational simplicity. - Embedding model: OpenAI → Cohere, Voyage, BGE-small (local). Change one constant (
EMBEDDING_MODEL) and one column dimension (VECTOR(n)). Run a one-shot re-embed of existing data. Concept 9's pipeline doesn't change. - Sandbox: Cloudflare Sandbox → E2B, Modal, Daytona, your own Docker. Anything with isolated process boundaries and a clean restart works. Skills'
scripts/execute the same way. The trust-boundary diagram from Course #3 still applies. - Agent runtime: OpenAI Agents SDK → LangGraph, CrewAI, Pydantic AI, your own loop. The MCP boundary is what survives; every modern agent framework has an MCP client. Skills work in any agent that can load
SKILL.mdfiles (Claude Code, OpenCode, Goose, and increasingly Cursor/Windsurf). The audit-trail discipline is framework-agnostic Python.
What doesn't swap easily. The MCP protocol itself, the Skills format spec, and the audit-trail discipline. These are the invariants: the things you'd port between products, not the things you'd port between. Same architectural shape, replaceable implementations underneath.
What this course doesn't cover (yet)
You now have a Worker that satisfies two of the Seven Invariants the thesis sets out. Specifically: it runs on an engine (Invariant 4, from Course #3), and it runs against a system of record (Invariant 5, from this course). The other five Invariants are what production AI-Native Companies require, and what subsequent courses cover. Each is one bullet here, not a section.
- Invariant 1: The human is the principal. Authored specs, approval gates, budget declarations. The architecture for setting intent and owning outcomes, covered in Part 6 of the book.
- Invariant 2: Every human needs a delegate. A personal agent at the edge that holds your context, represents your judgment, and brokers work to the workforce. The thesis names OpenClaw as the current realization.
- Invariant 3: The workforce needs a manager. An orchestrator that assigns work, enforces budgets, audits execution, exposes hiring as a callable capability. The thesis names Paperclip.
- Invariant 6: The workforce is expandable under policy. A meta-layer where an authorized agent generates a prompt, provisions a runtime, and registers a new Worker, without waking a human. Claude Managed Agents is one realization.
- Invariant 7: The workforce runs on a nervous system. Triggers (schedules, webhooks, inbound API calls) wake the agent under the authority envelope. Inngest (durable functions and background jobs) is one realization for general workforce events; Claude Code Routines is the coding-agent-specific path.
A single Worker reading from a Neon system of record is the smallest unit of the architecture this course teaches. The next course extends that Worker into a workforce: multiple Workers coordinated by a manager, expandable on demand, woken by triggers. Same OpenAI Agents SDK foundation, same Cloudflare Sandbox runtime, same Skills format, same Neon system of record. The architecture is invariant.
How to actually get good at this
Reading this crash course does not make you good at building Workers. Using it does. The path looks the same as for the previous course: you start manual, feel the friction, and let each piece of friction teach you which Concept it belongs to.
The mapping for this course:
- "Why isn't my skill firing when it should?" → description quality (Concept 3). Rewrite. Test by inventing five different ways a user might phrase the trigger.
- "Why is the agent inventing data the database doesn't have?" → the agent isn't actually calling the MCP server. Check the trace; check the
mcp_servers=[...]registration. - "Why is my audit log incomplete?" → the audit write isn't in the same code path as the action (Concept 10). Move it next to the action, in the same transaction.
- "Why are my pgvector results irrelevant?" → either chunking is wrong (Concept 9), or the embedding model at insert-time doesn't match the embedding model at query-time. Re-embed.
- "Why is my MCP server slow under load?" → connection pool inside the server is too small, or the tools list isn't cached on the client. Concept 15.
- "Why does the Neon MCP server feel scary in production?" → because Neon's own docs say it's not for production. Write a custom MCP server (Concept 14). The first one takes 30 minutes; the second takes 10.
Build the architecture one piece at a time. Don't try to add Skills, system of record, and MCP all in one weekend. Take the chat agent from Course #3. Add a system of record first (Decisions 3–5). Watch what changes about your debugging experience. Add a Skill (Decision 4). Watch how the model decides to use it. Add the MCP boundary last (Decision 6). Each step is its own learning. Combined into one big rewrite, they're a wall.
The portability dividend extends. Skills you write here work in any Agent Skills-compatible client. Schemas you write here work in any Postgres. MCP servers you write here work with any MCP client: Claude, GPT, Gemini, local models. The architecture is invariant; the products are 2026. When the products change, your code mostly doesn't.
What "the human owns the description, the agent owns everything else" actually feels like
After Decision 4 sinks in, your work shifts shape. The thing you used to do (write code) becomes the thing you brief; the thing you used to skim (the description field of a config file) becomes the thing you draft, redraft, and live with. A description that took five minutes to write and another twenty to refine is doing more architectural work than the 200 lines of MCP server code that the agent generated underneath it, because the description is the routing surface the model reads every turn.
The internal experience is a small inversion. You stop thinking "how do I implement this?" and start thinking "what are the five different ways a real user might phrase the trigger for this?" You stop asking "what should this function return?" and start asking "what description tells the model not to fire this when the user means X instead?" The code is downstream; the description is upstream. If the description is wrong, the agent never reaches the code, and the code's quality is irrelevant.
The other shift: review replaces authorship as the load-bearing skill. The agent drafts; you decide whether the draft does the right thing in the trigger cases you wrote the description for. The hardest part is resisting the urge to rewrite the agent's draft when you could solve it yourself in three minutes. That urge is structurally similar to the urge to bypass the MCP boundary that this chapter warns about. Same discipline, different layer.
You'll know you've internalized it when you instinctively spend forty minutes on a description and three minutes briefing the agent on the body, instead of the other way around.
Quick reference
The 15 concepts in one line each
- An Agent Skill is a folder. SKILL.md plus optional scripts/references/assets.
- Progressive disclosure. Metadata at startup → full body on activation → references on demand.
- A SKILL.md is frontmatter + body. Name, description, optional metadata, then operational instructions.
- Skills travel as files. Same SKILL.md works in Claude Code and OpenCode without modification.
- Compose small skills via filesystem handoff when isolation matters more than orchestration simplicity.
- Postgres + pgvector beats a separate vector DB for almost all agent workloads. Neon adds branching, scale-to-zero, and an MCP server.
- Six tables are the minimum operational schema: conversations, messages, documents, embeddings, audit_log, capability_invocations.
- pgvector basics:
VECTOR(1536)+<=>cosine distance + HNSW index. Use the same embedding model on both ends. - The embedding pipeline: chunk at semantic boundaries (~400 tokens with overlap), batch-embed, store with model metadata.
- Audit is not logging. Every meaningful action writes a row in the same transaction as the action it records.
- MCP is a protocol, not a service. Three primitives (tools, resources, prompts), three transports (stdio, streamable HTTP, legacy SSE).
- The Neon MCP server is for development. Schema design, branch-based migrations. Not for production runtime.
- The OpenAI Agents SDK has a built-in MCP client.
from agents.mcp import MCPServerStdio, MCPServerStreamableHttp. Useasync with. Cachelist_toolsin production. - Custom MCP servers earn their keep via scope, isolation, and reusability. Don't write one for a single function used by one agent.
- MCP under load: streamable HTTP for remote, cache tools, reuse connections, pool inside the server, propagate trace context via
_meta.
Decision tree: @function_tool vs. custom MCP server vs. vendor MCP server
Capability used by one agent, one process, one function?
→ @function_tool
Capability that multiple agents (or multiple deployments) will reuse?
→ Custom MCP server
Vendor provides one and it does what you need?
→ Vendor MCP server (don't rebuild)
Sensitive operations, needing narrow scope?
→ Custom MCP server (NOT a broad `run_sql` interface)
Long-running, background, or process-isolated work?
→ Custom MCP server (process isolation buys safety)
File location quick-ref
| What | Path |
|---|---|
| Project skills | .claude/skills/<name>/SKILL.md (OpenCode reads as fallback) |
| Personal skills | ~/.claude/skills/<name>/SKILL.md (OpenCode reads as fallback) |
| Schema migrations | migrations/NNN_*.sql |
| Embedding code | src/chat_agent/embedding/{chunker,embedder,seeder}.py |
| Custom MCP server | src/customer_data_mcp/server.py |
| MCP server registration (Claude Code) | .claude/settings.json mcpServers |
| MCP server registration (OpenCode) | opencode.json mcp block |
| Audit helper | src/chat_agent/audit.py |
When something feels wrong
Skill not firing when it should
→ Description too vague. Rewrite with "Use when..." and specific keywords (Concept 3).
Skill firing when it shouldn't
→ Description too broad. Add explicit constraints in the description.
pgvector returning irrelevant results
→ Embedding model mismatch (insert vs. query). Verify the model column in
the embeddings table. Re-embed if needed.
MCP tool not appearing in agent
→ Server not registered, or list_tools cache stale. Check mcp_servers=[...]
and try cache_tools_list=False temporarily.
Audit log has gaps
→ Action and audit write are in different code paths. Move them next to
each other, ideally same transaction.
Agent timing out on Postgres operations under load
→ MCP server's connection pool too small. Check asyncpg.create_pool(max_size=...).
MCP server hangs on startup with torch / sentence-transformers / large imports
→ Default client_session_timeout_seconds=5 is too short for servers that
load ML models at import. Bump to 60. See Concept 13's callout.
CREATE TABLE fails: relation "notes" already exists
→ You're pointing at a database that already has tables. Use a fresh
database or Neon branch for the Quick Win (Step 2).
An OLD asyncpg refuses the Neon DSN with "unsupported startup parameter"
→ Only older asyncpg versions choke on channel_binding=require. The
version this course pins accepts it; if you are pinned to an old
asyncpg, either upgrade or strip that one parameter from the DSN.
TLS posture is unchanged either way.
Neon -pooler endpoint: schema-qualified SQL needed, search_path ignored
→ The pooled endpoint silently drops server_settings startup params
and resets SET search_path on connection release. If your tables
are not in the public schema, schema-qualify them (see the
Decision 6 MCP-server skeleton's get_pool comment). A reader who
built in public is unaffected.
Non-OpenAI key getting 401 against api.openai.com
→ Set OPENAI_BASE_URL to your provider's OpenAI-compatible endpoint
(e.g., https://api.deepseek.com/v1) before running the agent.
Agent fails partway with a 401 / auth / BadRequestError
→ Wrong key, wrong provider, or expired key. Run the curl key smoke-test
from Step 5 before the full run; it fails in one second instead of
four files deep.
Neon MCP server returning errors in production agent code
→ You're using it wrong. Neon's docs are explicit: development only.
Write a custom MCP server instead (Concept 14, ~30 minutes).
Appendix: Audience, prerequisites, glossary
Audience, prerequisites, a Course #3 recap, a Postgres primer, a first-pass reading guide, and a glossary. Use it as a reference: flip back when a term in the main flow is unfamiliar, or read A.1 before you start if you are not sure you have the prerequisites.
A.1: Audience & prerequisites
Audience. This is an intermediate-to-advanced course, denser than its predecessors. You need to have completed Course #3 (or be comfortable with everything it taught), because we extend its chat agent rather than re-build it from scratch. The OpenAI Agents SDK, the agent loop, sessions, streaming, function tools, sandboxing: all assumed.
Prerequisites. This page assumes four things.
- You have completed Build AI Agents with the OpenAI Agents SDK and Cloudflare Sandbox. This is non-negotiable. We pick up where its worked example left off: same project layout, same
agents.py, samecli.py, sametools.py. If you have not, do that course first; this one will read as friction without it.- You have the Agentic Coding Crash Course discipline. Plan mode, rules files (
CLAUDE.md/AGENTS.md), slash commands, context discipline. This course's worked example uses Skills as slash commands at one point, so the rules-file discipline is load-bearing.- You have done at least one PRIMM-AI+ cycle from Chapter 42. The Predict prompts in this course assume you know to predict, run, investigate, modify, make.
- You have a working Postgres mental model. Tables, indexes, transactions, foreign keys. You don't need to be a DBA. You should know what
SELECT ... WHEREdoes, what an index is for, and roughly whatJOINdoes. If you have written one CRUD app in any language, you're calibrated.
A.2: What Course #3 taught you that this course assumes
Full course: Build AI Agents with the OpenAI Agents SDK and Cloudflare Sandbox. The eight things from that course this one builds directly on:
- The state-and-trust frame. Every agent primitive is the SDK's answer to a state question or a trust question. This course extends both axes: state into a system of record, trust into a Skills library.
- The agent loop. Model decides → is_final? → run_tool (trust boundary) → history grows → next turn. This course adds MCP tool calls and Skills invocations to that loop, but the loop shape doesn't change.
@function_tool. A typed Python function exposed to the model. This course's Concept 14 contrasts it with MCP-exposed tools; you need to know what@function_toolis to understand when not to reach for MCP.- Sessions.
SQLiteSessionfrom Course #3 still works. This course supplements it with a Postgres-backed audit trail, not a replacement. - Streaming events.
Runner.run_streamedandRunItemStreamEvent. We log skill activations and MCP tool calls from these events (Decision 7). - Guardrails. Input and output guardrails. This course doesn't add new guardrail concepts; you carry over what you have.
- Cloudflare Sandbox.
SandboxAgentwithShell()andFilesystem(). This course's Worker still deploys to a sandbox; the system of record lives outside it (in Neon). - The dual-tool pattern (Claude Code + OpenCode). Skills you write in
.claude/skills/work in both. MCP server registration differs by tool config; the server itself is identical.
Stop signal. If "the agent loop is model → tool → history → loop, with max_turns capping it" reads as review, continue. If it feels like new material, stop and do Course #3 first. This course's worked example evolves Course #3's chat agent; reading without that foundation is friction.
A.3: Postgres essentials this course uses
If you have written one CRUD application in any language, you're calibrated. The things you'll see:
- Tables, primary keys, foreign keys. Six tables in our schema, each with a UUID primary key and explicit foreign keys to its parent.
- Indexes. Regular B-tree indexes on the lookup columns; HNSW indexes on vector columns. Indexes accelerate queries; they cost on inserts.
- Transactions.
BEGIN ... COMMIT(orasync with conn.transaction():inasyncpg) groups multiple writes so they all happen or none do. The audit write and the action it logs go in one transaction, per Concept 10. - JSONB columns. Postgres's native JSON type. Stores arbitrary key-value data, queryable with JSON operators. Used in
audit_log.payloadandaudit_log.result. - Extensions.
CREATE EXTENSION vectorenables pgvector. Other extensions exist (pg_trgm for text search, postgis for spatial); we only need pgvector.
Stop signal. If "JOIN tickets ON tickets.customer_id = customers.id" reads as obvious, continue. If JOIN syntax is unfamiliar, the worked example will read as friction. Pick up a 90-minute Postgres tutorial first.
A.4: How to read this page on first pass
Same rule as the previous course in the series:
- Expand on first read: anything labeled "What you'll see," "Sample transcript," "Expected output," "Verify." These contain the runnable behavior to check predictions against.
- Skip on first read: anything labeled "What
skill.mdlooks like in full," "The complete migration SQL," and other full-file listings in Part 4's worked example. The narrative above each block tells you what changed; you only need the file contents when you actually build. - Optional throughout: the "Try with AI" blocks at the end of each concept. Extension prompts for Claude Code or OpenCode; skip without guilt if your tool isn't set up.
The goal of first pass is to internalize the three-layer model (Skills are the capability layer, the Neon system of record is the state layer, MCP is the connector) and the way they sit on top of the OpenAI Agents SDK + Cloudflare Sandbox stack you already know. Second pass with your hands on the keyboard is where you build.
A.5: Glossary
These are the terms most likely to trip a reader on first encounter. Each is explained again in context as it appears.
-
Skill: A folder with a
SKILL.mdfile and optional scripts, references, and assets. The folder is the skill; the file inside it is the entry point. An agent loads skills via progressive disclosure: name+description at startup, full instructions when triggered, referenced files on demand. -
SKILL.md: The skill's entry file. YAML frontmatter withnameanddescription(and optional metadata), then markdown body with the instructions the agent follows when the skill activates. -
Progressive disclosure: The three-stage skill-loading model. Discovery: agent reads names and descriptions of all available skills at startup. Activation: agent reads the full
SKILL.mdof the matching skill when a task triggers it. Execution: agent loads referenced files (or runs bundled scripts) on demand during execution. -
System of record (SoR): Authoritative store of state the Worker reads from and writes to. The thesis term for "the database that holds the truth." For this course: a Neon Postgres database.
-
Neon: A managed Postgres service with serverless branching, scale-to-zero, and a free tier. Its differentiator versus other managed Postgres is branching (copy-on-write database copies in seconds) and its first-class MCP server.
-
pgvector: A Postgres extension that adds a
vectorcolumn type plus distance operators for similarity search. Lets one database hold both relational data and embedding-based semantic search. -
Embedding: A fixed-length numerical vector representing a piece of text (or other data) such that semantic similarity maps to vector distance. Generated by an embedding model (
text-embedding-3-smallis the OpenAI default). -
MCP (Model Context Protocol): An open standard for how AI agents connect to external tools, resources, and prompts. Defines a client/server architecture, three primitives (tools, resources, prompts), and three transports (stdio, SSE, streamable HTTP).
-
MCP server: A program that exposes capabilities (tools/resources/prompts) to MCP clients. The Neon MCP server is one example; you can write your own in Python or TypeScript.
-
MCP client: The agent-side counterpart that connects to MCP servers, lists their capabilities, and surfaces them to the model. The OpenAI Agents SDK has an MCP client built in.
-
Tool (MCP): One of three MCP primitives. A function the model can invoke. From the model's perspective, an MCP tool looks similar to a
@function_tool, but the implementation lives in the MCP server, not the agent's process. -
Resource (MCP): One of three MCP primitives. A read-only data source the agent can fetch. Files, database query results, API responses. Read but not write.
-
Prompt (MCP): One of three MCP primitives. A reusable prompt template the server provides for the model to invoke. Less common than tools and resources; useful for standardized templates across teams.
-
Audit log: A database table that records every meaningful action a Worker takes (every tool call, every database write, every capability invocation) in a form the company can replay, query, and reason about after the fact.
A.6: What this appendix does NOT replace
The full curriculum includes courses on the manager layer (Paperclip), the meta-layer (hiring AI Workers at runtime), and the trigger gateway (Inngest). None of those are summarized here, because they extend rather than precede this course. Read them when you reach them; this course's Worker doesn't need them yet.