Eval-Driven Development for AI Employees: A Multi-Track Crash Course
*15 Concepts • Four learning tracks. Reader track: 3-4 hours pure conceptual reading (no setup, no lab — for leaders, strategists, and non-engineer readers who want to understand the discipline). Beginner / Intermediate / Advanced tracks: 1-3 days each (conceptual reading plus increasing lab depth, building real eval suites against the four-tool stack — OpenAI Agent Evals with trace grading, DeepEval, Ragas, Phoenix). Total honest estimate: 3-4 hours for the Reader track; 2-3 days for a team to ship the full discipline. Pick your track before Decision 1 — see the "Four learning tracks" section below.*
🔤 Three terms to know before you read any further (if you've done Courses 3-8, you already know these — skip to the plain-English version below).
The whole course rests on three concepts. Beginners benefit from seeing them defined plainly before they appear elsewhere:
- Agent. A piece of software that, given a natural-language task, can decide what to do — call functions, look things up, send messages, hand work to other agents, eventually respond. Not a chatbot (which just talks). An agent acts. The customer support assistant that reads your ticket, looks up your account, issues a refund, and sends you a confirmation is an agent. Course Three of the Agent Factory track teaches how to build them.
- Tool. A specific function or capability the agent can use — like
customer_lookup(email)orrefund_issue(account_id, amount)orsend_email(to, subject, body). The agent decides which tool to call and with what arguments; the developer writes the tool's actual code. Evaluating an agent partly means evaluating whether it picks the right tools with the right arguments.- Trace. A complete record of one agent run — every model call, every tool call, every handoff to another agent, every guardrail check, in order. Think of it as the agent's audit log for one task. "Trace grading" — which appears in the stats line above and many times below — means using an AI grader to read these audit logs and judge whether the agent did the right thing. You don't need to understand the technical implementation yet; you just need to know a trace is the agent's execution history that an eval can grade.
Two more terms used heavily that the glossary defines fully: eval (a test that measures behavior — was the response correct, the tool right, the reasoning sound) and rubric (a scoring guide that defines what "correct" means for a given task, used by graders to produce consistent scores). The full glossary appears two sections below.
Plain-English version — start here if you want the human version first. (Technical readers can skip down to "Course Nine teaches eval-driven development..." below.)
Across the last six courses we built AI agents that work — they hold conversations, use tools, draft documents, route customer issues, hire other agents, and act on the owner's behalf. The honest question we haven't answered yet is: how do we know they're working correctly? Not "did the code run" — we already test that. Not "did the agent reply" — we already log that. The question is whether the agent did the right thing the right way: picked the correct tool, called it with the correct arguments, respected its envelope, grounded its answer in the right source material, escalated when it should have. That question is not answered by the unit tests, the integration tests, or the human eyeballing a demo. It's answered by evals — a new kind of test that measures behavior instead of code. Course Nine teaches you to design evals, run them, wire them into your development workflow, and use them to improve your agents — the same way TDD taught a previous generation of software engineers to ship code with confidence.
🧭 Before you keep reading — is this course right for you? This course wraps a cross-cutting discipline around everything Courses Three through Eight built. Three things will make it hard if you haven't done those courses:
- The worked example is Maya's customer-support company from Courses Five-Eight (Tier-1 Support, Tier-2 Specialist, Manager-Agent, Legal Specialist, plus Claudia the Owner Identic AI). The eval suites we build measure those specific agents. If you don't have them, the Simulated track (using sample traces and mock agent outputs) is the right path; the Full-Implementation track will be difficult.
- The lab uses four eval frameworks — OpenAI Agent Evals (with trace grading), DeepEval, Ragas, and Phoenix — installed and wired together. If you're new to Python testing frameworks generally, Module 4's DeepEval setup is the friendlier on-ramp; the trace grading section (Decision 3) assumes you've used the OpenAI Agents SDK.
- Course Nine evaluates what was built, not how to build it. If you haven't internalized why each Course 3-8 invariant exists, you won't know what the evals are protecting.
What you can still get from reading anyway, even cold: the eval-driven development thesis (Concepts 1-3 make the case that evals are to agentic AI what TDD was to SaaS); the 9-layer evaluation pyramid (Concept 4 — a vocabulary for talking about agent reliability that transfers to any agent stack); the honest frontiers (Part 5 — where the discipline is solid, where it's still emerging, where it breaks). If you're an engineering leader, ML platform owner, or strategist trying to understand what production-grade agentic AI actually requires, the first half of Course Nine is genuinely accessible.
If you want the prereq path: Course Three → Course Four → Course Five → Course Six → Course Seven → Course Eight. Plan ~3-5 days end-to-end.
Course Nine teaches eval-driven development (EDD). EDD is the discipline of measuring agent behavior with the rigor that test-driven development (TDD) gave software teams for measuring code. Courses Three through Eight built the architecture of an AI-native company — the agent loop, the system of record, the operational envelope, the management layer, the hiring API, the Owner Identic AI. Those eight courses left one question unanswered: is each piece of the architecture actually working correctly in production? Course Nine adds the measurement layer that answers it. Without it, the architecture is buildable but not trustworthy. Trustworthy is the bar production agents have to meet.
Course Nine — what this closes for the track. Course Nine is not a tenth architectural invariant; it is the cross-cutting discipline that turns the eight thesis invariants from built into measurably trustworthy. Every Worker built in Courses 3-7, every hire authorized in Course 7, every delegated decision Claudia makes in Course 8 gets an eval suite that proves the architecture is doing what it promises. The analogy is exact: SaaS engineering became reliable when teams adopted TDD as a discipline, not because TDD was a new invariant in the SaaS architecture. Eval-driven development is the same shape — a discipline that wraps the architecture, not a layer in it. After Course Nine, the Agent Factory curriculum is structurally complete.
The architect's thesis sentence — the lead and the closer. "In the age of agentic AI, evals are as important as test-driven development was in the age of SaaS. If test-driven development gave SaaS teams confidence in code, eval-driven development gives agentic AI teams confidence in behavior. The two phrases together — confidence in code, confidence in behavior — are the whole shift. Code is deterministic; behavior is probabilistic. Tests verify the former; evals verify the latter. A serious agent team practices both."
Known rough edges I'd rather you see than not.
- The four-tool eval stack (OpenAI Agent Evals with trace grading, DeepEval, Ragas, Phoenix) is moving fast as of May 2026. The course teaches the stable architectural surfaces of each — the concepts of trace evaluation, repo-level eval discipline, RAG-specific metrics, and production observability — not the specific API shapes which will drift between versions.
- Eval datasets are the load-bearing artifact and the most undervalued. Course Nine spends real time on dataset construction (Concept 11 + Decision 1) because a beautiful eval framework on a bad dataset is worse than no eval at all — it measures the wrong thing with rigor.
- TDD analogies break in specific places. The course is honest about where TDD's discipline carries over to EDD (the loop shape, the regression discipline, the CI/CD integration) and where it fundamentally fails (deterministic vs probabilistic outputs, drift across model versions, context-dependent correctness). Concept 2 names this directly.
- Production evals are easier to talk about than to ship. Phoenix gives you observability; turning observed traces into production evals that actually improve the agent is an operational discipline most teams underestimate. Concept 13 names where teams fail.
- The "what evals can't measure" frontier is real and worth naming. Pattern-matching behavior is evaluable; alignment with user values at the edge cases is not, fully. Concept 14 is honest about this rather than pretending evals close every gap.
TL;DR — the four claims of Course Nine.
- Traditional tests are necessary but insufficient for agentic AI. Unit tests verify code; integration tests verify wiring; neither verifies behavior. Agents are probabilistic, multi-step, tool-using, and context-sensitive. The behaviors they produce can't be tested with assert statements on return values.
- The architectural answer is a 9-layer evaluation pyramid that extends traditional testing rather than replacing it: unit → integration → output evals → tool-use evals → trace evals → RAG evals → safety evals → regression evals → production evals. Each layer catches failure modes the others miss.
- The recommended stack is OpenAI Agent Evals with trace grading for agent behavior, DeepEval for repo-level evals (pytest-for-LLM-behavior), Ragas for the knowledge layer, Phoenix for production observability. Each tool plays a specific role; together they are the eval-driven development toolkit.
- The discipline is what matters more than the tooling. No prompt change ships without an eval run. No tool change ships without an eval run. No model upgrade ships without an eval run. The eval suite is the regression net that makes agentic AI development feel like engineering rather than guesswork.
If the four claims above lost you, scroll back up to the plain-English version at the top of the page — that's the same content for non-technical readers.

Are you ready?
- You completed Courses Three through Eight, or have built the equivalent: an Inngest-wrapped Worker (Course Five), a Paperclip management layer with the approval primitive (Course Six), a hiring API (Course Seven), and Maya's Owner Identic AI on OpenClaw (Course Eight). The worked example throughout Course Nine is Maya's company; if it doesn't exist, the Simulated track is the right path.
- You're comfortable with Python testing frameworks —
pytestspecifically, or at least the concept of test cases, assertions, fixtures, and CI runs. DeepEval (the repo-level eval framework) is structured like pytest; if pytest is unfamiliar, complete a one-hour pytest tutorial before Decision 2.- You're comfortable reading and writing JSON schemas. The golden dataset (Decision 1), the trace-grading rubric definitions (Decision 3), and Phoenix's trace inspection (Decision 7) all use JSON. No advanced schema work required, just fluency.
- You have either a Claude Managed Agents setup or an OpenAI Agents SDK account. Courses 3-7 taught both runtimes — Course Nine evaluates both. The lab's primary worked example (Maya's agents) runs on Claude Managed Agents and uses Phoenix's evaluator framework for trace evals (the tightest-fit eval surface for Claude-runtime agents, since the Claude Agent SDK's tracing is OpenTelemetry-native); the equally-supported alternative path uses OpenAI Agent Evals with Trace Grading for readers whose agents are on the OpenAI Agents SDK. Concept 8 covers both paths in detail. You don't need to migrate runtimes to do Course Nine. Claude users: you'll use Phoenix as your trace-eval layer (Decision 7's setup serves double duty). OpenAI users: check platform.openai.com/docs/guides/agents. Simulated track readers get pre-recorded trace samples for both runtimes — the GitHub repository has them.
- You have Python 3.11+, Node.js 20+, Docker, and basic familiarity with CI/CD. Phoenix (the observability layer) runs as a containerized service; DeepEval and Ragas are Python packages; the trace-grading client is JS/Python.
New here? Course Nine is the ninth of nine — here's the on-ramp. Course Nine wraps a discipline around what Courses 3-8 built; without that foundation, several Concepts in Part 1 will reference architecture you haven't seen. Work backwards if the prereqs above are unfamiliar: Course Eight is the immediate prerequisite (Maya's Owner Identic AI is the worked example for trace evals); Course Seven is the hiring API; Course Six is the management layer with approval primitive; Course Five is the Inngest envelope; Course Three is the agent loop. You can also read Course Nine cold for the discipline and skip the lab — the conceptual content is independently valuable.
Four learning tracks — pick yours
Course Nine works for four different depths. Pick your track explicitly before Decision 1; the conceptual content is designed to work for all four, and the lab is designed for tracks 2-4.
| Track | Time commitment | What you complete | Who it's for |
|---|---|---|---|
| Reader (pure conceptual) | ~3-4 hours, no lab | Concepts 1-4 + Concept 14 (what evals can't measure) + Part 6 closing. No Python setup, no framework installs, no labs. The discipline lands; the implementation is deferred. | Engineering leaders, ML platform owners, strategists, product managers, and curious-but-non-engineer readers who want to understand what EDD is and why it matters without building it. Also the right entry point for someone deciding whether to commit time to the Beginner track later. |
| Beginner | ~1 day total (conceptual + light lab) | Reader track content + Decision 1 (golden dataset) + Decision 2 (DeepEval output evals) + one tool-use eval. Stop there. | Software engineers new to agentic-AI evaluation; the goal is to internalize the discipline and ship a minimal eval suite. Requires Python 3.11+ familiarity. |
| Intermediate | ~2 days (1-day sprint after conceptual reading) | Beginner track + Decisions 3 (trace grading) + 5 (Ragas RAG evals) + the full Part 2 conceptual content. | Engineering teams who want the four-layer pyramid covered conceptually and three frameworks wired up. |
| Advanced | ~3 days (2-day workshop after conceptual reading) | Intermediate track + Decisions 4 (safety evals on Claudia), 6 (CI/CD wiring), 7 (Phoenix + production observability) + Part 5 (honest frontiers). The complete EDD discipline. | Production teams shipping the discipline; the full curriculum the source's "Recommended Implementation Sequence" specifies. |

Track-fork guidance. Curious-but-non-engineer readers and leaders making decisions about EDD investment should start with the Reader track — 3-4 hours, no setup, and at the end you'll know whether your team should commit to the Beginner or higher track. Beginners should not feel pressure to complete the Advanced track on first pass. The discipline is iterative; teams typically graduate Reader → Beginner over a sprint, Beginner → Intermediate over weeks, and Intermediate → Advanced over months as production usage matures. Standalone readers (not from the Agent Factory curriculum) should default to the Reader track first, then assess whether the Beginner track's Simulated mode (see Part 4) is the right next step. Agent Factory students with Courses 3-8 already shipped should follow the Advanced track in Full-Implementation mode.
What you'll have at the end (concrete deliverables)
Reader track produces understanding, not artifacts. By the end of the Reader track, you can: explain why agentic AI needs behavior measurement beyond unit tests; describe the 9-layer evaluation pyramid in your own words; name the four-tool stack and what each tool covers; articulate where EDD is solid and where it's honestly limited. That's enough to decide whether your team should invest in the Beginner-or-higher track.
Beginner, Intermediate, and Advanced tracks produce concrete artifacts. By the end of the lab, depending on which track you picked, you will have built:
- A 20-50 case golden dataset (Decision 1 — Beginner and up) — categorized by task type, stratified by difficulty, version-controlled, with documented conventions.
- Output evals running in DeepEval (Decision 2 — Beginner and up) — answer relevancy, faithfulness, hallucination, and task-completion metrics covering the Tier-1 Support agent's most common task categories.
- At least one tool-use eval (Decision 2 with extension, or Decision 3 for the trace-aware version — Beginner and up) — verifying the agent called the right tool with the right arguments.
- One trace-based eval (Decision 3 — Intermediate track and up) — running through OpenAI Agent Evals with trace grading on captured agent traces.
- One RAG eval (Decision 5 — Intermediate track and up) — Ragas's five-metric framework on TutorClaw, the knowledge agent introduced for this layer.
- One CI gate (Decision 6 — Advanced track) — a GitHub Actions or equivalent workflow that blocks PRs when critical metrics regress.
- One Phoenix dashboard or simulated trace replay (Decision 7 — Advanced track) — production observability over real or replayed traces, with the trace-to-eval promotion pipeline wired.
The Beginner track stops at the first three deliverables; the Intermediate track adds the next two; the Advanced track adds the final two. Each track is internally complete — there is no Beginner-track deliverable that depends on a deliverable from a higher track.
Vocabulary you'll meet in this course
Course Nine uses vocabulary from across the Agent Factory track plus several new terms specific to eval-driven development. Terms grouped by what they describe.
Glossary — click to expand
Eval-driven discipline:
- Eval-driven development (EDD) — the discipline of measuring agent behavior with the same rigor TDD gave SaaS teams for measuring code. Every prompt, tool, or workflow change ships only after the eval suite confirms it didn't regress.
- Golden dataset — a curated set of representative tasks with expected behavior, acceptable/unacceptable outputs, and required tool usage. The load-bearing artifact of EDD; eval quality is bounded by dataset quality.
- Eval — a test that measures behavior (was the agent correct, helpful, safe, well-grounded) rather than code (did the function return the expected value). May produce a graded score (0-5), a pass/fail, or a categorical judgment.
- Rubric — a scoring guide that defines what "correct" means for a given task. Used by graders to produce consistent eval scores.
- Grader — the mechanism that produces the eval score: a human (slow, expensive, accurate), an LLM-as-judge (fast, cheap, sometimes biased), or a deterministic rule (fast, free, only works for some metrics).
The evaluation pyramid: the seven agent-specific layers (output, tool-use, trace, RAG, safety, regression, production) sit on top of the SaaS-foundation layers (unit, integration). Each layer catches failures invisible to the layers below it. The full nine-layer taxonomy with definitions is in Concept 4 — this Glossary won't restate it.
The four-tool stack:
- OpenAI Evals — OpenAI's hosted eval platform. Dataset management, output evals at scale, model-vs-model comparison, experiment tracking, hosted dashboards. The output-and-dataset half of OpenAI's eval offering.
- OpenAI Agent Evals (with trace grading) — OpenAI's hosted agent-evaluation platform. "Agent Evals" is the broader product (datasets, eval runs, model-vs-model comparison, hosted dashboards); "trace grading" is the trace-aware capability within it (reads agent traces from the OpenAI Agents SDK ecosystem directly and runs trace-level assertions on tool calls, handoffs, guardrails). Together they are the primary agent eval framework for OpenAI Agents SDK-based agents.
- DeepEval — open-source, pytest-style eval framework. Runs in the project repository, fits into CI/CD, feels familiar to developers who know pytest.
- Ragas — open-source RAG-specific eval framework. Provides retrieval-quality, faithfulness, context-relevance, and answer-correctness metrics for knowledge-layer agents.
- Phoenix — open-source observability and evaluation platform. Production traces, dashboards, experiment comparison, sampling for eval datasets.
- Braintrust — the commercial alternative to Phoenix; introduced as the upgrade path in Concept 10 and Decision 7 for teams that want a polished collaborative product with hosted infrastructure.
- LLM-as-judge — using an LLM (typically a larger model than the one being evaluated) to grade the output of a smaller agent. Standard in all four products for behavior metrics that aren't deterministic.
Cross-course concepts:
- Worker / Digital FTE — a role-based AI agent the company hired (Courses 4-7). The unit Course Nine evaluates.
- Owner Identic AI — the human owner's personal AI delegate, runs on OpenClaw (Course 8). Course Nine evaluates its delegated-governance decisions specifically.
- Authority envelope — the bounds on what a Worker is allowed to do (Course 6). Safety evals verify Workers respect their envelopes.
- Activity log / Governance ledger — the audit trails from Courses 6 and 8. Production evals sample from these to construct future eval datasets.
- MCP — the open Model Context Protocol that agents use to read and write the system of record (Course 4). RAG evals measure the quality of the MCP-served knowledge.
Operational vocabulary:
- Test fixture / eval example — one entry in the golden dataset (one task, one expected behavior).
- Pass threshold — the minimum score on a given metric that constitutes a passing eval. Set per metric, per agent role, often per task category.
- Drift — the phenomenon of agent behavior changing over time without the code changing, typically because the underlying model has been updated or retrained. Regression evals catch drift; production evals quantify it.
- Eval-of-evals — measuring whether your evals are themselves measuring what you think they measure. The honest-frontier problem of EDD (Concept 14).
What you bring forward from Courses Three through Eight
If you've just finished Course Eight, skim and move on. If you're picking this up cold or it's been a while, the five bullets below are the load-bearing pieces of context the rest of Course Nine depends on — read them carefully.
- From Course Three (the agent loop): Workers built on the OpenAI Agents SDK have traces — structured records of every model call, tool call, handoff, and guardrail check inside a run. Trace grading (Decision 3) reads these. If your Workers were built on a different SDK, Concept 8 covers the substrate-portability story.
- From Course Four (the system of record): Workers read and write authoritative data through MCP servers. Course Four's worked example uses a knowledge-base MCP for product documentation. Decision 5 evaluates that knowledge layer with Ragas.
- From Course Six (the management layer): Paperclip's
activity_logandcost_eventstables capture every Worker action. Production evals (Decision 7 + Concept 13) sample from these to build future eval datasets. - From Course Seven (hiring API + talent ledger): Every hire produces an eval-pack run before approval. Course Nine teaches what those eval packs actually measure; Course Seven introduced the interface, Course Nine teaches the implementation.
- From Course Eight (Owner Identic AI + governance ledger): Maya's Identic AI Claudia signs and resolves delegated approvals. The governance ledger records every Claudia decision with confidence, reasoning summary, and layer source. Course Nine's Decision 4 (safety + envelope evals) uses these records to verify Claudia stayed within her delegated envelope.
Full recap: where Courses Three through Eight left things (click to expand for additional detail)
From Course Three: Workers are agent loops built on the OpenAI Agents SDK (or Claude Agent SDK; the patterns transfer). Each run produces a trace: a structured tree of model calls, tool calls, handoffs, and guardrail checks. The SDK's tracing UI lets you inspect any run's full execution path.
From Course Four: Workers read and write through MCP servers. The system-of-record pattern keeps authoritative data outside the agent's context window — the agent fetches what it needs at the right granularity. Knowledge-layer MCPs (product docs, internal wikis, customer history) are where retrieval quality genuinely matters.
From Course Five: Workers run inside Inngest's durable-execution wrapper. Every step is logged. step.wait_for_event is the durable pause used for approval flows. If a Worker crashes mid-run, Inngest replays from the last successful step. This durability is what makes long-running evals feasible.
From Course Six: Paperclip is the management layer. The activity_log records every Worker action. The cost_events table records every model and tool call's cost. Approval gates use the wait_for_event primitive. The authority envelope cascade (company → role → issue → approval-level) is what bounds Worker behavior.
From Course Seven: Hiring is a callable capability. The Manager-Agent detects capability gaps and proposes new hires. Each hire goes through an eval-pack runner that scores candidates on four dimensions before the board approves. The talent ledger records every hire, eval, retirement. The eval-pack runner is the prototype of Course Nine's discipline; Course Nine generalizes it to all agent-quality measurement.
From Course Eight: Maya has an Owner Identic AI (Claudia) running on OpenClaw. Claudia signs delegated approvals with ed25519; Paperclip verifies signature + envelope before resolving. The governance ledger records every Claudia decision with principal, confidence, layer_source, reasoning_summary. The two-envelope intersection (Maya's authority ∩ Claudia's delegated subset) is the boundary safety evals enforce.
What's left after Course Eight: the architecture is buildable end-to-end. What's missing is a way to prove it works correctly in production. That's Course Nine.
Cross-course evaluation map
Course Nine evaluates everything Courses 3-8 built. This table maps each prior course to the eval layer that primarily measures it. This is the architectural commitment of Course Nine — not just "evals matter" but "this eval covers that course's primitive."
| Course | What it built | Eval layers that measure it | Course Nine touchpoint |
|---|---|---|---|
| Three | The agent loop (model + tools + handoffs) | Output evals (the agent's final response), Tool-use evals (right tool, right args), Trace evals (the full execution path) | Concepts 5-6, Decisions 2-3 |
| Four | System of record via MCP, Skills | RAG evals (retrieval, grounding, faithfulness) | Concept 7, Decision 5 |
| Five | Operational envelope (Inngest durability) | Regression evals (does the agent behave consistently across runs?), Production evals (what real runs look like) | Concepts 12-13, Decisions 6-7 |
| Six | Management layer (Paperclip + approval primitive) | Safety/policy evals (envelope respect, approval-gate triggering), Production evals (sampling from activity_log) | Decisions 4, 7 |
| Seven | Hiring API + talent ledger | Eval packs (the four-dimension scoring at hire time) — Course Nine generalizes this primitive | Concept 4 (the eval pack pattern), Decision 1 |
| Eight | Owner Identic AI + governance ledger | Trace evals (Claudia's reasoning chain), Safety evals (delegated-envelope respect), Regression evals (drift in Claudia's judgment) | Decisions 3, 4, 6 |
The thesis-aligned framing: the eight invariants describe what an AI-native company is built from. Course Nine teaches how to measure whether each invariant is actually working. The discipline is the bridge from architecture to trustworthy production.
Cheat sheet — the 15 Concepts
| # | Concept | Part | One-line summary |
|---|---|---|---|
| 1 | Why traditional tests aren't enough for agents | 1 | Probabilistic, multi-step, tool-using systems need behavior measurement, not code measurement. |
| 2 | The TDD analogy and its limits | 1 | TDD's red-green-refactor loop carries to EDD; TDD's determinism assumption breaks. Honest about both. |
| 3 | What "behavior" means for agents | 1 | Final answer ≠ trace ≠ path. Evaluating only the final answer misses the most consequential failures. |
| 4 | The 9-layer evaluation pyramid | 2 | Unit → integration → output → tool-use → trace → RAG → safety → regression → production. Each layer catches what the others miss. |
| 5 | Output evals | 2 | The accessible starting point. What they catch: correctness, format, hallucination. What they miss: process failures. |
| 6 | Tool-use and trace evals | 2 | For tool-using agents, the path matters as much as the result. Trace evals are the agentic equivalent of integration tests with internal assertions. |
| 7 | RAG evals | 2 | Knowledge-layer agents have three failure modes (retrieval, grounding, citation). Each needs its own metric. |
| 8 | The trace-eval layer per runtime | 3 | Phoenix evaluators for Claude-runtime agents (Maya's primary); OpenAI Agent Evals + Trace Grading for OpenAI-runtime agents — same discipline, two platform UIs. |
| 9 | DeepEval for repo-level discipline | 3 | Pytest-for-agent-behavior. Brings evals into the developer workflow rather than the research notebook. |
| 10 | Ragas + Phoenix | 3 | Ragas evaluates the knowledge layer; Phoenix observes production. The two together complete the stack. |
| 11 | Golden dataset construction | 5 | The most undervalued artifact. Eval quality is bounded by dataset quality; bad datasets measure confusion. |
| 12 | The eval-improvement loop | 5 | Define task → run agent → capture trace → grade → identify failure mode → improve prompt/tool → rerun. Ship only when behavior improves. |
| 13 | Production observability and the trace-to-eval pipeline | 5 | Phoenix gives you traces; turning traces into eval examples is an operational discipline most teams underestimate. |
| 14 | What evals can't measure | 5 | Pattern behavior is evaluable; novel-edge alignment isn't, fully. Honest about the gap rather than pretending evals close every hole. |
| 15 | Eval-driven development as foundational discipline | 6 | EDD takes its place alongside TDD as one of the foundational reliability disciplines of software engineering — and what comes next. |
Part 1: The Discipline
The thesis of Courses 3-8 was that an AI-native company is buildable end-to-end — engines, system of record, durability, management layer, hiring, delegate. The thesis Course Nine adds is that buildable is not trustworthy. Anyone who has shipped a Worker into production and watched it occasionally fail in a confusing way knows this. The Worker passes its unit tests. The integration tests are green. The agent demo went well. And yet — in production — it sometimes picks the wrong tool, sometimes ignores a constraint it acknowledged in training, sometimes confabulates an answer when it should have escalated. Why? Because none of those tests measured the thing that's actually failing: the agent's behavior under conditions the tests didn't anticipate.
Part 1 makes that case concrete, then introduces the architectural response: a discipline of measuring behavior that extends — not replaces — the testing disciplines you already know. Three Concepts.
Concept 1: Why traditional tests aren't enough for agents
A unit test for a function asks: given this input, does the function return this output? The discipline is decades old, the tooling is mature, the developer ergonomics are excellent. A failure is unambiguous — the assertion either passes or fails, the reproduction case is the test itself, the fix is local. Software engineering became reliable when teams adopted this discipline; the production systems we trust today (banks, hospitals, flight control) are built on rigorous unit and integration testing.
Now consider what changes when the "function" is an AI agent.
The input is not a concrete value — it's a natural-language task, often ambiguous, sometimes context-dependent. The output is not a return value — it's a sequence of model calls, tool invocations, intermediate decisions, handoffs to other agents, retries, eventual response. The "function" is not deterministic — the same input can produce different outputs across runs, across models, across time. None of the assumptions a unit test rests on hold for an agent.
Specifically, an agent is:
- Probabilistic. The same model with the same prompt can produce different outputs on different runs. Sometimes the variation is acceptable — different phrasings of the same correct answer. Sometimes it's catastrophic — one run picks the right tool, another picks the wrong one. A test that runs once and passes proves nothing about the next run. Reliable evaluation requires running the agent many times against the same input and grading the distribution of behavior.
- Multi-step. A useful agent rarely produces one model call and stops. It plans, calls tools, observes results, plans again, calls more tools, hands off to other agents, eventually responds. Each step can succeed or fail. A test that checks only the final response can pass on a run where every intermediate step did the wrong thing. The agent "got lucky" and stumbled into a correct answer despite a broken process. (Same reason an engineer doesn't ship code based on "it compiled and ran" — compilation success is necessary but vastly insufficient for correctness.)
- Tool-using. Modern agents read databases, call APIs, search documentation, invoke other agents. Tool use is where agents stop being chatbots and start being workers. Did the agent use the right tool? With the right arguments? In the right order? Did it interpret the result correctly? Each question is its own evaluation problem — distinct from whether the final response was correct.
- Context-sensitive. Agents behave differently depending on what's in their context — which documents they retrieved, which prior messages are in the conversation, which Skills are installed, which model is running them. A test that works in isolation can fail when the agent runs with realistic production context. And vice versa. Evaluating an agent requires evaluating it in representative contexts, not just minimal ones.
- Connected to external systems. Agents read from databases, write to ticket systems, send messages, update calendars, execute code. Their behavior has side effects. A traditional unit test mocks out the external world. An agent eval has two harder paths: (a) run against staging-equivalent infrastructure, accepting the latency and cost, or (b) build careful mocks that reproduce the agent-relevant behavior of those systems. Neither is as easy as the unit-test happy path.
The implication is not that traditional tests are obsolete. They aren't. Course Nine's first phase of the lab (Decision 1) starts by ensuring traditional tests still exist — unit tests on tools, integration tests on the durability layer, API tests on the Paperclip surface. These remain essential. What's new is the layer of evaluation that sits above them and measures the agent itself.
Course Nine names this layer behavior evaluation, or evals for short. A test verifies code; an eval verifies behavior. The two are complementary, not substitutes. A serious agent team practices both.
Here's how the distinction maps to a concrete failure mode from the Course 5-8 worked example. Suppose Maya's Tier-1 Support agent receives a customer ticket about a billing error. The traditional tests on the agent's code all pass: the Inngest wrapper starts correctly, the agent's tools (the customer-lookup API, the refund-issuance API) are integration-tested and working, the response-generation function returns a string. But in production, on this particular ticket, the agent looks up the wrong customer (similar email, different account), confirms the refund applies to that customer's purchase history, and issues a $89 refund to the wrong person. No traditional test catches this failure, because every component worked correctly — the failure is in the agent's reasoning about which customer to look up. Only a behavior eval (tool-use eval, in this case — "was the right argument passed to the customer-lookup tool?") catches it.
The same pattern shows up across the Course 3-8 architecture. The Course Seven hiring API can pass all its tests while the Manager-Agent recommends a hire that doesn't match the gap. The Course Eight governance ledger can record a valid signature on an envelope-respecting decision that nonetheless contradicts how Maya herself would have decided. The interesting failures of agentic systems live above the layer of traditional testing. Evals are how we get to them.
PRIMM — Predict before reading on. Maya's Tier-1 Support agent (Course 5-6) handles 200 customer tickets per day. Maya has installed unit tests on every tool the agent uses, integration tests on the Paperclip approval primitive, and a synthetic end-to-end test that runs ten realistic customer scenarios nightly. All tests are green. The agent has been in production for six weeks.
Predict before reading on: what fraction of agent failures in production would you expect this test suite to catch? Specifically, of the failures Maya would consider "agent did the wrong thing," what fraction would the green test suite have flagged in advance?
- 80-100% — strong test coverage like this should catch almost everything
- 40-60% — catches the easy ones, misses the subtle ones
- 10-30% — catches code bugs, misses agent-reasoning bugs
- Less than 10% — tests verify code; almost all agent failures are behavior failures
Pick one before reading on. The answer, with reasoning, lands at the end of Concept 3.
Bottom line: traditional tests verify code; agentic AI requires verifying behavior. Five properties of agents — probabilistic, multi-step, tool-using, context-sensitive, side-effecting — make unit-test discipline necessary but vastly insufficient. The architectural response is not to discard traditional testing but to add a complementary layer (evals) above it that measures the agent's behavior the same way tests measure the code's correctness. Concept 1 makes the case for that layer's necessity; the rest of Course Nine builds it.
Concept 2: The TDD analogy and its limits
The most useful frame for understanding eval-driven development is by analogy to test-driven development. TDD was the discipline that made SaaS engineering reliable. Before TDD, code shipped when it ran in development; after TDD, code shipped when it passed its tests. The shift was not in the tooling (test frameworks existed before TDD became disciplined practice) but in the workflow: tests were written before the code, every code change ran the test suite, regressions were caught at change-time rather than at incident-time. CI/CD made the discipline automatic. Production reliability improved by an order of magnitude.
EDD is the same shape. Before EDD, agents shipped when they demoed well; after EDD, agents ship when their eval suite passes. The shift is in the workflow: evals are written before the agent change (or at least concurrently with it), every prompt/tool/model change runs the eval suite, regressions are caught at change-time rather than in production. CI/CD makes the discipline automatic. Production reliability of agents improves by the same kind of margin.
This analogy is useful and load-bearing for the rest of Course Nine. We will return to it repeatedly: when introducing DeepEval (Concept 9 — "pytest-for-agent-behavior"); when introducing regression evals (Concept 12 — "the eval suite is the regression net that lets you ship"); when introducing the eval-improvement loop (Concept 12 — "red, green, refactor"). The shape of TDD as a discipline carries over to EDD.
But the analogy also breaks in specific places that matter. Honest pedagogy requires naming where.
Where TDD carries over to EDD:
- The loop shape. Red-green-refactor in TDD becomes "failing eval, passing eval, refactor prompt/tool/workflow" in EDD. Both disciplines write the failure case first, get to passing, then improve.
- The regression net. TDD's regression suite catches yesterday's correctness from being broken by today's change. EDD's eval suite does the same for behavior. Both make change safe.
- The CI/CD integration. TDD's tests run on every commit; mature shops won't merge code that fails the suite. EDD's evals run on every prompt/tool/model change; mature shops won't ship an agent change that regresses the eval suite.
- The dataset as artifact. TDD's test fixtures (sample inputs, expected outputs) are version-controlled, reviewed, and treated as part of the codebase. EDD's golden dataset is the same — version-controlled, reviewed, evolved over time.
- The team discipline. TDD took ten years of advocacy before becoming mainstream practice in SaaS engineering. EDD is at the equivalent of TDD's early-2000s adoption curve. The shape of the transition — from "we should test" to "we won't ship without tests" — is the same shape EDD is going through now.
Where TDD's assumptions break for EDD:
- Determinism. A TDD test on a pure function is deterministic — given the same input, the function produces the same output. The assertion either passes or fails. An eval on an agent is probabilistic. The same input can produce different outputs across runs. The eval has to grade a distribution of behavior, not a single point. This changes the math of "passing." Instead of
result == expected, an eval looks likepass_rate >= threshold across N runs. The discipline is the same; the underlying statistical model is different. - Drift. A TDD test on a pure function gives the same result on Tuesday as it did on Monday. An eval on an agent can give different results on Tuesday, because the underlying model has been retrained, fine-tuned, or upgraded between then and now. Drift is the EDD-specific failure mode TDD has no analog for. Regression evals (Concept 12) and production evals (Concept 13) are the discipline responses. Both are EDD-native rather than borrowed from TDD.
- Context-dependent correctness. A TDD test on a pure function tests one input. An agent's "correct behavior" depends on the entire context window — conversation history, installed Skills, which model is running. EDD requires testing the agent in representative contexts, not isolated inputs. This is much harder to scope. The golden dataset has to be constructed with care (Concept 11).
- Cost. A TDD test costs a millisecond of compute. An eval on an agent costs model-call API fees (sometimes substantial) plus the time of every tool the agent invokes. Running the eval suite has a non-trivial budget. Teams optimize which evals run on every commit, which run nightly, which run weekly. EDD has an economic dimension TDD does not.
- Grader subjectivity. A TDD assertion is unambiguous —
result == expectedreturns true or false. An eval's grader has to judge whether a natural-language response is "correct, helpful, well-grounded, safe." That judgment is itself an AI problem when the grader is an LLM, and itself an expense when the grader is a human. The grader is not an oracle. It has its own failure modes — LLM-as-judge bias, human grader inconsistency. Concept 14 returns to this honestly. - The "passing" target moves. In TDD, "the test passes" is binary. Once you write the assertion, it either holds or it doesn't, and you fix the code until it holds. In EDD, "the eval passes" is a graded measurement on a moving target. What counts as "good enough" depends on the agent's role, the task category, the deployment context. Setting eval thresholds is a judgment call TDD never asked of you.
The synthesis Course Nine teaches: treat the TDD analogy as a guide to the discipline shape but not as a complete specification of how EDD works. The loop, the regression-net mindset, the CI/CD integration, the dataset-as-artifact — these all transfer. The determinism, the cost economics, the grader problem, the threshold-setting — these are EDD-native and require new thinking.
Bottom line: EDD is best understood through the TDD analogy, but only critically — the analogy carries on workflow, loop, regression discipline, and CI/CD integration; it breaks on determinism, drift, context-dependence, cost, grader subjectivity, and threshold-setting. Course Nine teaches the discipline at its strongest where the analogy carries, and names the EDD-native challenges where the analogy doesn't. Pretending the analogy is complete would mislead teams trying to implement EDD; pretending the analogy fails entirely would discard the most useful framing available.
Concept 3: What "behavior" means for agents — final answer vs trace vs path
What exactly are we evaluating when we evaluate an agent? The answer determines what the eval suite can catch and, more importantly, what it can miss.
The naive answer is "the agent's response." If the agent answered the customer's question correctly, the agent behaved correctly. This is the easiest eval to write and the most popular starting point — and it is profoundly insufficient.
Consider Maya's Tier-1 Support agent again. A customer asks for help with a billing dispute. The agent produces a response: "I've processed a $89 refund for the duplicate charge on November 12. The refund will appear on your statement within 3-5 business days." The response is correct in form, polite in tone, action-completing. An output eval would pass this.
Now look at what the agent actually did:
- Read the customer's message — correctly identifying it as a refund request.
- Called the customer-lookup tool — passing the customer's email as the lookup key.
- The lookup returned three matches (the email belongs to two different accounts, one a personal account and one a small-business account; the third is a flagged duplicate).
- The agent picked the first result without checking which account matched the disputed charge.
- Looked up recent charges on that account — found a $89 charge from November 12 that coincidentally also looked refundable.
- Issued the refund.
- Composed the response above.
The output is correct. The behavior is incorrect. The agent refunded the wrong customer a charge that happened to match the dispute amount. The real customer didn't get their refund. The wrong customer got a free $89. Three months later, the auditor catches it. By then, dozens of similar mismatches have happened. The reason: the agent's reasoning about disambiguating between accounts is broken. Nothing in the output eval caught it, because the response always looks correct.
This is the core insight of Concept 3: the agent's "behavior" is its full execution path, not just its final response. Evaluating only the final response is like grading a student exam by reading only the last paragraph. You'll catch students who explicitly conclude wrongly. You'll miss the ones who reasoned wrongly and arrived at the right conclusion by accident. (In production, both kinds of failure happen.)

The three levels of agent behavior, each requiring its own eval layer:
Level 1: The final output. What the agent ultimately said or did. This is what users see. Output evals (Concept 5) grade this layer. What output evals catch: factual errors, format violations, hallucinations, refusals that shouldn't have been refusals, unsafe content. What output evals miss: every failure where the output happens to look correct despite a broken process.
Level 2: The tool-use record. What tools the agent called, with what arguments, in what order, and how it interpreted the results. Tool-use evals (Concept 6) grade this layer. What tool-use evals catch: wrong tool selection, wrong arguments passed, incorrect interpretation of tool results, unnecessary tool calls (cost and latency), missed tool calls (the agent should have looked something up but didn't). What tool-use evals miss: failures in the reasoning between tool calls. The agent picks the right tool with the right arguments, but does so based on a flawed plan that wasn't visible in the tool calls themselves.
Level 3: The full trace. The complete execution path: model calls, tool calls, handoffs, guardrail checks, intermediate reasoning, retries, error handling. Trace evals (Concept 6 and Concept 8) grade this layer. What trace evals catch: the reasoning failures that produce correct tool calls; the handoff failures where the agent escalated to the wrong specialist; the guardrail bypasses; the retry storms that indicate the agent is stuck; the path-of-least-resistance failures (the agent picked an easy answer when a harder one was correct). What trace evals don't fully solve: they require structured traces (the Course 3 OpenAI Agents SDK provides them; other SDKs do too), and they require graders that can read traces — usually LLM-as-judge configurations that have their own evaluation problems.
The three levels are not alternatives. They are a stack. Output evals are easier to write and cheaper to run, so they should run frequently. Trace evals are more expensive but catch failures the output evals can't see, so they should run on every meaningful change. Tool-use evals sit between the two and are essential for any tool-using agent. A serious EDD discipline uses all three.
Why this stratification matters for Course Nine specifically. Each layer of the architecture you built in Courses 3-8 fails in a way that maps to one of the three levels. The Tier-1 Support agent's wrong-customer failure is a tool-use failure (Level 2). Claudia's hypothetical "approved a refund Maya wouldn't have approved" is a trace failure (Level 3) — Claudia's reasoning produced a signed action that passed the envelope check but contradicted Maya's actual judgment patterns. The Manager-Agent recommending a hire that doesn't fit the gap is a path failure (Level 3) — the recommendation looks correct but the reasoning that produced it skipped a step the human would have taken.
The behavior the eval suite measures determines the failures the eval suite catches. Output-only evals would let all three of these failures through. The full stack — output + tool-use + trace — catches each one at the level where it actually breaks.
The answer to the Concept 1 PRIMM Predict. The honest answer is closer to (3) or (4): a test suite as described catches roughly 10-30% of agent failures in production, sometimes less. Unit tests catch tool bugs (the customer-lookup API returned malformed data) and integration bugs (the Paperclip approval primitive didn't fire). They do not catch agent-reasoning failures (wrong customer disambiguation, wrong tool selection, hallucinated facts, broken handoff logic), which constitute the majority of production failures for any serious agent. This is exactly why output evals + tool-use evals + trace evals are necessary in addition to the traditional test stack — not in place of it.
*Bottom line: agent behavior has three levels — the final output, the tool-use record, and the full trace. Each level has its own failure modes; each requires its own eval layer. Output-only evaluation, the easiest starting point, misses the majority of consequential agent failures. The discipline Course Nine teaches uses all three layers as a stack: output evals for fast feedback, tool-use evals for the workhorse correctness check, trace evals for the failures invisible at the output layer. The agent's behavior is the path, not just the destination.*
Part 2: The Evaluation Pyramid
Part 2 expands the output → tool-use → trace stratification from Concept 3 into a full nine-layer pyramid — the architectural taxonomy of agent evaluation. The pyramid is the most important conceptual artifact of Course Nine; every eval suite you'll build maps to one or more layers, and the layers are not interchangeable. Four Concepts.
Concept 4: The 9-layer evaluation pyramid
A reliable agentic AI application needs evaluation at multiple layers, the same way a reliable SaaS application needs testing at multiple layers (unit → integration → end-to-end → manual QA → monitoring). Agentic AI's layers extend the SaaS testing pyramid rather than replacing it. The full nine layers:

Three groups, with the friend-of-the-curriculum's regrouping (more precise than a naive "carryover from SaaS" framing). Foundation (layers 1-2) — unit tests and integration tests — carries over directly from SaaS testing tradition and remains necessary in agentic AI. LLM/Agent evaluation (layers 3-6) — output evals, tool-use evals, trace evals, RAG evals — is the agentic-AI native discipline this course teaches; output evals belong here, not in the foundation group, because grading natural-language responses is fundamentally an LLM-evaluation problem rather than a code-correctness problem (this is where DeepEval, Agent Evals' output-grading runs, and Ragas all operate). Operational reliability (layers 7-9) — safety evals, regression evals, production evals — is the discipline that turns a working eval suite into a production-grade reliability practice, regardless of which framework you used to build it.
Three observations about the pyramid before drilling into each layer.
Observation 1: each layer catches failures invisible to the layers below. A unit test passes. An integration test passes. An output eval passes. A tool-use eval fails — the agent picked the wrong tool. The tool-use eval has caught a failure that the three layers below it cannot see. The pyramid isn't redundant; it's layered defense, the way a serious software-quality discipline uses unit + integration + e2e + monitoring not because they overlap but because they catch different things.
Observation 2: cost and frequency trade off as you go up. Unit tests are nearly free and run on every commit. Integration tests cost more (real infrastructure) and run on most commits. Output evals cost model-call API fees and run on every meaningful agent change. Trace evals cost more (longer runs, deeper inspection) and run on every prompt/tool/model change. Production evals operate on sampled traces from real usage and run continuously but in the background. The discipline budgets where each layer runs in the CI/CD pipeline based on cost and the failure modes it catches.
Observation 3: dataset overlap, eval-suite distinctness. A single example in the golden dataset (Concept 11) can be graded by multiple eval layers — the same customer-refund task is graded by an output eval ("was the refund correct?"), a tool-use eval ("did the agent call refund-issuance with the right amount?"), a trace eval ("did the agent verify the customer's account before issuing?"), and a safety eval ("did the agent stay within the auto-approval threshold from Course Six's Concept 9?"). One dataset, four evals, four different scores. The dataset is the substrate; the eval suites are the lenses.
Walking through each of the nine, with what it catches and the Course-3-8 architecture it primarily measures:
Layer 1 — Unit tests. Verify deterministic code: tool functions, utility modules, data transformations, schema validation, API helpers, database access. These remain essential. Architecture they cover: the tool implementations in Course Three's agent loop, the MCP server code in Course Four, the Inngest step functions in Course Five, the Paperclip API endpoints in Course Six. A failing unit test means the code under the agent is broken, which fails the agent for reasons that aren't its fault.
Layer 2 — Integration tests. Verify that components work together: API contracts, database transactions, queue behavior, authentication, external service integration. Especially important for agentic systems because tool failures often look like model failures from the outside. When an agent appears to fail, the first diagnostic is often whether the integration tests on the tools are still green — if a downstream API has changed shape, the agent will appear to behave wrongly when the actual failure is integration-level. Architecture they cover: the same components as unit tests but at the inter-component level. Especially the Paperclip approval primitive (Course Six) and the durability layer (Course Five) — both have integration tests that have to stay green for higher-layer evals to mean anything.
Layer 3 — Output evals. Grade the agent's final response or final artifact. Did the agent answer correctly? Did it follow the requested format? Did it avoid hallucination? Did it satisfy the user's goal? The easiest layer to understand and the most popular starting point. Concept 5 takes this up in detail. Architecture they cover: every agent's response — the Tier-1 Support agent's customer reply, the Manager-Agent's hire proposal, Claudia's escalation summary to Maya. Necessary for fast feedback, insufficient on its own.
Layer 4 — Tool-use evals. Check whether the agent selected the right tool, passed the correct arguments, handled the response properly, and avoided unnecessary tool calls. Concept 6 takes this up in detail. Architecture they cover: the tool-using behavior of every Worker in Courses 3-8. The first eval layer where the eval is genuinely agent-specific — output evals can be adapted from traditional QA; tool-use evals are new.
Layer 5 — Trace evals. Evaluate the internal execution path: model calls, tool calls, handoffs, guardrails, retries, intermediate reasoning. Trace evals are the agentic equivalent of replaying the game tape after the match — the final score matters, but the coach wants to know how the team played. Concept 6 covers the conceptual structure; Concept 8 covers the OpenAI Agent Evals implementation (with trace grading). Architecture they cover: the multi-step reasoning of every Worker. Especially Claudia's signed-delegation decisions in Course Eight — the trace shows what evidence she consulted, which standing instruction she matched on, what confidence she assigned.
Layer 6 — RAG and knowledge evals. Evaluate retrieval quality, source relevance, grounding, faithfulness, and answer correctness relative to the retrieved context. Required for any agent that depends on a knowledge base, vector database, MCP-served knowledge layer, or documentation. Concept 7 takes this up in detail. Architecture they cover: Course Four's MCP-served knowledge bases, any agent that does retrieval before answering. The most common production failure mode for agents is retrieval failure — the agent has the right reasoning but the wrong source material — and traditional output evals frequently misdiagnose this as agent failure.
Layer 7 — Safety and policy evals. Check whether the agent follows constraints, avoids unsafe actions, protects sensitive data, respects permissions, and escalates to a human when needed. Critical for agents that can send emails, change calendars, update databases, execute code, or interact with customer systems. Architecture they cover: the authority envelope from Course Six (does the Worker stay within its bounds?), the auto-approval policy from Course Seven (does the Manager-Agent correctly identify which hires should bypass the human?), the delegated envelope from Course Eight (does Claudia respect the bounds Maya set?). The most consequential failures of agentic AI are safety failures, and these evals are not optional.
Layer 8 — Regression evals. Compare current behavior against previous behavior. Did the latest change make the agent better or worse? Every prompt change, model change, tool change, memory change, or workflow change should be measured against a stable eval dataset. Concept 12 covers this as part of the eval-improvement loop. Architecture they cover: every change to every agent across Courses 3-8. Regression evals are what makes shipping agent changes feel like engineering rather than guesswork.
Layer 9 — Production evals. Use real traces, user feedback, sampled conversations, and operational metrics to evaluate the system after deployment. Production evals turn real behavior into better development datasets, creating a continuous improvement loop. Concept 13 covers the operational discipline. Architecture they cover: the activity_log and governance_ledger from Courses Six and Eight, which are the raw material for production evals. The hardest layer to operationalize and the one most teams underestimate — Concept 13 is honest about why.
The pyramid is not a checklist where every layer needs equal attention. A pragmatic team starts at the bottom and works up, adding layers as the agent's complexity and the deployment stakes increase. Concept 12's eval-improvement loop describes the iteration; Decision 1 in the lab walks the practical first phase.
Bottom line: agent evaluation has nine distinct layers, grouped as Foundation (1-2: unit and integration tests, carried over from SaaS), LLM/Agent Eval (3-6: output, tool-use, trace, and RAG evals — the discipline's native contribution to agentic AI), and Operational Reliability (7-9: safety, regression, and production evals — the operational practice). Each layer catches failures invisible to the layers below it. A serious EDD discipline doesn't use all nine equally — it adds layers based on the agent's complexity and stakes. The pyramid is the vocabulary teams need to talk about agent reliability concretely rather than vaguely.
See an eval before you study the discipline
Before Concepts 5-7 deep-dive into the eval layers, here is what one eval actually looks like — one row of the golden dataset, one rubric, one grading output. Beginners benefit from seeing the object before studying the discipline; this is that object.
One golden-dataset row (JSON, illustrative — the dataset's schema is documented in Decision 1):
{
"task_id": "refund_T1-S014",
"category": "refund_request",
"input": "I see a duplicate charge of $89 on my November 12 statement. Can you refund the duplicate?",
"customer_context": {
"customer_id": "C-3421",
"account_age_days": 1247,
"prior_refunds": 0
},
"expected_behavior": "Verify the customer's account, confirm the duplicate charge exists, and issue a single refund of $89.",
"expected_tools": ["customer_lookup", "charge_history", "refund_issue"],
"expected_response_traits": [
"Acknowledges the dispute",
"Confirms the duplicate was found",
"States the refund amount and timeline"
],
"unacceptable_patterns": [
"Issues refund without verifying the charge exists",
"Refunds a different amount than the disputed charge",
"Promises a timeline shorter than 3-5 business days"
],
"difficulty": "easy"
}
A 10-row sample dataset (the Simulated track's seed — paste these into datasets/golden-sample.json and you can run Decision 2 immediately, no Maya's-company-build required). Categories follow the full schema; difficulties span easy/medium/hard:
[
{
"task_id": "refund_T1-S001",
"category": "refund_request",
"input": "Charged twice for the $49 monthly plan in October. Please refund the duplicate.",
"customer_context": {
"customer_id": "C-2001",
"account_age_days": 412,
"prior_refunds": 0
},
"expected_behavior": "Verify account, confirm duplicate, issue single $49 refund.",
"expected_tools": ["customer_lookup", "charge_history", "refund_issue"],
"difficulty": "easy"
},
{
"task_id": "refund_T1-S002",
"category": "refund_request",
"input": "I cancelled last month but got charged again. I want a full refund and my account closed.",
"customer_context": {
"customer_id": "C-2002",
"account_age_days": 89,
"prior_refunds": 0
},
"expected_behavior": "Verify cancellation status; if cancellation valid, refund; close account; confirm both actions.",
"expected_tools": [
"customer_lookup",
"cancellation_status",
"refund_issue",
"account_close"
],
"difficulty": "medium"
},
{
"task_id": "account_T1-S003",
"category": "account_inquiry",
"input": "What's my current plan and when does it renew?",
"customer_context": {
"customer_id": "C-2003",
"account_age_days": 1847,
"prior_refunds": 2
},
"expected_behavior": "Look up plan and next-renewal date; respond with both.",
"expected_tools": ["customer_lookup", "plan_details"],
"difficulty": "easy"
},
{
"task_id": "technical_T1-S004",
"category": "technical_issue",
"input": "Sync mode says 'real-time' but my changes don't appear until I refresh manually. Is real-time sync broken?",
"customer_context": {
"customer_id": "C-2004",
"account_age_days": 234,
"prior_refunds": 0
},
"expected_behavior": "Acknowledge that the product offers batch sync only (not real-time); clarify the documentation; suggest enabling auto-refresh as the closest available option.",
"expected_tools": ["product_capabilities_lookup"],
"unacceptable_patterns": [
"Claims real-time sync is available when it is not"
],
"difficulty": "medium"
},
{
"task_id": "escalation_T1-S005",
"category": "escalation_request",
"input": "This is the third time I've contacted support about the same billing issue. I want to speak to a manager.",
"customer_context": {
"customer_id": "C-2005",
"account_age_days": 678,
"prior_refunds": 1,
"open_tickets": 2
},
"expected_behavior": "Acknowledge the frustration; check ticket history; escalate to Tier-2 with full context; provide expected response time.",
"expected_tools": [
"customer_lookup",
"ticket_history",
"escalation_create"
],
"difficulty": "medium"
},
{
"task_id": "refund_T1-S006",
"category": "refund_request",
"input": "I want a refund for the past 6 months because I wasn't using the service. The charges are $49/month.",
"customer_context": {
"customer_id": "C-2006",
"account_age_days": 1100,
"prior_refunds": 0,
"usage_last_6mo": "low"
},
"expected_behavior": "Empathize; explain that retroactive refunds for unused-but-not-cancelled accounts fall outside the standard refund window (30 days); offer a one-time goodwill credit if policy permits; if customer insists, escalate to Tier-2 for policy exception review.",
"expected_tools": ["customer_lookup", "policy_lookup", "escalation_create"],
"unacceptable_patterns": [
"Refunds 6 months without any approval",
"Refuses without explaining policy or offering escalation"
],
"difficulty": "hard"
},
{
"task_id": "policy_T1-S007",
"category": "policy_question",
"input": "What's your data retention policy if I cancel my account?",
"customer_context": {
"customer_id": "C-2007",
"account_age_days": 412,
"prior_refunds": 0
},
"expected_behavior": "Look up data-retention policy; respond with the specific retention windows for each data category (account metadata, content, billing records).",
"expected_tools": ["policy_lookup"],
"difficulty": "easy"
},
{
"task_id": "refund_T1-S008",
"category": "refund_request",
"input": "sarah@example.com — I see a duplicate $89 charge from Nov 12. Refund please.",
"customer_context": {
"lookup_email": "sarah@example.com",
"matching_accounts": 3,
"note": "email matches multiple accounts"
},
"expected_behavior": "Disambiguate the customer — three accounts match this email; ask the customer for a confirmation detail (last 4 of card, account ID, or other) before issuing any refund. Do NOT pick the first match.",
"expected_tools": ["customer_lookup", "account_disambiguation"],
"unacceptable_patterns": [
"Picks the first matching account without disambiguating",
"Issues a refund to any account before confirming which one is correct"
],
"difficulty": "hard"
},
{
"task_id": "technical_T1-S009",
"category": "technical_issue",
"input": "API returns 401 even though my key is correct. What's wrong?",
"customer_context": {
"customer_id": "C-2009",
"account_age_days": 156,
"prior_refunds": 0,
"plan": "free_tier"
},
"expected_behavior": "Check if the API endpoint requires a paid plan; if so, explain the limitation and the upgrade path; if not, walk through standard 401 debugging (key format, header name, expired token).",
"expected_tools": [
"customer_lookup",
"plan_details",
"api_endpoint_lookup"
],
"difficulty": "medium"
},
{
"task_id": "escalation_T1-S010",
"category": "escalation_request",
"input": "I'm a journalist working on a story about your company's data practices. Can someone respond to my media inquiry?",
"customer_context": {
"customer_id": "C-2010",
"account_age_days": 12,
"prior_refunds": 0,
"flags": ["media_inquiry"]
},
"expected_behavior": "Recognize this as a media inquiry, not a standard support request; do NOT answer substantively; route to the legal/PR team via the appropriate escalation channel; provide expected response timeframe.",
"expected_tools": ["escalation_create"],
"unacceptable_patterns": [
"Provides substantive answers about data practices without legal/PR review"
],
"difficulty": "hard"
}
]
Notice the dataset's shape: 3 refunds (one easy, one medium, one hard), 2 account-or-policy lookups (both easy), 2 technical issues (both medium), 2 escalations (one medium, one hard), 1 hard refund that's actually a disambiguation test (S008 — the wrong-customer-refund failure from Concept 3 distilled into one example). The distribution mirrors what Concept 11 calls a "stratified" dataset: roughly representative of production category mix, with explicit difficulty stratification, including the edge cases the agent is most likely to fail on. A complete production dataset would be 30-50 such rows (Decision 1); this 10-row sample is what the Simulated track readers paste in to get started.
One rubric (markdown, illustrative — a Decision 2 output-eval rubric for answer_correctness):
# Rubric: answer_correctness
Given the customer's task and the agent's response, grade how correct the
response is on a 1-5 scale.
5 — Fully correct. Agent addresses the refund request, confirms the
duplicate charge with specific details, states the refund amount,
and gives the standard 3-5 business day timeline.
4 — Mostly correct. Minor omission (e.g., timeline phrased vaguely) but
the action and amount are right.
3 — Partially correct. The action is right but a key detail is wrong or
missing (e.g., wrong amount mentioned, no confirmation of which
charge was duplicated).
2 — Largely incorrect. The agent acknowledged the request but issued
the wrong action (refund denied when it should have been approved,
or refund issued without verification).
1 — Fundamentally wrong. The agent gave a confidently-stated response
that contradicts the expected behavior (e.g., claimed no duplicate
exists when one is on the statement).
Output: a single integer 1-5 followed by a one-sentence rationale
identifying which trait or unacceptable pattern drove the score.
One grading output (what the eval framework returns when run on this row):
example: refund_T1-S014
metric: answer_correctness
score: 4
rationale: "The agent confirmed the duplicate, issued the refund, and gave
a timeline — but the timeline was phrased as 'soon' rather than
the standard 3-5 business days, which is a minor omission."
threshold: 3 (configured per metric in Decision 2)
result: PASS
This is what one eval is. The discipline of Course Nine is building dozens to hundreds of these — across categories, across layers of the pyramid, across all the Course 3-8 invariants — and wiring them into CI/CD so regressions on critical metrics block merges. The full discipline is what Concepts 5-15 and Decisions 1-7 walk through. But every eval is fundamentally this shape: a dataset row, a rubric, a grader, a score. Start there.
Concept 5: Output evals — the accessible starting point and its limits
Output evals are the easiest eval layer to write and the most common starting point. This is good — accessibility matters, and a team that ships output evals quickly is better off than a team that overthinks the eval architecture and ships nothing. It is also a trap — teams that stop at output evals miss the failure modes that hurt the most in production.
Concept 5 takes up both sides: what output evals catch (and how to write them well), what they miss (and how to recognize when you've outgrown them).
What an output eval looks like. The agent receives a task. The agent produces a response. The eval grades the response on one or more metrics. Pseudo-code shape:
def eval_customer_refund_response(task, agent_response):
# Metric 1: Did the agent answer the customer's question?
answered = grade_with_llm(
rubric="Did the response address the customer's billing dispute? Yes/No.",
task=task,
response=agent_response,
)
# Metric 2: Did the agent specify a concrete next step?
actionable = grade_with_llm(
rubric="Does the response specify what was done (e.g., refund issued, escalation filed)? Yes/No.",
task=task,
response=agent_response,
)
# Metric 3: Was the tone appropriate?
tone = grade_with_llm(
rubric="Is the tone professional and empathetic? Score 1-5.",
task=task,
response=agent_response,
)
return {"answered": answered, "actionable": actionable, "tone": tone}
Three metrics, three graders, three scores. The grader is typically an LLM — usually a larger or more capable model than the one running the agent, configured with a clear rubric. (Human grading is also valid for the highest-stakes evals; see the dataset-construction discussion in Concept 11.)
What output evals catch well.
- Format violations. The agent was supposed to respond in JSON; it responded in prose. The eval rubric says "is the response valid JSON?" and grades fail.
- Refusals that shouldn't have been refusals. The agent refused a legitimate customer question, citing a safety concern that doesn't apply. An output eval with "did the agent answer the question?" catches the refusal.
- Obvious factual errors. The agent said "your account was opened on January 17, 2026" when the customer's account was opened in 2023. If the dataset includes the correct fact in the task metadata, the eval can compare against it.
- Hallucinations on grounded tasks. The agent invented a policy or feature that doesn't exist. An output eval comparing the response against the known-correct policy catches the invention.
- Tone and clarity. The agent's response was technically correct but rude or confusing. LLM-as-judge graders with clear rubrics catch this consistently enough to be useful.
What output evals miss systematically.
- Process failures with correct outputs. As Concept 3 showed with the wrong-customer-refund example, the response can look correct while the agent did the wrong thing. Output evals are blind to this.
- Unnecessary tool calls. The agent answered correctly but burned five extra tool calls (and several seconds and a dollar of compute) on the way. The output is fine; the process is wasteful. Tool-use evals catch this; output evals don't.
- Lucky correctness. The agent's reasoning was flawed but the response happened to be right anyway. Over enough runs, the flawed reasoning will produce wrong responses too; the output eval will start failing then, but by that point the agent has been in production making decisions on flawed logic. Trace evals catch the underlying problem earlier.
- Reasoning failures hidden by post-hoc rationalization. The agent's response includes a confident-sounding explanation that doesn't match what the agent actually did. Output evals grade the final explanation; they don't compare it against the trace. The agent can lie to itself (and to the eval) about what it did. Trace evals are the corrective.
The right role for output evals. They are the fast, cheap, frequent layer of the eval pyramid — the eval that runs on every commit. They catch the failures that are obvious enough to be visible at the response level. They are not the whole story, and a team that ships only output evals will believe their agent is more reliable than it actually is. This isn't a hypothetical; it's the modal pattern in 2025-2026 production agentic AI. The output eval scores look great; the production failures keep happening; the team concludes "evals don't work for agents." The honest diagnosis: their evals were just at one layer.
PRIMM — Predict before reading on. Maya is running an output-eval suite on her Tier-1 Support agent. The suite has 50 golden examples covering common customer scenarios, graded by GPT-4-class LLM-as-judge on four metrics (correctness, helpfulness, tone, format compliance). The suite passes 96% — only 2 examples fail. Maya considers herself done with eval setup.
Predict: what's the most likely pattern Maya is missing? Pick one before reading on:
- The 2 failing examples are the actual problem — fix those, achieve 100%, you're done
- The 96% pass rate is hiding tool-use failures that produce correct-looking outputs
- The grader (GPT-4-class) is the same model running the agent, and is biased toward its own outputs
- The 50-example dataset isn't representative of production traffic; failures concentrate in the long tail
The answer, with discussion, lands at the end of Concept 6. Pick one before reading on.
Bottom line: output evals are the right starting point for any eval-driven discipline — accessible, cheap, fast. They catch format violations, obvious factual errors, hallucinations on grounded tasks, refusals that shouldn't have been, and tone problems. They miss the failures Course Nine spends its real teaching time on: process failures, unnecessary tool calls, lucky correctness, and post-hoc rationalization. Use output evals as the entry point and the fast-feedback layer; do not stop there.
Concept 6: Tool-use and trace evals — where the path matters as much as the result
For tool-using agents (which is to say, almost all production-grade agents from Course Three onward), the path the agent took matters as much as the result. Tool-use evals and trace evals are the two layers that grade the path. They are the workhorse layers of agentic AI evaluation, and the ones output-only teams most underestimate.
Tool-use evals: the question they answer.
Did the agent select the right tool? Pass the right arguments? Handle the response properly? Avoid unnecessary tool calls? These four questions correspond to four failure modes, each its own metric:
- Tool-selection metric. Given the task, was the chosen tool the correct one? An agent asked to look up a customer should call the customer-lookup tool, not the order-lookup tool. A grader compares the chosen tool against the expected tool (from the dataset's metadata) or against an LLM-as-judge rubric ("for this task, what tool should have been called?").
- Argument-correctness metric. Given the chosen tool, were the arguments correct? Wrong customer email, wrong order ID, wrong date range — all manifest as argument failures. A grader compares the arguments passed against the expected arguments, often with looser matching for natural-language fields and stricter matching for structured IDs.
- Response-interpretation metric. Given the tool's response, did the agent interpret it correctly? The customer-lookup tool returned three candidate accounts; did the agent disambiguate correctly, or pick the first? This is the metric the wrong-customer refund example in Concept 3 fails on.
- Efficiency metric. Did the agent make unnecessary tool calls? An agent that calls the same lookup three times "to be sure" is burning cost and latency; an agent that called five tools when one was sufficient is over-elaborate. A grader counts tool calls and compares against the dataset's expected minimum, flagging substantial overshoots.
Tool-use evals require structured trace data. Specifically, they require a record of every tool call with its arguments and response. The OpenAI Agents SDK produces this by default; other agent SDKs do as well. If your agent runs through an SDK that doesn't produce structured tool-call records, tool-use evals are dramatically harder to write — you'd be parsing logs or relying on the agent to self-report, both unreliable. This is one of the substrate considerations Concept 8 takes up.
Trace evals: the question they answer.
Did the agent's full execution path — model calls, tool calls, handoffs, guardrails, intermediate reasoning, retries, error handling — accomplish the task correctly, efficiently, and safely? Trace evals are the agentic AI equivalent of integration tests with internal assertions; they don't just check what happened at the boundaries (the inputs and outputs), they check what happened inside the run.
What a trace eval can catch that output and tool-use evals can't:
- Reasoning failures between correct tool calls. The agent called the right tool with the right arguments, but its plan for why to call it was wrong. A trace shows the model's reasoning between tool calls; a trace grader can assess whether the reasoning was sound.
- Handoff failures. In multi-agent systems, when does Agent A handoff to Agent B, and was the handoff appropriate? A trace shows the handoff decision and the context passed; a trace grader catches handoffs to the wrong specialist or premature handoffs that lose context.
- Guardrail bypasses. If the agent has guardrails (safety filters, policy checks), did they fire when they should have? Did the agent route around them? A trace shows guardrail invocations; a trace grader catches both false negatives (guardrail should have fired) and false positives (guardrail fired and unnecessarily blocked the agent).
- Retry storms. The agent encountered an error and retried. Once is normal; ten times in a loop is a stuck-loop pathology. A trace shows retry counts; a trace grader catches the pathology before it shows up in cost reports.
- Path-of-least-resistance failures. The agent had multiple ways to accomplish the task and picked the cheap-but-shallow one when a more careful approach was correct. A trace shows the path taken; a trace grader (or a comparison against a reference path in the dataset) catches the shortcut.
The challenge of trace evals: they require a grader that can read traces. Sometimes this is an LLM-as-judge with the trace embedded in its prompt; sometimes this is a deterministic rule (count the retries, check the handoff target); often it's a combination. OpenAI's trace grading capability (Concept 8) is built specifically for this — it has primitives for assertions on tool calls, handoffs, guardrails, and intermediate reasoning. DeepEval (Concept 9) has trace-aware metrics that work for OpenAI-Agents-SDK and other compatible runtimes.
A concrete example tying tool-use and trace evals together: Claudia's signed-delegation behavior. When Claudia (the Owner Identic AI from Course Eight) decides to auto-approve a refund or escalate it to Maya, the decision goes through multiple steps: she polls Paperclip for pending approvals (tool call 1), she retrieves Maya's standing instructions for that decision class (tool call 2), she compares the request against the delegated envelope (internal reasoning), she signs the decision if approving (tool call 3), she posts the decision to Paperclip (tool call 4).
The output eval grades the final decision: was the refund correctly approved or correctly escalated? Important but insufficient.
The tool-use eval grades each step: did Claudia poll the right endpoint, retrieve the right instruction set, sign with the right key, post with the right principalid? _Catches important failures the output eval would miss.
The trace eval grades the reasoning: in the comparison step, did Claudia correctly map the request against the standing instructions? Did her confidence assignment match the historical pattern? Did she explain her decision in a way consistent with Maya's stated reasoning style? Catches the most important failure: Claudia produced a technically correct signed decision that contradicts how Maya herself would have decided.
Three layers, three different lenses on the same decision. No single layer would catch all three failure modes. This is why the pyramid exists.
The answer to Concept 5's PRIMM Predict. All four options are real risks, but the most common pattern in 2025-2026 production agents is (2) — the 96% pass rate on output evals is hiding tool-use failures producing correct-looking outputs. The output eval grader sees a polite, correct-sounding response and grades it pass; the wrong-customer refund happens silently; weeks pass before the auditor catches it. (1) is the answer Maya is tempted to believe and is almost always wrong. (3) is real (the LLM-as-judge bias toward its own outputs is documented) and is partly addressed by using a different model family for grading than for the agent. (4) is real (the 50-example dataset's representativeness is a Concept 11 problem) and Course Nine takes up dataset construction seriously. But the most important pattern to internalize is (2): output-eval scores systematically overstate agent reliability for tool-using agents. This is why tool-use and trace evals are not optional for production agentic AI.
Bottom line: tool-use evals grade the path (right tool, right arguments, right interpretation, no waste); trace evals grade the full execution including the reasoning that produced the tool calls. For tool-using agents, these layers are not optional — output-only evaluation systematically misses the most consequential failures. Tool-use evals are accessible and run on every change; trace evals are more expensive and run on every meaningful prompt/model/workflow change. Together with output evals (Concept 5), they form the core of the agentic AI eval discipline.
Concept 7: RAG evals — separating retrieval failures from reasoning failures
Concepts 5 and 6 covered the eval layers that apply to any tool-using agent. Concept 7 takes up the layer specific to knowledge-layer agents — agents that retrieve information from a knowledge base, documentation, vector database, or MCP-served system of record before answering. This is most production agents at scale; few useful agents work from pure model knowledge alone.
The architectural pattern from Course Four: the agent doesn't carry the company's entire knowledge in its context. Instead, when the agent needs information, it calls a retrieval tool (typically an MCP server backed by a vector database or document store), gets back the relevant passages, and reasons over them. This is retrieval-augmented generation — RAG, for short.
Why RAG agents need their own eval layer. A RAG agent has three failure modes that other agents don't:
- Retrieval failure. The agent asks the retrieval tool for "billing policy on duplicate charges" and the tool returns documents about shipping policy on duplicates. The retrieval is wrong; the agent's subsequent reasoning, however sound, produces a wrong answer because it was based on wrong source material. Output evals misdiagnose this as an agent reasoning failure.
- Grounding failure. The retrieval returned the right documents, but the agent's response includes claims that aren't supported by those documents — either invented or drawn from the model's pre-training. The agent appears confident; the customer-facing response sounds authoritative; the cited source doesn't actually support the claim. Output evals on the surface text miss this. Specialized grounding metrics catch it by checking whether each factual claim in the response is supported by the retrieved context.
- Citation failure. The retrieval was right, the answer was correctly grounded, but the agent failed to cite its source (or cited the wrong source). For knowledge-base agents in regulated industries — legal, medical, financial — citation failure is its own compliance problem. Output evals can grade for citation presence but not for citation correctness.
The Ragas framework (Concept 10's runtime) ships with specific metrics for each of these:
- Context relevance — given the user's question, was the retrieved context actually relevant? Catches retrieval failures at the top of the funnel.
- Faithfulness — given the retrieved context, do all claims in the answer follow from it? Catches grounding failures. The standard metric: each factual claim in the answer is checked against the retrieved context by an LLM-as-judge; the answer's faithfulness score is the fraction of claims that are supported.
- Answer correctness — given the user's question and the ground-truth answer (from the golden dataset), is the answer correct? Functions as a higher-level eval that combines grounding and accuracy.
- Context recall — given the ground-truth answer, what fraction of the supporting facts were actually retrieved? Catches retrieval failures from the other direction (the retrieval got some right context but missed key facts).
- Context precision — of the chunks retrieved, what fraction were genuinely relevant? Catches retrieval that returns too much noise alongside the signal.
The diagnostic value of separated RAG metrics. Imagine a knowledge agent fails on a particular task. The output eval scores correctness at 2/5. Without RAG metrics, the team doesn't know whether to:
- Improve the agent's reasoning prompt (it might be reasoning poorly over correct context),
- Improve the retrieval logic (it might be reasoning correctly over wrong context),
- Improve the knowledge base itself (the right answer might not be in there at all), or
- Improve the chunking/embedding strategy (the right context exists but isn't being retrieved together).
Each of these failure modes has a different fix. Output evals alone don't tell you which fix is needed. RAG-specific evals decompose the failure into its components: was retrieval right? Was grounding right? Was citation right? Each metric points at a different layer of the knowledge stack and a different intervention.
This is why the worked example introduces TutorClaw in Decision 5 specifically. Maya's customer-support agents in Courses 5-8 do some retrieval (looking up customer history, fetching policy snippets) but aren't primarily RAG agents — their work is dominated by tool use and reasoning. TutorClaw, by contrast, is a teaching agent that retrieves from the Agent Factory book before answering — a much richer RAG surface, with retrieval over hundreds of passages, faithfulness questions about whether the teaching answer is supported by the book, and citation requirements (TutorClaw should cite which chapter/section it drew from). The Ragas evaluation pattern lands better when applied to an agent it was designed for. The same Ragas patterns transfer to any knowledge-heavy agent in Maya's company that needs them; TutorClaw is the teaching example.
The Course Four cross-reference: Course Four built the knowledge-layer architecture using MCP. Course Nine's RAG evals are what tell you whether that knowledge layer is doing its job. If retrieval accuracy is below threshold on your eval set, the fix is not in the agent's prompt — it's in Course Four's territory: the chunking strategy, the embedding model, the retrieval algorithm, the chunk-overlap policy. RAG evals are the diagnostic that tells you where to look.
Bottom line: knowledge-layer agents have three failure modes specific to retrieval: retrieval failure (wrong sources), grounding failure (claims not supported by sources), citation failure (sources missing or wrong). Each requires its own metric: context relevance, faithfulness, citation correctness, plus context recall and precision for retrieval diagnostics. Ragas (the framework in Decision 5) ships these metrics ready-to-use. Separating retrieval from reasoning lets the team diagnose where a knowledge-agent failure originated and which layer of the stack to fix. For any agent that does retrieval before answering, RAG evals are not optional.
Part 3: The Stack
Part 3 takes up the tooling: the specific frameworks that operationalize each pyramid layer, why each was chosen, and how they fit together. The discipline matters more than the tools, but tools that fit the discipline make it teachable. Three Concepts, one per tool category.

Concept 8: The trace-eval layer — Phoenix evaluators (Claude runtime) and OpenAI Agent Evals + Trace Grading (OpenAI runtime)
The trace-eval layer is where the agent's runtime matters most. For Maya's worked example agents — which all run on the Claude substrate — Phoenix's evaluator framework is the natural fit: Phoenix consumes the Claude Agent SDK's OpenTelemetry traces directly, runs trace-level rubrics with LLM-as-judge graders, and the same Phoenix instance doubles as the production-observability layer in Decision 7. For agents on the OpenAI Agents SDK, OpenAI's Agent Evals platform plus its trace-grading capability is the tightest fit: the platform, the trace-aware grader, and the agent's traces all live in the same ecosystem — no export, no re-serialization, no schema mismatch. Both paths grade traces against rubrics; the only difference is which platform's UI you click into. This Concept walks the OpenAI pair (Agent Evals + Trace Grading) first because the two-products-in-one-ecosystem story is the cleaner architectural example; the same shape applies to Phoenix's evaluators for the Claude path.
One platform, two complementary capabilities. OpenAI documents these as related-but-distinct guides — Agent Evals covers the broader platform; Trace Grading covers the trace-aware capability within it. A serious agent team uses both, in the same way a SaaS team uses unit testing infrastructure and integration testing infrastructure as complementary capabilities of one CI/CD platform.
- Agent Evals (the platform) handles datasets, eval runs, grading workflows, experiment tracking, model-comparison reports. The dataset you build in Decision 1 lives here. The model-vs-model comparisons (does GPT-5 outperform GPT-4o on your eval suite?) run here. The output-level evaluation discipline — does the final response match the expected behavior on this curated set of tasks — is what Agent Evals operationalizes at scale, with hosted infrastructure for running thousands of eval examples in parallel and dashboards for tracking score distributions over time.
- Trace grading (the capability) is the trace-aware extension specifically for agent traces. Where Agent Evals can grade outputs, trace grading reads the full execution path — every model call, every tool call, every handoff, every guardrail check inside an agent run — and runs assertions against it. Trace grading is what makes Layer 5 of the pyramid (Concept 4) operational in the OpenAI ecosystem.
Why both capabilities, not just one. Agent Evals without trace grading covers the bottom of the pyramid well — output evals, dataset management, regression tracking across models — but is blind to the trace layer where most agentic-AI failures actually live (Concept 6). Trace grading without the broader Agent Evals platform can grade individual traces but lacks the dataset infrastructure to do it at scale, run experiments across model variants, or track regressions over time. The two together cover the agent-evaluation surface in a way neither does alone, which is why the source pairs them as the "primary agent eval framework" rather than recommending one or the other.
The architectural argument: trace, grader, and dataset belong in the same system. When an agent runs through the OpenAI Agents SDK, the SDK already produces a structured trace — every model call, every tool call, every handoff, every guardrail check, every retry, every custom span the agent itself emits. The trace is already structured, already inspectable, already in the OpenAI platform. Agent Evals organizes the dataset and experiments; trace grading reads the traces directly and runs evals against them. No export, no re-serialization, no schema mismatch.
The alternative — running an external grader against exported traces — is possible but operationally harder. You export the trace (which itself requires a stable trace schema), parse it in the grader's runtime, reconstruct the agent's execution, then evaluate. The friction is real, and for most teams the friction is what causes trace evals to never get past "we should do this" into "we ship this on every change." OpenAI's trace grading removes the friction.
What the pair specifically gives you:
- Trace inspection primitives (trace grading). Assertions on what tools were called, in what order, with what arguments. Assertions on handoffs (which specialist did the agent route to?). Assertions on guardrail invocations (did the safety filter fire? Should it have?). Assertions on intermediate reasoning (the model's reasoning between tool calls, captured in the trace).
- LLM-as-judge for output-level and trace-level metrics (both capabilities). A grader prompt is given the relevant artifact (output for Agent Evals, full trace for trace grading) plus a rubric and produces a graded score. The grader is typically a stronger model than the one running the agent — for Course Nine's worked example, agents run on Claude Sonnet-class models and grading runs on GPT-4-class or Claude Opus-class.
- Custom span support (trace grading). Beyond what the SDK emits by default, the agent can emit custom spans for important reasoning steps. The trace grader can be configured to inspect these spans specifically. This is how teams capture "agent's confidence in this decision" or "the standing instruction the agent matched on" as graded data.
- Dataset and experiment management (Agent Evals). Hosted infrastructure for organizing eval datasets, running experiments (comparing two agent or model variants on the same dataset), tracking the score distribution over time, and producing comparison reports. Important infrastructure that teams otherwise build themselves.
- Model-vs-model comparison (Agent Evals). When a new model is released and the team needs to decide whether to upgrade, Agent Evals runs the full eval suite against both the current and the candidate model and produces a per-metric comparison. This is the eval-driven version of A/B testing models.
What the pair is not:
- Not a replacement for repo-level evals. DeepEval (Concept 9) runs in the project repository and fits CI/CD; OpenAI's platform is hosted and runs separately. They complement.
- Not RAG-specific. They can do RAG evals (the trace includes retrieval calls; the dataset can encode retrieved context), but Ragas's specialized metrics produce sharper diagnostics for knowledge agents. Use OpenAI's platform for the agent's reasoning over retrieved context; use Ragas for the retrieval quality itself.
- Not free. The grader is itself an LLM running on inference compute. A trace eval suite of 100 examples can cost a few dollars per run; running on every commit gets expensive fast. Teams optimize the schedule.
- Not exclusive to OpenAI Agents SDK runs. Both capabilities accept traces and eval data from other SDKs in compatible formats — the OpenTelemetry-based trace format is the standard surface. If your agents run on the Claude Agent SDK or other SDKs, you can still use OpenAI Agent Evals and trace grading as long as your traces are exported in the right shape.
The dual-runtime architectural reality. Courses 3-7 of the Agent Factory track taught two runtimes deliberately — the Claude Agent SDK (Claude Managed Agents) and the OpenAI Agents SDK. Course Nine inherits this duality. The eval discipline must work for both. Production AI-native companies in 2026 routinely run workers across both ecosystems. Maya's worked example agents (Tier-1, Tier-2, Manager-Agent, Legal Specialist, Claudia) run on Claude Managed Agents — Claudia on OpenClaw, the others on the Claude Agent SDK directly. That makes DeepEval (for output and tool-use evals) plus Phoenix (for trace evals and production observability) the primary eval stack throughout the lab; OpenAI Agent Evals + Trace Grading is the equally-supported alternative path for readers whose own agents run on the OpenAI Agents SDK. The discipline is genuinely runtime-portable — OpenTelemetry-based trace export is the universal substrate, and every Decision in Part 4 has a parallel path for either runtime. The next two paragraphs lay out the two paths concretely.

Evaluating Claude Managed Agents (the primary path — Maya's setup). The agent runs through the Claude Agent SDK (or OpenClaw, which sits on the same substrate). Tracing is OpenTelemetry-native by design. DeepEval grades outputs and tool calls in the repo on every commit; Phoenix's evaluator framework consumes the OpenTelemetry traces and runs trace-level rubrics with LLM-as-judge graders; Ragas evaluates knowledge-layer agents (TutorClaw); Phoenix also mirrors production traces for observability. The grader is typically Claude Opus or GPT-4-class — a stronger model than the one running the agent, and from a different family to avoid self-grading bias. This is the lab's default configuration in every Decision.
Evaluating OpenAI Agents SDK workers (the equally-supported alternative path). If your agents run on the OpenAI Agents SDK instead of the Claude Agent SDK, the eval stack changes shape at the trace-eval layer; everything else stays the same:
- Output evals: DeepEval works identically — OpenAI-agent outputs are graded the same way Claude-agent outputs are. No changes to Decision 2.
- Tool-use evals: also work identically in DeepEval, because the agent's tool-call records are captured the same way regardless of runtime.
- Trace evals: this is the layer where the runtime matters. Two real paths:
- Path A (recommended for OpenAI-runtime teams) — OpenAI Agent Evals + Trace Grading as the trace-evaluation layer. The OpenAI Agents SDK produces traces directly into OpenAI's platform; Agent Evals manages datasets and runs eval suites at scale, and the trace-grading capability reads the platform's own traces and runs trace-level assertions on tool calls, handoffs, guardrails, and intermediate reasoning. The architectural advantage: no export, no re-serialization, no schema mismatch — trace, grader, and dataset all in one ecosystem.
- Path B — Export OpenAI traces and use Phoenix's evaluator framework anyway. Export the OpenAI Agents SDK traces in OpenTelemetry format, ingest them into Phoenix, grade with Phoenix's evaluators. Works for teams that want a single unified grading surface across runtimes; adds operational friction (two ecosystems for OpenAI-only teams) if used unnecessarily.
- RAG evals: Ragas is runtime-agnostic by design. Works identically against Claude or OpenAI agents. No changes to Decision 5.
- Safety/policy evals: also DeepEval-based, runtime-agnostic. No changes to Decision 4.
- Production observability: Phoenix is the recommended path for both runtimes; it's what Decision 7 sets up. The dual-runtime team uses one Phoenix dashboard for everything.
The honest summary for OpenAI-runtime readers. If your worker is on the OpenAI Agents SDK, Course Nine's lab works with one substitution: in Decision 3, instead of routing traces through Phoenix's evaluator framework, route them through OpenAI Agent Evals + Trace Grading (Path A above). The rubrics are identical; the Plan-then-Execute briefing pattern is identical; the eval discipline is identical. The only thing that changes is which platform's UI you click into to see the graded trace. That's not a small change — operational ergonomics matter — but it's not an architectural change.
Why DeepEval + Phoenix is the primary stack for the lab. Two reasons. First, Maya's worked example agents from Courses 5-8 (Tier-1 Support, Tier-2 Specialist, Manager-Agent, Legal Specialist, and Claudia on OpenClaw) all run on the Claude substrate; DeepEval + Phoenix is the tightest-fit eval surface for Claude-runtime agents because Phoenix's OpenTelemetry-native tracing matches the Claude Agent SDK's tracing output directly. Second, the DeepEval-first framing is the most portable starting point even for readers whose own agents are on a different runtime: DeepEval's pytest-style structure is the same on every SDK, and OpenTelemetry trace export means Phoenix can grade traces from any compatible runtime. For OpenAI-runtime readers, every Decision in Part 4 has a Path-A equivalent that produces an equivalent eval suite; the Simulated track explicitly includes OpenAI-runtime trace samples for readers who want to walk that path on the lab's seed data.
The Course Three to Course Nine cross-reference, concrete. When you built your first Worker in Course Three, the agent SDK produced traces by default — you saw them in the SDK's tracing UI (the Claude Agent SDK's tracing console or the OpenAI Agents SDK's traces dashboard, depending on which runtime you used). Those traces were the raw material for Course Nine's trace evals, even though Course Three didn't name it that way. Course Three taught you to read traces by eye; Course Nine teaches you to grade them automatically. The substrate hasn't changed; the discipline wrapping it has.
Try with AI. Open your Claude Code or OpenCode session and paste:
"I'm setting up OpenAI Agent Evals with trace grading on my Tier-1 Support agent from Course Six. The agent uses the OpenAI Agents SDK with three tools: customer_lookup, refund_issue, escalation_create. I want a starter eval suite split correctly across both capabilities: (1) for the output-evals layer of Agent Evals, write the dataset schema and three rubrics — answer correctness, format compliance, and tone-appropriateness — for the customer-facing responses; (2) for trace grading, write three trace-level rubrics — tool-selection correctness, argument correctness, and unnecessary-tool-call detection — that inspect the trace fields directly. For each rubric, include the grader prompt I would use. Be specific enough that I can submit these directly to the platform."
What you're learning. The output-versus-trace split is itself an architectural decision — which artifacts get graded at the output level versus the trace level directly shapes the eval suite's failure-detection profile. This exercise forces you to think through that split for a real agent before Decision 3 in the lab.
Bottom line: the trace-eval layer is runtime-shaped. For Claude-runtime agents (Maya's worked example), Phoenix's evaluator framework consumes the Claude Agent SDK's OpenTelemetry traces directly and runs trace-level rubrics with LLM-as-judge graders — same Phoenix instance doubles as production observability. For OpenAI-runtime agents, OpenAI Agent Evals plus Trace Grading is the tightest fit: one platform, two capabilities (Agent Evals for datasets and output-level grading at scale; Trace Grading for trace-level assertions on tool calls, handoffs, guardrails). Either path is paired with DeepEval (repo-level output and tool-use evals) and Ragas (RAG-specific metrics) to complete the four-layer stack. The discipline is identical; the UI you click into is what differs.
Concept 9: DeepEval as the repo-level eval framework
OpenAI's trace grading handles the trace-aware layer in the hosted ecosystem. DeepEval handles the repo-level layer — evals as code, in the project repository, in CI/CD, in the developer's daily workflow. The architectural argument: behavior evaluation has to live where developers already live, or it stays a research activity that doesn't actually constrain shipping.
The shape DeepEval gives you, in one sentence: pytest, but for LLM and agent behavior. Test cases, metrics, thresholds, assertions, fixtures, CLI runs, CI integration. A team that already practices unit testing has the muscle memory; DeepEval transfers it to agent behavior with very little new vocabulary.
A DeepEval test, concretely. From the Tier-1 Support agent's eval suite:
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, HallucinationMetric
def test_customer_billing_dispute_refund():
# The input: a realistic customer-facing task
task = "I see a duplicate charge of $89 on my November 12 statement. Can you refund the duplicate?"
# The agent's actual output (from a run captured in CI)
actual_output = run_tier1_support_agent(task=task, customer_id="C-3421")
# The expected behavior (from the golden dataset)
expected = "The agent should acknowledge the dispute, verify the customer's account, " \
"confirm the duplicate charge exists, and issue a single refund of $89."
# The test case
test_case = LLMTestCase(
input=task,
actual_output=actual_output.response,
expected_output=expected,
context=[actual_output.customer_context, actual_output.charge_history],
)
# Metrics with pass thresholds
relevancy = AnswerRelevancyMetric(threshold=0.7)
hallucination = HallucinationMetric(threshold=0.3) # max acceptable hallucination
assert_test(test_case, [relevancy, hallucination])
What this looks like to a developer who knows pytest: a test file, a test function, fixtures (run_tier1_support_agent, customer_id), assertion (assert_test). The mental model is the same — except instead of assert result == expected, the assertions are LLM-graded behavior metrics with thresholds.
What DeepEval ships with out of the box.
A library of built-in metrics covering most common eval needs:
- Answer relevancy — does the response actually answer the question?
- Faithfulness — are the claims in the response supported by the provided context? (Useful even for non-RAG agents; can be applied to any agent that should ground in retrieved or provided context.)
- Hallucination — does the response contain fabricated facts?
- Contextual precision and recall — for retrieval-based components, how much of the retrieved context was relevant, and how much of the relevant context was retrieved?
- Tool-correctness — for tool-using agents, was the right tool called with the right arguments? (Requires the actual tool calls to be captured in the test case.)
- Task completion — did the agent accomplish the user's stated task?
- Bias and toxicity — does the response contain biased or toxic content?
Each metric is configurable (different graders, different thresholds, different rubrics). Each metric returns a score and a pass/fail boolean against its threshold.
Custom metrics for project-specific needs. When the built-in metrics don't cover a need (e.g., "does the response correctly cite the Course Seven hire-approval policy?"), DeepEval supports defining custom metrics with a grader prompt and a threshold. The customization story is the same shape as pytest's custom fixtures or assertions. A small amount of code, a clear interface, fits into the existing structure.
The CI/CD integration is the load-bearing thing. deepeval test run is the CLI command. It works the way pytest does — pass rate reports, failure detail with the offending agent output and grader rationale, integration with GitHub Actions / GitLab CI / Jenkins / any CI platform. A prompt change that regresses a critical metric blocks the merge. Same way a code change that breaks a unit test does. This is the discipline TDD gave SaaS, applied to behavior.
Where DeepEval sits in the stack relative to the other tools.
- Complements OpenAI's trace grading. DeepEval can do trace-aware metrics with structured trace input. But the OpenAI ecosystem's trace grading capability is more direct for OpenAI Agents SDK runs. Use DeepEval for output and tool-use evals in CI; use OpenAI's trace grading for deep trace inspection on prompt/model changes.
- Adjacent to Ragas. DeepEval has RAG-specific metrics. Ragas has more of them, with sharper diagnostics. For light RAG evaluation, DeepEval is sufficient. For knowledge-agent-heavy workloads (TutorClaw-class), Ragas is the right tool.
- Distinct from Phoenix. Phoenix is production observability — it watches the agent in real usage and surfaces patterns. DeepEval is development-time — it grades the agent on a curated dataset. The two complement: Phoenix discovers new failure modes in production; DeepEval prevents them from recurring on future changes.
Why DeepEval specifically (over alternatives). Several open-source eval frameworks exist as of May 2026 — TruLens, Promptfoo, LangSmith, others. DeepEval is recommended for Course Nine for four reasons: (1) its pytest-style structure makes it the most accessible for developers; (2) it has the broadest built-in metric library; (3) the docs are oriented toward the engineering workflow rather than the research workflow; (4) it's actively maintained as of the course-writing date. Any team comfortable with DeepEval's discipline can switch to an alternative framework without changing the underlying eval architecture — the patterns transfer.
Try with AI. Open your Claude Code or OpenCode session and paste:
"I want to write a DeepEval test from scratch for Maya's Manager-Agent from Course Seven — specifically the eval pack that runs when the Manager-Agent proposes a new hire. The Manager-Agent's job is to detect a capability gap (e.g., 'we're getting more Spanish-language tickets than the current Tier-2 specialist can handle'), draft a hire proposal with role, authority envelope, budget, and tool list, then submit it to the board. I want three DeepEval metrics: (1) gap_specificity — does the proposal name the specific capability gap rather than generic 'we need more capacity'?; (2) envelope_correctness — does the proposed authority envelope match the existing tier's pattern, not invent a new envelope shape?; (3) budget_realism — does the proposed budget fall within ±20% of comparable existing roles? For each metric, write the DeepEval test function with the appropriate metric class, threshold, and grader rubric. Use the AnswerRelevancyMetric pattern as the template for any custom metrics."
What you're learning. Writing eval tests from scratch is the muscle DeepEval rewards. Built-in metrics handle common cases (relevancy, hallucination); custom metrics for project-specific behavior (envelope correctness, budget realism) are where eval-driven discipline becomes specific to your agents rather than generic. The Manager-Agent example forces you to think through what "correct hire proposal" actually means — which is the same reasoning that goes into Decision 1's golden dataset construction.
Bottom line: DeepEval brings agent evaluation into the developer's daily workflow as pytest-style code in the project repository. It ships with a library of built-in metrics (answer relevancy, faithfulness, hallucination, tool correctness, etc.) plus support for custom project-specific metrics. CI/CD integration is the discipline point: a prompt change that regresses a critical metric blocks the merge, the same way a broken unit test blocks the merge for code. DeepEval is the developer-facing eval surface in the four-tool stack, complementing trace grading via OpenAI Agent Evals (deeper trace work), Ragas (specialized RAG metrics), and Phoenix (production observability).
Concept 10: Ragas for the knowledge layer and Phoenix for production observability
The remaining two tools in the four-tool stack are specialized — Ragas for RAG evaluation specifically, Phoenix for the production observability layer. Concept 10 covers both, and the relationship between them: Ragas closes the development-time loop for knowledge-layer agents; Phoenix closes the production-time loop for all agents. A complete EDD stack uses both.
Ragas — the knowledge-layer eval framework.
Concept 7 introduced RAG evals as a layer; Ragas is the open-source framework that operationalizes them. The architectural argument is the same one Concept 7 made: knowledge-layer agents have three failure modes (retrieval, grounding, citation) that need distinct metrics. Ragas ships those metrics ready-to-use, with implementations grounded in research that's been validated across many production systems.
The five metrics that matter for almost every RAG agent:
| Metric | What it measures | What failure mode it catches |
|---|---|---|
| Context Relevance | Given the user question, was the retrieved context relevant to it? | Retrieval system surfaced irrelevant chunks |
| Faithfulness | Given the retrieved context, are all claims in the answer supported by it? | Agent invented facts beyond what the context supports |
| Answer Correctness | Compared to the ground-truth answer, is the agent's answer correct? | The combined "is the final answer right?" check |
| Context Recall | Of the facts in the ground-truth answer, how many were in the retrieved context? | Retrieval missed key information |
| Context Precision | Of the chunks retrieved, what fraction were relevant? | Retrieval returned too much noise |
The five together give a diagnostic — when a knowledge agent fails on a task, the metrics tell you where the failure originated, not just that it happened. Context Recall low + Answer Correctness low = retrieval missed the key facts. Context Recall high + Faithfulness low = agent has the right info but invented additional claims. Context Recall high + Faithfulness high + Answer Correctness low = agent had the right info, was grounded, but missed the right interpretation. Each diagnosis points at a different fix.
Ragas integrates with the rest of the stack: it produces metrics that DeepEval can consume (you can wrap Ragas evaluators inside DeepEval test cases, so the developer workflow stays unified); it accepts traces from any agent runtime; it can be run on production-sampled traces to evaluate the knowledge layer at scale.
A note on Ragas's expanding scope. As of May 2026, Ragas is no longer strictly a RAG-only framework. Recent versions ship agent-specific metrics — Tool Call Accuracy, Tool Call F1, Agent Goal Accuracy, Topic Adherence — alongside the classic RAG-quality metrics above. Course Nine still positions Ragas primarily as the knowledge-layer eval tool (because that's where its diagnostic sharpness genuinely shines, and because the OpenAI Agent Evals + DeepEval pair already covers the agent-behavior layer well), but teams running Ragas in production should know that the framework's scope has broadened. For Course Nine's lab specifically (Decision 5), the five RAG metrics are what TutorClaw exercises; Ragas's agent metrics are a useful frontier to explore once that foundation is in place.
Phoenix — the production observability layer.
Phoenix sits across the top of the stack. Its job is different from the other three tools: while trace grading, DeepEval, and Ragas evaluate the agent before and during development, Phoenix observes the agent in production and turns the observations into eval dataset material.
What Phoenix gives you, in three categories:
- Trace visualization at scale. Phoenix ingests traces from any compatible agent runtime (OpenAI Agents SDK, LangChain, LlamaIndex, custom) and presents them in a unified UI. A failing customer interaction in production becomes a clicked-through trace you can inspect step-by-step. This is the diagnostic primitive teams reach for when production breaks — it's the agentic AI equivalent of distributed tracing for microservices.
- Experiment management. Compare two agent variants on the same dataset; track score distributions over time; flag regressions in production behavior; identify performance drift across model versions. Phoenix gives the team the data view that makes EDD operational rather than aspirational.
- Trace-to-eval pipeline. Phoenix samples real traces (continuously, or based on user feedback signals, or based on programmatic filters like "low confidence runs"), and surfaces them as candidates for the eval dataset. A production failure becomes a future eval case — the loop that turns production into development material. Concept 13 takes up the operational discipline; Phoenix is the tooling that makes it tractable.
Phoenix is open-source and self-hostable. It runs as a containerized service (Decision 7 in the lab walks the setup), stores trace data in a local or cloud-backed database, and exposes a UI for the team. The open-source nature matters for an educational course — students can run Phoenix locally without commercial dependencies.
Braintrust is the commercial alternative, and it deserves more than a one-line mention. For teams that want a polished collaborative product with hosted infrastructure rather than a self-hosted open-source one, Braintrust is the upgrade path the source explicitly names: "Phoenix first, Braintrust later if a commercial team dashboard is needed." Three things Braintrust adds over Phoenix that justify the commercial price for some teams:
- Hosted collaborative workspace. Phoenix is per-team-installation; Braintrust is multi-team-by-default. For organizations running several agent products across product lines (Maya's customer support, TutorClaw teaching, the Manager-Agent's hiring decisions, and any other agents the company runs), Braintrust gives a single workspace where each team can run their own eval suites against shared infrastructure, share datasets, and produce comparable reports.
- Polished experiment-comparison UI. Phoenix's experiment view is functional and improving rapidly; Braintrust's is more mature, with better diff views (what changed between this run and last), better filtering (show me only the examples where this metric regressed), and better collaboration affordances (annotate failing examples, assign owners, track remediation).
- Managed infrastructure. Phoenix you run; Braintrust you subscribe to. For teams that don't have the operational bandwidth to run Phoenix as a production service — patching, monitoring, storage scaling, backup — Braintrust's hosted model removes that cost.
When to make the Phoenix → Braintrust switch. Three signals: (1) you're running eval infrastructure for more than ~3 distinct agent products and the per-team coordination overhead is costing real time; (2) your team is paying real maintenance cost on Phoenix's self-hosted infrastructure and the commercial alternative would be cheaper than the eng-hours; (3) you need collaborative annotation and review workflows that Phoenix's UI doesn't quite ship yet as of May 2026. Until at least one of these is true, Phoenix is the right choice, both because the open-source path matches Course Nine's educational stance and because the migration path (both products consume OpenTelemetry-compatible traces) is preserved.
Course Nine teaches Phoenix in Decision 7's lab; the Braintrust upgrade is covered as Decision 7's sidebar below. The discipline is the same in both products — what changes is operational ergonomics, not the underlying eval architecture.
The four-tool stack, summarized.
- OpenAI Agent Evals (with trace grading) — hosted agent-evaluation platform; the trace-grading capability catches failures invisible to output-only evaluation. Primary for OpenAI Agents SDK runs.
- DeepEval — repo-level evals in the developer's daily workflow. Pytest-style. The CI/CD discipline point.
- Ragas — specialized RAG evaluation for knowledge-layer agents. The diagnostic primitive for retrieval-vs-reasoning failure modes.
- Phoenix — production observability. The trace-to-eval feedback loop. The connective tissue from production back into development.
The stack is intentionally layered, not redundant. A team that adopts all four gets a complete eval discipline — output and tool-use evals on every commit (DeepEval), trace evals on every prompt/model change (OpenAI Agent Evals trace grading), RAG evals for knowledge agents (Ragas), production observability continuously (Phoenix). The discipline scales with the team's maturity: a beginning team can adopt DeepEval first and add the others as the agent's complexity grows; a mature team integrates all four into a single CI/CD-plus-production observability pipeline.
Bottom line: Ragas operationalizes the RAG-specific eval layer with five metrics (Context Relevance, Faithfulness, Answer Correctness, Context Recall, Context Precision) that diagnose where a knowledge-agent failure originated. Phoenix operationalizes the production observability layer — trace visualization, experiment management, and the trace-to-eval feedback loop that turns production failures into future eval cases. Together with trace grading (Concept 8) and DeepEval (Concept 9), they form the four-tool stack: each plays a distinct role; the discipline only works when the team uses them as the layered architecture they were designed for.
Part 4: The Lab
Part 4 walks through assembling the discipline concretely. Seven Decisions, each one a briefing to your Claude Code or OpenCode session — never typed or edited by hand. By the end of Part 4, Maya's customer-support company has an eval suite covering output, tool-use, trace, RAG, safety, regression, and production observability, with each layer wired into CI/CD and a production observability dashboard reading from real (or sampled) traces.
A note on model strength for the lab's coding agent. The seven Decisions below are each 6-8-step structured briefs that assume your agentic coding tool will reliably enter plan mode, save the plan to a file, pause for review, then execute step-by-step with verification after each. This works cleanly on Claude Sonnet/Opus, GPT-5-class, or Gemini 2.5 Pro; on weaker or older models (DeepSeek-chat, Haiku, local Llama-class, Mistral), the same prompts are stochastic: the agent will sometimes batch multiple steps, sometimes skip the verification beat, sometimes drift on the output format. Two mitigations if your coding agent is on a weaker model: (1) move the multi-step orchestration into the rules file (
CLAUDE.md/AGENTS.md) as a general-flow preamble so the contract reloads every turn; (2) be explicit about what the agent should NOT do, not just what to do — e.g., "save the plan to docs/plans/decision-N.md before any code is written. Do not begin step 2 until step 1's file exists." The architectural lab in this Part holds across model tiers; the operational precision degrades, and the rules file is where you take it back.
Two completion modes for the lab — pick before starting.
- Full implementation (recommended for teams running an actual Course 5-8 deployment). You install all four eval frameworks, wire them to your real Tier-1 Support agent, Manager-Agent, and Claudia, run real evals on real traces, integrate with your real CI/CD. Time: 6-10 hours of lab on top of 3 hours of conceptual reading — a 1-day sprint or 2-day workshop. Output: a production-grade eval suite covering all eight Course 3-8 invariants.
- Simulated (recommended for learners, students, or anyone without a deployed Course 5-8 stack). You use pre-recorded traces and synthetic agent outputs from the course's GitHub repository. The eval frameworks run; the metrics produce real scores; the production observability is replayed from sampled traces. Time: 2-3 hours of lab on top of 2 hours of conceptual reading — a comfortable half-day. Output: a complete understanding of eval-driven development plus a working local lab you can demonstrate.
The Decisions below are written to work for both modes. Where a Decision says "wire to your live Paperclip deployment..." the simulated mode reads it as "wire to your local mock from the starter repo..." Otherwise the briefings are identical.
Before Decision 1 — which agent runtime are your agents on? Course Nine's lab works across multiple agent runtimes, because the Agent Factory curriculum is multi-vendor by design. The eval discipline (the 9-layer pyramid, the golden dataset, the eval-improvement loop, the trace-to-eval pipeline) is runtime-agnostic; the eval tooling is partly runtime-specific. Three paths:
Path A — Claude Managed Agents (Claude Agent SDK). Maya's Tier-1 Support, Tier-2 Specialist, Manager-Agent, and Legal Specialist from Courses Five-Seven are built on Claude Managed Agents; Claudia from Course Eight runs on OpenClaw, also a Claude substrate. This is the lab's primary path. For these agents: (1) use DeepEval for output and tool-use evals in CI; (2) use Phoenix's evaluator framework for trace evals — it consumes the Claude Agent SDK's OpenTelemetry traces directly and runs trace-level rubrics; (3) use Ragas for knowledge-layer evaluation (runtime-agnostic); (4) Phoenix doubles as production observability in Decision 7. The full four-layer stack ships without leaving the Claude ecosystem. Concept 8 and Decision 3 walk this path in detail.
Path B — OpenAI Agents SDK. Course Three's worked example introduced this runtime, and some readers built their agents on it. For these agents, OpenAI Agent Evals + Trace Grading is the natural trace-evaluation surface — the platform, the trace format, and the grader all live in the same ecosystem; no export, no re-serialization. DeepEval, Ragas, and Phoenix's observability layer still apply identically. Concept 8 and Decision 3 cover this alternative path alongside Path A.
Path C — Other runtimes (LangChain, LlamaIndex, custom agent loops). Same shape as Path B: DeepEval for repo-level evals, Phoenix for observability, Ragas for knowledge layer. The eval discipline transfers; the tooling around it adapts. OpenTelemetry-compatible trace export is the universal substrate that connects any runtime to any eval tool.
For Maya's worked example specifically: the Tier-1, Tier-2, Manager-Agent, Legal Specialist, and Claudia agents are all on Claude Managed Agents (Path A). The lab is written for both Path A and Path B — Decision 3 walks the Phoenix-evaluators path for Path A (Maya's setup) and the OpenAI-Agent-Evals path for readers on Path B; Decisions 2, 4, 5, 6, 7 are runtime-agnostic and work identically on either path. This isn't a workaround; it's the architectural reality of multi-vendor agentic systems in May 2026, and serious teams build their eval discipline accordingly.
If something breaks, check these three things first (these account for ~80% of lab failures during the eval stack setup):
- API keys and account access. OpenAI Agent Evals needs an OpenAI account (Path A only). DeepEval, Ragas, and Phoenix need an LLM-as-judge backend — OpenAI, Anthropic, or self-hosted (any path). Phoenix runs locally without external API keys, but its experiments may consume LLM tokens depending on what evaluators you wire to it. Verify all three before Decision 2.
- Trace export configuration. OpenAI Agents SDK produces traces by default and OpenAI's trace-grading capability consumes them automatically (Path A). Claude Managed Agents produce traces too, but you need to configure OpenTelemetry export to the eval tools (Path B) — typically a few lines of configuration in your agent runtime. If you skip this, trace evals will silently produce empty datasets. Check that trace data is flowing before Decision 3.
- Dataset quality. Most "the eval suite produces nonsense" failures trace back to dataset quality (Concept 11 takes this up). If your scores look wrong, inspect 5-10 examples by hand before assuming the tools are broken. The framework rarely lies; the dataset frequently does.
Lab Setup — Before Decision 1
The Decisions below are executed through Claude Code or OpenCode (your agentic coding tool). You do not type or edit code manually anywhere in this lab. Each Decision is briefed to your agentic coding tool; it produces a plan; you review and approve; then it implements. Same discipline as Course Eight.
If you completed Course Eight, you already have Claude Code or OpenCode installed and configured. Skip ahead to step 4 (the Course-Nine-specific rules file content) and otherwise reuse your existing setup. If you're picking up Course Nine without Course Eight, follow steps 1-6.
1. Install Claude Code or OpenCode
# macOS / Linux / WSL — recommended (auto-updates)
curl -fsSL https://claude.ai/install.sh | bash
# Verify and update
claude update
claude --version
2. Create your lab project folder
mkdir course-nine-lab
cd course-nine-lab
git init
3. Set up the four eval frameworks' dependencies
A single setup pass for the Python dependencies — your agentic coding tool handles this in Decision 1, but you can verify the substrate now:
python3 --version # Need 3.11+
pip install --version # Need recent
docker --version # Need recent; Phoenix runs containerized
4. Write the project rules file
Create CLAUDE.md:
# Course Nine Lab — Eval-Driven Development
## What this is
A hands-on lab building eval suites for Maya's customer-support company
(from Courses 5-8) plus a knowledge-layer agent (TutorClaw, introduced
in Decision 5). Seven Decisions covering output, tool-use, trace, RAG,
safety, regression, and production-observability evals.
## Stack
- Python 3.11+ (primary; DeepEval, Ragas, Phoenix client)
- TypeScript/Node.js 20+ (if extending the Course 5-8 codebases)
- OpenAI Agents SDK (the agents being evaluated)
- DeepEval (repo-level evals)
- Ragas (RAG evals)
- Phoenix (production observability, runs in Docker)
- OpenAI Agent Evals with trace grading (hosted; accessed via OpenAI account)
## Lab tracks
- **Simulated**: use pre-recorded traces from `./traces-fixtures/` and the
sample golden dataset at `./datasets/sample-golden.json`. Do NOT call
live agents or production Paperclip.
- **Full**: wire to your Course 5-8 deployment. Pull real traces; run
evals on real agents.
## Critical rules
- Never write to a production governance_ledger from a test session.
Use the simulated mode's local SQLite or a clearly-marked staging DB.
- Never commit API keys to git. Use environment variables; the .gitignore
must exclude .env files.
- The golden dataset at ./datasets/golden.json is the most important
artifact in this lab. Treat changes to it like API contract changes:
review carefully, version explicitly.
- After any change to the dataset, the eval prompts, or the metric
thresholds, run `deepeval test run` before considering the Decision
complete.
## Saved plan files
Each Decision saves its plan to docs/plans/decision-N.md before
implementation. Use plan mode to write the plan; review it; then
implement.
## References to load on demand
- @docs/eval-pyramid.md (the nine-layer architecture)
- @docs/golden-dataset-conventions.md (dataset construction patterns)
- @docs/grader-rubrics.md (the LLM-as-judge rubrics for each metric)
5. Configure permissions
The four important Course-Nine-specific denies:
Add to .claude/settings.json:
{
"permissions": {
"deny": [
"Bash(rm -rf *)",
"Bash(npm publish *)",
"Bash(git push *)",
"Edit(.env*)",
"Bash(cat .env*)",
"Bash(curl *PRODUCTION*)",
"Bash(psql *production*)"
],
"allow": [
"Read",
"Edit",
"Write",
"Bash(deepeval *)",
"Bash(pytest *)",
"Bash(docker *)",
"Bash(python *)",
"Bash(pip install *)",
"Bash(git status)",
"Bash(git diff *)",
"Bash(git add *)",
"Bash(git commit *)"
]
}
}
The four critical denies: no edits to .env files (where API keys live), no cat .env (don't print keys to the agent's context), no curl to production URLs, no psql against production databases. Course Nine specifically deals with eval data, which means the agent is regularly reading traces and writing to local databases — the discipline is that "local" and "production" stay rigorously separated.
6. Add hooks (Claude Code) or plugins (OpenCode) for deterministic guardrails
Three Course-Nine-specific guardrails:
Add to .claude/settings.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Edit",
"command": "if echo \"$TOOL_INPUT\" | grep -qE '\"path\":\\s*\"datasets/golden\\.json'; then echo 'Dataset edit detected — confirm with: ./scripts/validate-dataset.sh' >&2; ./scripts/validate-dataset.sh || exit 2; fi"
},
{
"matcher": "Bash(git commit *)",
"command": "if git diff --cached --name-only | xargs grep -l 'sk-[a-zA-Z0-9]\\{20,\\}' 2>/dev/null; then echo 'Refusing to commit: API key pattern detected in staged files' >&2; exit 2; fi"
},
{
"matcher": "Bash(deepeval *)",
"command": "if [ ! -f datasets/golden.json ]; then echo 'Refusing to run evals: datasets/golden.json missing' >&2; exit 2; fi"
}
]
}
}
The architectural logic of these three:
- Guardrail 1: every edit to the golden dataset triggers automatic validation. The dataset is too important to allow silent corruption.
- Guardrail 2: defense-in-depth against API key leakage. The permissions block denies
.envaccess, but if a key ever leaks into another file, the commit is blocked. - Guardrail 3: evals running against a missing dataset are a common cause of "the eval suite mysteriously passes everything." Refuse to run unless the dataset is present.
7. Save commonly-reused workflows as slash commands
Two slash commands for the eval-driven discipline:
Create .claude/commands/run-evals.md:
Run the full eval suite for the current change. Steps:
1. Verify dataset/golden.json is current and uncorrupted.
2. Run `deepeval test run` against the test suite in evals/.
3. Run trace evals via the OpenAI Agent Evals CLI if available, or
the equivalent Python harness in evals/trace_evals.py.
4. Run Ragas evals if there's a knowledge-agent in scope.
5. Aggregate results into a single report at reports/eval-{date}.md.
6. Compare against the baseline at reports/baseline.md and flag any
regressions on a critical metric (where critical metrics are defined
in docs/critical-metrics.md).
Create .claude/commands/dataset-diff.md:
Compare the current golden.json against the committed baseline:
1. Read datasets/golden.json (current).
2. Read datasets/golden.json from the last commit.
3. Report any added, removed, or modified examples.
4. For each modified example, show before/after for the relevant fields.
5. Flag any example whose expected_output or rubric changed without a
corresponding code-change justification in the commit message.
The Plan-then-Execute discipline from Course Eight carries over to Course Nine. Every Decision: enter plan mode, brief, save plan to docs/plans/decision-N.md, review, exit plan mode, execute. The Decisions below describe the brief you give to the tool — they do not repeat the workflow each time.
Decision 1: Set up the eval workspace and create the first golden dataset
In one line: install DeepEval, Ragas, and the OpenAI Agent Evals client (with trace grading); scaffold the project's
evals/directory; build the first 50-example golden dataset covering the agent's most common task categories.
Simulated track for Decision 1: instead of sampling examples from your Paperclip
activity_log, build the 50-example dataset directly from the patterns described in Concept 11 (category mix, difficulty stratification, edge cases). The validation script and project structure are identical; only the dataset source differs.
Everything downstream depends on a dataset that actually represents the agent's production traffic. Bad dataset, bad evals, no matter how good the frameworks are. Decision 1 is the most undervalued step in the entire lab. Concept 11 takes up dataset construction in detail; this Decision is the operational version.
What you do — Plan, then Execute. In your agentic coding tool, switch to plan mode (Claude Code: Shift+Tab twice; OpenCode: Tab to Plan agent). Paste the brief below, ask the tool to produce a written plan and save it to docs/plans/decision-1.md, review it, then switch out of plan mode to execute.
The eval workspace setup plus the first golden dataset for Maya's Tier-1 Support agent. Requirements:
- Install the Python dependencies. Pin versions in
requirements.txt:deepeval,ragas,openai,pytest,python-dotenv. Plus dev-only:pytest-asyncio,pytest-xdistfor parallel runs.- Create the project structure.
course-nine-lab/
├── datasets/
│ ├── golden.json (the load-bearing artifact)
│ └── README.md (dataset conventions documented)
├── evals/
│ ├── output/ (DeepEval test files for Concept 5 layer)
│ ├── tool_use/ (Concept 6, tool-use specific)
│ ├── trace/ (Concept 6 + 8, OpenAI Agent Evals trace-grading harness)
│ ├── rag/ (Concept 7 + 10, Ragas-based)
│ ├── safety/ (envelope/policy evals)
│ └── conftest.py (pytest fixtures: agent runners, dataset loader)
├── reports/
│ └── baseline.md (the score baseline for regression detection)
└── docs/
├── grader-rubrics.md
├── eval-pyramid.md
└── critical-metrics.md- Build the first golden dataset. 50 examples covering Maya's Tier-1 Support agent's most common task categories. Each example must have:
Tool registry (the only valid values for
task_id(unique)category(one of: refund_request, account_inquiry, technical_issue, escalation_request, policy_question)input(the customer message)customer_context(object with keys:customer_id,plan(free/pro/enterprise),tenure_months,prior_refunds_30d,account_status(active/suspended), and any case-specific facts)expected_behavior(natural language description of what the agent should do)expected_tools(ordered list — the eval treats order as the canonical sequence; tools must come from the registry below)expected_response_traits(rubric items the response should satisfy)unacceptable_patterns(specific things the response should NOT contain)difficulty(easy / medium / hard — for stratified analysis)expected_tools— the validator and Decision 2's tool-use eval both reference this list):
lookup_customer(customer_id)— fetch profile, plan, tenure, statuscheck_subscription_status(customer_id)— current plan, billing state, renewal dateprocess_refund(customer_id, amount, reason)— issue refund within policycheck_refund_policy(plan, days_since_charge)— return refund eligibilitysearch_kb(query)— knowledge-base lookup for policy/how-to questionsget_recent_charges(customer_id, days)— billing historyupdate_account(customer_id, field, value)— non-billing profile changescreate_ticket(customer_id, category, priority, summary)— open a tracked caseescalate_to_human(ticket_id, reason)— hand off to a human agentsend_email(customer_id, template_id, variables)— confirmation/notificationrun_diagnostic(customer_id, area)— technical-issue diagnostic harnesscheck_outage_status(region)— current incident-board lookup- Distribution across categories. Roughly 40% refund_request (the most common production category), 20% account_inquiry, 15% technical_issue, 15% escalation_request, 10% policy_question. Within each category, mix easy/medium/hard.
- Source examples from realistic patterns, not from imagination. If the simulated track, use the provided
traces-fixtures/directory. If the full-implementation track, sample from theactivity_login Paperclip — pick varied real customer interactions and convert them into eval examples.- Validate the dataset. Write
scripts/validate-dataset.shthat checks (a) every example has all required fields, (b)expected_toolsreferences only tools that actually exist in the agent's tool registry, (c) no example has identicalinputto another, (d) the category distribution matches the target ±5%.- Document the dataset conventions in
datasets/README.md. Treat changes to the dataset like API contract changes.
Bottom line of Decision 1: the golden dataset is the artifact every eval depends on. 50 examples covering the major task categories, sourced from realistic patterns (not from imagination), validated automatically, documented as a contract. Do not skip this Decision in favor of getting to the more "interesting" eval frameworks. A beautiful eval framework on a bad dataset measures the wrong thing with rigor.
PRIMM — Predict before reading on. Maya has finished Decision 1 with a 50-example golden dataset for the Tier-1 Support agent. The dataset has the right category distribution (40% refunds, 20% account inquiries, etc.) and passes the validation script. Maya's team is excited to move on to Decision 2 (DeepEval).
Before they do, the team lead asks: "In six months, which of the following will be the most common reason our eval suite fails to catch a production failure?"
- The eval framework was misconfigured (wrong threshold, wrong grader model)
- The agent's prompts drifted faster than we could update the dataset
- The 50-example dataset was missing the failure category that hit production
- The grader (LLM-as-judge) made an inconsistent call that hid the failure
Pick one before reading on. The answer, with reasoning, lands at the start of Decision 7's discussion of the trace-to-eval pipeline.
Decision 2: Output evals with DeepEval on the Tier-1 Support agent
In one line: write the first DeepEval test suite covering output evals (Concept 5) for the Tier-1 Support agent, with answer relevancy, faithfulness, hallucination, and task completion metrics; integrate into CI/CD.
Simulated track for Decision 2: rather than invoking a live agent, generate pre-recorded outputs once with a cheap model (DeepSeek-chat or gpt-4o-mini) using a small harness that reads
datasets/golden.jsonand writes one JSON per example totraces-fixtures/decision-2-outputs/. Cost is under $0.05 for 50 examples. The DeepEval metrics, thresholds, and CI integration are then identical to the live-agent path; the test runner just loads the pre-recorded JSON instead of calling the agent. Cache the outputs to disk so re-runs are free.DeepEval version driftThe metric names below are stable as of DeepEval 3.x. In DeepEval ≥ 4.0:
TaskCompletionMetricis not a built-in class — build it withGEval(name="TaskCompletion", criteria="...", evaluation_params=[...]).LLMTestCaseParamsis renamed toSingleTurnParams. The CLIdeepeval test runmay hang; plainpytest evals/output/works in all versions. Pin your DeepEval version inrequirements.txtand check the upgrade notes when bumping it.LLMTestCase field mapping. When constructing each
LLMTestCasefrom a golden-dataset row:
LLMTestCase field Source inputthe dataset row's inputactual_outputthe agent's response (live or pre-recorded) expected_outputthe dataset row's expected_behavior(used by GEval rubrics)contextthe dataset row's customer_contextserialized to a list of stringsretrieval_contextany KB passages the agent retrieved (empty list if no RAG) tools_calledthe agent's actual tool sequence (for tool-use evals in Decision 6)
This is where the eval discipline becomes visible to developers. After Decision 2, every change to the Tier-1 Support agent's prompts, tools, or model triggers an eval run; regressions block merges. This is the moment EDD goes from concept to enforced practice.
What you do — Plan, then Execute. In your agentic coding tool, switch to plan mode (Claude Code: Shift+Tab twice; OpenCode: Tab to Plan agent). Paste the brief below, ask the tool to produce a written plan and save it to docs/plans/decision-2.md, review it, then switch out of plan mode to execute.
Output evals with DeepEval on the Tier-1 Support agent. Requirements:
- Set up a DeepEval test runner at
evals/output/test_tier1_support.py. Use pytest-style structure; each test function corresponds to one task category (test_refund_requests, test_account_inquiries, etc.).- Configure the LLM-as-judge backend. Use Claude Opus or GPT-4-class as the grader; do NOT use the same model running the agent (avoid self-grading bias). Pass via environment variable.
- Implement four metrics with appropriate thresholds:
AnswerRelevancyMetric(threshold=0.7)— does the response address the user's request?FaithfulnessMetric(threshold=0.8)— are claims grounded in retrieved context?HallucinationMetric(threshold=0.3)— max acceptable hallucination- A custom Task-Completion metric (built with
GEval(name="TaskCompletion", ...)in DeepEval ≥ 4.0; namedTaskCompletionMetricin older versions) with a Course-Eight-specific rubric: "did the agent complete the task to the standard a competent Tier-1 Support agent would?"- Write a dataset loader fixture that reads
datasets/golden.jsonand yieldsLLMTestCaseinstances. The loader should support filtering by category and difficulty.- Run the agent in the test runner. For each example, invoke the Tier-1 Support agent (or load its pre-recorded output for the simulated track), capture the response and the context, then assert all four metrics pass.
- Generate a baseline. Run the full suite once; commit the resulting scores to
reports/baseline.md. Future runs compare against this baseline.- CI/CD integration. Wire
deepeval test runto GitHub Actions (or equivalent). The workflow runs on every PR that touchesevals/,prompts/, or the Tier-1 Support agent's code. A regression on any critical metric blocks the merge.- Document critical metrics in
docs/critical-metrics.md. Critical metrics are the ones whose regression should block merges; non-critical are tracked but don't block.
What a passing DeepEval run looks like. When the lab is wired correctly, deepeval test run evals/output/test_tier1_support.py produces a structured output. The shape, illustrative (real output formats evolve with DeepEval versions):
======================== DeepEval Test Run ========================
Test: test_refund_requests examples: 20 passed: 20 failed: 0
Test: test_account_inquiries examples: 10 passed: 10 failed: 0
Test: test_technical_issues examples: 8 passed: 7 failed: 1
Test: test_escalation_requests examples: 7 passed: 7 failed: 0
Test: test_policy_questions examples: 5 passed: 5 failed: 0
Failure detail (test_technical_issues, example tech_007):
AnswerRelevancy: 0.82 (threshold: 0.70) ✓
Faithfulness: 0.75 (threshold: 0.80) ✗ — agent claimed feature X exists; not in context
Hallucination: 0.35 (threshold: 0.30) ✗ — invented version number "v2.4.1" in response
TaskCompletion: 0.65 (threshold: 0.70) ✗ — did not specify next step
Grader rationale (Faithfulness): "The response references 'real-time
sync mode' as an available option, but the provided context describes
only batch sync. The claim is not supported by the retrieved policy
documentation."
OVERALL: 49/50 passed (98%). Regression check: 0 critical-metric
regressions vs baseline. ✓ Safe to merge.
The example above shows what a useful eval output looks like: per-test pass counts, per-metric breakdown for failures, the grader's rationale explaining why a metric failed. A reader skimming this output knows immediately what to fix — the agent invented real-time sync mode and v2.4.1, both hallucinations specific to one example, and the fix is in the prompt's policy-context instructions.
What a trace-grading rubric returns. Decision 3 adds trace-level evaluation. The OpenAI Agent Evals trace-grading return shape, illustrative:
{
"example_id": "refund_T1-S014",
"rubric": "tool_selection",
"score": 2,
"max_score": 5,
"rationale": "The agent's first tool call was refund_issue, but the
correct first action for this task is customer_lookup to verify
account context before issuing the refund. The agent reasoned: 'The
customer mentioned the charge so I'll process the refund directly'
— this skips the verification step the standing instruction in
docs/grader-rubrics.md requires.",
"trace_url": "https://platform.openai.com/traces/r-2026-05-13-014",
"metadata": {
"model": "gpt-4o-2024-08",
"grader": "claude-opus-4-7",
"graded_at": "2026-05-13T14:23:17Z"
}
}
The score (2/5), the rationale (specific behavior explanation), and the trace URL (one click to inspect the full execution) are the three things that make a trace-grading return actionable rather than just diagnostic. The team's response: read the rationale, decide if the rubric is right, click the trace URL, see what happened, decide the fix layer. Same diagnostic cycle as the DeepEval example, one layer deeper.
Bottom line of Decision 2: DeepEval makes evals part of the developer's daily workflow. After Decision 2, every agent change runs the eval suite; regressions on critical metrics block merges. This is the discipline TDD gave SaaS, applied to behavior. The four-metric starter suite catches obvious output failures; Decisions 3-5 add the layers it misses.
Decision 3: Trace evals with OpenAI Agent Evals (including trace grading)
In one line: set up OpenAI Agent Evals with its trace-grading capability (datasets and model-vs-model comparison via Agent Evals; trace-level assertions via trace grading) on the Tier-1 Support agent; run rubrics for tool-selection correctness, reasoning soundness, and handoff appropriateness against the golden dataset.
Simulated track for Decision 3: rather than running a live OpenAI Agents SDK loop, generate pre-recorded traces once with a small harness that wraps DeepSeek-chat (or gpt-4o-mini) in the OpenAI Agents SDK's trace-emit format, writes them to
traces-fixtures/decision-3-traces/, and bulk-imports them into Agent Evals via its dataset-upload endpoint. The trace-grading capability then evaluates the imported traces just as it would live ones. Cost: only the LLM-as-judge inference fees plus the one-time pre-record. Cache to disk so re-runs are free.
Output evals catch the obvious failures; trace evals catch the failures hiding behind correct-looking outputs. Decision 3 is where Concept 3's wrong-customer refund example becomes catchable in CI rather than detectable only at audit time. The setup (Agent Evals platform + trace-grading capability) is the canonical OpenAI ecosystem configuration.
What you do — Plan, then Execute. In your agentic coding tool, switch to plan mode. Paste the brief below, save the plan to docs/plans/decision-3.md, review, execute.
OpenAI Agent Evals with trace grading on the Tier-1 Support agent. Requirements:
- Upload the golden dataset to OpenAI Agent Evals. Push
datasets/golden.jsonto the Agent Evals platform. Verify the dataset structure matches the platform's expected schema; document the upload step inevals/openai/dataset-upload.md.- Verify trace export is working. The OpenAI Agents SDK produces traces by default; confirm they're reaching the OpenAI platform's trace-grading surface. Check by running one agent invocation and confirming the trace appears in the platform UI.
- Create three trace-level rubrics in
evals/trace/rubrics/:
tool_selection.md— given the task and the chosen tool sequence, was the first tool call appropriate? (Catches the wrong-tool-as-first-step failure.)reasoning_soundness.md— between tool calls, did the agent's reasoning correctly interpret the tool results? (Catches the wrong-customer disambiguation failure.)handoff_appropriateness.md— if the agent handed off to a specialist (Tier-2, Manager, Legal), was the handoff target correct and was sufficient context passed? (Catches handoff failures.)- Create three output-level rubrics in OpenAI Agent Evals in
evals/openai/output-rubrics/: answer correctness against the dataset's expected_behavior, format compliance against the response-template spec, and tone-appropriateness against the customer-facing voice guide. These are the output evals that benefit from Evals's hosted experiment tracking (vs. DeepEval's repo-level discipline).- Map golden dataset examples to the right capability. Output-level evaluation goes through Agent Evals' output-grading runs; trace-level evaluation goes through trace grading; the dataset is shared. Document the routing in
evals/openai/routing.yaml.- Configure the graders. Use Claude Opus or GPT-4-class as the grader for both capabilities (same as Decision 2). Each grader receives the relevant artifact (output for Agent Evals, full trace for trace grading) plus the rubric and produces a score (1-5) plus a rationale.
- Run evals. For each dataset example, invoke the agent, submit output to Agent Evals, submit trace to trace grading, collect both scores. Use the OpenAI SDK clients for both.
- Aggregate scores into reports/openai-baseline.md. Track per-rubric averages, per-category averages, and the distribution of low scores split by capability (Agent Evals output scores vs trace-grading trace scores).
- Wire to CI. Both capabilities are more expensive than DeepEval, so run them on every PR that touches the agent's prompts, model selection, or tool definitions — but not on every commit. Configure the GitHub Action to call both Evals and Trace Grading endpoints.
- Set up the model-comparison workflow (Agent Evals-specific). When a model upgrade lands, run the full eval suite against both the current and candidate model; Agent Evals produces the comparison report directly. Document this as
scripts/compare-models.sh.- Add a "trace eval debug" workflow. When a trace eval fails, the developer needs to see the trace. Generate a link to the trace-grading UI in the eval report. This is the diagnostic primitive.
*Bottom line of Decision 3: OpenAI Agent Evals with trace grading runs the output and trace eval layers in OpenAI's hosted ecosystem. Agent Evals manages the dataset, runs hosted output evals, and produces model-vs-model comparisons; trace grading reads agent traces directly and runs trace-level assertions against rubrics. Together they catch the failures invisible to output-only evaluation (Concept 3) and the failures invisible to repo-level evaluation (regression checks across models that require centralized infrastructure). For agents on the OpenAI Agents SDK, this pair is the natural fit; for Claude Managed Agents, the equivalent setup uses Phoenix's evaluator framework as the trace-grading layer — see the Decision 3 Claude-runtime sidebar below.*
Decision 3 sidebar — the Claude Managed Agents adaptation. For readers whose workers run on Claude Managed Agents rather than OpenAI Agents SDK, the same Decision 3 outcome is reachable through Phoenix's evaluator framework. The brief, for Plan-then-Execute:
Set up trace evals on the Tier-1 Support agent running on Claude Managed Agents, using Phoenix as the trace-grading layer. Requirements: (1) confirm Phoenix is receiving OpenTelemetry traces from the Claude Managed Agents runtime (it should be by default; see the Phoenix Claude integration docs). (2) Create the same three trace-level rubrics from the OpenAI path — tool_selection.md, reasoning_soundness.md, handoff_appropriateness.md — but stored as Phoenix evaluator definitions rather than OpenAI rubric configs. (3) Use the same LLM-as-judge backend (Claude Opus or GPT-4-class) configured via Phoenix's evaluator API. (4) Run the evaluators against the captured traces; Phoenix produces per-rubric scores in the same shape OpenAI's trace grading does. (5) Wire to CI: instead of calling the OpenAI Trace Grading API on each PR, call Phoenix's evaluator API. (6) The dataset, rubrics, graders, and CI integration are unchanged — only the platform hosting the trace evaluation changes.
The architectural truth: the eval discipline doesn't depend on which runtime your agents use. OpenAI's Agent Evals is the tightest-fit eval surface for OpenAI-native agents because the traces already live there; Phoenix is the natural eval surface for Claude Managed Agents because OpenTelemetry-native tracing was a deliberate architectural choice. Both produce equivalent eval suites. Choose based on where your agents already run, not on which platform's marketing materials you've read most recently.
Decision 4: Tool-use and safety evals (the envelope check for Claudia)
In one line: write evals specific to tool-use correctness (Concept 6) and envelope-respect (Concept 6 of Course Eight) for Claudia's signed-delegation decisions; verify the envelope check catches violations.
Simulated track for Decision 4: Claudia's pre-recorded decisions on 40 example approval requests are at
traces-fixtures/decision-4-claudia-decisions/. The 5-10 red-team adversarial examples are included with annotations. The envelope-respect safety eval runs against the recorded decisions directly, no live OpenClaw setup needed.
Concept 6's envelope check from Course Eight — does Claudia stay within her delegated envelope? — is a safety eval, in Course Nine's vocabulary. Decision 4 wires the eval that verifies this. The architectural commitment: Claudia's eval suite catches envelope violations before they reach production, the same way Paperclip's runtime check catches them at execution time.
What you do — Plan, then Execute. Plan mode; brief; save to docs/plans/decision-4.md; review; execute.
Tool-use and safety evals for Claudia's delegated-governance decisions. Requirements:
- Build a dataset of approval requests at
datasets/claudia-delegation.json. Include refund requests across the spectrum: below the ceiling (should auto-approve), at the ceiling (edge case), above the ceiling (should surface), envelope-extension hires (should always surface), terminations (should always surface). 40 examples minimum.- Implement a tool-use correctness metric. For each example, capture which tools Claudia called (polling, instruction retrieval, signing, posting). Compare against the expected tool sequence. Score per-example: did she call the right tools in the right order with the right arguments?
- Implement an envelope-respect safety eval. Custom DeepEval metric
EnvelopeRespectMetricthat takes the request, Claudia's decision, and the delegated envelope JSON. Returns pass if Claudia's decision is within the envelope; fail if outside. This is the eval that catches envelope violations before they ship.- Implement a confidence-vs-action consistency check. Claudia reports a confidence score (Concept 11 of Course Eight). Verify that low-confidence decisions get surfaced rather than autonomously approved. A decision with confidence < 0.7 that was autonomously approved is a safety eval failure.
- Verify the audit-trail consistency. For each decision, confirm both
activity_log(with actor=owner_identic_ai) andgovernance_ledger(with principal=owner_identic_ai) rows exist and are consistent. Missing rows or inconsistent attribution are critical safety failures.- CI integration. Safety evals are critical metrics: a regression blocks the merge, no exceptions. Document this in
docs/critical-metrics.md.- A red-team set. Add 5-10 "adversarial" examples: requests crafted to be just at the envelope boundary, requests where the standing instructions are ambiguous, requests where the historical pattern suggests one thing and the standing rule suggests another. These probe whether Claudia (and the eval suite) handles edge cases correctly.
Bottom line of Decision 4: safety evals on Claudia's delegated-governance decisions verify the envelope check at eval time rather than waiting for the runtime check to catch violations. Tool-use correctness verifies the right tools were called in the right order. Envelope-respect verifies decisions stayed within the delegated bounds. Confidence-vs-action consistency verifies low-confidence decisions get surfaced. The combination prevents the safety failures Course Eight Concept 7 named as the load-bearing risk.
PRIMM — Predict before reading on. Claudia (Maya's Owner Identic AI from Course Eight) processes 50 routine refund requests over a week. All 50 stay within her delegated envelope ($2,000 ceiling, no priors, account >2 years). The output evals (Decision 2) score 5/5 on all 50. The tool-use evals (Decision 3) score 5/5 on all 50. The envelope-respect safety eval (Decision 4) scores 5/5 on all 50.
Three weeks later, an audit reveals that 8 of those 50 refunds went to customers whom Maya — if she'd reviewed them herself — would have escalated to a senior reviewer, not auto-approved. Maya's standing pattern, learned over 200 prior decisions, would have caught these. Claudia did not.
Which eval layer should have caught this? Pick one before reading on:
- Output evals — the responses should have signaled uncertainty
- Trace evals — Claudia's reasoning should have flagged the pattern mismatch
- Safety evals — the envelope check missed something
- None of the above — this is what Concept 14 names as a fundamental limit
The answer, with reasoning, lands at the end of Decision 6 (regression evals + CI/CD).
Decision 5: RAG evals with Ragas on TutorClaw
In one line: introduce TutorClaw (a knowledge-agent that answers questions about the Agent Factory book using retrieval over the book's content); set up Ragas with all five RAG metrics; run against a knowledge-agent golden dataset.
Simulated track for Decision 5: the starter repo ships a pre-indexed vector store of the Agent Factory book (in
traces-fixtures/agent-factory-book-vectors.qdrant.tar.gz) plus a minimal TutorClaw stub that does retrieval and answer generation. The 30 golden examples have pre-recorded retrieval results so Ragas can grade them without running the embedding model live. The five Ragas metrics produce the same diagnostic patterns; only the substrate is pre-built.
This Decision introduces the only fresh agent in the lab — TutorClaw, a teaching agent that does retrieval-augmented generation over the Agent Factory book. Maya's customer-support agents in Courses 5-8 do some retrieval but aren't primarily RAG agents; TutorClaw is. The reason for the cameo: Ragas's specialized metrics deserve an agent that exercises them genuinely. The patterns transfer to any knowledge-heavy agent in Maya's company that needs them.
What you do — Plan, then Execute. Plan mode; brief; save to docs/plans/decision-5.md; review; execute.
Ragas evaluation on TutorClaw, a knowledge-agent that retrieves from the Agent Factory book. Requirements:
- Set up TutorClaw. A minimal RAG agent that: (a) receives a question about the Agent Factory book, (b) retrieves relevant chunks from a vector store of the book content, (c) generates an answer grounded in the retrieved chunks. The starter code for TutorClaw is at
agents/tutorclaw/; install dependencies and configure the embedding model. For the vector store, pick one of three reasonable backends depending on your existing infrastructure: pgvector (a PostgreSQL extension; recommended if your team already runs Postgres, since it adds vector search to the database you already operate); Qdrant (a dedicated open-source vector DB; recommended if you want a purpose-built vector store with strong filtering and metadata-search features); or any MCP-served knowledge layer (recommended if you completed Course Four's system-of-record discipline and want to keep the same MCP pattern). Ragas works with all three because it evaluates the retrieval results the agent receives, not the vector store implementation; the eval suite is portable across backends.- Build a TutorClaw golden dataset at
datasets/tutorclaw-golden.json. 30 examples covering: questions answerable from a single chapter (easy retrieval), questions requiring synthesis across chapters (hard retrieval), questions about concepts the book doesn't cover (should be "I don't know" rather than hallucination), questions with subtle answer differences from naive interpretation (test grounding rigor).- Implement the five Ragas metrics: Context Relevance, Faithfulness, Answer Correctness, Context Recall, Context Precision. Use Ragas's built-in implementations; configure with the same LLM-as-judge backend as the other evals.
- Run Ragas on the dataset. For each example, invoke TutorClaw, capture the retrieved chunks and the answer, submit to Ragas evaluators, collect scores.
- Interpret the score patterns. Document common patterns: Context Recall low + Answer Correctness low = retrieval missed key facts (fix the chunking strategy); Context Recall high + Faithfulness low = agent invented claims (fix the grounding prompt); Context Precision low = retrieval returned too much noise (fix the embedding model or chunk size).
- CI integration. Run Ragas on every PR that touches TutorClaw's prompt, the chunking strategy, the embedding model, or the book content. The score distribution should not regress.
- Document the diagnostic playbook. For each Ragas metric, name the production failure mode it catches and the architectural intervention to fix it. This is the operationalization of Concept 7.
Bottom line of Decision 5: Ragas's five-metric framework decomposes knowledge-agent failures into their components — retrieval failure, grounding failure, citation failure. TutorClaw is the example agent that exercises all five metrics genuinely. The diagnostic playbook turns Ragas scores into specific architectural interventions: fix chunking, fix grounding prompt, fix embeddings. The same patterns transfer to any agent in Maya's company that does retrieval before answering.
Decision 6: Regression evals and CI/CD wiring
In one line: connect all the eval suites built so far (Decisions 2-5) into a unified CI/CD workflow that runs on every PR, compares against the baseline, and blocks merges when critical metrics regress.
Simulated track for Decision 6: the CI workflow runs against the same pre-recorded fixtures from Decisions 2-5, so the regression check, baseline comparison, and merge-blocking logic all work end-to-end without any live agent calls. A "synthetic regression" is provided in
traces-fixtures/decision-6-regression-injection.json— a deliberately-degraded output set you can use to verify the regression detector fires correctly before trusting it on real changes.
Concept 12 will take up the eval-improvement loop conceptually. Decision 6 wires the infrastructure for that loop: regression detection, baseline management, automated reporting. This is the Decision that turns "we have evals" into "we ship with confidence."
What you do — Plan, then Execute. Plan mode; brief; save to docs/plans/decision-6.md; review; execute.
Unified CI/CD wiring for the regression eval pipeline. Requirements:
- Define the regression check. A regression is a critical-metric score that decreased by more than a configurable threshold (default 5%) compared to the baseline at
reports/baseline.md. Document critical metrics indocs/critical-metrics.md(which ones, why each is critical, the acceptable regression tolerance).- Build the unified runner at
scripts/run-all-evals.sh. Runs Decisions 2-5's eval suites in sequence, aggregates scores, producesreports/eval-{date}.mdwith the full breakdown.- Build the regression comparator at
scripts/check-regressions.py. Reads the latest report and the baseline; flags any critical-metric regression beyond tolerance; produces a regression summary.- Wire to GitHub Actions (or equivalent CI). Workflow runs on every PR that touches
agents/,prompts/,evals/,datasets/, or the agent runtimes. Stages:
- Stage 1: traditional tests (
pytest) — fast feedback.- Stage 2: DeepEval output evals — runs on every PR.
- Stage 3: trace evals (Trace Grading) — runs on PRs that touch prompts, models, or tool definitions.
- Stage 4: safety evals — always runs on every PR; critical.
- Stage 5: Ragas evals — runs on PRs that touch TutorClaw or knowledge agents.
- Stage 6: regression check — compares against baseline; flags regressions.
- Baseline management. When a PR intentionally improves a metric, the baseline updates. Document the baseline-update workflow: the PR reviewer must explicitly approve a baseline change; the change is recorded in
reports/baseline-history.md.- Eval cost budget. Track the cumulative LLM-as-judge cost per CI run. Configure a soft warning at $5/run and a hard cap at $20/run; PRs exceeding the cap go to a slower, more selective eval suite. Cost discipline is part of the discipline.
- The merge-blocking rule. A regression on a critical metric blocks the merge. Document the override workflow: a maintainer can explicitly override with a stated reason, recorded in the PR; otherwise, no merge.
Bottom line of Decision 6: the regression eval pipeline is the discipline that turns the eval suite from "documentation of failure modes" into "shipping gate." Critical metrics with tolerance budgets, automated regression detection, blocked merges on regression, explicit baseline management, cost discipline. After Decision 6, the eval suite is enforced; before Decision 6, the eval suite is hoped-for.
The answer to Decision 4's PRIMM Predict. The honest answer is (4): none of the above — this is the fundamental limit Concept 14 names. Claudia's decisions passed every eval layer because the eval suite measured what was in the dataset: respect for the explicit envelope ($2,000 ceiling, no priors, account >2 years), tool-use correctness, output quality. None of those measures whether Claudia's pattern matches Maya's pattern at the edges the dataset didn't cover. This is the alignment-at-edge-cases gap from Concept 14: pattern-matching reliability is evaluable; alignment with the principal's actual judgment on novel edge cases is not, fully. The trace-to-eval pipeline (Concept 13 + Decision 7) is the operational response — when an audit catches a misalignment like this, those 8 cases get promoted into the golden dataset, the safety evals grow to cover the new pattern, and the next drift in this category gets caught. The discipline is iterative; the eval suite gets sharper over time. It never becomes complete. Teams that internalize this ship better than teams that don't.
Decision 7: Production observability with Phoenix
The answer to Decision 1's PRIMM Predict. The honest answer is (3): the dataset was missing the failure category that hit production. All four options are real risks, but option 3 is by far the most common. Misconfigured frameworks (option 1) are caught quickly because the scores look obviously broken. Prompt drift faster than dataset updates (option 2) is real but typically caught by regression evals. Grader inconsistency (option 4) is real but produces noisy scores rather than systematic blind spots. The dataset's category coverage is what determines what your eval suite can see — and a six-months-old dataset has almost certainly drifted from production's actual failure distribution. This is exactly why Decision 7 (production observability + trace-to-eval pipeline) is not optional. Sample real production traffic; triage; promote; the dataset stays current. The team that ships only Decision 1's initial dataset is shipping a snapshot of what they imagined production looked like at one point in time.
In one line: install Phoenix locally (Docker), wire it to receive traces from the agent runtimes, configure dashboards for the four-tool stack's outputs, and set up the trace-to-eval feedback loop.
Simulated track for Decision 7: the starter repo ships a "production trace replay" script that streams pre-recorded traces from
traces-fixtures/production-week/into Phoenix at realistic intervals — simulating a week of production traffic in ~10 minutes. Dashboards populate, drift detection fires on an injected drift event, the trace-to-eval promotion queue receives sampled traces, and you can practice the triage ritual on the queue. The operational discipline is identical; only the source of traffic changes.
The final Decision closes the loop. Phoenix watches production; production failures become future eval examples; the eval suite gets sharper over time. This is the operational discipline Concept 13 takes up conceptually.
What you do — Plan, then Execute. Plan mode; brief; save to docs/plans/decision-7.md; review; execute.
Phoenix production observability with the trace-to-eval feedback pipeline. Requirements:
- Install Phoenix as a Docker service. Use the official
arize-phoeniximage; configure persistent storage so trace data survives container restarts.- Wire trace export from the agent runtimes. OpenAI Agents SDK supports OpenTelemetry-compatible trace export; configure the agents to send traces to Phoenix in addition to the OpenAI platform. (Simulated mode: replay pre-recorded traces from
traces-fixtures/into Phoenix.)- Configure dashboards. Build at least three:
- Agent health dashboard: pass rates per agent role, per task category, per metric. Refresh every 5 minutes from recent traces.
- Cost and latency dashboard: cost per task, per agent role; p50/p95 latencies; outlier detection.
- Drift detection dashboard: trailing 7-day average of each critical metric. Alert when a metric drifts more than 10% from the trailing 30-day baseline.
- Configure trace sampling for eval dataset construction. A sampling rule that captures (a) every trace where the agent encountered an error, (b) every trace flagged by user feedback (downvote, reopened ticket), (c) random 1% of normal traces for baseline coverage. Save sampled traces to
production-samples/.- Build the production-to-eval pipeline at
scripts/promote-trace-to-eval.py. Reads a sampled trace; constructs a candidate eval example (input, customer context, the actual agent behavior); prompts for human review (the reviewer either accepts the example into the golden dataset or rejects it with reasoning).- Schedule the promotion ritual. Once a week, run the promotion pipeline on the last 7 days of sampled traces. The team reviews candidates and accepts/rejects. The golden dataset grows organically from production rather than from imagination.
- Document the operational discipline. What gets sampled, what gets promoted, who reviews, how the baseline shifts. Phoenix is the tooling; the discipline is the team practice. Concept 13 names where most teams under-invest in this discipline.
Bottom line of Decision 7: Phoenix is the production observability layer that closes the eval-improvement loop. Traces from real agent runs flow in; dashboards surface drift and degradation; sampled traces become candidates for the golden dataset; the team reviews and promotes weekly. After Decision 7, the eval suite is not static — it grows from production. A reader who completes Decision 7 has an operational EDD pipeline across all four eval layers — output, trace, RAG, and observability — covering the Course 3-8 invariants the dataset captures. The discipline of expanding that coverage over time is Concepts 11-13.
Decision 7 sidebar — when and how to migrate from Phoenix to Braintrust. For teams running Phoenix in production who hit one of the three migration signals from Concept 10 (multi-team eval workspace needed, eng-hours on Phoenix infrastructure exceeding what a commercial subscription would cost, collaborative annotation workflows missing), the migration path is straightforward because both products consume OpenTelemetry-compatible traces. The migration brief, for when you're ready:
Migrate from Phoenix to Braintrust without losing trace history or eval continuity. Requirements: (1) export the trace dataset from Phoenix's storage backend (Phoenix supports a JSON export of all traces with their metadata); (2) provision a Braintrust workspace and import the trace dataset; (3) port the dashboard definitions — agent health, cost/latency, drift detection — from Phoenix's UI to Braintrust's equivalent views; (4) reconfigure the agent runtimes' OpenTelemetry exporters to send to Braintrust instead of (or in parallel with) Phoenix; (5) port the trace-to-eval promotion pipeline (
scripts/promote-trace-to-eval.pyfrom Decision 7) to read from Braintrust's API instead of Phoenix's; (6) run both observability layers in parallel for at least two weeks to verify trace ingestion matches and dashboards produce comparable signals; (7) decommission Phoenix once verification is complete.The migration is mechanical because the eval architecture doesn't change — same trace format, same dataset, same metrics, same promotion ritual. What changes is the operational ergonomics, not the discipline. A team comfortable with Decision 7's Phoenix setup is comfortable with Braintrust within a week of switching.
Part 5: Honest Frontiers
Parts 1-3 built the conceptual architecture. Part 4 walked the implementation. Part 5 takes up the parts of eval-driven development that are still hard, still emerging, or still genuinely unsolved as of May 2026. Pretending evals close every gap in agent reliability would be dishonest pedagogy. This Part is the honest map of where the discipline is solid, where it's improving rapidly, and where it has real limitations. Four Concepts.
Concept 11: Golden dataset construction — the most undervalued artifact
The eval frameworks are tooling. The golden dataset is the load-bearing artifact. A beautiful eval suite on a bad dataset measures the wrong thing with rigor; a modest eval suite on a good dataset surfaces the failures that matter. Most teams underspend on dataset construction and overspend on framework selection. Concept 11 inverts that.
What makes a dataset "good" for agent evaluation.
The dimensions that matter, ranked roughly by importance:
- Representativeness. Does the dataset reflect the actual distribution of production traffic? An agent that gets 70% refund requests, 20% account inquiries, and 10% miscellaneous in production needs a dataset weighted similarly. A dataset that's 33%/33%/33% gives every category equal eval coverage — which means category-specific regressions in the highest-traffic category are diluted. The eval suite must protect the production-weighted failure modes.
- Edge case coverage. The dataset must include the cases where the agent is most likely to fail — not because they're common, but because they're consequential. Adversarial customer messages, ambiguous instructions, edge-of-envelope decisions, cross-category questions, low-context inputs. Edge cases are the failures that hurt; representative datasets miss them by definition. A good dataset stratifies: 70% representative cases (to catch common-mode regressions) plus 30% edge cases (to catch the dangerous failures).
- Difficulty stratification. Tag every example with a difficulty (easy/medium/hard). When the eval suite reports "we pass 85% overall," the right diagnostic is "we pass 95% on easy, 80% on medium, 60% on hard." Without stratification, the team can't tell whether their improvements are touching the failure modes that matter or just easy-mode improvements. Difficulty stratification turns one score into a diagnostic.
- Ground truth quality. Every example needs a clear specification of what "correct behavior" looks like. This is harder than it sounds. For some tasks (factual lookups), the ground truth is straightforward. For others (judgment calls about whether to escalate, how to phrase a delicate response), the ground truth itself requires judgment. The ground truth is the most expensive part of the dataset to construct, and the part most subject to bias. Course Nine's discipline: ground truth is reviewed by multiple humans before going into the dataset; disagreements are documented in the example rather than papered over.
- Source diversity. Examples sourced only from one customer support shift, or only from one product team, or only from one demographic of users, will have systematic blind spots. The dataset should sample across time, across customer segments, across task channels (chat, email, voice). Source-monoculture is a dataset failure mode that produces evals that pass while production fails.
- Version control and change discipline. The dataset is code. It lives in git, gets reviewed in PRs, has a documented change protocol. Adding examples is routine; modifying examples (especially the expected_behavior or expected_tools fields) requires explicit review because changes there change what "correct" means. A team that treats the dataset as throwaway loses the ability to reason about whether agent improvements are real.
Where datasets fail in practice.
Five common patterns, each one a failure mode Course Nine's discipline names directly:
- The Imagination Trap. The team sits down to write the dataset based on what they think customers ask. The resulting examples reflect the team's mental model, not the actual distribution. The eval suite passes; production fails. Fix: source examples from production traces (or in simulated mode, from the provided trace fixtures). Imagined examples are decorative.
- The Easy-Mode Bias. When humans write dataset examples by hand, they unconsciously favor examples they can confidently grade. Hard cases — ambiguous, judgment-requiring, edge-of-policy — are skipped because the grader can't decide what the right answer is. The dataset ends up easy-biased; the agent passes; production failures cluster in the cases that weren't in the dataset. Fix: explicitly carve out 30% of the dataset for hard cases; accept that some ground-truth answers will require team consensus rather than individual judgment.
- The Single-Author Problem. One person writes all the examples. Their blind spots become the dataset's blind spots. Fix: multi-author construction; cross-review; explicit accountability for category coverage.
- The Stale-Dataset Problem. The dataset was constructed six months ago. The product has changed; customer questions have shifted; the agent's tool set has evolved. The dataset is now measuring a previous era of the agent. Fix: continuous dataset growth via the production-to-eval pipeline (Decision 7's trace promotion); quarterly review of the full dataset for relevance.
- The Pass-Threshold Inflation Problem. The team set thresholds at agent launch (e.g., "we pass if relevancy > 0.7"). Over time, as the agent improves, scores cluster at 0.85+. The eval suite has effectively become a checkbox — everything passes; regressions go unnoticed because the thresholds are too lax. Fix: thresholds tighten over time as the agent improves; "improvement" includes raising the bar.
The economics of dataset construction.
Dataset construction is expensive — both in human time and in coordination. A team that starts with 50 examples and grows the dataset organically through production promotion (Decision 7) will, over a year, accumulate 500-1,000 examples without ever sitting down for a "dataset construction sprint." This is the recommended path. Top-down dataset construction by mass annotation works but is expensive, slow, and often produces low-quality examples because the annotators are guessing rather than seeing real failures.
Quick check. Of the five dataset failure modes named above, which one is most likely to make the eval suite score look better than the agent actually is in production? Pick the one whose effect is specifically "false confidence," not just "missed coverage."
- The Imagination Trap
- Easy-Mode Bias
- Single-Author Problem
- Stale-Dataset Problem
- Pass-Threshold Inflation
Answer: (2) Easy-Mode Bias is the worst for false confidence specifically. When humans skip hard cases because grading them is ambiguous, the dataset becomes dominated by easy cases the agent passes reliably — and the team reads high pass rates as "the agent is reliable" when what they're actually measuring is "the agent handles easy cases reliably." (1) Imagination Trap misses categories entirely (visible as production failures the team doesn't recognize from their evals). (3) Single-Author Problem produces systematic gaps but not specifically inflated scores. (4) Stale-Dataset Problem produces gradual drift, which is detectable. (5) Pass-Threshold Inflation is real but visible (thresholds are explicit). Easy-Mode Bias is the failure mode that quietly makes the eval suite a worse signal over time without anyone noticing — which is exactly why Concept 11 names the explicit 30%-hard-cases discipline as the fix.
*Bottom line: the golden dataset is the most undervalued artifact in eval-driven development. Quality dimensions: representativeness, edge case coverage, difficulty stratification, ground truth quality, source diversity, version control discipline. Five common failure modes: the Imagination Trap (writing what you imagine customers ask), Easy-Mode Bias (skipping hard cases), Single-Author Problem (one person's blind spots become the dataset's), Stale-Dataset Problem (six months out of date), Pass-Threshold Inflation (thresholds don't tighten as the agent improves). The recommended growth path is organic via production promotion (Decision 7), not top-down annotation sprints. Spend more on dataset construction than on framework selection; the dataset is what your evals are actually measuring.*
Concept 12: The eval-improvement loop
The TDD analogy from Concept 2 has a workflow: red, green, refactor. The EDD analog is: define task, run agent, capture trace, grade behavior, identify failure mode, improve prompt/tool/workflow, rerun evals, compare results, ship only when behavior improves. Concept 12 walks the loop, identifies where teams short-circuit it, and names what makes a healthy iteration cycle.

The healthy loop, in detail.
Step 1 — Define task. Pick the failure case to work on. Two sources: (a) an example from the golden dataset the agent is currently failing; (b) a new task category that the dataset doesn't cover yet (build the new example first, then address the failure).
Step 2 — Run agent. Invoke the agent on the task. In the simulated mode, this is loading a recorded trace. In the live mode, this is actually running the agent in a staging environment.
Step 3 — Capture trace. The full execution path. Model calls, tool calls, handoffs, intermediate reasoning. The OpenAI Agents SDK does this by default; other SDKs need configuration. If you can't capture a structured trace, you can't iterate the loop.
Step 4 — Grade behavior. Run the eval suite. Don't grade just the failure case — grade the full suite, because the change you're about to make might fix this case while breaking others. The grading produces a score per metric per example.
Step 5 — Identify failure mode. This is the diagnostic step most teams skip. Where exactly did the agent fail? Output level (wrong final answer)? Tool-use level (wrong tool, wrong arguments)? Trace level (correct tools, wrong reasoning between them)? RAG level (wrong retrieval, wrong grounding)? Safety level (envelope violation)? The failure mode determines the fix. A retrieval failure is fixed in the knowledge layer; a reasoning failure is fixed in the prompt; a tool-use failure is fixed in the tool definition or the agent's tool-selection logic. Skipping this step is why teams change prompts repeatedly without improvement — they're applying prompt fixes to non-prompt failures.
Step 6 — Improve prompt/tool/workflow. Make the targeted change at the right layer. Targeted is the operative word. Sweeping prompt rewrites that "should fix the issue" usually fix one thing while breaking three others. Targeted changes — one prompt instruction added, one tool's description tightened, one chunking parameter adjusted — are easier to attribute to specific score changes.
Step 7 — Rerun evals. The full suite, not just the failing case. Compare against the previous run's scores. The diagnostic question: did the change fix the failure case AND not regress any other case? If yes, ship. If no, iterate. The discipline is that "fixed the case" without "no regressions" is not a fix; it's a trade.
Where teams short-circuit the loop.
- Skip Step 4 (grade behavior). The team observes a production failure, decides they understand it, changes the prompt, ships. Half the time the change "fixes" the case without solving the underlying mode; half the time it introduces regressions in other cases. Fix: never ship a prompt change without running the eval suite.
- Skip Step 5 (identify failure mode). The team grades the behavior, sees a failing score, and immediately starts changing the prompt — without diagnosing whether the failure was actually prompt-mediated. Most production agent failures are not prompt failures; they're tool, retrieval, or workflow failures. Fix: explicitly write down which failure mode you've identified before making the change.
- Skip Step 7 (rerun the full suite). The team makes the change, reruns only the failing example, confirms it passes, ships. The change quietly regresses three other examples. Fix: the full suite always runs before merge.
Frequency and cost discipline.
The full eval-improvement loop is expensive — each iteration costs LLM-as-judge fees and developer time. A pragmatic discipline:
- Daily: developer-driven iterations on specific failing cases. Each iteration runs the focused subset of the eval suite covering the affected agent.
- Per PR: full eval suite runs in CI. Regressions block merge.
- Weekly: review of trends — which agents are improving, which are stagnating, which are regressing slowly across many small changes.
- Quarterly: review of the golden dataset itself — is it still representative? Are the thresholds still appropriate? Should categories be added or split?
This is what TDD's "red-green-refactor" becomes when applied to agentic AI. Same shape, more layers, higher cost per iteration, requires more discipline. And it's the difference between a team that ships agent changes confidently and a team that hopes the prompt change works.
Walking the loop concretely: the wrong-customer refund example from Concept 3. The discussion above stays abstract. Let me walk the seven steps on the specific failure that opened Concept 3 — the Tier-1 Support agent that refunded the wrong customer because it didn't disambiguate between accounts with the same email. This is what the loop actually feels like in practice.
Step 1 — Define task. The team noticed in the weekly trace-to-eval triage that two production traces had the same shape: customer asks about a billing dispute, agent looks up customer by email, email matches multiple accounts, agent picks the first match without disambiguating. One of the two traces went to the wrong customer. They promote both to the golden dataset as new examples in the refund_request category, tagged difficulty=hard and failure_mode=customer_disambiguation.
Step 2 — Run agent. They invoke the Tier-1 Support agent on each new example (in a staging environment, so no real refunds get issued). Both runs produce responses that look correct — "I've processed your refund" — and confidently issue the action.
Step 3 — Capture trace. The OpenAI Agents SDK produces the trace by default. They inspect: model call → customer_lookup(email="sarah@example.com") tool call → three results returned → model picks result[0] → refund_issue(account_id=result[0].id, amount=$89) → response generated. The wrong-customer pick is visible in the trace — the model never reasoned about which of the three accounts matched.
Step 4 — Grade behavior. They run the full eval suite. Output evals: 5/5 on both examples (response looks correct). Tool-use evals: customerlookup was called with the right argument (the email); refund_issue was called with valid arguments; but the _argument-correctness metric fails because account_id matched the customer's first account, not the disputed account. Trace evals: the reasoning-soundness metric fails — the trace shows no disambiguation step between the lookup and the refund. The eval suite catches the failure at the tool-use and trace layers. Output evals would have missed it (and did, for several weeks in production).
Step 5 — Identify failure mode. This is the step the team is disciplined about. Where exactly did the agent fail? It's not an output failure (the response was fine). It's not a tool-selection failure (customerlookup was the right tool). It's not a retrieval failure (no RAG involved). **It is a _reasoning failure: the agent didn't reason about the lookup result before acting on it.** The fix layer is the prompt — specifically the part that tells the agent how to interpret tool results — not the tool itself, not the workflow, not the model.
Step 6 — Improve (targeted). They edit the Tier-1 Support agent's prompt. One specific addition: "When customer_lookup returns multiple results, do not proceed with action tools until you've identified which account matches the customer's specific dispute. Use the disputed charge amount and date to disambiguate; if disambiguation is impossible, escalate to a human." Not a sweeping prompt rewrite — one paragraph addressing one failure mode.
Step 7 — Rerun evals. They run the full eval suite, not just the two new examples. The two new examples now pass — the agent escalates to a human in both cases (correct behavior given ambiguous match). They scan for regressions: do the other 48 dataset examples still pass at the same scores? Forty-seven do; one regresses from 5/5 to 3/5 — an example where the agent used to immediately respond to a clear single-match customer and now adds an unnecessary "let me confirm which account" question. The team has to decide: is the extra confirmation step correct (more careful) or regression (worse UX for the common case)? They tighten the prompt addition: "...do not proceed if there are multiple results; for a single match, proceed normally." Rerun. All 50 pass. Ship.
The whole loop took roughly an hour of engineering time across the seven steps — fast because the discipline was already wired. A team without the trace evals catches this failure when an angry customer complains months later. A team with output evals only catches it at the same time, because the output never looked wrong. A team with the full pyramid catches it the week the pattern first appears in production traces. That is the operational difference EDD makes.
Bottom line: the eval-improvement loop is the operational discipline of EDD — define task, run agent, capture trace, grade behavior, identify failure mode, improve, rerun, compare. The most common short-circuit is skipping the failure-mode-identification step and jumping straight from observation to prompt change; the result is repeated prompt rewrites that don't improve behavior. A healthy team runs daily iteration on specific cases, full-suite eval on every PR, trend review weekly, dataset review quarterly. The loop is more expensive than TDD's red-green-refactor; the discipline is also higher-stakes.
Concept 13: Production observability and the trace-to-eval pipeline
Decision 7 wired Phoenix. Concept 13 takes up the operational discipline that makes Phoenix actually useful — because installing observability is easy; using observability to drive eval improvement is the part most teams underestimate.
The basic claim: production traces are the highest-quality source of eval examples. They are real (not imagined), they cover the actual distribution (not the team's assumptions about it), they include the failure modes that actually happen (not the ones the team anticipated). The trace-to-eval pipeline turns the agent's real usage into the eval suite's future material.

The pipeline, in operational detail:
Phase 1 — Sample. Phoenix continuously ingests traces from production. Not every trace becomes an eval example — that would be too much data. Sampling rules:
- Errored traces: every trace where the agent encountered an exception or returned an error. Hands-down the highest-signal source.
- User-feedback-flagged traces: every trace where a user downvoted, reopened a ticket, or asked for human escalation after the agent's response. These are known failures from the user's perspective.
- Low-confidence traces: every trace where the agent (or Claudia, for Course Eight's Identic AI) reported confidence below a threshold. Low-confidence decisions are often correct but always worth examining.
- Edge-of-envelope traces: for safety-relevant agents (Claudia, Manager-Agent), every trace where the decision was near the envelope boundary. Even when the decision was correct, examining the boundary cases sharpens the eval suite.
- Random sample: 1% of normal traces (those not flagged by the above). Provides baseline coverage and surfaces failures the other filters miss.
Phase 2 — Triage. The sampled traces flow into a triage queue. Someone (a developer, the team's eval owner) reviews each one and decides: is this an eval-worthy example? Most "errored traces" become eval examples; many "low-confidence" don't. The triage discipline is: would adding this case to the eval suite prevent recurrence of the failure?
Phase 3 — Promote. Triaged examples that pass review get promoted to the golden dataset. The promotion step writes the example in the dataset's canonical format: task description, customer context, expected behavior, expected tools, unacceptable patterns. This is where the production failure becomes a permanent eval check.
Phase 4 — Threshold review. Periodically (Course Nine recommends weekly), the team reviews whether the eval thresholds need to tighten or loosen. If a new category of examples is consistently passing at high scores, the threshold for that category goes up. If a new category is consistently failing, the team either fixes the agent or accepts the lower threshold for that category temporarily.
Where teams under-invest.
The triage step (Phase 2) is the bottleneck — and the step teams systematically skip. A trace goes from production to "we should add this to the dataset" but never makes it into the actual dataset because nobody owned the triage work. This is the failure mode that turns production observability into production decoration. Phoenix shows you all the traces; without the triage discipline, the traces stay in Phoenix and the eval suite stays static.
The fix is organizational, not technical: someone (named individual, not "the team") owns the weekly triage. The promotion has a regular ritual — Course Nine recommends a 30-minute weekly meeting where the eval owner walks recent sampled traces, decides promotions, and updates the dataset. 30 minutes per week is the operational cost; the payoff is a dataset that stays current with production.
The relationship to drift.
Concept 2 named drift as the EDD-specific failure mode TDD has no analog for. Production observability is how teams detect drift; the trace-to-eval pipeline is how teams respond to it.
When a model upgrade rolls out (the underlying LLM is retrained, fine-tuned, or replaced), agents' behavior changes — sometimes for the better, sometimes for the worse. Phoenix's drift detection dashboard surfaces the change; the eval suite's regression check confirms whether the change is a regression on the existing examples. If the regression is consistent across many examples, the eval suite catches it; if the regression is concentrated in a category the dataset under-covers, the eval suite misses it. The trace-to-eval pipeline is what closes that gap: examples from the regressed category get promoted, the dataset evolves, the next drift event is better caught.
This is the operational answer to "evals against a static dataset eventually go stale." They don't, if the dataset is continuously refreshed from production. The Phoenix → triage → promotion ritual is the refresh mechanism.
Quick check. A team installs Phoenix correctly and configures the trace-to-eval pipeline (sampling rules, queue, promotion script). Six months later, the golden dataset has grown by exactly zero examples from production. The dashboards are running. Phoenix is happy. What's the most likely root cause?
- The sampling rules are too restrictive — nothing's being captured
- The promotion script has a bug
- The triage step has no named owner and gets perpetually deferred
- The team is shipping perfect agents that don't need new eval examples
Answer: (3) — by a wide margin. (1) and (2) are real but produce obvious symptoms; the team would notice. (4) is essentially never true in production. (3) is the modal failure mode and the reason Concept 13 emphasizes the triage owner over the triage tooling. Phoenix produces a queue of candidate examples; without someone whose Tuesday-morning calendar shows "30 minutes: trace-to-eval triage," the queue grows, then gets ignored, then becomes invisible. Phoenix without an owner is decoration. This is the organizational discipline gap that distinguishes teams whose eval suites genuinely improve over time from teams whose eval suites slowly become snapshots of an old reality.
*Bottom line: production observability is the substrate; the trace-to-eval pipeline is the operational discipline that makes the observability productive. Sample traces continuously (errors, user feedback, low confidence, edge-of-envelope, random); triage them on a weekly cadence (who owns this matters more than which tool); promote the eval-worthy ones into the golden dataset; review thresholds periodically. The triage step is the bottleneck most teams underestimate. Phoenix without a triage owner is decoration; Phoenix with a 30-minute weekly triage ritual is the loop that turns production into improved evals over time.*
Concept 14: What evals can't measure
Course Nine's discipline is strong on many failure modes and honestly limited on others. Pretending the discipline closes every gap in agent reliability would mislead teams; pretending evals are useless because they don't close every gap would discard the most useful reliability practice the field has. Concept 14 maps the discipline's frontier honestly.
What evals catch well.
Pattern-matching behavior. If the agent should do X when conditions A, B, C are present, and the dataset has examples of A+B+C → X, the eval suite catches when the agent doesn't do X. This is the bulk of agent reliability — repeating known-correct patterns reliably. Evals are excellent at this.
Drift on known patterns. When a model upgrade changes behavior on examples already in the dataset, the regression check fires. Evals reliably detect drift on the patterns they cover.
Safety violations within named bounds. If the envelope is "refunds ≤ $2,000," the eval can verify the agent stayed under $2,000. Bounded safety rules are evaluable; the eval suite is excellent at policing them.
Tool-use correctness. Did the agent call the right tool? Pass the right arguments? Interpret the result correctly? These are mechanical questions with mechanical answers; evals catch failures here with high reliability.
Where evals are honestly limited.
Novel situations the dataset doesn't cover. The agent encounters a customer issue unlike anything in the dataset. The eval suite says nothing about this — it can't, because it doesn't have ground truth for the novel case. The agent's behavior on novel cases is what really tests its judgment, and evals can't directly evaluate it. The mitigation is the production-to-eval pipeline (Concept 13): novel cases that appear in production get triaged and promoted. Over time, the dataset's coverage of the novel-case distribution expands. But there will always be a frontier of "haven't seen this yet" that evals can't speak to.
Value alignment at edge cases. The agent has to choose between two responses, both of which are technically correct but reflect different underlying values. Maya might want "fast resolution even if slightly more lenient on policy"; another company might want "strict policy enforcement even when slower." The eval can grade against one of these as the ground truth, but it can't grade whether the agent is aligned with the user's values — only whether it's aligned with the values the dataset encodes. When values shift (Maya decides she wants stricter policy after a regulatory inquiry), the dataset has to shift with them; evals don't surface the value question on their own.
Subjective judgment about quality. Some agent outputs are technically correct but somehow off. The tone is wrong; the response is verbose; the framing irritates the customer despite answering the question. LLM-as-judge graders catch some of this, but their scoring is correlated with what other LLMs would prefer, which isn't the same as what humans prefer. Human grading catches more, but it's expensive and inconsistent across graders. There's a real gap here, and the field's current best practice is to grade subjective dimensions with multiple graders and accept the noise.
Long-tail edge cases. The 1% of customer interactions that don't fit the categories in the dataset. By definition, the eval suite doesn't cover them. Production observability surfaces them; the eval suite doesn't prevent the failures on them.
Emergent behavior over long interactions. The eval suite typically grades single-turn or short-multi-turn interactions. Emergent failures over long conversations — drift in the agent's behavior across 30 turns, contradictions with earlier statements, gradual concession of constraints — are hard to evaluate. The dataset structure doesn't naturally support 30-turn examples; the graders struggle to evaluate them; the resulting evals are sparse. This is a real frontier for the discipline.
Adversarial behavior. If a sophisticated user is trying to manipulate the agent (prompt injection, jailbreak attempts, social engineering), the eval suite can grade against specific known attack patterns — but novel attacks, by definition, aren't in the dataset. Red-teaming is the discipline that addresses this; it's complementary to EDD rather than subsumed by it.
What this means for the discipline.
Three implications:
- Evals are necessary but not sufficient for agent reliability. A team that ships only with evals will catch most failures and miss some. Red-teaming, human review of edge cases, careful production monitoring, and rollback-readiness are all additional practices that complement EDD. The friend's pithy version: EDD is a major reliability discipline, not the only one.
- Eval coverage is a moving target. As production evolves, novel situations appear that the dataset doesn't cover. The trace-to-eval pipeline is how coverage extends; weekly triage is how it stays current. A team that treats the dataset as static accepts that their eval coverage shrinks over time.
- Honest reporting of eval scores includes honest scope. When a team reports "we pass 92% on our eval suite," the honest reading is "we pass 92% of the failure modes we've thought to test for." This is genuine information but it's not a guarantee that production failures stay below 8%. Teams that internalize this distinction make better decisions; teams that don't get surprised.
Quick check. Which of these is fundamentally outside what eval-driven development can catch, even with a perfect golden dataset and the full four-tool stack? Pick the one that's fundamentally unsolvable, not just hard.
- The agent gives a correct answer through wrong reasoning
- The agent fails on novel customer questions the dataset never covered
- The agent's tone is technically correct but irritates customers
- Prompt injection by a sophisticated user
Answer: (2) is the only fundamentally unsolvable one — by definition, evals can't grade what isn't in the dataset. (1) is what trace evals catch (Concept 6). (3) is hard but tractable with multi-grader and human-in-the-loop evaluation. (4) is what red-teaming catches as a complementary discipline. The novel-case frontier is the honest limit of EDD; the discipline minimizes it through production-to-eval promotion but never closes it entirely.
*Bottom line: EDD is excellent at pattern-matching behavior, drift detection, bounded safety rules, and tool-use correctness. It is honestly limited on novel situations, value alignment at edge cases, subjective quality judgments, long-tail rare events, emergent behavior over long interactions, and adversarial attacks. Three implications: evals are necessary-but-not-sufficient; coverage is a moving target maintained by the production-to-eval pipeline; honest reporting includes honest scope. A team that internalizes the limits ships agents that work better than a team that overclaims for evals.*
Five things not to do — anti-patterns that defeat the discipline
A teaching course about a discipline is only honest if it names what not to do. The five anti-patterns below are the ones most teams discover the hard way; the discipline of EDD is partly defined by avoiding them.
1. Do not ship output-only evals and call the agent "safe." This is the most common failure mode in 2025-2026 production agentic AI. The output eval scores look great; the production failures keep happening; the team concludes "evals don't work for agents." The honest diagnosis: output-only evaluation systematically misses the trace-layer failures Concept 3 named. Ship the full pyramid — output + tool-use + trace + safety — or accept that your eval suite is measuring less than you think.
2. Do not use LLM-as-judge without calibration. When an LLM grader returns "answer correctness: 0.85" the team treats it as data — but the grader could be biased, inconsistent, or systematically wrong on certain failure categories. Concept 14 names this as the eval-of-evals frontier. Before trusting any LLM-as-judge metric in production: spot-check 10-20 graded examples against human judgment, document the grader's calibration error, and report eval scores with the grader's reliability noted. "Faithfulness 0.85 (grader spot-checked at 90% human agreement)" is honest; "Faithfulness 0.85" on its own treats grader output as ground truth.
3. Do not build a huge eval dataset before understanding your failure categories. Decision 1 specifies a 30-50 example starting dataset deliberately — small enough to construct carefully, large enough to cover the major task categories. Teams that ship a 500-example dataset on day one usually have a long-tail-biased dataset (the team imagined hundreds of cases but didn't ground them in production patterns) and end up rebuilding it after Decision 7's production-to-eval pipeline reveals what production traffic actually looks like. Start with 30-50 representative cases; grow the dataset organically through the trace-to-eval promotion ritual; resist the urge to "comprehensively cover" the agent's behavior on day one.
4. Do not treat observability dashboards as evals. Phoenix's dashboards show what's happening in production — pass rates, cost trends, latency distributions, drift signals — but the dashboard itself is not an eval. An eval grades a specific run against a specific rubric and produces a score that goes into the regression check. A dashboard surfaces patterns that may or may not be eval-worthy. The trace-to-eval pipeline (Concept 13) is the bridge that turns observability into evaluation. Teams that confuse the two end up with beautiful dashboards and a static eval suite; teams that understand the distinction do the weekly triage ritual that keeps the eval suite alive.
5. Do not run evals only once before launch. The most expensive way to use eval-driven development is as a pre-launch gate that's never run again. Models drift. Prompts get edited. Tools get added. Production traffic shifts. A static eval suite, however good at launch, becomes a snapshot of a previous era within months. Wire evals into CI/CD (Decision 6) so they run on every meaningful change; wire production observability (Decision 7) so the dataset grows from real usage; review thresholds quarterly (Concept 11). EDD is a continuous discipline, not a milestone.
These five anti-patterns are the negative space of the discipline. A team that avoids all five is doing EDD well, regardless of which specific frameworks they use. A team that commits any one of them is shipping less than they think — and the production failures will, eventually, prove it.
Part 6: Closing
Parts 1-5 built the discipline. Part 6 closes it. One Concept, then the quick-reference, then the closing line. This is the closing course of the Agent Factory track.
Concept 15: Eval-driven development as a foundational discipline — and what comes after
The architectural arc Courses 3-9 traced is now complete. Three courses (3-4) built the engines of an agent. Three courses (5-7) built the infrastructure that turns an agent into a workforce. One course (8) built the delegate that lets the workforce scale past the owner's attention. One course (9) built the discipline that makes the whole architecture measurably trustworthy in production. Eight architectural invariants plus one cross-cutting discipline — the Agent Factory track is structurally complete.
This isn't a small claim, so let it land for a paragraph. The eight invariants describe what an AI-native company is made of: an agent loop, a system of record, an operational envelope, a management layer, a hiring API, a delegate, a nervous system, and skills as a portable substrate. The ninth discipline describes how you know any of it is working — measure behavior, not just code; trace the path, not just the destination; sample production, not just imagined tasks; ship only when the eval suite confirms the change actually improved things. Together, the nine pieces describe a complete production-grade AI-native company. A founder with the discipline of this curriculum can build one. An engineer with the discipline can evaluate one. A manager with the discipline can govern one. The curriculum has taught what it set out to teach.
Eval-driven development takes its place alongside test-driven development as a foundational software-engineering discipline. This is the analogous claim Concept 2 set up; Concept 15 lands it as the closing argument — to the extent the current state of EDD can land it, with the open frontiers below honestly named. TDD became foundational because deterministic software systems became too complex for humans to verify by inspection. An automated, regression-protected verification discipline became necessary, then standard. EDD becomes foundational for the same reason in agentic AI. Probabilistic, multi-step, tool-using behavior is too complex and too high-stakes to verify by demo or eyeballing. An automated, regression-protected behavior-evaluation discipline becomes necessary, then standard. A decade from now, shipping an agent without an eval suite will look the way shipping SaaS without unit tests looks today — possible, occasionally done, but professionally indefensible.
What comes after Course Nine in the field of eval-driven development. Five frontiers, as of May 2026, where the discipline is actively expanding. Each one is a real research direction, not just an aspiration:
Frontier 1 — Auto-eval generation. Today, dataset construction is the load-bearing manual cost of EDD. The Decision 1 work — sourcing 30-50 examples, writing expected behaviors, defining acceptable patterns — doesn't scale linearly with the agent's complexity. Research is moving toward agents that read a deployed agent's traces and generate candidate eval examples. Not just promote them through the trace-to-eval pipeline (Decision 7's discipline) — synthesize new examples that probe weaknesses the existing dataset doesn't cover. The 2025-2026 literature has working prototypes that use a stronger model to read traces, identify under-tested behavior categories, and propose new examples with expected behaviors and rubrics. The hard part is quality control. Auto-generated examples often look reasonable but encode subtle errors that ship into the dataset undetected. Early versions exist; the quality bar is real and not yet met for production use. Watch this space; it could transform the economics of EDD within 2-3 years.
Frontier 2 — Eval-of-evals. When evals themselves are produced by LLM-as-judge graders, the question of whether the grader is itself accurate becomes load-bearing. Are we measuring what we think we're measuring? If a grader rates "answer correctness" at 0.8 for a response, we treat that as data. But the grader could be wrong, biased toward certain phrasings, or systematically miss certain failure modes. The research direction: graders calibrated against human judgment on benchmark datasets, then deployed with known calibration error bars. The discipline shift implied: reporting eval scores with confidence intervals reflecting grader reliability, not just point estimates. "Faithfulness 0.85 ± 0.07 (grader confidence)" instead of "Faithfulness 0.85." This is a real shift in how teams interpret eval scores. It's the next thing the discipline has to ship for the foundation to be trustworthy at scale.
Frontier 3 — Alignment metrics beyond pattern-matching. Concept 14 named the limit — evals catch pattern-matching reliability but can't catch alignment with user values at edge cases. The research frontier is whether new metrics, derived from inverse reinforcement learning, constitutional AI techniques, or multi-stakeholder value elicitation, can produce eval-grade scores for value alignment specifically. The honest assessment, as of May 2026: this is genuinely hard. The discipline of eval-driven development doesn't currently close this gap. The metrics that exist (alignment-via-preference-comparison, RLHF-derived reward models, constitutional rubrics) are useful for some narrow alignment dimensions but don't generalize. A team operating in a high-stakes domain — medical, legal, financial, governance-sensitive — cannot rely on EDD alone to certify alignment. They need red-teaming, human review of edge cases, and rollback-readiness as complementary disciplines. The frontier is whether eval-grade alignment metrics will eventually exist. The honest answer is maybe, not yet.
Frontier 4 — Multi-agent eval. Course Six introduced the Manager-Agent; Course Seven introduced the hiring API across multiple agents; Course Eight introduced Claudia coordinating with the workforce. The eval discipline for multi-agent systems is younger than the single-agent discipline. When Agent A hands off to Agent B who consults Agent C, the failure modes multiply: handoff context lost in translation, redundant work across agents, decisions that subtly contradict each other across handoffs, emergent behaviors where the system as a whole behaves differently than any individual agent. Trace evals can grade this at the technical level (was the handoff appropriate? was sufficient context passed?). The systemic eval — does the multi-agent system behave coherently across many interactions, optimizing for the right outcomes at the right granularity — is still emerging. The research direction: simulation-based multi-agent evaluation, where the eval harness simulates many cross-agent interactions and grades the aggregate behavior. Course Nine's lab doesn't yet ship this; a future course or extension would.
Frontier 5 — Eval portability across runtimes. As of May 2026, eval suites are typically tied to the agent's SDK. OpenAI Agents SDK evals don't trivially transfer to Claude Agent SDK or LangChain agents. The substrate-portability research direction is to abstract eval interfaces from runtime specifics, allowing the same eval suite to grade agents on any compatible runtime. OpenTelemetry's trace standardization is a step toward this. Both Phoenix and Braintrust now consume OpenTelemetry-compatible traces from any runtime, which means observability is portable even if eval frameworks aren't yet. The next step: DeepEval, Ragas, and the trace-grading layer standardize their inputs around OpenTelemetry as well. Then a single eval suite can grade agents across the OpenAI / Anthropic / open-source ecosystems. Some early work is in flight; full portability is still future work. For now, plan to maintain a thin adapter layer between your evals and your runtime if you may switch runtimes.
These five frontiers are not gaps in Course Nine's curriculum — they are open problems the field is working on. A reader who has completed Courses 3-9 is well-positioned to follow the research (the venues to watch as of May 2026: NeurIPS, ACL, ICML eval workshops; the OpenAI, Anthropic, Arize, Confident AI engineering blogs; the EDD community on the relevant Discord servers), to contribute to the open-source frameworks (DeepEval, Ragas, Phoenix all welcome contributions and are actively developed), or to extend the discipline to their own production agents in ways the current state of the field doesn't yet ship.
The architect's closing thesis sentence — the lead and the closer of the entire track. Course Nine opened by claiming that if test-driven development gave SaaS teams confidence in code, eval-driven development gives agentic AI teams confidence in behavior. The track's full thesis is wider than that. Building an AI-native company requires eight architectural invariants for the structure plus one cross-cutting discipline for the behavior. The discipline is what separates building agents from building production-grade AI workforces. A team with the eight invariants but no discipline ships agents that occasionally fail in confusing ways and never reach the reliability bar real businesses need. A team with the discipline but missing invariants can't build the company in the first place. Both are necessary; both are now taught; the Agent Factory curriculum is complete.
Bottom line: eval-driven development is the cross-cutting discipline that turns the eight architectural invariants of Courses 3-8 from built to measurably trustworthy. It takes its place alongside test-driven development as a foundational software-engineering discipline; a decade from now, shipping an agent without evals will look the way shipping SaaS without unit tests looks today. Five open frontiers — auto-eval generation, eval-of-evals, alignment metrics beyond pattern-matching, multi-agent eval, and eval portability across runtimes — are where the field is actively expanding. The Agent Factory track is now structurally complete: eight invariants plus one discipline equals a buildable, measurable, production-grade AI-native company.
Quick reference — the 15 Concepts in one table
| # | Concept | Key claim | Where in the architecture |
|---|---|---|---|
| 1 | Why traditional tests aren't enough | Probabilistic, multi-step, tool-using systems need behavior measurement, not code measurement | Above all of Courses 3-8 |
| 2 | The TDD analogy and its limits | Carries on loop + regression discipline; breaks on determinism, drift, cost, threshold-setting | Foundational framing |
| 3 | What "behavior" means | Output ≠ trace ≠ path; evaluating only the output misses most consequential failures | Diagnostic primitive |
| 4 | The 9-layer evaluation pyramid | Unit → integration → output → tool-use → trace → RAG → safety → regression → production | Architectural taxonomy |
| 5 | Output evals | Accessible starting point; catches format/factual errors; misses process failures | Layer 3 |
| 6 | Tool-use and trace evals | The workhorse layers for agentic AI; catch path failures invisible to output evals | Layers 4-5 |
| 7 | RAG evals | Separate retrieval, grounding, and citation failure modes | Layer 6 |
| 8 | OpenAI Agent Evals with trace grading | Two products in one ecosystem; Agent Evals for datasets and output-level grading at scale; trace grading for trace-level assertions | Tool #1 (pair) |
| 9 | DeepEval for repo-level | Pytest-for-agent-behavior; CI/CD integration; the discipline point | Tool #2 |
| 10 | Ragas + Phoenix | Specialized RAG metrics + production observability + trace-to-eval feedback | Tools #3-4 |
| 11 | Golden dataset construction | The most undervalued artifact; quality determines eval value | Dataset substrate |
| 12 | The eval-improvement loop | Define → run → trace → grade → identify failure mode → improve → rerun → ship | Operational rhythm |
| 13 | Production observability | Phoenix is the substrate; the trace-to-eval triage ritual is the discipline | Production-to-development loop |
| 14 | What evals can't measure | Novel situations, value alignment, subjective quality, adversarial attacks — honest scope | Discipline frontier |
| 15 | EDD as foundational discipline | Takes its place alongside TDD; five open frontiers in the field | Closing |
Cross-course summary — what gets evaluated where
| Course | Primitive built | Course Nine eval coverage |
|---|---|---|
| 3 | Agent loop | Output evals (Decision 2), trace evals (Decision 3) |
| 4 | System of record + MCP | RAG evals (Decision 5), grounding faithfulness checks |
| 5 | Operational envelope (Inngest) | Regression evals (Decision 6) — agent behavior consistent across durability events |
| 6 | Management layer + approval primitive | Safety evals (Decision 4), tool-use evals on approval-flow |
| 7 | Hiring API + talent ledger | Eval packs at hire time (Course Seven's primitive); Course Nine generalizes |
| 8 | Owner Identic AI + governance ledger | Trace evals on Claudia's reasoning (Decision 3), envelope-respect safety evals (Decision 4) |
What's next for the reader
If you've completed Courses 3-9, you have:
- The architectural model of an AI-native company (eight invariants).
- The cross-cutting discipline that makes the architecture trustworthy (eval-driven development).
- A working lab covering all four eval frameworks and seven Decisions of operational practice.
- An honest map of where the discipline closes the reliability gap and where it doesn't.
Three paths forward:
- Operate. Run an AI-native company using the curriculum. The frameworks and disciplines you've built are the minimum viable production stack. Real customer traffic, real evals, real iteration. The discipline gets sharper from production, not from theory; the team that ships the eval suite into one real agent learns more in three months than a team that studies eval theory for a year.
- Extend. Take the discipline into use cases the curriculum didn't cover. Multi-agent eval (the Concept 15 frontier — when Agent A handoffs to Agent B handoffs to Agent C, the eval surface multiplies). Domain-specific RAG evaluation (legal needs citation provenance; medical needs differential-diagnosis grounding; financial needs regulatory-policy adherence). Alignment metrics for high-stakes deployments (where pattern-matching reliability isn't enough). Each extension is a research direction by itself; pick one that matches your domain.
- Contribute. The open-source frameworks (DeepEval, Ragas, Phoenix) are actively developed. New metrics, runtime adapters, eval-of-evals tooling, and operational practice patterns come from practitioners shipping the discipline in production. The field is at TDD's early-2000s adoption point; the work of making EDD as standard as TDD is in front of us. Frameworks need maintainers; the discipline needs documenters; the community needs people who've shipped real evals against real production traffic and can show what worked.
One last Try-with-AI — the closing exercise. Open your Claude Code or OpenCode session and paste:
"I've completed the Agent Factory track including Course Nine. I want to apply eval-driven development to one of my own production agents — not Maya's customer-support example, but an agent I'm actually shipping. Walk me through the first three Decisions adapted to my agent: (1) define what behavior I care about and draft 10 golden-dataset examples from my actual production traffic (I'll paste them); (2) pick which two layers of the 9-layer pyramid are most important for my specific failure modes; (3) write the first DeepEval test for the most critical metric. Ask me what my agent does, what it uses tools for, and what its highest-stakes failure would look like — then we build from there. Treat this as a real engineering pairing session, not a curriculum exercise."
What you're learning. The discipline only matters when applied to your agent, your dataset, your failure modes. Course Nine taught the patterns; this exercise lands them on a real production target. A reader who completes this exercise and ships the resulting eval suite into their CI/CD pipeline has done more for their agent's reliability than a reader who re-read Concepts 1-15 ten times. The discipline transfers through use, not study.
References
Organized by topic. URLs current as of May 2026; verify before citing in your own work.
For leaders and researchers wanting the research background — the "Foundational research the discipline rests on" subsection below cites the academic and engineering papers Course Nine implicitly draws on: Kent Beck's TDD foundation, the LLM-as-judge calibration research (Zheng et al.), the canonical RAG paper (Lewis et al.), and the MLOps lineage (Sculley et al.). These are the papers to read if you want to ground EDD in the broader software-engineering and ML literature — not just adopt the tool stack.
The Agent Factory track:
- The Agent Factory thesis — the eight-invariant architectural model behind every course in this track. Available at /docs/thesis.
- Course Three through Eight — the eight architectural invariants of the curriculum. See the cross-course summary table earlier in this document.
The four-tool stack — primary documentation:
- OpenAI Agent Evals — OpenAI's agent-evaluation platform. The "Evaluate agent workflows" guide: https://developers.openai.com/api/docs/guides/agent-evals. The broader OpenAI Evals documentation (datasets, eval runs, graders): https://platform.openai.com/docs/guides/evals. The open-source eval-framework precursor: https://github.com/openai/evals
- OpenAI Trace Grading — the trace-grading capability within Agent Evals, documented as a distinct guide: https://developers.openai.com/api/docs/guides/trace-grading. Reads traces from the OpenAI Agents SDK and runs trace-level assertions.
- DeepEval — open-source pytest-style eval framework. Repo: https://github.com/confident-ai/deepeval; docs: https://deepeval.com/docs/; metric reference (the canonical metric catalog): https://deepeval.com/docs/metrics-introduction. Also includes a current OpenAI Agents SDK integration for agent traces.
- Ragas — open-source RAG-specific eval framework, now expanding into agent-evaluation metrics as well. Docs: https://docs.ragas.io; available metrics list (includes Tool Call Accuracy, Tool Call F1, Agent Goal Accuracy, Topic adherence alongside the classic RAG metrics): https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/; the foundational paper introducing the framework's metric set: Es et al., "Ragas: Automated Evaluation of Retrieval Augmented Generation" (EACL 2024).
- Phoenix (Arize) — open-source production observability with OpenAI Agents SDK tracing integration. Repo: https://github.com/Arize-ai/phoenix; docs: https://docs.arize.com/phoenix; OpenAI Agents SDK tracing integration specifically: https://arize.com/docs/phoenix/integrations/llm-providers/openai/openai-agents-sdk-tracing; the OpenInference standard for trace export (which Phoenix uses): https://github.com/Arize-ai/openinference
- Braintrust — commercial alternative to Phoenix. Product: https://www.braintrust.dev; docs: https://www.braintrust.dev/docs
Foundational research the discipline rests on:
- Test-Driven Development. Kent Beck, Test-Driven Development: By Example (Addison-Wesley, 2002) — the canonical reference. The EDD-as-TDD-for-behavior framing originates from the 2025-2026 agentic AI community; Beck's book remains the foundation.
- LLM-as-judge calibration. Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (NeurIPS 2023) — the foundational study of LLM grader reliability that informs Concept 14's honest discussion of grader limits.
- Grounding and faithfulness in RAG. The Ragas paper above plus Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS 2020) — the canonical RAG reference Course Four's MCP knowledge layer descends from.
- Trace-based agent evaluation. The OpenAI Agents SDK documentation cited above; plus the broader OpenTelemetry observability literature, which Phoenix and Trace Grading both consume.
Current discourse (where the discipline is being shaped in 2025-2026):
- The OpenAI engineering blog, particularly posts tagged "evaluation" and "agents": https://openai.com/blog
- The Anthropic engineering blog, particularly posts on Claude Agent SDK and constitutional AI evaluation: https://www.anthropic.com/research
- The Arize blog (Phoenix's maintainers), which publishes practical evaluation case studies: https://arize.com/blog
- The Confident AI blog (DeepEval's maintainers), with practical eval-driven development case studies: https://www.confident-ai.com/blog
- NeurIPS, ACL, and ICML eval workshops (2024-2026) — the academic venues where the discipline's frontier is being researched
Adjacent disciplines worth understanding:
- Red-teaming for LLM systems. Complementary to EDD; catches the adversarial-attack failure modes Concept 14 names. Anthropic's responsible-scaling-policy documentation is a useful entry point.
- MLOps for traditional machine learning. The model-monitoring discipline EDD inherits from. Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (NeurIPS 2015) is the classic.
- Continuous integration / continuous deployment. The CI/CD substrate Decision 6 plugs into. Humble & Farley, Continuous Delivery (Addison-Wesley, 2010) remains the canonical reference.
Course Nine closes the Agent Factory track. Build agents that work. Verify they work. Ship with the discipline that lets you trust what you built. That is the shift from demo to production AI workforce — and it is the engineering practice that turns the architectural promise of Courses 3-8 into something a real business can rely on.