Chapter 47: Evals - Measuring Agent Performance

Part: 6 (AI Native Software Development) Phase: Quality Assurance Proficiency Level: B1-B2 (Intermediate to Upper Intermediate) Duration: ~4.5 hours

Chapter Overview

This chapter teaches the thinking and methodology behind agent evaluations (evals)—the systematic approach to measuring AI agent reasoning quality. Unlike TDD (Chapter 46) which tests code correctness with deterministic PASS/FAIL outcomes, evals measure probabilistic reasoning quality with scores.

Core Thesis (Andrew Ng): "One of the biggest predictors for whether someone is able to build agentic workflows really well is whether or not they're able to drive a really disciplined evaluation process."

Prerequisites

Ch34-36 (SDK chapters): Understanding of agent architecture
Ch40 (FastAPI for Agents): Task API running example
Ch46 (TDD for Agents): Understanding of test-driven development

Student Skill

Students build the agent-evals skill throughout this chapter, starting with L00 and finalizing in L10.

Lesson Index

#	Lesson	Duration	Focus
L00	Build Your Evals Skill	25 min	Skill-First: Create agent-evals skill
L01	Evals Are Exams for Reasoning	20 min	TDD vs Evals distinction
L02	The Two Evaluation Axes	20 min	Four-quadrant classification
L03	Designing Eval Datasets	25 min	Quality over quantity (10-20 cases)
L04	Building Graders with Binary Criteria	30 min	Binary yes/no pattern
L05	LLM-as-Judge Graders	30 min	LLM evaluation with limitations
L06	Systematic Error Analysis	30 min	Spreadsheet-based counting
L07	Component vs End-to-End Evals	25 min	Decision framework
L08	Regression Protection	25 min	Eval-on-every-change
L09	The Complete Quality Loop	30 min	Build-Evaluate-Analyze-Improve
L10	Finalize Your Evals Skill	20 min	Skill validation

Key Concepts

Evals as Exams: Testing reasoning quality, not code correctness
Two Axes: Objective/Subjective × Ground Truth availability
Binary Criteria: 5 yes/no checks → 0-5 score (more reliable than 1-5 scales)
Error Analysis: Systematic counting replaces gut feeling
The Quality Loop: Build → Evaluate → Analyze → Improve → Repeat

Running Example

Task API agent from Ch40, evaluated for:

Routing decisions (create vs update vs query)
Tool selection correctness
Output format compliance
Error handling quality

Learning Outcomes

By chapter end, students can:

Distinguish evals from TDD based on determinism and outcomes
Design eval datasets with typical, edge, and error cases
Create graders using binary criteria
Perform systematic error analysis
Build regression protection workflows
Apply the complete quality loop

Framework-Agnostic

These concepts apply to any SDK (OpenAI, Claude, Google ADK, LangChain). The thinking is portable.

Chapter Overview​

Prerequisites​

Student Skill​

Lesson Index​

Key Concepts​

Running Example​

Learning Outcomes​

Framework-Agnostic​