Skip to main content

Chapter 47: Evals - Measuring Agent Performance

Part: 6 (AI Native Software Development) Phase: Quality Assurance Proficiency Level: B1-B2 (Intermediate to Upper Intermediate) Duration: ~4.5 hours

Chapter Overview

This chapter teaches the thinking and methodology behind agent evaluations (evals)—the systematic approach to measuring AI agent reasoning quality. Unlike TDD (Chapter 46) which tests code correctness with deterministic PASS/FAIL outcomes, evals measure probabilistic reasoning quality with scores.

Core Thesis (Andrew Ng): "One of the biggest predictors for whether someone is able to build agentic workflows really well is whether or not they're able to drive a really disciplined evaluation process."

Prerequisites

  • Ch34-36 (SDK chapters): Understanding of agent architecture
  • Ch40 (FastAPI for Agents): Task API running example
  • Ch46 (TDD for Agents): Understanding of test-driven development

Student Skill

Students build the agent-evals skill throughout this chapter, starting with L00 and finalizing in L10.

Lesson Index

#LessonDurationFocus
L00Build Your Evals Skill25 minSkill-First: Create agent-evals skill
L01Evals Are Exams for Reasoning20 minTDD vs Evals distinction
L02The Two Evaluation Axes20 minFour-quadrant classification
L03Designing Eval Datasets25 minQuality over quantity (10-20 cases)
L04Building Graders with Binary Criteria30 minBinary yes/no pattern
L05LLM-as-Judge Graders30 minLLM evaluation with limitations
L06Systematic Error Analysis30 minSpreadsheet-based counting
L07Component vs End-to-End Evals25 minDecision framework
L08Regression Protection25 minEval-on-every-change
L09The Complete Quality Loop30 minBuild-Evaluate-Analyze-Improve
L10Finalize Your Evals Skill20 minSkill validation

Key Concepts

  • Evals as Exams: Testing reasoning quality, not code correctness
  • Two Axes: Objective/Subjective × Ground Truth availability
  • Binary Criteria: 5 yes/no checks → 0-5 score (more reliable than 1-5 scales)
  • Error Analysis: Systematic counting replaces gut feeling
  • The Quality Loop: Build → Evaluate → Analyze → Improve → Repeat

Running Example

Task API agent from Ch40, evaluated for:

  • Routing decisions (create vs update vs query)
  • Tool selection correctness
  • Output format compliance
  • Error handling quality

Learning Outcomes

By chapter end, students can:

  1. Distinguish evals from TDD based on determinism and outcomes
  2. Design eval datasets with typical, edge, and error cases
  3. Create graders using binary criteria
  4. Perform systematic error analysis
  5. Build regression protection workflows
  6. Apply the complete quality loop

Framework-Agnostic

These concepts apply to any SDK (OpenAI, Claude, Google ADK, LangChain). The thinking is portable.