Copy as MarkdownCtrl+⇧+C
Chapter 69: Multi-Agent Reliability — Errors, Escalation, Provenance & Quality
A multi-agent system is only as reliable as its weakest error path. This chapter teaches the five disciplines of multi-agent reliability: structured error propagation, escalation calibration, context management at scale, information provenance, and human review integration. These are the patterns that separate prototype-grade systems from production-grade systems.
What You'll Learn
By the end of this chapter, you'll be able to:
- Design structured error responses with the MCP
isErrorpattern and error category taxonomy - Implement subagent error recovery with partial results propagation and coordinator decision-making
- Calibrate escalation triggers using explicit criteria and few-shot examples (not sentiment or confidence scores)
- Preserve critical information across long sessions using the "case facts" pattern
- Build claim-source mapping pipelines that maintain provenance through multi-agent synthesis
- Design human review workflows with field-level confidence calibration and stratified sampling
- Orchestrate coordinator-subagent architectures with parallel execution and iterative refinement
Chapter Structure
- Structured Error Propagation — The MCP isError Pattern — error categories (transient, validation, business, permission), structured metadata, access-failure vs valid-empty-result
- Subagent Error Recovery and Coordinator Decision-Making — local recovery first, partial results, coverage annotations in synthesis
- Escalation Calibration — When to Escalate vs Resolve — what works (explicit criteria + few-shot), what doesn't (confidence scores, sentiment, "be conservative")
- Context Management at Scale — progressive summarization risks, the "case facts" pattern, "lost in the middle" effect, scratchpad files
- Information Provenance in Multi-Source Synthesis — claim-source mappings, conflicting sources, temporal data, scoped verification tools
- Human Review Workflows & Confidence Calibration — aggregate accuracy trap, field-level scores, threshold calibration with labeled sets, stratified sampling
- Coordinator-Subagent Orchestration Patterns — hub-and-spoke, context isolation, Task tool, parallel execution, iterative refinement (Capstone)
Running Project
Students build a multi-agent research system (certification exam Scenario 3) with a coordinator agent, web search subagent, document analysis subagent, and synthesis subagent — progressively adding reliability engineering to each layer.
Prerequisites
- Chapter 65: Anthropic Claude Agent SDK
- Chapter 64: The Claude API — Agentic Loops
- Chapters 66-67: MCP Fundamentals and Custom MCP Servers
Certification Exam Coverage
This chapter covers Claude Certified Architect — Foundations exam domains:
- Domain 2 (18%): Task Statement 2.2 — Structured error responses
- Domain 5 (15%): Task Statements 5.1-5.6 — Context management, escalation, error propagation, provenance, human review, confidence calibration
- Directly covers Sample Questions 3, 7, 8, 9
- Covers exam Scenario 3 (Multi-Agent Research System) end-to-end