Deep Dive Chapter 69: Multi-Agent Reliability — Errors, Escalation, Provenance & Quality

A multi-agent system is only as reliable as its weakest error path. This chapter teaches the five disciplines of multi-agent reliability: structured error propagation, escalation calibration, context management at scale, information provenance, and human review integration. These are the patterns that separate prototype-grade systems from production-grade systems.

What You'll Learn

By the end of this chapter, you'll be able to:

Design structured error responses with the MCP isError pattern and error category taxonomy
Implement subagent error recovery with partial results propagation and coordinator decision-making
Calibrate escalation triggers using explicit criteria and few-shot examples (not sentiment or confidence scores)
Preserve critical information across long sessions using the "case facts" pattern
Build claim-source mapping pipelines that maintain provenance through multi-agent synthesis
Design human review workflows with field-level confidence calibration and stratified sampling
Orchestrate coordinator-subagent architectures with parallel execution and iterative refinement

Chapter Structure

Structured Error Propagation — The MCP isError Pattern — error categories (transient, validation, business, permission), structured metadata, access-failure vs valid-empty-result
Subagent Error Recovery and Coordinator Decision-Making — local recovery first, partial results, coverage annotations in synthesis
Escalation Calibration — When to Escalate vs Resolve — what works (explicit criteria + few-shot), what doesn't (confidence scores, sentiment, "be conservative")
Context Management at Scale — progressive summarization risks, the "case facts" pattern, "lost in the middle" effect, scratchpad files
Information Provenance in Multi-Source Synthesis — claim-source mappings, conflicting sources, temporal data, scoped verification tools
Human Review Workflows & Confidence Calibration — aggregate accuracy trap, field-level scores, threshold calibration with labeled sets, stratified sampling
Coordinator-Subagent Orchestration Patterns — hub-and-spoke, context isolation, Task tool, parallel execution, iterative refinement (Capstone)

Running Project

Students build a multi-agent research system (certification exam Scenario 3) with a coordinator agent, web search subagent, document analysis subagent, and synthesis subagent — progressively adding reliability engineering to each layer.

Prerequisites

Chapter 65: Anthropic Claude Agent SDK
Chapter 64: The Claude API — Agentic Loops
Chapters 66-67: MCP Fundamentals and Custom MCP Servers

Certification Exam Coverage

This chapter covers Claude Certified Architect — Foundations exam domains:

Domain 2 (18%): Task Statement 2.2 — Structured error responses
Domain 5 (15%): Task Statements 5.1-5.6 — Context management, escalation, error propagation, provenance, human review, confidence calibration
Directly covers Sample Questions 3, 7, 8, 9
Covers exam Scenario 3 (Multi-Agent Research System) end-to-end

What You'll Learn​

Chapter Structure​

Running Project​

Prerequisites​

Certification Exam Coverage​

What You'll Learn

Chapter Structure

Running Project

Prerequisites

Certification Exam Coverage