Skip to main content

The Complete Quality Loop

You've learned the pieces. Datasets that capture real user behavior. Graders that define "good" with binary precision. Error analysis that reveals where failures cluster. Component evals that isolate problems. Regression protection that prevents backsliding.

But pieces alone don't ship agents. What matters is how they work together as a system.

Andrew Ng, who has observed hundreds of teams building AI agents, distills successful agent development into a single insight:

"When I'm building these workflows, there are two major activities. One is building the workflow, you know, prompting the agent or agents. And the second is then looking at the outputs and error analysis and adjusting based on learnings."

Two activities. Building and analysis. The teams that master this loop ship reliable agents. The teams that don't keep wondering why their agents "work in testing but fail in production."

This lesson brings everything together into the complete quality loop.

The Two Activities Framework

Andrew Ng's observation is deceptively simple. Agent development alternates between two modes:

Building Mode

  • Writing prompts
  • Configuring tools
  • Setting up guardrails
  • Designing agent architecture
  • Making changes to improve quality

Analysis Mode

  • Running evals
  • Reading outputs
  • Finding patterns in failures
  • Identifying which component is broken
  • Measuring improvement from changes

Most developers spend 90% of their time in building mode and 10% in analysis mode. Successful agent builders flip this ratio. They spend more time understanding WHY their agent fails than they spend writing code to fix it.

The counterintuitive truth: The path to shipping faster is slowing down to analyze.

When you guess at fixes, you might get lucky. But usually you fix the wrong thing, introduce new problems, or make marginal improvements that don't address root causes. When you analyze systematically, you find high-leverage fixes that improve multiple test cases at once.

The complete loop formalizes this insight into a repeatable process.

The Complete 10-Step Loop

Here's the full quality loop, synthesizing everything from this chapter:

          THE COMPLETE QUALITY LOOP
======================================

1. BUILD quick-and-dirty agent v1
|
v
2. CREATE eval dataset (10-20 cases)
|
v
3. RUN evals --> Find initial score (often 60-70%)
|
v
4. ERROR ANALYSIS --> "45% errors from routing"
|
v
5. FIX routing --> Re-run evals --> Score improves
|
v
6. ERROR ANALYSIS --> "30% errors from output format"
|
v
7. FIX format --> Re-run evals --> Score improves
|
v
8. DECISION: Ship or iterate?
|
+-----> If < threshold: Return to step 4
|
v
9. DEPLOY with regression protection
|
v
10. MONITOR production --> Add failed cases to dataset
|
+-----> Return to step 3 (continuous improvement)

Each step builds on skills from earlier lessons:

StepChapter LessonSkill Applied
1L00Agent-evals skill foundation
2L03Dataset design (typical, edge, error cases)
3L04-L05Graders (binary criteria, LLM judges)
4L06Error analysis (trace reading, pattern counting)
5-7L07Component vs end-to-end focus
8This lessonShip decision framework
9L08Regression protection
10L03Growing datasets from production

This isn't a linear process you complete once. It's a cycle you run continuously. Even after shipping, production failures feed back into your eval dataset, triggering new iterations.

The Iteration Pattern: 70% to 85% to 92%

Let's trace a realistic improvement journey for the Task API agent:

Iteration 1: Initial Baseline

Pass rate: 70%

Error breakdown:
- routing_correct: 65%
- tool_selection: 80%
- output_format: 75%
- helpful_response: 60%

Error analysis: 45% of failures trace to routing.
User says "what's on my plate today?" and agent
creates a new task instead of listing existing ones.

Action: Improve routing prompt with explicit examples

Iteration 2: After Routing Fix

Pass rate: 85%

Error breakdown:
- routing_correct: 92% (improved!)
- tool_selection: 82%
- output_format: 78%
- helpful_response: 72%

Error analysis: 30% of remaining failures are output format.
Task lists display as JSON instead of friendly text.

Action: Add output formatting instructions

Iteration 3: After Format Fix

Pass rate: 92%

Error breakdown:
- routing_correct: 92%
- tool_selection: 88%
- output_format: 95% (improved!)
- helpful_response: 87%

Error analysis: Remaining failures are edge cases -
ambiguous requests where multiple interpretations are valid.

Decision: 92% meets threshold. Ship.

Notice the pattern: Each iteration targets the lowest-scoring component. This is highest-leverage improvement. Fixing a 60% component to 90% yields bigger gains than polishing a 90% component to 95%.

The math is simple: Improving helpful_response from 60% to 87% adds 27 percentage points. Improving routing_correct from 92% to 98% adds only 6 points. Focus on the worst component first.

When to Ship: The Decision Framework

Not every agent needs 99% quality. The threshold depends on stakes, alternatives, and improvement trajectory.

Pass RateDecisionAppropriate When
≥95%Ship with confidenceHigh-stakes agents, customer-facing, replacing humans
90-95%Ship with light monitoringNormal agents, users have fallback options
80-90%Ship if low-stakes, with active monitoringInternal tools, experimental features, clear "beta" labeling
70-80%Ship only for demos/prototypesNot production-ready; iterate before real users
≤70%Keep iteratingToo unreliable for any production use

Beyond the number: Pass rate alone doesn't determine ship readiness. Also consider:

Improvement trajectory: Are you still making gains? If you jumped from 70% to 88% in two iterations, you're likely to reach 95% with one more push. If you've been stuck at 88% for five iterations, you're hitting diminishing returns.

Failure modes: A 90% agent that fails gracefully ("I'm not sure, let me get a human") is more shippable than a 92% agent that fails confidently with wrong answers.

User expectations: Internal tools tolerate more failures than customer-facing products. "Beta" labels buy forgiveness. Mission-critical workflows demand near-perfection.

Cost of delay: If users are currently doing this task manually with 75% accuracy, an 85% agent is an improvement. Don't let perfect be the enemy of deployed.

from dataclasses import dataclass

@dataclass
class ShipDecisionContext:
"""Context for making ship/iterate decisions."""
pass_rate: float
iterations_at_this_level: int
failure_mode: str # "graceful" or "confident_wrong"
user_stakes: str # "high", "medium", "low"
current_alternative_quality: float # What's the baseline without agent?


def should_ship(context: ShipDecisionContext) -> dict:
"""
Decide whether to ship based on quality and context.

Returns:
Decision with reasoning
"""
# High-stakes requires high quality
if context.user_stakes == "high" and context.pass_rate < 0.95:
return {
"decision": "ITERATE",
"reason": f"High-stakes use case requires 95%+; currently at {context.pass_rate:.0%}"
}

# Diminishing returns detection
if context.iterations_at_this_level >= 3:
if context.pass_rate >= 0.85:
return {
"decision": "SHIP",
"reason": f"Diminishing returns after {context.iterations_at_this_level} iterations; {context.pass_rate:.0%} is acceptable"
}

# Better than current alternative
if context.pass_rate > context.current_alternative_quality + 0.10:
if context.failure_mode == "graceful":
return {
"decision": "SHIP",
"reason": f"10%+ improvement over current ({context.current_alternative_quality:.0%}), graceful failures"
}

# Standard thresholds
if context.pass_rate >= 0.90:
return {"decision": "SHIP", "reason": f"{context.pass_rate:.0%} meets standard threshold"}

if context.pass_rate >= 0.80 and context.user_stakes == "low":
return {"decision": "SHIP_WITH_MONITORING", "reason": "80%+ acceptable for low-stakes with monitoring"}

return {"decision": "ITERATE", "reason": f"{context.pass_rate:.0%} below threshold for {context.user_stakes}-stakes use"}

Output:

context = ShipDecisionContext(
pass_rate=0.88,
iterations_at_this_level=4,
failure_mode="graceful",
user_stakes="medium",
current_alternative_quality=0.70
)

result = should_ship(context)
print(f"Decision: {result['decision']}")
print(f"Reason: {result['reason']}")
Decision: SHIP
Reason: Diminishing returns after 4 iterations; 88% is acceptable

Cost and Latency: Only AFTER Quality

A common mistake: optimizing for cost and speed before the agent works well.

The correct order:

  1. Quality first: Get pass rate above threshold
  2. Then latency: Reduce response time for user experience
  3. Then cost: Optimize inference spending

Why this order matters: A fast, cheap agent that gives wrong answers is worthless. Users don't care about 200ms response time if the response is unhelpful. Saving $0.02 per request means nothing if users abandon the product.

Only after quality meets your threshold should you ask: "Can we maintain this quality while reducing latency/cost?"

Cost and latency optimization techniques (apply AFTER quality threshold met):

TechniqueQuality ImpactCost ReductionLatency Reduction
Use smaller modelRisk regressionHighHigh
Cache common responsesNone if done correctlyHighVery high
Reduce prompt lengthRisk regressionMediumMedium
Batch similar requestsNoneMediumNegative (increases)
Use tool cachingNone if statelessMediumHigh

Critical: Run your full eval suite after ANY optimization. Cost savings that cause quality regression are not savings—they're damage.

def optimize_after_quality(
baseline: EvalRun,
optimized: EvalRun,
cost_before: float,
cost_after: float
) -> dict:
"""
Evaluate whether optimization is worthwhile.

Only accept if quality maintained and meaningful savings achieved.
"""
quality_delta = optimized.pass_rate - baseline.pass_rate

# Reject if quality dropped
if quality_delta < -0.02: # 2% tolerance
return {
"accept": False,
"reason": f"Quality dropped {abs(quality_delta):.1%}. Reject optimization."
}

cost_savings = (cost_before - cost_after) / cost_before

# Accept if quality maintained and savings meaningful
if quality_delta >= -0.02 and cost_savings >= 0.15:
return {
"accept": True,
"reason": f"Quality maintained ({quality_delta:+.1%}), cost reduced {cost_savings:.0%}"
}

return {
"accept": False,
"reason": f"Savings too small ({cost_savings:.0%}) to justify optimization effort"
}

Output:

from dataclasses import dataclass

@dataclass
class EvalRun:
pass_rate: float

baseline = EvalRun(pass_rate=0.92)
optimized = EvalRun(pass_rate=0.90) # Small quality drop

result = optimize_after_quality(baseline, optimized, cost_before=0.05, cost_after=0.02)
print(f"Accept: {result['accept']}")
print(f"Reason: {result['reason']}")
Accept: True
Reason: Quality maintained (-2.0%), cost reduced 60%

Example: Complete Cycle with Task API Agent

Let's walk through the complete loop for the Task API agent:

Step 1: Build v1

task_agent = TaskAPIAgent(
model="gpt-4",
tools=["create_task", "list_tasks", "update_task", "delete_task"],
system_prompt="You help users manage their tasks..."
)

Step 2: Create eval dataset

eval_cases = [
# Typical cases (10)
{"input": "Add buy groceries to my list", "expected": "creates_task"},
{"input": "What's on my plate today?", "expected": "lists_tasks"},
{"input": "Mark the grocery task done", "expected": "updates_task"},
# ... 7 more typical cases

# Edge cases (5)
{"input": "rm * ", "expected": "clarifies_intent"}, # Could be delete, could be gibberish
{"input": "", "expected": "asks_for_input"}, # Empty input
# ... 3 more edge cases

# Error cases (5)
{"input": "Delete task ID 99999", "expected": "graceful_not_found"},
# ... 4 more error cases
]

Step 3: Run initial evals

Pass rate: 68%

routing_correct: 60%
tool_selection: 75%
output_format: 72%
helpful_response: 65%

Step 4: Error analysis

Examining 32% of failures (16 cases):

- 7 cases: Routing errors (user asked to list, agent created)
- 4 cases: Output format (JSON instead of friendly text)
- 3 cases: Unhelpful responses (too brief, missed context)
- 2 cases: Tool selection (used update when should delete)

Highest concentration: Routing (7/16 = 44% of errors)

Step 5: Fix routing

# Enhanced system prompt with explicit routing examples
system_prompt = """
You help users manage their tasks. Route carefully:

LISTING words: "what's on", "show me", "list", "what do I have"
-> Call list_tasks

CREATING words: "add", "create", "remind me to", "new task"
-> Call create_task

UPDATING words: "mark done", "complete", "change", "update"
-> Call update_task

DELETING words: "remove", "delete", "cancel"
-> Call delete_task
"""

Iteration 2 results:

Pass rate: 82% (up from 68%)

routing_correct: 88% (was 60%)
tool_selection: 80%
output_format: 75%
helpful_response: 72%

Steps 6-7: Fix output format

# Add formatting instructions
output_instructions = """
Always format responses as friendly text, not JSON.

GOOD: "Here are your 3 tasks for today:
- Buy groceries (due today)
- Call mom (due tomorrow)
- Prepare presentation (due Friday)"

BAD: {"tasks": [{"title": "Buy groceries", ...}]}
"""

Iteration 3 results:

Pass rate: 91%

routing_correct: 88%
tool_selection: 85%
output_format: 94%
helpful_response: 88%

Step 8: Ship decision

91% pass rate with:
- Medium-stakes use case (personal task management)
- Graceful failure mode (agent asks clarifying questions)
- 4 iterations completed
- Current alternative: Manual task management (~60% completion rate)

Decision: SHIP

Step 9: Deploy with regression protection

# Baseline locked at 91%
regression_config = RegressionConfig(
overall_threshold=0.05, # Alert if drops more than 5%
criterion_threshold=0.10,
block_on_regression=True
)

Step 10: Monitor production

Week 1 production:
- 2 new failure modes discovered
- Added to eval dataset (now 25 cases)
- Re-ran evals: 89% (slight drop from production variety)
- New iteration: Improved edge case handling
- Current: 93%

Exercise: Document Your Improvement Journey

Take an agent you're building (or plan to build) and document your complete quality loop:

Phase 1: Planning

  1. What are your 5 key evaluation criteria?
  2. What's your target pass rate for shipping?
  3. What's your regression protection threshold?

Phase 2: Initial Cycle 4. Create a 15-case eval dataset (8 typical, 4 edge, 3 error) 5. Run initial evals and record baseline 6. Perform error analysis: Where do failures cluster?

Phase 3: Iteration 7. Make one focused fix targeting the highest-error component 8. Re-run evals and record improvement 9. Repeat until you hit threshold or diminishing returns

Phase 4: Ship Decision 10. Document your decision: Ship, ship with monitoring, or iterate? 11. What factors beyond pass rate influenced your decision?

Phase 5: Production 12. How will you capture production failures for dataset growth? 13. What's your schedule for regression checks?

Reflect on Your Skill

This lesson synthesizes the entire chapter. Here's what to add to your agent-evals skill:

The Complete Loop Pattern

Build --> Dataset (10-20 cases) --> Evals --> Error Analysis
^ |
| v
+--------- Fix Lowest Component <--------------+
|
v
Ship Decision (≥90% or diminishing returns)
|
v
Deploy + Regression Protection
|
v
Production Monitoring --> Grow Dataset

Ship Decision Heuristics

≥95%: Ship with confidence (high-stakes OK)
90-95%: Ship with light monitoring
80-90%: Ship if low-stakes, active monitoring
70-80%: Prototypes only
≤70%: Keep iterating

Also consider:
- Improvement trajectory (still gaining vs stuck)
- Failure mode (graceful vs confident-wrong)
- User expectations (internal vs customer-facing)
- Current alternative quality (is agent better than status quo?)

Optimization Order

ALWAYS: Quality first, then latency, then cost

Only optimize cost/latency AFTER quality threshold met.
Run full eval suite after ANY optimization.
Reject optimizations that regress quality by more than 2%.

The Two Activities Insight

Agent development = Building + Analysis

Most developers: 90% building, 10% analysis
Successful builders: Flip the ratio

More time understanding failures = faster path to shipping
Don't guess at fixes - measure where errors cluster

The key insight for your skill: The complete quality loop isn't just a process - it's a mindset. Every change is a hypothesis. Every eval run is an experiment. Every error analysis is learning. The teams that internalize this loop ship agents that work. The teams that skip steps ship agents that "work in testing but fail in production."

Try With AI

Prompt 1: Analyze Your Current Loop

I'm building an agent that [describe your agent]. Here's my current
development process:

[Describe how you currently test/improve your agent]

Help me identify:
1. Which of the 10 loop steps am I already doing?
2. Which steps am I skipping or doing poorly?
3. What's the highest-leverage step I should improve first?
4. How would adding that step change my development velocity?

What you're learning: Self-assessment of your current process. Most developers have an informal loop; making it explicit reveals gaps.

Prompt 2: Design Your Ship Decision Framework

My agent is at 85% pass rate. Here's the context:
- Use case: [describe]
- User stakes: [high/medium/low]
- Current alternative: [what do users do without the agent?]
- Iterations so far: [number]
- Improvement trend: [improving/flat/declining]

Help me decide:
1. Should I ship, ship with monitoring, or iterate?
2. What specific criteria would change my decision?
3. If I iterate, what's my target and why?
4. If I ship, what monitoring should I set up?

What you're learning: Ship decisions require judgment, not just thresholds. AI helps you think through the factors systematically.

Prompt 3: Plan Your Optimization Phase

My agent is at 92% quality - above my ship threshold. Now I want to
optimize for cost without regressing quality.

Current setup:
- Model: [model name]
- Average tokens per request: [number]
- Cost per request: [amount]
- Latency: [time]

Help me plan:
1. What optimizations should I try, in what order?
2. What's the minimum eval coverage I need before each change?
3. How do I know if an optimization is worth the risk?
4. What's my rollback plan if quality drops?

What you're learning: Optimization is a disciplined process, not random tweaking. The order matters. The eval coverage matters. The rollback plan matters.

Safety Note

The complete quality loop gives you confidence, but not certainty. Evals measure what you define as quality - they can't catch unknown unknowns. Production will always surface failure modes your dataset missed. Build the loop, trust the loop, but stay humble. The best agents are built by teams who assume they're missing something and keep looking for it.