Systematic Error Analysis

Your agent fails 30% of the time. You have graders that detect failures. But which failures matter most? Which component is causing them? And what should you fix first?

Andrew Ng identified what separates effective AI builders from the rest: "Less experienced teams spend a lot of time building and probably much less time analyzing." When your agent fails, the instinct is to start fixing immediately. You see an error, you have a theory about the cause, and you start coding. But that theory is often wrong.

Consider this scenario: Your web research agent produces poor results. You think "the LLM prompt must be unclear," so you spend two days rewriting prompts. Performance improves slightly. Then you actually count the errors and discover: 45% of failures came from the web search component returning low-quality sources. The prompt was fine. You fixed the wrong component.

Analysis time is not overhead. Analysis time is investment that prevents wasted effort. The developers who spend 30 minutes counting errors before fixing anything outperform those who spend 30 hours fixing the wrong component.

The Build-Analyze Loop

Effective agent development follows a cycle:

Build agent v1
    |
    v
Run evaluations
    |
    v
Analyze errors <-- This is where most developers skip
    |
    v
Identify which component failed most
    |
    v
Fix that specific component
    |
    v
Run evaluations again
    |
    v
(Repeat until quality is acceptable)

The trap is moving directly from "run evaluations" to "fix something." Without analysis, you're guessing which component to fix. Even smart guesses are often wrong because agent failures have multiple causes, and the one that comes to mind first is not necessarily the one causing most failures.

Traces and Spans: The Vocabulary of Analysis

Before analyzing errors, you need language to describe what you're examining.

Trace: All intermediate outputs from a single agent run. When your agent processes a query, it might call an LLM, search the web, select sources, and generate a response. The trace captures everything: every input, every intermediate output, every decision point.

Span: The output of a single step within a trace. If the trace is the complete journey, each span is one leg of that journey. The web search span contains query terms and results. The source selection span contains which sources were chosen and why. The final output span contains the response shown to the user.

Think of debugging a flight delay. The trace is the entire trip: airport, plane, connections, destination. Each span is one segment: "check-in took 45 minutes," "boarding delayed 30 minutes," "connection missed due to late arrival." To fix delays, you examine spans to find which segment caused the problem.

The relationship in practice:

Term	What It Contains	When to Examine
Trace	Complete agent run from input to output	Understanding overall failure pattern
Span	Single step's input, processing, and output	Identifying which component failed

When your agent produces a poor response, you examine the trace to see what happened. You look at each span to identify where the problem originated. Maybe the web search span returned only blog posts when you needed academic sources. Maybe the source selection span picked the wrong articles from good search results. Maybe the output generation span ignored the best sources.

This vocabulary matters because error analysis requires precision. "The agent failed" doesn't tell you what to fix. "The source selection span selected low-quality blogs despite high-quality academic sources being available in the search results" tells you exactly what to fix.

The Spreadsheet Method

The most effective error analysis tool is embarrassingly simple: a spreadsheet.

When your agent fails, don't just note "it failed." Break down the trace into spans and mark which spans produced problematic outputs. After analyzing 20-30 failures, patterns emerge from the data rather than from your intuition.

The structure:

Case	Input	Search Terms	Search Results	Source Selection	Final Output	Error Location
Q1	"Black holes"	OK	Too many blogs	Based on poor input	Missing key points	Search Results
Q2	"Seattle rent"	OK	OK	Missed relevant blog	OK	Source Selection
Q3	"Robot farming"	Too generic	Poor results	Based on poor input	Missing company	Search Terms
Q4	"Climate 2024"	OK	OK	OK	OK	None
Q5	"AI agents"	OK	Outdated sources	Based on poor input	Stale information	Search Results

After completing the table, count the Error Location column:

Error Location	Count	Percentage
Search Results	2	40%
Source Selection	1	20%
Search Terms	1	20%
None (success)	1	20%
Total	5	100%

Now you have data instead of intuition. If you had guessed, you might have focused on the LLM prompt generating final outputs. The data shows the real problem: the web search component returns low-quality results 40% of the time. Fix that first.

Why Counting Beats Intuition

Your intuition about error causes is biased. You remember the dramatic failures, not the common ones. You remember the failures you understand, not the ones that confuse you. You remember recent failures more than older ones.

Andrew Ng describes teams that "spend a lot of time building and probably much less time analyzing." These teams chase the last error they saw. They fix the failure that frustrated them most. They work on the component they understand best, regardless of whether it's the component that fails most.

Counting corrects for these biases:

Bias	How It Misleads	How Counting Corrects
Availability bias	Recent or dramatic errors feel more common	All errors counted equally
Confirmation bias	You notice errors matching your theory	Data shows all patterns
Expertise bias	You focus on components you understand	Data reveals unfamiliar problems
Anchoring bias	First error you saw dominates thinking	Percentages show true distribution

When someone says "I think the routing is the problem," ask them: "What percentage of errors come from routing?" If they can't answer with data, they're guessing. Maybe they're right, maybe not. The spreadsheet tells you for certain.

The Prioritization Formula

Knowing which component fails most doesn't automatically tell you what to fix first. A component might fail frequently but be nearly impossible to fix. Another might fail rarely but be trivially fixable.

The prioritization formula balances both factors:

Priority = Frequency x Feasibility

Factor	What It Measures	Scale
Frequency	How often this error type occurs	0-100% of failures
Feasibility	How easily you can fix it	0 (impossible) to 1 (trivial)

Example prioritization:

Error Type	Frequency	Feasibility	Priority Score
Search returns blogs	45%	0.8 (add filters)	36
Routing misclassifies	25%	0.4 (needs new training data)	10
Output format wrong	15%	0.9 (fix template)	13.5
Source timeout	10%	0.3 (infrastructure change)	3
Unknown errors	5%	0.2 (need investigation)	1

By priority score, you should fix in this order:

Search returns blogs (score: 36)
Output format wrong (score: 13.5)
Routing misclassifies (score: 10)
Source timeout (score: 3)
Unknown errors (score: 1)

The routing errors occur more often than format errors (25% vs 15%), but format errors are much easier to fix (0.9 vs 0.4). Fix the format first, then tackle routing.

How to estimate feasibility:

Feasibility	Description	Example
0.9 - 1.0	Trivial fix, minutes to implement	Change a config value, fix a regex
0.7 - 0.8	Clear solution, hours to implement	Add search filters, update prompt
0.5 - 0.6	Known approach, days to implement	Retrain classifier, add new component
0.3 - 0.4	Uncertain solution, research needed	New architecture, external dependency
0.0 - 0.2	Unknown cause, investigation required	Intermittent failures, vendor issues

Generating Error Analysis Data

While spreadsheets work for small-scale analysis, Python code systematizes the process for larger evaluations:

import csv
from dataclasses import dataclass
from collections import Counter

@dataclass
class AnalyzedCase:
    """A single analyzed test case with error attribution."""
    case_id: str
    input_query: str
    search_terms_ok: bool
    search_results_ok: bool
    source_selection_ok: bool
    output_ok: bool
    error_location: str  # Which span failed, or "None"


def analyze_trace(case_id: str, trace: dict) -> AnalyzedCase:
    """
    Analyze a single trace and determine which span caused failure.

    Args:
        case_id: Identifier for this test case
        trace: Dictionary containing span outputs from agent run

    Returns:
        AnalyzedCase with error attribution
    """
    # Extract span quality from trace
    # Your actual logic depends on trace structure
    search_terms_ok = trace.get("search_terms_quality", "OK") == "OK"
    search_results_ok = trace.get("search_results_quality", "OK") == "OK"
    source_selection_ok = trace.get("source_selection_quality", "OK") == "OK"
    output_ok = trace.get("output_quality", "OK") == "OK"

    # Attribute error to first failing span
    if not search_terms_ok:
        error_location = "Search Terms"
    elif not search_results_ok:
        error_location = "Search Results"
    elif not source_selection_ok:
        error_location = "Source Selection"
    elif not output_ok:
        error_location = "Output"
    else:
        error_location = "None"

    return AnalyzedCase(
        case_id=case_id,
        input_query=trace.get("input_query", ""),
        search_terms_ok=search_terms_ok,
        search_results_ok=search_results_ok,
        source_selection_ok=source_selection_ok,
        output_ok=output_ok,
        error_location=error_location
    )


def generate_error_report(cases: list[AnalyzedCase]) -> dict:
    """
    Generate error analysis report from analyzed cases.

    Returns:
        Dictionary with error counts, percentages, and recommendations
    """
    error_counts = Counter(case.error_location for case in cases)
    total = len(cases)

    # Calculate percentages
    error_percentages = {
        location: (count / total) * 100
        for location, count in error_counts.items()
    }

    # Sort by frequency
    sorted_errors = sorted(
        error_percentages.items(),
        key=lambda x: x[1],
        reverse=True
    )

    # Identify highest-frequency error (excluding "None" which means success)
    non_success = [(loc, pct) for loc, pct in sorted_errors if loc != "None"]
    recommendation = non_success[0][0] if non_success else "No errors detected"

    return {
        "total_cases": total,
        "error_counts": dict(error_counts),
        "error_percentages": error_percentages,
        "sorted_by_frequency": sorted_errors,
        "recommendation": f"Focus on: {recommendation} ({error_percentages.get(recommendation, 0):.1f}% of failures)"
    }


def export_to_csv(cases: list[AnalyzedCase], filename: str) -> None:
    """Export analyzed cases to CSV for spreadsheet analysis."""
    with open(filename, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow([
            "Case ID", "Input Query", "Search Terms OK",
            "Search Results OK", "Source Selection OK",
            "Output OK", "Error Location"
        ])
        for case in cases:
            writer.writerow([
                case.case_id,
                case.input_query,
                "OK" if case.search_terms_ok else "ERROR",
                "OK" if case.search_results_ok else "ERROR",
                "OK" if case.source_selection_ok else "ERROR",
                "OK" if case.output_ok else "ERROR",
                case.error_location
            ])

Output:

# Example usage with sample data
traces = [
    {"case_id": "Q1", "input_query": "Black holes",
     "search_results_quality": "ERROR", "output_quality": "ERROR"},
    {"case_id": "Q2", "input_query": "Seattle rent",
     "source_selection_quality": "ERROR"},
    {"case_id": "Q3", "input_query": "Robot farming",
     "search_terms_quality": "ERROR", "search_results_quality": "ERROR"},
    {"case_id": "Q4", "input_query": "Climate 2024"},  # All OK
    {"case_id": "Q5", "input_query": "AI agents",
     "search_results_quality": "ERROR"}
]

# Analyze all traces
cases = [analyze_trace(t["case_id"], t) for t in traces]

# Generate report
report = generate_error_report(cases)
print(f"Total cases analyzed: {report['total_cases']}")
print(f"\nError distribution:")
for location, percentage in report['sorted_by_frequency']:
    print(f"  {location}: {percentage:.1f}%")
print(f"\n{report['recommendation']}")

# Export for spreadsheet review
export_to_csv(cases, "error_analysis.csv")
print(f"\nExported to error_analysis.csv")

Total cases analyzed: 5

Error distribution:
  Search Results: 40.0%
  None: 20.0%
  Source Selection: 20.0%
  Search Terms: 20.0%

Focus on: Search Results (40.0% of failures)

Exported to error_analysis.csv

Exercise: Analyze Task API Agent Failures

Your Task API agent helps users manage their tasks. You've run 20 test cases and collected traces. Here are the summarized results:

Case	Input	Intent Recognition	Database Query	Task Matching	Response Generation	Overall
1	"What's next?"	OK	OK	OK	OK	PASS
2	"Add dentist call"	OK	OK	N/A	OK	PASS
3	"Overdue tasks"	OK	ERROR (timeout)	N/A	ERROR	FAIL
4	"Mark groceries done"	ERROR (ambiguous)	N/A	N/A	ERROR	FAIL
5	"High priority only"	OK	OK	OK	OK	PASS
6	"Delete old tasks"	OK	ERROR (timeout)	N/A	ERROR	FAIL
7	"What did I finish?"	OK	OK	OK	OK	PASS
8	"Tasks for Monday"	OK	OK	OK	OK	PASS
9	"Add buy milk"	ERROR (truncated)	N/A	N/A	ERROR	FAIL
10	"Show everything"	OK	OK	OK	OK	PASS
11	"Complete the report"	ERROR (ambiguous)	N/A	N/A	ERROR	FAIL
12	"Overdue items"	OK	OK	OK	OK	PASS
13	"What's urgent?"	OK	OK	OK	OK	PASS
14	"Delete meeting prep"	OK	OK	ERROR (matched wrong task)	ERROR	FAIL
15	"Add task call mom"	OK	OK	N/A	OK	PASS
16	"Show today's tasks"	OK	ERROR (timeout)	N/A	ERROR	FAIL
17	"Mark done: email"	OK	OK	ERROR (matched wrong task)	ERROR	FAIL
18	"What needs attention?"	OK	OK	OK	OK	PASS
19	"Clear completed"	OK	OK	OK	OK	PASS
20	"Add reminder walk dog"	ERROR (truncated)	N/A	N/A	ERROR	FAIL

Your task:

Count errors by component
Calculate percentages
Apply the prioritization formula (estimate feasibility yourself)
Determine which component to fix first

Work through this before reading the solution below.

Solution:

Step 1: Count errors by component

Component	Error Count
Intent Recognition	4 (Cases 4, 9, 11, 20)
Database Query	3 (Cases 3, 6, 16)
Task Matching	2 (Cases 14, 17)
Response Generation	0 (errors are downstream from other failures)
None (success)	11

Step 2: Calculate percentages (of 20 total, 9 failures)

Component	Percentage of Failures
Intent Recognition	4/9 = 44%
Database Query	3/9 = 33%
Task Matching	2/9 = 22%

Step 3: Apply prioritization formula

Error Type	Frequency	Feasibility	Priority Score
Intent Recognition (ambiguous/truncated)	44%	0.6 (prompt engineering, add examples)	26.4
Database Query (timeout)	33%	0.4 (infrastructure, caching, indices)	13.2
Task Matching (wrong task)	22%	0.7 (improve matching algorithm)	15.4

Step 4: Recommended fix order

Intent Recognition (score: 26.4) - Fix the prompt to handle ambiguous and truncated inputs
Task Matching (score: 15.4) - Improve the matching logic for similar task names
Database Query (score: 13.2) - Address timeout issues (lower priority due to infrastructure complexity)

Even though database timeouts are frustrating, they're harder to fix than prompt improvements. Start with intent recognition.

Reflect on Your Skill

After practicing systematic error analysis, add these patterns to your agent-evals skill:

Pattern: The Spreadsheet Method

When analyzing agent failures:
Create a table with columns for each component/span
Mark each span as OK or ERROR
Identify which span FIRST introduced the error
Count error locations across all failures
Percentages reveal where to focus

Pattern: Prioritization Formula

Priority = Frequency x Feasibility

Frequency: Percentage of failures from this component
Feasibility: How easily you can fix it (0 to 1)

Fix high-priority items first, even if lower-frequency
errors are more frustrating to debug.

Pattern: Trace and Span Vocabulary

Trace: Complete record of one agent run
  - Contains all intermediate outputs
  - Shows the full journey from input to output

Span: Output of single step within trace
  - Web search span: query + results
  - Selection span: which sources chosen
  - Output span: final response

Error attribution: Which span first produced bad output?
Downstream spans inherit upstream errors.

Key insight to encode: Don't go by gut. Count errors systematically. The time spent analyzing is an investment that prevents wasted effort fixing the wrong component. Andrew Ng's observation that "less experienced teams spend a lot of time building and probably much less time analyzing" is the difference between methodical improvement and thrashing.

Try With AI

Prompt 1: Design Error Categories for Your Agent

I'm building error analysis for my [describe agent type] agent.

The agent has these components:
- [Component 1, e.g., "intent classification"]
- [Component 2, e.g., "data retrieval"]
- [Component 3, e.g., "response generation"]

Help me design a spreadsheet structure for error analysis:
1. What columns should I track for each test case?
2. What error categories make sense for each component?
3. How should I attribute errors when multiple components fail?

Give me a template I can use with 20 test cases.

What you're learning: Error categories must match your specific agent architecture. Generic categories like "LLM error" don't help you fix anything. AI helps you design categories specific to your component structure.

Prompt 2: Estimate Feasibility for Your Error Types

I've identified these error types in my agent (with frequency):
1. [Error type 1] - [X]% of failures
2. [Error type 2] - [Y]% of failures
3. [Error type 3] - [Z]% of failures

For each error type, help me estimate feasibility:
- What would fixing it involve?
- Is it a code change, prompt change, or infrastructure change?
- What unknowns would require investigation?

Use the 0-1 feasibility scale where 0.9-1.0 is trivial and 0.0-0.2 needs investigation.
Then calculate priority scores and recommend my fix order.

What you're learning: Feasibility estimation requires understanding the fix. You might think "improve intent recognition" is easy, but it could require new training data (hard) or just adding examples to the prompt (easy). AI helps you think through what the fix actually involves.

Prompt 3: Generate Error Analysis Code for Your Framework

I'm using [OpenAI Agents SDK / Claude SDK / Google ADK / custom framework]
for my agent. My agent has these spans:
- [Span 1]
- [Span 2]
- [Span 3]

Write Python code that:
1. Extracts span outputs from my framework's trace format
2. Classifies each span as OK or ERROR based on [describe your criteria]
3. Produces error analysis CSV with component attribution
4. Calculates error percentages and recommends focus area

Include example output showing what the analysis would look like
for 5 sample failures.

What you're learning: Error analysis automation is framework-specific. The trace format differs between SDKs. AI helps you bridge from generic patterns to your specific implementation.

Safety Note

Error analysis reveals patterns in your agent's failures, but patterns are not always causes. A component might fail frequently because it receives bad input from an earlier component, not because it's broken. Always trace errors back to their root cause by examining full traces, not just counting which span flagged the error. The goal is systematic improvement, not blame attribution.

The Build-Analyze Loop​

Traces and Spans: The Vocabulary of Analysis​

The Spreadsheet Method​

Why Counting Beats Intuition​

The Prioritization Formula​

Generating Error Analysis Data​

Exercise: Analyze Task API Agent Failures​

Reflect on Your Skill​

Try With AI​

Prompt 1: Design Error Categories for Your Agent​

Prompt 2: Estimate Feasibility for Your Error Types​

Prompt 3: Generate Error Analysis Code for Your Framework​

Safety Note​