Finalize Your Evals Skill

You started this chapter by creating a skeleton. Now it is time to see what you actually built.

A skill only proves its worth when it works on something you did not learn it on. You developed your agent-evals skill using Task API examples throughout this chapter. Every dataset design, grader pattern, and error analysis method came from that context. The question is: does your skill transfer?

This lesson has one purpose. You will take your completed skill and apply it to a completely different agent. Not Task API. Not anything you have seen in this chapter. A fresh domain where your skill must stand on its own.

If your skill helps you design evaluations for this new agent without returning to the chapter content, you own something valuable. If you find yourself confused or missing patterns, you know exactly where your skill needs strengthening.

What Your Skill Should Include

Before testing portability, verify your skill is complete. Review your skills/agent-evals/SKILL.md and check for these sections:

Section	Purpose	Completeness Check
Core Thesis	Why evals matter	Andrew Ng quote + your interpretation
When to Activate	Trigger patterns	5+ specific scenarios
Evals vs TDD	Foundational distinction	Table comparing tests and evals
Dataset Design	Creating test cases	Three categories: typical, edge, error
Graders	Defining "good"	Binary criteria pattern with examples
Error Analysis	Finding failure patterns	Spreadsheet method with component columns
Component vs E2E	Choosing eval scope	5-step decision flow
Regression Protection	Preventing quality drops	Workflow and threshold guidance
Framework Integration	SDK-specific details	Table mapping concepts to frameworks

If any section is missing or incomplete, address it now. A skill with gaps will fail when you need it most.

Skill Validation Checklist

Your skill is ready for validation if it can help you with these four core tasks:

Task 1: Dataset Design

Can your skill help you design an eval dataset for any agent?

Criteria:

Identifies the three categories (typical, edge, error)
Provides guidance on starting with 10-20 cases
Explains how to use real data instead of synthetic
Includes patterns for growing datasets over time

Task 2: Grader Creation

Can your skill help you build graders that define "good" automatically?

Criteria:

Explains why binary criteria beat 1-5 scales
Provides grader code template (binary checks pattern)
Covers LLM-as-Judge for subjective criteria
Warns about position bias

Task 3: Error Analysis

Can your skill help you find which component caused failures?

Criteria:

Includes spreadsheet method structure
Lists trace terminology (trace, span, error classification)
Provides prioritization guidance (frequency times feasibility)
Explains how to focus effort where errors cluster

Task 4: Component vs E2E Decision

Can your skill help you choose the right eval scope?

Criteria:

Includes the 5-step decision flow
Explains when E2E is appropriate (ship decisions, production monitoring)
Explains when component-level is better (debugging, tuning)
Provides guidance on moving between scopes

If your skill satisfies all four task areas, proceed to validation. If not, return to the relevant lessons and extract the missing patterns.

Testing on a Different Agent

Your skill was developed using Task API examples. Now test it on something completely different.

Hypothetical Agent: Customer Support Bot

A customer support agent that:

Answers product questions from a knowledge base
Handles returns and refunds
Escalates complex issues to human agents
Maintains a helpful and professional tone

This agent shares no code with Task API. It operates in a different domain with different success criteria. If your skill transfers, you can design evaluations for it without returning to chapter content.

Apply Your Skill: Dataset Design

Using only your skill, design an eval dataset for the customer support agent.

Typical Cases (5 examples):

Input	Expected Behavior
"Where is my order?"	Ask for order number, provide tracking information
"What's your return policy?"	Quote return policy from knowledge base
"Can I get a refund?"	Clarify reason, initiate refund process if valid
"Product X isn't working"	Troubleshoot with standard questions, offer solutions
"How do I cancel my subscription?"	Verify identity, process cancellation, confirm

Edge Cases (3 examples):

Input	Expected Behavior
"I'm really frustrated" + valid complaint	Acknowledge emotion, solve problem, maintain professionalism
Vague complaint without details	Ask clarifying questions, don't guess
Request for competitor comparison	Decline politely, redirect to product benefits

Error Cases (2 examples):

Input	Expected Behavior
Request to access other customer's data	Refuse firmly, explain privacy policy
Abusive language without valid request	Maintain professionalism, offer to help when ready

If you designed these categories using your skill's guidance, the skill is transferring.

Apply Your Skill: Grader Creation

Design a grader for customer support responses using binary criteria.

def grader_support_response(response: str, case: dict) -> dict:
    """
    Binary criteria grader for customer support agent.
    Each criterion is yes/no. Sum them for score.
    """
    checks = {
        # Criterion 1: Did it address the customer's issue?
        "addressed_issue": (
            case["expected_topic"] in response.lower()
        ),

        # Criterion 2: Did it maintain professional tone?
        "professional_tone": not any(
            word in response.lower()
            for word in ["rude", "stupid", "whatever"]
        ),

        # Criterion 3: Did it provide actionable next steps?
        "has_next_steps": any(
            phrase in response.lower()
            for phrase in ["please", "you can", "next step", "i'll help"]
        ),

        # Criterion 4: Did it avoid making things up?
        "no_hallucination": not (
            "our policy is" in response.lower() and
            case.get("has_no_policy", False)
        ),

        # Criterion 5: Did it know when to escalate?
        "appropriate_escalation": (
            case.get("should_escalate", False) ==
            ("human agent" in response.lower() or "escalate" in response.lower())
        )
    }

    score = sum(checks.values())
    return {
        "passed": score == 5,
        "score": score,
        "max_score": 5,
        "checks": checks,
        "explanation": f"Passed {score}/5 support criteria"
    }

Output:

# Test the grader
test_case = {
    "input": "Where is my order?",
    "expected_topic": "order",
    "should_escalate": False
}

response = "I'd be happy to help you track your order. Please provide your order number and I'll look that up for you right away."

result = grader_support_response(response, test_case)
print(f"Score: {result['score']}/5")
print(f"Checks: {result['checks']}")

Score: 4/5
Checks: {'addressed_issue': True, 'professional_tone': True, 'has_next_steps': True, 'no_hallucination': True, 'appropriate_escalation': True}

If you created this grader using your skill's binary criteria pattern, the skill is working in a new domain.

Documenting Your Skill

Your skill needs documentation that allows future you to use it without rereading the chapter. Complete this template in your SKILL.md:

---
name: agent-evals
description: Design and implement evaluation frameworks for AI agents. Use when testing agent reasoning quality, building graders, doing error analysis, or establishing regression protection. Framework-agnostic concepts that apply to any SDK.
---

# Agent Evaluations: Measuring Reasoning Quality

**Core Thesis**: "One of the biggest predictors for whether someone is able to build agentic workflows really well is whether or not they're able to drive a really disciplined evaluation process." - Andrew Ng

## When to Activate

Use this skill when:
- Building systematic quality checks for any AI agent
- Designing evaluation datasets (typical, edge, error categories)
- Creating graders to define "good" automatically
- Performing error analysis to find failure patterns
- Setting up regression protection for agent changes
- Deciding when to use end-to-end vs component-level evals
- Debugging why an agent's output quality is inconsistent
- Preparing an agent for production deployment

## Core Patterns

### Pattern: Dataset Design (10-20 cases to start)

Categories:
- Typical (60%): Common use cases the agent will handle daily
- Edge (25%): Unusual but valid inputs that test boundaries
- Error (15%): Cases where agent should fail gracefully

Use REAL data from production logs when possible. Synthetic data misses the messiness of reality.

### Pattern: Binary Criteria Graders

DO NOT use 1-5 scales (LLMs are poorly calibrated).

DO use binary criteria:
1. Define 3-7 yes/no criteria
2. Check each criterion independently
3. Sum to get total score
4. Threshold for pass/fail

Template:
def grader(response, case) -> dict:
    checks = {"criterion_1": bool_check_1, "criterion_2": bool_check_2}
    score = sum(checks.values())
    return {"passed": score == len(checks), "score": score, "checks": checks}

### Pattern: Error Analysis (Spreadsheet Method)

Columns: Case | Routing | Tool Selection | Output Format | Content Quality
Process:
1. Run failing cases through eval suite
2. Trace each failure to component
3. Count which component fails most often
4. Prioritize: frequency x feasibility

### Pattern: Component vs E2E Decision

5-step flow:
1. Start with E2E evals to find overall quality
2. Use error analysis to identify problem component
3. Build component-level eval for that component
4. Tune component using component eval
5. Verify improvement with E2E eval

### Pattern: Regression Protection

Workflow:
Before change -> Run eval suite -> Establish baseline
After change -> Run eval suite -> Compare to baseline
If drop > threshold -> Investigate before shipping

Thresholds by criticality:
- High-stakes (medical, financial): Any drop = block
- Normal (support, productivity): 5% drop = investigate
- Experimental (prototypes): 10% drop = investigate

## Framework Application

| Framework | Trace Access | Grader Integration |
|-----------|-------------|-------------------|
| OpenAI Agents SDK | Built-in tracing | Custom graders |
| Claude Agent SDK | Hooks for tracing | Custom graders |
| Google ADK | Evaluation module | Built-in graders |
| LangChain | LangSmith traces | LangSmith evals |
| Custom | Logging middleware | Custom graders |

## Anti-Patterns to Avoid

| Anti-Pattern | Why It's Bad | What to Do Instead |
|-------------|--------------|-------------------|
| 1000+ test cases first | Quantity without quality | Start with 20 thoughtful cases |
| 1-5 scale ratings | LLMs poorly calibrated | Binary criteria summed |
| Ignoring traces | Miss root cause | Read intermediate outputs |
| End-to-end only | Too noisy for debugging | Add component-level evals |
| Synthetic test data | Misses real-world messiness | Use actual user queries |

---

*Skill Version: 1.0.0 | Created: Chapter 47 | Owner: [Your Name]*

The Portable Thinking

The concepts you learned transfer across any agent framework because they address universal problems:

Concept	Why It Transfers
Evals vs TDD	Agents reason probabilistically everywhere
Binary criteria	LLM calibration issues exist in all systems
Error analysis	Multi-component agents fail similarly across frameworks
Regression protection	Quality degradation happens regardless of SDK
Dataset categories	Typical/edge/error applies to any domain

What does NOT transfer directly:

Concept	Framework-Specific Adaptation
Trace access	Each SDK has different tracing APIs
Built-in graders	Google ADK has them; others don't
Dataset storage	Varies by infrastructure
CI/CD integration	Depends on deployment pipeline

Your skill should contain the portable patterns. Framework-specific details get added when you apply the skill to a particular SDK.

Exercise: Test Your Skill on a Third Agent

Your skill has now been validated on Customer Support. For your final exercise, test it on one more agent type to confirm the patterns truly generalize.

Choose one:

Content Generation Agent: Creates blog posts, social media content, marketing copy
Code Review Agent: Reviews pull requests, suggests improvements, catches bugs
Data Analysis Agent: Answers questions about datasets, creates visualizations, identifies trends

Using only your skill:

Design a 10-case eval dataset (5 typical, 3 edge, 2 error)
Write 5 binary criteria for a grader
Identify which component would be hardest to evaluate
Decide: E2E eval or component-level first? Why?

If you complete this exercise without returning to chapter content, your skill is production-ready.

Try With AI

Prompt 1: Test Your Skill on a New Agent Type

I'm validating that my agent-evals skill is portable. I just learned
evaluation methodology using a Task API agent. Now I need to apply it
to a completely different domain.

My new agent is: [describe an agent in your actual domain]

Using evaluation methodology (not any specific framework), help me:
1. Design a 10-case eval dataset with typical, edge, and error categories
2. Define 5 binary criteria for grading responses
3. Identify which component would be hardest to trace errors back to

I want to verify my evaluation thinking transfers, not learn new concepts.

What you're learning: Skill portability requires active testing. Your evaluation thinking should work in any domain because you learned patterns, not examples. AI helps you apply those patterns to verify they transfer.

Prompt 2: Generate a New Domain's Eval Dataset

I have an agent that [describe your real agent]. I need to design an
evaluation dataset using the three-category approach:

- Typical (60%): Common cases
- Edge (25%): Unusual but valid
- Error (15%): Should fail gracefully

Generate 15 test cases for my agent following this structure. For each case:
1. Input: What the user says/does
2. Expected behavior: What the agent should do
3. Category: Typical, edge, or error
4. Why this category: Brief justification

Use realistic examples from my domain, not generic ones.

What you're learning: Dataset design transfers across domains when you use the category framework. The specific cases differ by domain, but the structure remains constant. AI helps you generate domain-specific cases using your portable framework.

Prompt 3: Create a Grader for Different Criteria

I need to evaluate a [describe your agent type] using binary criteria.

The subjective quality I care about is: [describe what "good" means]

Help me:
1. Break this subjective quality into 5-7 binary yes/no criteria
2. For each criterion, suggest how to check it (string matching, keyword presence, LLM judge)
3. Identify which criteria need LLM-as-Judge vs can be checked with code

Remember: No 1-5 scales. Each criterion must be decidable as true/false.

What you're learning: The binary criteria pattern applies to any subjective quality you need to measure. Breaking "good" into checkable components is a skill that transfers across every agent you will ever build.

Safety Note

Skills evolve through use. The version you finalize today is not the final version. As you apply this skill to more agents, you will encounter patterns not covered here. When that happens, update your skill. A living skill grows stronger with each use. A frozen skill becomes obsolete. Build the habit of returning to your skills and adding what you learn in practice.

What Your Skill Should Include​

Skill Validation Checklist​

Testing on a Different Agent​

Apply Your Skill: Dataset Design​

Apply Your Skill: Grader Creation​

Documenting Your Skill​

The Portable Thinking​

Exercise: Test Your Skill on a Third Agent​

Try With AI​

Prompt 1: Test Your Skill on a New Agent Type​

Prompt 2: Generate a New Domain's Eval Dataset​

Prompt 3: Create a Grader for Different Criteria​

Safety Note​

What Your Skill Should Include

Skill Validation Checklist

Testing on a Different Agent

Apply Your Skill: Dataset Design

Apply Your Skill: Grader Creation

Documenting Your Skill

The Portable Thinking

Exercise: Test Your Skill on a Third Agent

Try With AI

Prompt 1: Test Your Skill on a New Agent Type

Prompt 2: Generate a New Domain's Eval Dataset

Prompt 3: Create a Grader for Different Criteria

Safety Note