Designing Eval Datasets

You have decided to build systematic evaluations for your agent. Now comes the question that stops most teams: How many test cases do you need?

Your instinct says more is better. A thousand test cases must be more thorough than twenty. Enterprise software has thousands of unit tests. Why would agent evaluation be different?

Here is why that instinct is wrong, and what to do instead.

The Paralysis Problem

Teams building agent evaluations often fall into one of two traps:

Trap 1: Perfection paralysis. "We need a comprehensive dataset before we can start evaluating." The team spends weeks designing taxonomy, debating coverage, building elaborate labeling systems. Meanwhile, the agent ships without any evaluation at all because the "comprehensive" dataset is never finished.

Trap 2: Volume delusion. "We generated 5,000 synthetic test cases, so our coverage must be good." The team runs evaluations, sees 94% pass rate, ships confidently. Then production failures emerge in patterns the synthetic data never captured.

Both traps share the same root cause: confusing quantity with quality.

Quality Over Quantity: The Andrew Ng Insight

Andrew Ng, who has advised hundreds of teams building AI systems, offers a counterintuitive directive:

"Quick and dirty evals is fine to get started. 10-20 examples... You don't have to have a thousand examples to get started."

Why does this work? Because the bottleneck in agent improvement is not data volume. It is understanding WHY things fail.

Ten thoughtful cases tell you:

Which specific scenarios break your agent
What patterns of failure repeat
Where to focus improvement efforts

One thousand random cases tell you:

A pass rate number
Very little about WHY failures occur
Almost nothing actionable about how to improve

The purpose of evaluation is not to generate a score. It is to drive improvement. A small dataset that reveals failure patterns is infinitely more valuable than a large dataset that hides them in noise.

The Three-Category Framework

Every eval dataset should cover three categories with different purposes:

Category	Count	Purpose	What it tests
Typical	10	Common use cases	Does the agent handle normal requests well?
Edge	5	Unusual but valid	Does the agent handle uncommon situations gracefully?
Error	5	Should fail gracefully	Does the agent recognize when it cannot help and respond appropriately?

Typical Cases (10)

These are the bread-and-butter scenarios your agent was designed for. If it fails these, nothing else matters.

Characteristics:

Represent the most common 80% of real usage
Clear user intent, reasonable complexity
Success criteria are well-defined
These should have HIGH pass rates (90%+)

Example for Task API agent:

"Create a task called 'Buy groceries' for tomorrow"
"Show me my tasks for this week"
"Mark the dentist appointment as done"
"What's my highest priority task?"

Why 10 cases? Not arbitrary. Ten typical cases cover the core user journeys without redundancy. If you find yourself writing similar cases repeatedly, you have enough.

Edge Cases (5)

These are unusual but valid requests. Users do weird things. Your agent should handle them.

Characteristics:

Uncommon but legitimate use cases
May require the agent to make judgment calls
Success might mean "reasonable response" not "perfect response"
Acceptable pass rate is lower (70-80%)

Example for Task API agent:

"Create a task with emoji title: '🎉 Birthday party planning'"
"Schedule something for 'next week sometime'"
"Add a task that repeats every other Thursday except holidays"
"Create 15 tasks at once from this list: ..."

Why 5 cases? Edge cases have diminishing returns. The first few reveal whether your agent handles ambiguity. More than five typically find the same class of issues repeatedly.

Error Cases (5)

These are requests where your agent SHOULD fail gracefully. Not every request is valid. A good agent recognizes its limitations.

Characteristics:

Requests outside the agent's capabilities
Malformed or nonsensical inputs
Should NOT succeed (success here is actually failure)
Tests graceful degradation

Example for Task API agent:

"Delete all my tasks" (if agent lacks delete capability)
"What's the weather tomorrow?" (outside task domain)
"asdfghjkl" (nonsense input)
"Create task for yesterday" (impossible request)
"Hack into the mainframe" (inappropriate request)

Why 5 cases? Error handling has clear patterns. A handful of cases verify the agent recognizes different failure modes: out of scope, malformed, impossible, inappropriate.

Using REAL Data

Here is where most evaluation efforts go wrong: synthetic data.

Synthetic data is test cases you imagine users might submit. It is clean, well-formed, and represents how you THINK users behave.

Real data is actual user queries from production logs, support tickets, and customer feedback. It is messy, unexpected, and represents how users ACTUALLY behave.

The gap between synthetic and real data is enormous:

Synthetic Data	Real Data
"Create a task for tomorrow"	"can u remind me 2 call mom tmrw"
"What tasks are due this week?"	"whats due"
"Mark task as complete"	"done with that thing I added yesterday"
Clean grammar, clear intent	Typos, abbreviations, assumed context

Where to Find Real Data

Your production system already contains gold for evaluation:

1. Application logs

Actual user inputs and agent responses
Filter for sessions with negative feedback or repeated attempts
Look for queries that led to support tickets

2. Support tickets

User complaints describe exactly what failed
"I asked for X but got Y" is a perfect eval case
Support volume indicates which failures matter most

3. User feedback

Thumbs down, negative ratings, explicit complaints
Each piece of negative feedback is a potential eval case
Positive feedback can validate typical cases

4. Session recordings

Multi-turn conversations where users gave up
Indicates where agent behavior frustrated users
Reveals context-dependent failures

Mining Real Data: A Process

1. Export last 30 days of user queries
        |
        v
2. Filter for signals:
   - Negative feedback
   - Multiple attempts (retry pattern)
   - Support ticket creation
   - Abandoned sessions
        |
        v
3. Sample 50-100 candidates
        |
        v
4. Classify into categories:
   - Typical (should work)
   - Edge (unusual but valid)
   - Error (should fail gracefully)
        |
        v
5. Select 20 diverse cases

Eval Case Structure

Every eval case needs three components:

{
  "id": "unique-identifier",
  "category": "typical | edge | error",
  "input": {
    "user_message": "The actual user input",
    "context": {
      "prior_messages": [],
      "user_state": {}
    }
  },
  "expected_behavior": {
    "should_succeed": true,
    "criteria": [
      "criterion_1",
      "criterion_2"
    ],
    "expected_output_pattern": "optional regex or template"
  },
  "rationale": "Why this case matters and what it tests"
}

Example: Complete Eval Case

{
  "id": "typical-003",
  "category": "typical",
  "input": {
    "user_message": "remind me to call mom tomorrow at 3pm",
    "context": {
      "prior_messages": [],
      "user_state": {
        "timezone": "America/New_York"
      }
    }
  },
  "expected_behavior": {
    "should_succeed": true,
    "criteria": [
      "Creates a task (not a calendar event)",
      "Title contains 'call mom' or equivalent",
      "Due date is tomorrow",
      "Time is 3pm in user's timezone",
      "Reminder is set (not just due date)"
    ],
    "expected_output_pattern": null
  },
  "rationale": "Tests natural language time parsing and task vs reminder distinction - common user pattern from production logs"
}

The Rationale Field

The rationale field is often skipped but essential. It captures:

Why this case was included
What specific behavior it tests
Where it came from (production data reference)

Without rationale, six months later you cannot remember why a case exists or whether it still matters. With rationale, your eval dataset is self-documenting.

When to Grow Your Dataset

Start with 20 cases. But when do you need more?

Signal that you need more cases:

You observe a production failure that NO eval case covers
                    |
                    v
You add eval cases that capture this failure pattern
                    |
                    v
You verify the new cases fail (confirming the bug)
                    |
                    v
You fix the agent
                    |
                    v
New cases now pass (confirming the fix)

This is the eval-driven development loop. Your dataset grows organically as you discover gaps.

Do NOT add cases because:

"We should have more coverage" (vague, not actionable)
"Competitive products have 10,000 cases" (irrelevant to YOUR agent)
"We haven't updated the dataset in a while" (age is not a problem)

DO add cases when:

Production failure reveals an uncovered pattern
New feature requires new test scenarios
Error analysis shows a category is under-represented
User feedback identifies a failure mode

Exercise: Design a 20-Case Dataset for Task API

Your Task API agent helps users manage tasks through natural language. Design a complete eval dataset.

Step 1: List typical cases (10)

Think about the core user journeys:

Creating tasks
Viewing tasks
Updating tasks
Querying tasks

For each, write a realistic user input (not synthetic "Create a task" commands).

Step 2: List edge cases (5)

Think about unusual but valid scenarios:

Ambiguous time references
Special characters in titles
Multiple actions in one request
Missing information that requires inference

Step 3: List error cases (5)

Think about what should fail gracefully:

Requests outside task management domain
Impossible requests (past dates, contradictory requirements)
Missing required information with no reasonable default
Potentially harmful requests

Step 4: Add structure

For each case, fill in:

Input (user message + any context)
Expected behavior (criteria for success/graceful failure)
Rationale (why this case matters)

Sample answer structure:

ID	Category	Input	Key Criteria	Rationale
T-01	Typical	"remind me to pick up dry cleaning tomorrow"	Creates task, tomorrow due date, reminder set	Common natural language pattern
T-02	Typical	"what do I have due this week"	Lists tasks, filters by date range	Basic query pattern
E-01	Edge	"add groceries and also schedule dentist for next tuesday"	Creates TWO tasks correctly	Multi-action request
E-02	Edge	"task for 'meeting with @john re: $$$ deal'"	Handles special chars in title	Real user data often has special chars
R-01	Error	"what's the weather"	Politely declines, stays in domain	Out of scope, should not attempt
R-02	Error	"delete all my tasks forever"	Does not delete (no capability)	Tests capability boundaries

Common Mistakes to Avoid

Mistake 1: All synthetic data

Your imagination of user behavior differs from reality. Mine production data.

Mistake 2: Only happy path

A dataset of only typical cases tells you nothing about robustness. Include edge and error cases.

Mistake 3: No rationale

Cases without rationale become mysterious artifacts. Document WHY each case exists.

Mistake 4: Waiting for completeness

You will never have "complete" coverage. Start with 20 cases. Add more when production failures reveal gaps.

Mistake 5: Ignoring error cases

How your agent fails matters as much as how it succeeds. Graceful failure is a feature.

Try With AI

Prompt 1: Extract Cases from User Feedback

I'm building eval cases for my [describe agent] from real user feedback.
Here are some recent complaints from users:

[Paste 5-10 real user complaints or negative feedback]

Help me turn each complaint into a structured eval case with:
- Input (what the user likely said)
- Expected behavior (what should have happened)
- Category (typical, edge, or error)
- Rationale (what this case tests)

What you're learning: Transforming unstructured user feedback into actionable evaluation cases. Real complaints reveal the gaps between your mental model and user reality.

Prompt 2: Generate Edge Cases from Typical Cases

I have these typical eval cases for my Task API agent:

[List your 10 typical cases]

For each typical case, suggest ONE edge case variant that tests the same
core functionality but with added complexity or ambiguity. Explain what
additional behavior the edge case tests.

What you're learning: Systematic edge case generation. Starting from solid typical cases and adding complexity is more effective than imagining edge cases from scratch.

Prompt 3: Audit Your Dataset Balance

Here's my current eval dataset:

[Paste your 20 cases]

Audit this dataset for:
1. Category balance: Do I have appropriate typical/edge/error distribution?
2. Feature coverage: What agent capabilities are over/under-represented?
3. Diversity: Are cases too similar to each other?
4. Real vs synthetic: Which cases seem synthetic vs derived from real data?

Suggest 3 specific improvements to make this dataset more useful.

What you're learning: Critical evaluation of your own evaluation system. The meta-skill of auditing eval quality prevents the common trap of low-quality evals producing misleading confidence.

Safety Note

When extracting eval cases from production data, ensure you handle user privacy appropriately. Remove personally identifiable information (PII) from user queries before including them in eval datasets. Never include actual user names, emails, or sensitive content in test cases. Anonymize and generalize while preserving the linguistic patterns that make real data valuable.

Reflect on Your Skill

After completing this lesson, update your agent-evals skill with dataset design patterns:

Add to your skill's Dataset Design section:

### Dataset Design

**Quality Over Quantity**: Start with 10-20 cases, not 1000.

**Three-Category Framework**:
| Category | Count | Purpose |
|----------|-------|---------|
| Typical | 10 | Common use cases - should have 90%+ pass rate |
| Edge | 5 | Unusual but valid - acceptable 70-80% pass rate |
| Error | 5 | Should fail gracefully - tests boundaries |

**Real Data Over Synthetic**: Mine production logs, support tickets,
user feedback for authentic test cases.

**Growth Trigger**: Add cases when production failures reveal gaps,
not based on arbitrary coverage targets.

**Case Structure**: Every case needs input, expected_behavior, rationale.

Key insight to encode: The purpose of evaluation is improvement, not scoring. Twenty cases that reveal WHY things fail enable faster improvement than a thousand cases that only report WHAT percentage passes.

The Paralysis Problem​

Quality Over Quantity: The Andrew Ng Insight​

The Three-Category Framework​

Typical Cases (10)​

Edge Cases (5)​

Error Cases (5)​

Using REAL Data​

Where to Find Real Data​

Mining Real Data: A Process​

Eval Case Structure​

Example: Complete Eval Case​

The Rationale Field​

When to Grow Your Dataset​

Exercise: Design a 20-Case Dataset for Task API​

Common Mistakes to Avoid​

Try With AI​

Prompt 1: Extract Cases from User Feedback​

Prompt 2: Generate Edge Cases from Typical Cases​

Prompt 3: Audit Your Dataset Balance​

Safety Note​

Reflect on Your Skill​

The Paralysis Problem

Quality Over Quantity: The Andrew Ng Insight

The Three-Category Framework

Typical Cases (10)

Edge Cases (5)

Error Cases (5)

Using REAL Data

Where to Find Real Data

Mining Real Data: A Process

Eval Case Structure

Example: Complete Eval Case

The Rationale Field

When to Grow Your Dataset

Exercise: Design a 20-Case Dataset for Task API

Common Mistakes to Avoid

Try With AI

Prompt 1: Extract Cases from User Feedback

Prompt 2: Generate Edge Cases from Typical Cases

Prompt 3: Audit Your Dataset Balance

Safety Note

Reflect on Your Skill