Regression Protection

You've spent three hours improving your Task API agent's routing logic. The agent now correctly routes "What should I work on?" to the priority view instead of the task list. You run a quick test, it works, and you push to production.

Two days later, users complain: the agent's output formatting is broken. Tasks appear as raw JSON instead of friendly messages. While you were improving routing, you accidentally changed how results are presented.

This is the regression trap. Improvements in one area often cause regressions in another. Without systematic checking, you only discover problems when users report them.

Regression protection solves this by running your full eval suite on every change. Before any code ships, you know whether it improved things, broke things, or left quality unchanged. The eval suite becomes your safety net—not perfect, but far better than hope.

The Regression Problem

Agent development has a hidden cost: improvements often cause regressions elsewhere.

Why this happens:

Change Type	Common Regression
Improve routing	Break output formatting
Add new tool	Interfere with existing tools
Update prompt	Change tone or verbosity
Fix edge case	Break common case
Add guardrails	Reduce helpfulness

Traditional code testing catches some of these—if the function returns wrong types or throws exceptions. But agent regressions are often behavioral: the output is valid JSON but no longer helpful. The response is formatted correctly but addresses the wrong question. Tests pass, users complain.

The core insight: You need quality measurement that runs on every change, not just correctness checks. That's what regression protection provides.

Eval-on-Every-Change Workflow

The workflow is straightforward: run your eval suite before and after every change, then compare scores.

Change code --> Run eval suite --> Compare to baseline
    |
    v
If pass rate drops --> Investigate before shipping
    |
    v
If pass rate stable/improved --> Safe to deploy

The workflow in practice:

from dataclasses import dataclass
from typing import Any

@dataclass
class EvalRun:
    """Results from running eval suite."""
    passed: int
    total: int
    pass_rate: float
    by_criterion: dict[str, float]


def run_eval_suite(
    agent: Any,
    test_cases: list[dict],
    graders: list
) -> EvalRun:
    """
    Run full eval suite and return results.

    Args:
        agent: Your agent instance
        test_cases: List of test inputs with expected behaviors
        graders: List of grading functions/LLM judges

    Returns:
        EvalRun with pass counts and per-criterion breakdown
    """
    passed = 0
    total = 0
    criterion_scores = {}

    for case in test_cases:
        # Get agent response
        response = agent.run(case["input"])

        # Apply each grader
        for grader in graders:
            result = grader(response, case)
            total += 1
            if result["passed"]:
                passed += 1

            # Track per-criterion
            criterion = grader.__name__
            if criterion not in criterion_scores:
                criterion_scores[criterion] = {"passed": 0, "total": 0}
            criterion_scores[criterion]["total"] += 1
            if result["passed"]:
                criterion_scores[criterion]["passed"] += 1

    # Calculate rates
    pass_rate = passed / total if total > 0 else 0
    by_criterion = {
        k: v["passed"] / v["total"]
        for k, v in criterion_scores.items()
    }

    return EvalRun(
        passed=passed,
        total=total,
        pass_rate=pass_rate,
        by_criterion=by_criterion
    )

Output:

# Example eval run output
eval_result = run_eval_suite(my_agent, test_cases, graders)
print(f"Pass rate: {eval_result.pass_rate:.1%}")
print(f"By criterion: {eval_result.by_criterion}")

Pass rate: 85.0%
By criterion: {'routing_correct': 0.9, 'output_format_valid': 0.95, 'helpful_response': 0.7}

Before/After Comparison

Every change needs a baseline for comparison. The baseline is your eval score before making changes.

def compare_versions(
    baseline: EvalRun,
    current: EvalRun
) -> dict:
    """
    Compare current eval results against baseline.

    Returns:
        Comparison with overall delta and per-criterion changes
    """
    overall_delta = current.pass_rate - baseline.pass_rate

    criterion_deltas = {}
    all_criteria = set(baseline.by_criterion) | set(current.by_criterion)

    for criterion in all_criteria:
        before = baseline.by_criterion.get(criterion, 0)
        after = current.by_criterion.get(criterion, 0)
        criterion_deltas[criterion] = {
            "before": before,
            "after": after,
            "delta": after - before
        }

    return {
        "overall": {
            "before": baseline.pass_rate,
            "after": current.pass_rate,
            "delta": overall_delta,
            "improved": overall_delta > 0,
            "regressed": overall_delta < 0
        },
        "by_criterion": criterion_deltas,
        "regressions": [
            k for k, v in criterion_deltas.items()
            if v["delta"] < 0
        ],
        "improvements": [
            k for k, v in criterion_deltas.items()
            if v["delta"] > 0
        ]
    }

Output:

baseline = EvalRun(passed=17, total=20, pass_rate=0.85,
                   by_criterion={"routing": 0.9, "format": 0.8, "helpful": 0.85})

current = EvalRun(passed=18, total=20, pass_rate=0.90,
                  by_criterion={"routing": 0.95, "format": 0.75, "helpful": 1.0})

comparison = compare_versions(baseline, current)
print(f"Overall: {comparison['overall']['before']:.0%} --> {comparison['overall']['after']:.0%}")
print(f"Regressions: {comparison['regressions']}")
print(f"Improvements: {comparison['improvements']}")

Overall: 85% --> 90%
Regressions: ['format']
Improvements: ['routing', 'helpful']

Notice the hidden regression: overall pass rate improved (85% to 90%), but format actually got worse. The improvements in routing and helpful masked the format regression. Without per-criterion tracking, you'd ship this thinking it was a pure improvement.

Setting Pass Rate Thresholds

Not all score drops are equal. A 2% drop might be noise; a 10% drop is probably real. The threshold depends on your agent's criticality.

Threshold	When to Use	Example Agent
Any drop = investigate	High-stakes agents where errors have significant consequences	Medical advice, financial decisions, safety-critical
5% drop = investigate	Normal agents with moderate error tolerance	Customer support, task management, content generation
10% drop = investigate	Experimental agents during rapid development	Prototypes, internal tools, non-customer-facing

Implementing threshold checks:

@dataclass
class RegressionConfig:
    """Configuration for regression detection."""
    overall_threshold: float  # e.g., 0.05 for 5%
    criterion_threshold: float  # e.g., 0.10 for 10%
    block_on_regression: bool  # Whether to prevent deployment


def check_regression(
    comparison: dict,
    config: RegressionConfig
) -> dict:
    """
    Check if changes pass regression thresholds.

    Returns:
        Dict with passed/blocked status and reasons
    """
    issues = []

    # Check overall regression
    if comparison["overall"]["delta"] < -config.overall_threshold:
        issues.append(
            f"Overall pass rate dropped {abs(comparison['overall']['delta']):.1%} "
            f"(threshold: {config.overall_threshold:.1%})"
        )

    # Check per-criterion regressions
    for criterion, data in comparison["by_criterion"].items():
        if data["delta"] < -config.criterion_threshold:
            issues.append(
                f"'{criterion}' dropped {abs(data['delta']):.1%} "
                f"(threshold: {config.criterion_threshold:.1%})"
            )

    passed = len(issues) == 0
    should_block = not passed and config.block_on_regression

    return {
        "passed": passed,
        "should_block": should_block,
        "issues": issues,
        "recommendation": "BLOCK" if should_block else "WARN" if issues else "SHIP"
    }

Output:

# High-stakes configuration
config = RegressionConfig(
    overall_threshold=0.0,   # Any overall drop
    criterion_threshold=0.05, # 5% on any criterion
    block_on_regression=True
)

result = check_regression(comparison, config)
print(f"Status: {result['recommendation']}")
print(f"Issues: {result['issues']}")

Status: BLOCK
Issues: ["'format' dropped 5.0% (threshold: 5.0%)"]

The Eval-Driven Iteration Loop

Regression protection isn't just about catching problems—it's about systematic improvement. The loop goes: eval, identify weakest area, fix, re-eval, verify.

prompt v1 --> eval 70% --> error analysis --> fix routing --> eval 85%
          --> error analysis --> fix output format --> eval 92%
          --> ship

Implementing the loop:

def identify_improvement_target(eval_result: EvalRun) -> str:
    """Find the criterion with lowest pass rate."""
    if not eval_result.by_criterion:
        return "overall"

    return min(
        eval_result.by_criterion.items(),
        key=lambda x: x[1]
    )[0]


def iteration_step(
    agent: Any,
    test_cases: list[dict],
    graders: list,
    baseline: EvalRun
) -> dict:
    """
    Run one iteration of the eval-driven improvement loop.

    Returns:
        Current status, improvement target, and comparison to baseline
    """
    current = run_eval_suite(agent, test_cases, graders)
    comparison = compare_versions(baseline, current)
    target = identify_improvement_target(current)

    return {
        "current_pass_rate": current.pass_rate,
        "improvement_target": target,
        "target_score": current.by_criterion.get(target, current.pass_rate),
        "comparison": comparison,
        "recommendation": (
            "SHIP" if current.pass_rate >= 0.90
            else f"FIX '{target}' ({current.by_criterion.get(target, current.pass_rate):.0%})"
        )
    }

Output:

# Running iteration loop
status = iteration_step(agent, test_cases, graders, baseline)
print(f"Pass rate: {status['current_pass_rate']:.0%}")
print(f"Recommendation: {status['recommendation']}")

Pass rate: 70%
Recommendation: FIX 'helpful_response' (60%)

The iteration log tracks your progress:

Iteration	Pass Rate	Action Taken	Target Fixed
1	70%	Initial eval	-
2	78%	Fixed routing prompt	routing_correct: 65% -> 90%
3	85%	Fixed output formatting	output_format: 70% -> 95%
4	92%	Added context to helpful check	helpful: 80% -> 91%
5	92%	Ready to ship	-

Exercise: Implement Regression Protection Workflow

Your Task API agent currently scores 80% on your eval suite. You're about to add a new feature: priority filtering. Implement regression protection to ensure this change doesn't break existing functionality.

Step 1: Establish baseline

# Record current state before any changes
baseline = run_eval_suite(task_agent, test_cases, graders)
print(f"Baseline: {baseline.pass_rate:.0%}")
print(f"By criterion: {baseline.by_criterion}")

Step 2: Make your change and re-evaluate

# After adding priority filtering feature
task_agent.add_feature("priority_filtering")
current = run_eval_suite(task_agent, test_cases, graders)

Step 3: Compare and decide

comparison = compare_versions(baseline, current)

# Using 5% threshold for this normal-criticality agent
config = RegressionConfig(
    overall_threshold=0.05,
    criterion_threshold=0.10,
    block_on_regression=True
)

result = check_regression(comparison, config)
print(f"Decision: {result['recommendation']}")
if result['issues']:
    print("Issues to fix:")
    for issue in result['issues']:
        print(f"  - {issue}")

Step 4: Iterate if needed

If regression detected:

Identify which criterion regressed
Analyze why the change caused this
Fix while preserving the improvement
Re-eval until all criteria pass thresholds

Your task: Choose appropriate thresholds for three scenarios:

Medical symptom checker: Users rely on it for health decisions
Internal code review assistant: Helps developers but not customer-facing
Experimental content generator: Rapid prototyping, not deployed

What thresholds would you set for each? Why?

Reflect on Your Skill

After implementing regression protection, consider adding these patterns to your agent-evals skill:

Pattern: Baseline before change

Before modifying any agent code:
Run full eval suite
Save results as baseline
Record per-criterion scores
Make changes only after baseline captured

Pattern: Hidden regression detection

When overall score improves:
Still check per-criterion deltas
Improvements can mask regressions
A +5% overall with -10% on one criterion is still concerning
Use per-criterion thresholds, not just overall

Pattern: Threshold selection

Choose threshold based on agent criticality:
- High-stakes (medical, financial, safety): Any drop = block
- Normal (support, productivity): 5% drop = investigate
- Experimental (prototypes, internal): 10% drop = investigate
Set block_on_regression=True only for production-critical agents

Pattern: Iteration loop discipline

When pass rate is below target:
Identify lowest-scoring criterion
Focus effort there (highest leverage)
Fix one thing at a time
Re-eval after each fix
Verify no regressions before next fix

Key insight to encode: Regression protection isn't about preventing all changes—it's about knowing what your changes do. A 3% regression might be acceptable for a major feature. A 3% regression for a minor tweak is suspicious. The eval suite gives you information; you make the decision.

Try With AI

Prompt 1: Design Regression Protection for Your Agent

I have an agent that [describe your agent]. I want to set up regression
protection before deploying changes.

Help me design:
1. What criteria should I track? (List 5-7 specific binary criteria)
2. What thresholds make sense for this agent's criticality?
3. What's my baseline workflow before making changes?
4. How should I structure my iteration loop?

Consider: [describe the stakes - who uses this, what happens if it fails]

What you're learning: Regression protection is domain-specific. The criteria and thresholds for a medical agent differ from those for a task manager. AI helps you translate your context into a concrete workflow.

Prompt 2: Debug a Masked Regression

My overall eval score improved from 78% to 82%, so I shipped. Now users report
issues with [describe the problem].

Looking at my per-criterion breakdown:
Before: {routing: 85%, format: 80%, helpful: 70%}
After: {routing: 95%, format: 65%, helpful: 85%}

Help me understand:
1. Why did the overall score hide this regression?
2. What threshold configuration would have caught this?
3. How should I fix this now that it's in production?
4. What should I add to my eval suite to prevent this pattern?

What you're learning: Masked regressions are common. Understanding how they happen helps you design better detection. AI helps you analyze the math and design preventive measures.

Prompt 3: Build Iteration Loop for Target Pass Rate

My agent currently scores 72% on my eval suite. I need to reach 90% before
launch in 2 weeks.

Per-criterion scores:
- routing_correct: 85%
- output_format: 70%
- helpful_response: 60%
- tone_appropriate: 75%

Help me:
1. Prioritize which criteria to fix first (highest leverage)
2. Estimate effort for each improvement
3. Design a 2-week iteration plan
4. Set checkpoints to ensure I'm on track

What's realistic? Where should I focus first?

What you're learning: The iteration loop requires prioritization. Not all criteria are equally easy to improve, and some improvements yield bigger gains. AI helps you plan the path from current state to target.

Safety Note

Regression protection catches changes you make, not changes in the world. If user behavior shifts, external APIs change, or new edge cases emerge, your baseline becomes stale. Periodically update your test cases with fresh examples from production. The goal is a living eval suite, not a static gate.

The Regression Problem​

Eval-on-Every-Change Workflow​

Before/After Comparison​

Setting Pass Rate Thresholds​

The Eval-Driven Iteration Loop​

Exercise: Implement Regression Protection Workflow​

Reflect on Your Skill​

Try With AI​

Prompt 1: Design Regression Protection for Your Agent​

Prompt 2: Debug a Masked Regression​

Prompt 3: Build Iteration Loop for Target Pass Rate​

Safety Note​

The Regression Problem

Eval-on-Every-Change Workflow

Before/After Comparison

Setting Pass Rate Thresholds

The Eval-Driven Iteration Loop

Exercise: Implement Regression Protection Workflow

Reflect on Your Skill

Try With AI

Prompt 1: Design Regression Protection for Your Agent

Prompt 2: Debug a Masked Regression

Prompt 3: Build Iteration Loop for Target Pass Rate

Safety Note