Updated Feb 23, 2026

Regression Testing

You fine-tune a new model version with additional training data. Accuracy goes from 87% to 89%. Is that a real improvement, or just noise? Conversely, if accuracy drops from 87% to 85%, is that a true regression or just measurement variance?

Regression testing provides the statistical framework to answer these questions confidently. This lesson teaches you to build robust model comparison pipelines that catch real degradation while avoiding false alarms.

The Regression Testing Problem

When you update a model, three outcomes are possible:

Outcome	What Happened	Decision
Improvement	New model genuinely better	Deploy new model
Equivalent	Difference is noise	Keep old model (simplicity)
Regression	New model genuinely worse	Block deployment

The challenge: How do you distinguish real changes from random variance?

Why Raw Scores Mislead

Consider this scenario:

Model A accuracy: 87.2% (on 500 test examples)
Model B accuracy: 85.8% (on 500 test examples)

Is Model B a regression? Not necessarily. With 500 examples, a 1.4% difference could easily be noise.

import numpy as np
from scipy import stats

def simulate_variance(true_accuracy=0.87, n_samples=500, n_trials=1000):
    """Simulate measurement variance from sampling."""
    measured_accuracies = []

    for _ in range(n_trials):
        # Simulate n_samples coin flips with true_accuracy probability
        correct = np.random.binomial(n_samples, true_accuracy)
        measured = correct / n_samples
        measured_accuracies.append(measured)

    return {
        'mean': np.mean(measured_accuracies),
        'std': np.std(measured_accuracies),
        '95_range': (np.percentile(measured_accuracies, 2.5),
                     np.percentile(measured_accuracies, 97.5))
    }

variance = simulate_variance()

Output:

{
    'mean': 0.8701,
    'std': 0.015,
    '95_range': (0.842, 0.898)
}

With 500 samples, a model with 87% true accuracy could measure anywhere from 84% to 90% due to sampling variance alone.

Statistical Significance Testing

The Binomial Test for Accuracy Comparison

When comparing two models on the same test set:

from scipy.stats import fisher_exact, chi2_contingency
import numpy as np

def compare_models_statistical(model_a_correct: int, model_a_total: int,
                               model_b_correct: int, model_b_total: int) -> dict:
    """Compare two models with statistical significance testing."""

    acc_a = model_a_correct / model_a_total
    acc_b = model_b_correct / model_b_total

    # Contingency table for chi-square test
    # [[A correct, A incorrect], [B correct, B incorrect]]
    table = [
        [model_a_correct, model_a_total - model_a_correct],
        [model_b_correct, model_b_total - model_b_correct]
    ]

    chi2, p_value, dof, expected = chi2_contingency(table)

    return {
        'accuracy_a': acc_a,
        'accuracy_b': acc_b,
        'difference': acc_b - acc_a,
        'p_value': p_value,
        'significant_at_05': p_value < 0.05,
        'interpretation': interpret_difference(acc_a, acc_b, p_value)
    }

def interpret_difference(acc_a, acc_b, p_value, threshold=0.05):
    """Interpret comparison result."""
    diff = acc_b - acc_a

    if p_value >= threshold:
        return "NO_SIGNIFICANT_DIFFERENCE"
    elif diff > 0:
        return "SIGNIFICANT_IMPROVEMENT"
    else:
        return "SIGNIFICANT_REGRESSION"

# Example comparison
result = compare_models_statistical(
    model_a_correct=436, model_a_total=500,  # 87.2%
    model_b_correct=429, model_b_total=500   # 85.8%
)

Output:

{
    "accuracy_a": 0.872,
    "accuracy_b": 0.858,
    "difference": -0.014,
    "p_value": 0.527,
    "significant_at_05": false,
    "interpretation": "NO_SIGNIFICANT_DIFFERENCE"
}

The 1.4% difference is not statistically significant. Model B might not be worse; the difference could be noise.

McNemar's Test for Paired Comparisons

When both models evaluate the same examples, use McNemar's test for more statistical power:

from scipy.stats import binom

def mcnemar_test(model_a_predictions: list, model_b_predictions: list,
                 ground_truth: list) -> dict:
    """McNemar's test for paired model comparison."""

    # Count discordant pairs
    b = 0  # A correct, B incorrect
    c = 0  # A incorrect, B correct

    for a_pred, b_pred, truth in zip(model_a_predictions, model_b_predictions, ground_truth):
        a_correct = (a_pred == truth)
        b_correct = (b_pred == truth)

        if a_correct and not b_correct:
            b += 1
        elif not a_correct and b_correct:
            c += 1

    # McNemar's test (exact binomial)
    n = b + c
    if n == 0:
        return {'p_value': 1.0, 'interpretation': 'IDENTICAL_PREDICTIONS'}

    # Two-tailed test
    p_value = 2 * min(
        binom.cdf(min(b, c), n, 0.5),
        1 - binom.cdf(max(b, c) - 1, n, 0.5)
    )

    return {
        'a_wins': b,
        'b_wins': c,
        'tied': len(ground_truth) - n,
        'p_value': p_value,
        'significant_at_05': p_value < 0.05,
        'winner': 'A' if b > c else 'B' if c > b else 'TIE'
    }

# Example
result = mcnemar_test(model_a_preds, model_b_preds, ground_truth)

Output:

{
    "a_wins": 45,
    "b_wins": 32,
    "tied": 423,
    "p_value": 0.142,
    "significant_at_05": false,
    "winner": "A"
}

Even though Model A "wins" on more examples, the difference is not statistically significant.

Designing Regression Thresholds

The Threshold Trade-off

Strict Threshold	Lenient Threshold
Few false positives	Many false positives
Many real regressions missed	Catches most regressions
Blocks good updates	Allows bad updates

Recommended Thresholds

REGRESSION_THRESHOLDS = {
    "safety_critical": {
        "min_improvement": 0.0,    # Any regression is bad
        "significance_level": 0.01, # Very strict p-value
        "sample_size_min": 1000,
        "description": "Healthcare, finance, autonomous systems"
    },
    "production_standard": {
        "min_improvement": -0.02,   # Allow 2% regression
        "significance_level": 0.05,
        "sample_size_min": 500,
        "description": "Most production applications"
    },
    "rapid_iteration": {
        "min_improvement": -0.05,   # Allow 5% regression
        "significance_level": 0.10,
        "sample_size_min": 200,
        "description": "Early development, experimentation"
    }
}

Implementing Regression Gates

class RegressionGate:
    """Gate that blocks deployment on significant regression."""

    def __init__(self, threshold_profile: str = "production_standard"):
        self.config = REGRESSION_THRESHOLDS[threshold_profile]

    def evaluate(self, baseline_results: dict, new_results: dict) -> dict:
        """Evaluate if new model passes regression gate."""

        # Statistical comparison
        comparison = compare_models_statistical(
            baseline_results['correct'], baseline_results['total'],
            new_results['correct'], new_results['total']
        )

        # Apply thresholds
        passes_significance = (
            comparison['p_value'] >= self.config['significance_level'] or
            comparison['difference'] >= 0
        )

        passes_minimum = comparison['difference'] >= self.config['min_improvement']

        passes_sample_size = (
            baseline_results['total'] >= self.config['sample_size_min'] and
            new_results['total'] >= self.config['sample_size_min']
        )

        gate_passed = passes_significance and passes_minimum and passes_sample_size

        return {
            'gate_passed': gate_passed,
            'comparison': comparison,
            'checks': {
                'significance': passes_significance,
                'minimum_improvement': passes_minimum,
                'sample_size': passes_sample_size
            },
            'recommendation': self._recommend(gate_passed, comparison)
        }

    def _recommend(self, passed: bool, comparison: dict) -> str:
        if passed:
            if comparison['difference'] > 0.02:
                return "DEPLOY: Significant improvement"
            else:
                return "DEPLOY: No regression detected"
        else:
            if comparison['p_value'] < 0.05:
                return "BLOCK: Statistically significant regression"
            else:
                return "INVESTIGATE: Possible regression, increase sample size"

# Usage
gate = RegressionGate("production_standard")
result = gate.evaluate(baseline_results, new_model_results)

Output:

{
    "gate_passed": true,
    "comparison": {
        "accuracy_a": 0.872,
        "accuracy_b": 0.858,
        "difference": -0.014,
        "p_value": 0.527
    },
    "checks": {
        "significance": true,
        "minimum_improvement": true,
        "sample_size": true
    },
    "recommendation": "DEPLOY: No regression detected"
}

Building Automated Regression Pipelines

The Regression Testing Workflow

┌─────────────────┐
│ New Model Push  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Load Baseline   │
│ Model & Results │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Run Benchmark   │
│ on New Model    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Statistical     │
│ Comparison      │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌───────┐ ┌────────┐
│ PASS  │ │ FAIL   │
│Deploy │ │Block   │
└───────┘ └────────┘

CI/CD Integration

# regression_test.py - Run as part of CI/CD pipeline

import json
import sys
from pathlib import Path

def run_regression_test(new_model_path: str, benchmark_path: str,
                       baseline_results_path: str) -> int:
    """Run regression test, return exit code."""

    # Load baseline
    baseline = json.loads(Path(baseline_results_path).read_text())

    # Run benchmark on new model
    new_results = run_benchmark(new_model_path, benchmark_path)

    # Evaluate regression gate
    gate = RegressionGate("production_standard")
    result = gate.evaluate(baseline, new_results)

    # Output results
    print(f"Baseline: {baseline['correct']}/{baseline['total']} = {baseline['correct']/baseline['total']:.1%}")
    print(f"New:      {new_results['correct']}/{new_results['total']} = {new_results['correct']/new_results['total']:.1%}")
    print(f"Difference: {result['comparison']['difference']:+.1%}")
    print(f"P-value: {result['comparison']['p_value']:.4f}")
    print(f"Gate: {'PASSED' if result['gate_passed'] else 'FAILED'}")
    print(f"Recommendation: {result['recommendation']}")

    # Save results for next baseline (if passed)
    if result['gate_passed']:
        Path('new_baseline.json').write_text(json.dumps(new_results))

    return 0 if result['gate_passed'] else 1

if __name__ == "__main__":
    exit_code = run_regression_test(
        sys.argv[1],  # new model path
        sys.argv[2],  # benchmark path
        sys.argv[3]   # baseline results path
    )
    sys.exit(exit_code)

CI/CD pipeline configuration:

# .github/workflows/model-regression.yml
name: Model Regression Test

on:
  push:
    paths:
      - 'models/**'

jobs:
  regression-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run regression test
        run: |
          python regression_test.py \
            models/new_model \
            benchmarks/task_api.json \
            baselines/current.json

      - name: Update baseline on success
        if: success()
        run: |
          mv new_baseline.json baselines/current.json
          git add baselines/
          git commit -m "Update baseline: ${{ github.sha }}"

Managing Baselines

Baseline Versioning Strategy

class BaselineManager:
    """Manage model baselines for regression testing."""

    def __init__(self, storage_path: str):
        self.storage_path = Path(storage_path)
        self.storage_path.mkdir(exist_ok=True)

    def save_baseline(self, model_version: str, results: dict):
        """Save benchmark results as new baseline."""
        filename = f"baseline_{model_version}_{datetime.now().isoformat()}.json"
        filepath = self.storage_path / filename

        baseline_data = {
            'model_version': model_version,
            'timestamp': datetime.now().isoformat(),
            'results': results,
            'benchmark_version': results.get('benchmark_version', '1.0')
        }

        filepath.write_text(json.dumps(baseline_data, indent=2))

        # Update symlink to current baseline
        current_link = self.storage_path / 'current.json'
        if current_link.exists():
            current_link.unlink()
        current_link.symlink_to(filename)

    def get_current_baseline(self) -> dict:
        """Get current baseline for comparison."""
        current_link = self.storage_path / 'current.json'
        if not current_link.exists():
            raise ValueError("No baseline exists. Run initial evaluation first.")
        return json.loads(current_link.read_text())

    def get_history(self, n: int = 10) -> list[dict]:
        """Get recent baseline history."""
        files = sorted(self.storage_path.glob('baseline_*.json'), reverse=True)
        return [json.loads(f.read_text()) for f in files[:n]]

Update Your Skill

Add to llmops-evaluator/SKILL.md:

## Regression Testing

### Statistical Significance Required

Never deploy based on raw accuracy differences alone:
- 500 samples: ±3% variance expected
- 1000 samples: ±2% variance expected
- 5000 samples: ±1% variance expected

### Threshold Profiles

| Profile | Allowed Regression | P-value | Use Case |
|---------|-------------------|---------|----------|
| Safety Critical | 0% | 0.01 | Healthcare, finance |
| Production | 2% | 0.05 | Most applications |
| Rapid Iteration | 5% | 0.10 | Early development |

### Regression Gate Checklist

Before deploying updated model:
- [ ] Run on same benchmark as baseline
- [ ] Compare with statistical test (not raw difference)
- [ ] Verify sample size is adequate (500+ for production)
- [ ] Document comparison results
- [ ] Update baseline if passed

### When to Reset Baseline

- Major architecture change (new base model)
- Benchmark update (new test cases added)
- Production issue revealed benchmark gap

Try With AI

Prompt 1: Interpret Ambiguous Results

My regression test shows:

Model A (baseline): 87.2% accuracy (500 samples)
Model B (new): 85.8% accuracy (500 samples)
P-value: 0.32

My threshold allows 2% regression.

Help me interpret:
1. Is this a real regression or noise?
2. Should I block deployment?
3. What would give me more confidence either way?

What you are learning: Statistical interpretation. Numbers do not make decisions; humans do with statistical guidance. Your AI partner helps you reason through ambiguous results.

Prompt 2: Design a Testing Strategy

I update my Task API model weekly with new training data.

Current approach: Compare each new version to the original baseline from 3 months ago.

Problems I'm seeing:
- Gradual drift accumulates undetected
- Good improvements mask small regressions
- Baseline is becoming stale

Design a better regression testing strategy that:
1. Catches gradual degradation
2. Maintains a reasonable baseline update policy
3. Tracks long-term quality trends

What you are learning: Testing strategy design. Weekly model updates require different approaches than one-time deployments. Your AI partner helps design sustainable testing practices.

Prompt 3: Debug a False Positive

My regression gate blocked a deployment with:
- Accuracy: 86.1% → 85.9%
- P-value: 0.04
- Gate result: BLOCK (significant regression)

But when I manually tested, the new model seems better at handling edge cases.

1. What might explain this discrepancy?
2. How should I investigate further?
3. Should I override the gate?

What you are learning: Debugging evaluation systems. Automated gates can be wrong. Your AI partner helps investigate whether the gate is catching a real problem or needs adjustment.

Safety Note

Statistical tests protect against random noise, not systematic failures. A model could pass regression testing while developing new failure modes not covered by your benchmark. Always combine automated regression testing with ongoing monitoring and periodic human review.

The Regression Testing Problem​

Why Raw Scores Mislead​

Statistical Significance Testing​

The Binomial Test for Accuracy Comparison​

McNemar's Test for Paired Comparisons​

Designing Regression Thresholds​

The Threshold Trade-off​

Recommended Thresholds​

Implementing Regression Gates​

Building Automated Regression Pipelines​

The Regression Testing Workflow​

CI/CD Integration​

Managing Baselines​

Baseline Versioning Strategy​

Update Your Skill​

Try With AI​

Prompt 1: Interpret Ambiguous Results​

Prompt 2: Design a Testing Strategy​

Prompt 3: Debug a False Positive​

Safety Note​

The Regression Testing Problem

Why Raw Scores Mislead

Statistical Significance Testing

The Binomial Test for Accuracy Comparison

McNemar's Test for Paired Comparisons

Designing Regression Thresholds

The Threshold Trade-off

Recommended Thresholds

Implementing Regression Gates

Building Automated Regression Pipelines

The Regression Testing Workflow

CI/CD Integration

Managing Baselines

Baseline Versioning Strategy

Update Your Skill

Try With AI

Prompt 1: Interpret Ambiguous Results

Prompt 2: Design a Testing Strategy

Prompt 3: Debug a False Positive

Safety Note