Updated Feb 23, 2026

Quality Gates

Evaluation produces numbers. Quality gates turn those numbers into deployment decisions. A model with 88% accuracy, 3% harmful response rate, and 92% format compliance: Should it ship?

Quality gates define the boundaries between "deploy" and "block." This lesson teaches you to design gates that protect production quality while enabling rapid iteration.

The Gate Hierarchy

Not all quality dimensions are equal. Some failures are blocking, others are warnings.

┌─────────────────────────────────────────────────────────────┐
│                    BLOCKING GATES                            │
│   Failure = Deployment STOPS                                 │
│   • Safety: harmful_rate > 5%                                │
│   • Format: json_valid < 95%                                 │
│   • Regression: accuracy drop > 5%                           │
└─────────────────────────────────────────────────────────────┘
                            │
                            │ PASS
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    WARNING GATES                             │
│   Failure = Flag for review, deployment continues            │
│   • Accuracy: < 90% (target, not blocker)                    │
│   • Latency: p99 > 2s (performance concern)                  │
│   • Cost: per-request > $0.01                                │
└─────────────────────────────────────────────────────────────┘
                            │
                            │ PASS (with warnings)
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT                                │
└─────────────────────────────────────────────────────────────┘

Designing Multi-Dimensional Gates

Gate Configuration Schema

from dataclasses import dataclass
from enum import Enum
from typing import Optional

class GateType(Enum):
    BLOCKING = "blocking"      # Failure stops deployment
    WARNING = "warning"        # Failure flags but continues
    INFORMATIONAL = "info"     # Logged only

class Comparison(Enum):
    GREATER_THAN = ">"
    GREATER_EQUAL = ">="
    LESS_THAN = "<"
    LESS_EQUAL = "<="
    EQUALS = "=="

@dataclass
class QualityGate:
    name: str
    metric: str
    threshold: float
    comparison: Comparison
    gate_type: GateType
    description: str
    weight: Optional[float] = None  # For composite scoring

# Task API Quality Gates
TASK_API_GATES = [
    # Blocking gates - any failure stops deployment
    QualityGate(
        name="safety_gate",
        metric="harmful_rate",
        threshold=0.05,
        comparison=Comparison.LESS_THAN,
        gate_type=GateType.BLOCKING,
        description="Harmful responses must be below 5%"
    ),
    QualityGate(
        name="format_gate",
        metric="json_valid_rate",
        threshold=0.95,
        comparison=Comparison.GREATER_EQUAL,
        gate_type=GateType.BLOCKING,
        description="JSON validity must be at least 95%"
    ),
    QualityGate(
        name="regression_gate",
        metric="accuracy_vs_baseline",
        threshold=-0.05,
        comparison=Comparison.GREATER_EQUAL,
        gate_type=GateType.BLOCKING,
        description="Accuracy cannot regress more than 5%"
    ),

    # Warning gates - flag issues but allow deployment
    QualityGate(
        name="accuracy_target",
        metric="accuracy",
        threshold=0.90,
        comparison=Comparison.GREATER_EQUAL,
        gate_type=GateType.WARNING,
        description="Target accuracy is 90%+"
    ),
    QualityGate(
        name="schema_compliance",
        metric="schema_valid_rate",
        threshold=0.98,
        comparison=Comparison.GREATER_EQUAL,
        gate_type=GateType.WARNING,
        description="Schema compliance should be 98%+"
    ),
    QualityGate(
        name="latency_target",
        metric="p99_latency_ms",
        threshold=2000,
        comparison=Comparison.LESS_THAN,
        gate_type=GateType.WARNING,
        description="99th percentile latency should be under 2s"
    ),

    # Informational - logged for tracking
    QualityGate(
        name="average_latency",
        metric="mean_latency_ms",
        threshold=500,
        comparison=Comparison.LESS_THAN,
        gate_type=GateType.INFORMATIONAL,
        description="Average latency tracking"
    )
]

Gate Evaluation Engine

from typing import NamedTuple

class GateResult(NamedTuple):
    gate_name: str
    passed: bool
    actual_value: float
    threshold: float
    gate_type: GateType
    message: str

class QualityGateEvaluator:
    """Evaluate model metrics against quality gates."""

    def __init__(self, gates: list[QualityGate]):
        self.gates = gates

    def evaluate(self, metrics: dict) -> dict:
        """Evaluate all gates and return pass/fail decision."""
        results = []
        blocking_failures = []
        warnings = []

        for gate in self.gates:
            # Get metric value
            if gate.metric not in metrics:
                result = GateResult(
                    gate_name=gate.name,
                    passed=False,
                    actual_value=None,
                    threshold=gate.threshold,
                    gate_type=gate.gate_type,
                    message=f"Metric '{gate.metric}' not found in evaluation results"
                )
            else:
                value = metrics[gate.metric]
                passed = self._compare(value, gate.threshold, gate.comparison)
                result = GateResult(
                    gate_name=gate.name,
                    passed=passed,
                    actual_value=value,
                    threshold=gate.threshold,
                    gate_type=gate.gate_type,
                    message=f"{gate.metric}: {value:.3f} {gate.comparison.value} {gate.threshold}"
                )

            results.append(result)

            if not result.passed:
                if gate.gate_type == GateType.BLOCKING:
                    blocking_failures.append(result)
                elif gate.gate_type == GateType.WARNING:
                    warnings.append(result)

        return {
            'deploy_allowed': len(blocking_failures) == 0,
            'blocking_failures': blocking_failures,
            'warnings': warnings,
            'all_results': results,
            'summary': self._generate_summary(blocking_failures, warnings)
        }

    def _compare(self, value: float, threshold: float, comparison: Comparison) -> bool:
        if comparison == Comparison.GREATER_THAN:
            return value > threshold
        elif comparison == Comparison.GREATER_EQUAL:
            return value >= threshold
        elif comparison == Comparison.LESS_THAN:
            return value < threshold
        elif comparison == Comparison.LESS_EQUAL:
            return value <= threshold
        elif comparison == Comparison.EQUALS:
            return value == threshold

    def _generate_summary(self, blocking: list, warnings: list) -> str:
        if blocking:
            return f"BLOCKED: {len(blocking)} blocking failure(s)"
        elif warnings:
            return f"PASSED with {len(warnings)} warning(s)"
        else:
            return "PASSED: All gates passed"

# Usage
evaluator = QualityGateEvaluator(TASK_API_GATES)
metrics = {
    'harmful_rate': 0.02,
    'json_valid_rate': 0.97,
    'accuracy_vs_baseline': -0.01,
    'accuracy': 0.88,
    'schema_valid_rate': 0.95,
    'p99_latency_ms': 1800,
    'mean_latency_ms': 450
}
result = evaluator.evaluate(metrics)

Output:

{
    "deploy_allowed": true,
    "blocking_failures": [],
    "warnings": [
        {
            "gate_name": "accuracy_target",
            "passed": false,
            "actual_value": 0.88,
            "threshold": 0.90,
            "message": "accuracy: 0.880 >= 0.90"
        },
        {
            "gate_name": "schema_compliance",
            "passed": false,
            "actual_value": 0.95,
            "threshold": 0.98,
            "message": "schema_valid_rate: 0.950 >= 0.98"
        }
    ],
    "summary": "PASSED with 2 warning(s)"
}

Setting Threshold Values

The Threshold Calibration Process

Setting thresholds requires balancing business requirements against achievable quality:

def calibrate_thresholds(historical_data: list[dict], target_pass_rate: float = 0.80) -> dict:
    """Calibrate thresholds based on historical model performance."""

    calibrated = {}

    for metric in ['accuracy', 'json_valid_rate', 'harmful_rate']:
        values = [d[metric] for d in historical_data if metric in d]

        if metric == 'harmful_rate':
            # For bad metrics, set threshold at worst acceptable
            threshold = np.percentile(values, 100 - target_pass_rate * 100)
        else:
            # For good metrics, set threshold at target percentile
            threshold = np.percentile(values, (1 - target_pass_rate) * 100)

        calibrated[metric] = {
            'suggested_threshold': round(threshold, 3),
            'historical_mean': round(np.mean(values), 3),
            'historical_std': round(np.std(values), 3),
            'would_pass_rate': sum(1 for v in values if v >= threshold) / len(values)
        }

    return calibrated

# Example
historical = [
    {'accuracy': 0.85, 'json_valid_rate': 0.96, 'harmful_rate': 0.03},
    {'accuracy': 0.88, 'json_valid_rate': 0.97, 'harmful_rate': 0.02},
    {'accuracy': 0.87, 'json_valid_rate': 0.98, 'harmful_rate': 0.04},
    # ... more historical evaluations
]
calibration = calibrate_thresholds(historical, target_pass_rate=0.80)

Output:

{
    "accuracy": {
        "suggested_threshold": 0.852,
        "historical_mean": 0.867,
        "historical_std": 0.015,
        "would_pass_rate": 0.80
    },
    "json_valid_rate": {
        "suggested_threshold": 0.961,
        "historical_mean": 0.970,
        "historical_std": 0.010,
        "would_pass_rate": 0.80
    }
}

Stakeholder-Driven Thresholds

Different stakeholders care about different metrics:

Stakeholder	Primary Concern	Threshold Focus
Product Manager	User experience	Helpfulness, accuracy
Security Team	Safety	Harmful rate, refusal rate
Engineering	Reliability	Format compliance, latency
Finance	Cost	Per-request cost, throughput
Legal/Compliance	Risk	Bias, safety, audit trail

STAKEHOLDER_REQUIREMENTS = {
    "product": {
        "priority_metric": "user_satisfaction_score",
        "threshold": 4.0,
        "rationale": "Users rate helpfulness 1-5, target 4.0+"
    },
    "security": {
        "priority_metric": "harmful_rate",
        "threshold": 0.01,
        "rationale": "Zero tolerance, 1% accounts for measurement error"
    },
    "engineering": {
        "priority_metric": "p99_latency_ms",
        "threshold": 3000,
        "rationale": "SLA requires 3s max response time"
    },
    "finance": {
        "priority_metric": "cost_per_request",
        "threshold": 0.005,
        "rationale": "Budget allows $0.005/request at projected volume"
    }
}

def merge_stakeholder_gates(requirements: dict) -> list[QualityGate]:
    """Convert stakeholder requirements to quality gates."""
    gates = []

    for stakeholder, req in requirements.items():
        # Security requirements are always blocking
        gate_type = GateType.BLOCKING if stakeholder == "security" else GateType.WARNING

        gates.append(QualityGate(
            name=f"{stakeholder}_{req['priority_metric']}",
            metric=req['priority_metric'],
            threshold=req['threshold'],
            comparison=Comparison.LESS_THAN if 'rate' in req['priority_metric'] or 'cost' in req['priority_metric'] or 'latency' in req['priority_metric'] else Comparison.GREATER_EQUAL,
            gate_type=gate_type,
            description=req['rationale']
        ))

    return gates

Composite Quality Scores

Sometimes you need a single number for dashboards or quick decisions:

def compute_composite_score(metrics: dict, weights: dict) -> dict:
    """Compute weighted composite quality score."""

    # Normalize metrics to 0-1 scale
    normalized = {}
    for metric, value in metrics.items():
        if metric in ['harmful_rate', 'error_rate']:
            # Invert bad metrics: lower is better
            normalized[metric] = max(0, 1 - value)
        elif metric in ['p99_latency_ms', 'mean_latency_ms']:
            # Normalize latency: 0ms = 1.0, 5000ms = 0.0
            normalized[metric] = max(0, 1 - value / 5000)
        else:
            # Good metrics: higher is better, already 0-1
            normalized[metric] = value

    # Compute weighted average
    weighted_sum = 0
    weight_total = 0

    for metric, weight in weights.items():
        if metric in normalized:
            weighted_sum += normalized[metric] * weight
            weight_total += weight

    composite = weighted_sum / weight_total if weight_total > 0 else 0

    return {
        'composite_score': round(composite, 3),
        'normalized_metrics': normalized,
        'grade': _score_to_grade(composite)
    }

def _score_to_grade(score: float) -> str:
    if score >= 0.95:
        return 'A+'
    elif score >= 0.90:
        return 'A'
    elif score >= 0.85:
        return 'B'
    elif score >= 0.80:
        return 'C'
    elif score >= 0.70:
        return 'D'
    else:
        return 'F'

# Example
weights = {
    'accuracy': 0.30,
    'json_valid_rate': 0.20,
    'harmful_rate': 0.25,
    'schema_valid_rate': 0.15,
    'p99_latency_ms': 0.10
}
composite = compute_composite_score(metrics, weights)

Output:

{
    "composite_score": 0.891,
    "normalized_metrics": {
        "accuracy": 0.88,
        "json_valid_rate": 0.97,
        "harmful_rate": 0.98,
        "schema_valid_rate": 0.95,
        "p99_latency_ms": 0.64
    },
    "grade": "B"
}

Gate Reporting

Clear reporting helps teams understand gate decisions:

def generate_gate_report(evaluation_result: dict, model_info: dict) -> str:
    """Generate human-readable gate report."""

    report = []
    report.append("=" * 60)
    report.append("QUALITY GATE EVALUATION REPORT")
    report.append("=" * 60)
    report.append(f"\nModel: {model_info['name']}")
    report.append(f"Version: {model_info['version']}")
    report.append(f"Evaluated: {datetime.now().isoformat()}")

    report.append(f"\n{'='*60}")
    report.append(f"RESULT: {evaluation_result['summary']}")
    report.append("=" * 60)

    # Blocking failures
    if evaluation_result['blocking_failures']:
        report.append("\n⛔ BLOCKING FAILURES:")
        for failure in evaluation_result['blocking_failures']:
            report.append(f"  • {failure.gate_name}: {failure.message}")

    # Warnings
    if evaluation_result['warnings']:
        report.append("\n⚠️  WARNINGS:")
        for warning in evaluation_result['warnings']:
            report.append(f"  • {warning.gate_name}: {warning.message}")

    # All gates
    report.append("\n📊 ALL GATES:")
    for result in evaluation_result['all_results']:
        status = "✅" if result.passed else "❌"
        report.append(f"  {status} {result.gate_name}: {result.message}")

    report.append("\n" + "=" * 60)

    return "\n".join(report)

report = generate_gate_report(result, {'name': 'task-api-v2', 'version': '2.0.1'})
print(report)

Output:

============================================================
QUALITY GATE EVALUATION REPORT
============================================================

Model: task-api-v2
Version: 2.0.1
Evaluated: 2025-01-02T14:32:15

============================================================
RESULT: PASSED with 2 warning(s)
============================================================

⚠️  WARNINGS:
  • accuracy_target: accuracy: 0.880 >= 0.90
  • schema_compliance: schema_valid_rate: 0.950 >= 0.98

📊 ALL GATES:
  ✅ safety_gate: harmful_rate: 0.020 < 0.05
  ✅ format_gate: json_valid_rate: 0.970 >= 0.95
  ✅ regression_gate: accuracy_vs_baseline: -0.010 >= -0.05
  ❌ accuracy_target: accuracy: 0.880 >= 0.90
  ❌ schema_compliance: schema_valid_rate: 0.950 >= 0.98
  ✅ latency_target: p99_latency_ms: 1800.000 < 2000

============================================================

Update Your Skill

Add to llmops-evaluator/SKILL.md:

## Quality Gates

### Gate Hierarchy

1. **Blocking Gates** (must pass to deploy):
   - Safety: harmful_rate < 5%
   - Format: json_valid_rate >= 95%
   - Regression: accuracy >= baseline - 5%

2. **Warning Gates** (flag but allow deploy):
   - Accuracy: >= 90% target
   - Schema: >= 98% compliance
   - Latency: p99 < 2000ms

3. **Informational** (tracking only):
   - Cost per request
   - Average latency
   - Token usage

### Threshold Setting Process

1. Gather historical performance data
2. Identify stakeholder requirements
3. Set blocking thresholds at minimum acceptable
4. Set warning thresholds at target quality
5. Validate with calibration (80% historical pass rate)
6. Review quarterly for adjustment

### Gate Report Template

GATE: [name] METRIC: [what is measured] THRESHOLD: [value] ACTUAL: [model value] RESULT: [PASS/FAIL] TYPE: [BLOCKING/WARNING/INFO] ACTION: [what happens on failure]

Try With AI

Prompt 1: Design Gates for Your Use Case

I'm deploying a customer support chatbot for an e-commerce company.

Key requirements:
- Must not provide incorrect return policy information
- Should maintain friendly, helpful tone
- Cannot leak customer data
- Should resolve issues without escalation when possible

Design quality gates with:
1. 3 blocking gates (deployment stops)
2. 3 warning gates (flags but continues)
3. Thresholds and rationale for each

What you are learning: Domain-specific gate design. Generic gates miss industry-specific requirements. Your AI partner helps translate business requirements into technical thresholds.

Prompt 2: Resolve Stakeholder Conflicts

I'm setting thresholds and have conflicting requirements:

Product Team: "Accuracy must be 95%+ or users will complain"
Engineering: "We can realistically achieve 88% with current data"
Security: "Any harmful response is unacceptable"
Finance: "We can't afford more than $0.003 per request"

Help me:
1. Identify which requirements can coexist
2. Propose compromises where needed
3. Design a gate configuration that balances all stakeholders
4. Draft talking points for stakeholder discussions

What you are learning: Stakeholder negotiation. Quality gates require cross-functional alignment. Your AI partner helps find solutions that balance competing priorities.

Prompt 3: Debug Gate Failures

My model keeps failing the quality gate with these results:

- safety_gate: PASS (harmful_rate: 0.02)
- format_gate: FAIL (json_valid_rate: 0.93)
- accuracy_gate: PASS (accuracy: 0.89)

The format gate threshold is 95%.

Analyze:
1. What might cause 7% JSON failures?
2. How can I identify which inputs cause failures?
3. Should I adjust the threshold or fix the model?
4. What data would help me investigate?

What you are learning: Gate debugging. When gates fail, you need to understand why. Your AI partner helps develop investigation strategies.

Safety Note

Quality gates provide confidence, not guarantees. A model passing all gates can still fail in production on inputs your evaluation did not cover. Always implement monitoring, logging, and human review processes alongside quality gates.

The Gate Hierarchy​

Designing Multi-Dimensional Gates​

Gate Configuration Schema​

Gate Evaluation Engine​

Setting Threshold Values​

The Threshold Calibration Process​

Stakeholder-Driven Thresholds​

Composite Quality Scores​

Gate Reporting​

Update Your Skill​

Try With AI​

Prompt 1: Design Gates for Your Use Case​

Prompt 2: Resolve Stakeholder Conflicts​

Prompt 3: Debug Gate Failures​

Safety Note​

The Gate Hierarchy

Designing Multi-Dimensional Gates

Gate Configuration Schema

Gate Evaluation Engine

Setting Threshold Values

The Threshold Calibration Process

Stakeholder-Driven Thresholds

Composite Quality Scores

Gate Reporting

Update Your Skill

Try With AI

Prompt 1: Design Gates for Your Use Case

Prompt 2: Resolve Stakeholder Conflicts

Prompt 3: Debug Gate Failures

Safety Note