Skip to main content

Build Your Evaluation Skill

Before learning about model evaluation, you will build the skill that captures that knowledge. This skill-first approach means every concept you learn gets encoded into a reusable asset that becomes part of your Digital FTE toolkit.

When you fine-tune a model, how do you know it actually improved? A model might generate fluent text that completely misses the point. Evaluation frameworks provide systematic methods to measure what matters: accuracy, format compliance, reasoning quality, and safety. By the end of this chapter, you will have a skill that guides evaluation decisions for any fine-tuned model.

Step 1: Clone Skills Lab Fresh

Every chapter starts with a clean environment. This prevents state pollution from previous work and ensures reproducible results.

# Navigate to your workspace
cd ~/workspace

# Clone fresh skills-lab (or reset if exists)
if [ -d "skills-lab-llmops" ]; then
rm -rf skills-lab-llmops
fi

git clone https://github.com/panaversity/skills-lab.git skills-lab-llmops
cd skills-lab-llmops

# Create chapter directory
mkdir -p llmops-evaluation
cd llmops-evaluation

Output:

Cloning into 'skills-lab-llmops'...
remote: Enumerating objects: 156, done.
remote: Counting objects: 100% (156/156), done.
Receiving objects: 100% (156/156), 45.23 KiB | 2.26 MiB/s, done.

Step 2: Write Your LEARNING-SPEC.md

Before fetching documentation, articulate what you want to learn. This specification drives focused learning.

Create LEARNING-SPEC.md:

# Learning Specification: LLM Evaluation & Quality Gates

## Intent

Learn to systematically evaluate fine-tuned models to ensure they meet quality standards before deployment.

## What I Want to Learn

1. **Evaluation Taxonomy**: What metrics matter for different use cases?
2. **LLM-as-Judge**: How to use GPT-4 as an evaluator for subjective quality
3. **Benchmark Design**: How to create task-specific benchmarks for the Task API
4. **Regression Testing**: How to detect when model quality degrades
5. **Quality Gates**: How to define pass/fail thresholds for deployment

## Success Criteria

- [ ] I can select appropriate evaluation metrics for a given task
- [ ] I can implement LLM-as-Judge with structured rubrics
- [ ] I can create a custom benchmark for JSON output validation
- [ ] I can detect quality regression between model versions
- [ ] I can define quality gates that block bad deployments

## Constraints

- Must work on Colab Free Tier (T4, 15GB VRAM)
- Focus on practical evaluation, not research benchmarks
- Use lm-evaluation-harness as the primary tool
- Integrate with Task API from Chapter 40

## Prior Knowledge

- Chapter 64: SFT fundamentals
- Chapter 65-68: Various fine-tuning approaches
- Chapter 40: Task API structure

## Time Budget

- This lesson: 25 minutes (skill creation)
- Full chapter: ~4 hours (all evaluation concepts)

Step 3: Fetch Official Documentation

Use Context7 to retrieve the authoritative lm-evaluation-harness documentation. This ensures your skill is grounded in official patterns, not hallucinated best practices.

/fetching-library-docs lm-evaluation-harness

Key concepts to extract from documentation:

ConceptWhat It Means
TaskA specific evaluation benchmark (e.g., "hellaswag", "mmlu")
ModelThe model being evaluated (supports HuggingFace, OpenAI, local)
MetricWhat gets measured (accuracy, perplexity, exact match)
Few-shotNumber of examples provided in prompt before evaluation
Log-likelihoodProbability the model assigns to correct answer

Step 4: Create Your Initial Skill

Create llmops-evaluator/SKILL.md:

---
name: llmops-evaluator
description: "This skill should be used when evaluating fine-tuned LLM quality. Use when selecting metrics, designing benchmarks, running evaluations, or setting quality gates for model deployment."
---

# LLMOps Evaluator Skill

## When to Use This Skill

Invoke this skill when you need to:
- Evaluate a fine-tuned model before deployment
- Compare model versions for regression
- Design custom benchmarks for your use case
- Set pass/fail thresholds for CI/CD pipelines
- Debug why a model is underperforming

## Evaluation Decision Framework

### Step 1: Identify Evaluation Type

| Use Case | Evaluation Type | Primary Metrics |
|----------|----------------|-----------------|
| Classification | Accuracy-based | Accuracy, F1, Precision, Recall |
| Generation | Quality-based | Perplexity, BLEU, ROUGE |
| Instruction-following | LLM-as-Judge | Rubric scores (1-5) |
| JSON output | Format validation | Schema compliance rate |
| Safety | Red-teaming | Harmful response rate |

### Step 2: Select Benchmarks

**Standard Benchmarks** (for general capability):
- **MMLU**: General knowledge across domains
- **HellaSwag**: Common-sense reasoning
- **ARC**: Science reasoning
- **TruthfulQA**: Factual accuracy

**Task-Specific Benchmarks** (for your domain):
- Create custom evaluation sets matching your use case
- Minimum: 100 examples for reliable measurement
- Include edge cases and failure modes

### Step 3: Run Evaluation

```bash
# Basic evaluation with lm-eval-harness
lm_eval --model hf \
--model_args pretrained=my-fine-tuned-model \
--tasks hellaswag,arc_easy \
--batch_size 8 \
--output_path ./results

Step 4: Define Quality Gates

Deployment Thresholds:

  • Accuracy: > 85% on task-specific benchmark
  • Harmful response rate: < 5%
  • Schema compliance: > 95% for JSON output
  • Regression: New model >= Previous model - 2%

Common Patterns

Pattern 1: A/B Model Comparison

def compare_models(model_a_results, model_b_results, threshold=0.02):
"""Compare two models and determine if B is a regression from A."""
delta = model_b_results['accuracy'] - model_a_results['accuracy']
if delta &lt; -threshold:
return "REGRESSION", f"Model B is {abs(delta):.2%} worse"
elif delta > threshold:
return "IMPROVEMENT", f"Model B is {delta:.2%} better"
else:
return "EQUIVALENT", f"Within {threshold:.2%} threshold"

Pattern 2: LLM-as-Judge Template

JUDGE_PROMPT = """
Evaluate the assistant's response on a scale of 1-5:

User Request: {input}
Assistant Response: {output}
Expected Behavior: {expected}

Criteria:
- Accuracy: Does the response correctly address the request?
- Format: Does the response follow the expected format?
- Helpfulness: Is the response useful and complete?

Score (1-5):
Reasoning:
"""

Quality Gate Checklist

Before deploying a fine-tuned model, verify:

  • Task-specific accuracy > threshold
  • No regression from previous version
  • Format compliance verified
  • Safety evaluation passed
  • Cost/latency within budget

## Step 5: Verify Skill Works

Test that your skill provides useful guidance:

```bash
# Verify skill file exists and is valid
cat llmops-evaluator/SKILL.md | head -20

Output:

---
name: llmops-evaluator
description: "This skill should be used when evaluating fine-tuned LLM quality..."
---

# LLMOps Evaluator Skill
...

Your skill now exists as a starting point. As you progress through this chapter, you will add:

  • Detailed evaluation taxonomy (L01)
  • LLM-as-Judge implementation patterns (L02)
  • Task-specific benchmark design (L03)
  • Regression testing workflows (L04)
  • Quality gate configurations (L05)

Skill Evolution Map

Track how your skill grows through this chapter:

LessonWhat Gets Added
L00 (now)Initial framework, basic decision tree
L01Evaluation taxonomy, metric selection guide
L02LLM-as-Judge prompts and rubrics
L03Custom benchmark creation patterns
L04A/B testing, regression detection
L05CI/CD gate configurations
L06Complete pipeline integration

Try With AI

Prompt 1: Review Your LEARNING-SPEC

I wrote this LEARNING-SPEC.md for learning LLM evaluation:

[paste your LEARNING-SPEC.md]

1. Are my success criteria specific and measurable?
2. What am I missing that would be important for production evaluation?
3. Do my constraints match real-world limitations?

What you are learning: Specification refinement. Your AI partner helps identify gaps in your learning goals before you invest time in the wrong direction.

Prompt 2: Expand the Skill Framework

I'm building an llmops-evaluator skill. Review my initial framework:

[paste your SKILL.md]

Suggest 3 additional decision frameworks I should include for:
1. Choosing between automated metrics vs human evaluation
2. Determining sample size for reliable benchmarks
3. Handling evaluation of creative/open-ended outputs

What you are learning: Skill architecture. Evaluation has many dimensions. Your AI partner helps identify frameworks you might not have considered.

Prompt 3: Connect to Task API

My fine-tuned model outputs JSON for a Task API with this schema:

{
"action": "create|complete|list|delete",
"title": "string",
"priority": "low|medium|high",
"due_date": "string|null"
}

Design 5 evaluation test cases that would catch common failure modes:
- Invalid JSON
- Missing required fields
- Wrong action selection
- Inappropriate priority assignment
- Format consistency issues

What you are learning: Domain-specific evaluation design. Generic benchmarks miss your specific requirements. Your AI partner helps design tests that match your actual use case.

Safety Note

As you build evaluation frameworks, remember that evaluation can give false confidence. A model passing benchmarks does not guarantee safety in deployment. Always include human review for novel situations and maintain logging for post-deployment monitoring.