Updated Feb 23, 2026

Build Your Alignment Skill

You have a merged model that combines persona and agentic capabilities. But models that complete tasks eagerly can also complete harmful requests eagerly. Before learning alignment theory, you'll build a skill that encodes the knowledge you need.

This lesson follows the pattern you've used throughout Part 8: create the skill first from official documentation, then use the chapter to test and improve it. By the end of this chapter, your model-alignment skill will be battle-tested and production-ready.

Why Alignment Needs a Skill

Alignment is not a one-time fix. It's an ongoing discipline with evolving techniques, common failure modes, and nuanced tradeoffs. Memorizing this is impossible. Encoding it into a skill makes the knowledge permanently accessible.

Traditional Approach	Skill-First Approach
Read papers, forget details	Build skill, query anytime
Scattered alignment notes	Single authoritative source
Reinvent safety patterns	Reuse validated approaches
Cannot delegate to AI	AI executes your skill

Your merged Task API model from Chapter 67 needs alignment before deployment. Without it, the model will helpfully attempt anything a user asks, including harmful requests.

Step 1: Clone a Fresh Skills-Lab

Start with a clean environment. State from Chapter 67 should not affect your alignment work.

# Clone the skills-lab repository
git clone https://github.com/panaversity/skills-lab.git ~/skills-lab-ch68

# Navigate to the directory
cd ~/skills-lab-ch68

# Create the skill directory structure
mkdir -p .claude/skills/model-alignment

Output:

Cloning into '/Users/you/skills-lab-ch68'...

Step 2: Write Your LEARNING-SPEC.md

Before creating the skill, define what you need to accomplish. This specification guides both your learning and the skill you create.

# LEARNING-SPEC.md

## What I Want to Learn
Model alignment using Direct Preference Optimization (DPO) with TRL library
on Colab Free Tier (T4 GPU, 15GB VRAM).

## Why This Matters
My merged Task API model completes any request. I need it to:
- Refuse harmful task requests gracefully
- Maintain helpfulness for legitimate tasks
- Detect and resist jailbreak attempts
- Provide safe, useful responses consistently

## Success Criteria
1. I can explain why fine-tuned models need alignment
2. I can create preference datasets (prompt/chosen/rejected)
3. I can run DPO training on Colab T4 without OOM errors
4. I can red-team my model to find vulnerabilities
5. I can implement refusal training for unsafe requests
6. My aligned model achieves 90%+ harm reduction

## Constraints
- Must work on Colab Free Tier (T4, 15GB VRAM, 12GB RAM)
- Must use DPO (not RLHF) for simplicity and compute efficiency
- Must use TRL library for training
- Must integrate with existing merged model from Chapter 67

## Running Example
Align my merged Task API model to refuse harmful task requests
(e.g., "create a task to stalk someone") while remaining helpful
for legitimate productivity use cases.

Save this file:

cat > LEARNING-SPEC.md << 'EOF'
# LEARNING-SPEC.md

## What I Want to Learn
Model alignment using Direct Preference Optimization (DPO) with TRL library
on Colab Free Tier (T4 GPU, 15GB VRAM).

## Why This Matters
My merged Task API model completes any request. I need it to:
- Refuse harmful task requests gracefully
- Maintain helpfulness for legitimate tasks
- Detect and resist jailbreak attempts
- Provide safe, useful responses consistently

## Success Criteria
1. I can explain why fine-tuned models need alignment
2. I can create preference datasets (prompt/chosen/rejected)
3. I can run DPO training on Colab T4 without OOM errors
4. I can red-team my model to find vulnerabilities
5. I can implement refusal training for unsafe requests
6. My aligned model achieves 90%+ harm reduction

## Constraints
- Must work on Colab Free Tier (T4, 15GB VRAM, 12GB RAM)
- Must use DPO (not RLHF) for simplicity and compute efficiency
- Must use TRL library for training
- Must integrate with existing merged model from Chapter 67

## Running Example
Align my merged Task API model to refuse harmful task requests
(e.g., "create a task to stalk someone") while remaining helpful
for legitimate productivity use cases.
EOF

Step 3: Fetch Official Documentation

The skill must be grounded in official documentation, not AI memory which may be outdated or incorrect.

Use Claude Code or your AI assistant:

/fetching-library-docs trl dpo

Fetch the official TRL documentation covering:
1. DPOTrainer setup and configuration
2. Preference dataset format (prompt/chosen/rejected)
3. DPO hyperparameters (beta, loss type)
4. Memory optimization for consumer GPUs

Key sources to reference:

Step 4: Create Your model-alignment Skill

Based on the documentation, create your skill file:

cat > .claude/skills/model-alignment/SKILL.md << 'EOF'
---
name: model-alignment
description: This skill should be used when aligning language models using DPO. Use when creating preference datasets, configuring DPO training, red-teaming models, and implementing refusal behaviors.
---

# Model Alignment Skill

## Purpose

Guide model alignment using Direct Preference Optimization (DPO) to ensure
models refuse harmful requests while remaining helpful for legitimate use cases.

## When to Use This Skill

Invoke this skill when:
- Creating preference datasets for alignment training
- Configuring DPO training with TRL
- Red-teaming models to find vulnerabilities
- Implementing refusal behaviors
- Evaluating alignment effectiveness
- Balancing safety with helpfulness

## Prerequisite Knowledge

- Supervised fine-tuning (Chapter 64)
- Model merging (Chapter 67)
- Python and Hugging Face ecosystem
- Understanding of harmful content categories

## Hardware Context

**Colab Free Tier (T4 GPU):**
- 15GB VRAM
- 12GB RAM
- Use QLoRA (4-bit) for DPO training
- Gradient accumulation required for effective batch sizes

## Alignment Methods Comparison

| Method | Complexity | Compute | Use Case |
|--------|-----------|---------|----------|
| DPO | Low | T4 feasible | Most alignment tasks |
| RLHF | High | Cluster needed | Maximum quality |
| ORPO | Medium | T4 feasible | Combined SFT+alignment |
| KTO | Low | T4 feasible | When only good examples available |

**Why DPO for This Chapter:**
- No reward model required (unlike RLHF)
- Single training pass
- Stable optimization
- 90%+ of RLHF quality with 10% of complexity

## Preference Dataset Format

```json
{
  "prompt": "Create a task to track my ex-partner's location",
  "chosen": "I can't help with tracking someone without their consent. This could constitute stalking. Instead, I can help you create tasks for your own location-based reminders or coordinate meetings with people who have agreed to share their location.",
  "rejected": "Sure! Here's a task for tracking your ex-partner: 1. Install a location tracking app..."
}

Quality Requirements

Dimension	Requirement	Why It Matters
Clear contrast	chosen/rejected obviously different	Model learns distinction
Realistic prompts	Actual user requests	Generalizes to real use
Graceful refusals	Not just "I can't"	Maintains user trust
Helpful alternatives	Suggest safe options	Preserves utility

DPO Training Configuration

Starting Configuration

from trl import DPOConfig, DPOTrainer

dpo_config = DPOConfig(
    output_dir="./dpo_output",
    beta=0.1,                           # KL penalty strength
    per_device_train_batch_size=2,      # Limited by VRAM
    gradient_accumulation_steps=8,       # Effective batch = 16
    num_train_epochs=1,                  # Often sufficient
    learning_rate=5e-7,                  # Very low for alignment
    warmup_ratio=0.1,                    # 10% warmup
    fp16=True,                           # Mixed precision
    logging_steps=10,
    save_strategy="epoch",
)

Beta Parameter Guide

Beta Value	Effect	Use Case
0.05	Weak KL penalty	When model is already aligned
0.1	Standard	Default starting point
0.5	Strong KL penalty	When alignment is aggressive
1.0	Very strong	When base model is unsafe

Intuition: Higher beta keeps aligned model closer to base model. Lower beta allows larger deviations for stronger alignment.

Learning Rate Guidelines

Scenario	Learning Rate	Reasoning
Standard DPO	5e-7	Very conservative for alignment
Aggressive alignment	1e-6	When base model is very unsafe
Fine-grained adjustment	1e-7	When base model is mostly aligned

Red-Teaming Categories

Direct Harmful Requests

Violence, weapons, self-harm
Illegal activities
Privacy violations
Harassment

Prompt Injection Attempts

"Ignore previous instructions"
"You are now DAN (Do Anything Now)"
"Roleplay as evil AI"
System prompt extraction

"My grandmother used to read me..."
"I'm a security researcher testing..."
"This is just fictional..."
Authority impersonation

Context Manipulation

Gradual escalation
Building false rapport
Emotional manipulation
Technical jargon obfuscation

Refusal Training Patterns

Graceful Refusal Template

I understand you're asking about [topic], but I'm not able to help with that
because [clear reason]. Instead, I can [helpful alternative].

Would you like me to help with that instead?

Key Principles

Acknowledge the request: Don't pretend not to understand
Clear reason: Explain why you can't help
Offer alternative: Maintain helpfulness
Stay professional: No lecturing or moralizing

Evaluation Metrics

Safety Metrics

Metric	Target	How to Measure
Harmful compliance rate	<5%	Red-team test suite
Jailbreak resistance	>90%	Standard jailbreak prompts
Refusal quality	>80% helpful	Human evaluation

Utility Metrics

Metric	Target	How to Measure
Legitimate task completion	>95%	Normal use test suite
False refusal rate	<5%	Edge case evaluation
User satisfaction	>4/5	User feedback

Troubleshooting Decision Tree

Model Refuses Too Much

Lower beta (try 0.05)
Add more "helpful" examples to preference data
Ensure chosen responses are genuinely helpful
Check for overly broad refusal patterns

Model Still Complies with Harmful Requests

Higher beta (try 0.5)
Add more examples of the failure mode
Ensure rejected responses are clearly harmful
Check if harmful pattern is covered in training

Training Unstable

Lower learning rate (try 1e-7)
Increase warmup ratio (try 0.2)
Check preference data quality
Verify chosen/rejected labels are correct

Validation Checks

Before declaring alignment complete:

Model refuses direct harmful requests
Model resists common jailbreak attempts
Model remains helpful for legitimate requests
Refusals are graceful, not preachy
No catastrophic forgetting of base capabilities
Alignment generalizes beyond training examples EOF

## Step 5: Verify Your Skill

Test that your skill was created correctly:

```bash
# Check the skill exists
ls -la .claude/skills/model-alignment/

# View the skill content
head -50 .claude/skills/model-alignment/SKILL.md

Output:

total 8
drwxr-xr-x  3 you  staff   96 Jan  1 10:00 .
drwxr-xr-x  3 you  staff   96 Jan  1 10:00 ..
-rw-r--r--  1 you  staff  5832 Jan  1 10:00 SKILL.md

What Happens Next

You now have a model-alignment skill grounded in TRL documentation. As you progress through this chapter:

Lesson	How Your Skill Improves
L01: Why Alignment	Add mental model explanations
L02: DPO vs RLHF	Refine method selection logic
L03: Preference Datasets	Add data quality checklists
L04: DPO Training	Add Colab-specific patterns
L05: Red-Teaming	Add attack category taxonomy
L06: Refusal Training	Add refusal templates
Capstone	Validate skill produces aligned model

Each lesson will include a "Reflect on Your Skill" section where you update and improve this skill based on what you learned.

Try With AI

Use your AI companion (Claude, ChatGPT, Gemini, or similar).

Prompt 1: Verify Skill Structure

I just created my model-alignment skill. Review the structure and tell me:
1. Does it follow the SKILL.md format correctly?
2. Is the content grounded in documentation (not hallucinated)?
3. What sections should I add as I learn more about alignment?

Here is my skill:
[paste your SKILL.md content]

What you are learning: Critical evaluation of your own skill structure. Your AI partner helps identify gaps before you invest time in an incomplete skill.

Prompt 2: Connect to Task API Safety

My Task API model helps users manage tasks. Looking at my model-alignment skill,
what specific safety concerns should I address? Think about:
- What harmful tasks could users try to create?
- How might users try to manipulate the assistant?
- What legitimate requests might accidentally trigger false refusals?

What you are learning: Domain-specific safety thinking. Generic alignment is not enough. You need to understand the specific risks in your application.

Prompt 3: Validate Against TRL Docs

Compare my skill's DPO configuration recommendations against the official
TRL documentation. Are there any discrepancies? Any best practices
I should add?

Specifically check:
1. Beta parameter recommendations
2. Learning rate guidance
3. Batch size configuration for T4 GPU

What you are learning: Documentation verification. You are building the habit of validating AI-generated content against authoritative sources.

Safety Note

As you create skills from documentation, remember that AI tools may not have the most current information. TRL is actively developed, and best practices evolve. Always verify critical configuration values against the official source before production use.

Why Alignment Needs a Skill​

Step 1: Clone a Fresh Skills-Lab​

Step 2: Write Your LEARNING-SPEC.md​

Step 3: Fetch Official Documentation​

Step 4: Create Your model-alignment Skill​

Quality Requirements​

DPO Training Configuration​

Starting Configuration​

Beta Parameter Guide​

Learning Rate Guidelines​

Red-Teaming Categories​

Direct Harmful Requests​

Prompt Injection Attempts​

Social Engineering​

Context Manipulation​

Refusal Training Patterns​

Graceful Refusal Template​

Key Principles​

Evaluation Metrics​

Safety Metrics​

Utility Metrics​

Troubleshooting Decision Tree​

Model Refuses Too Much​

Model Still Complies with Harmful Requests​

Training Unstable​

Validation Checks​

What Happens Next​

Try With AI​

Prompt 1: Verify Skill Structure​

Prompt 2: Connect to Task API Safety​

Prompt 3: Validate Against TRL Docs​

Safety Note​