Skip to main content

Structured Output Training

You know what tool-calling patterns look like. Now you'll learn how to train a model to produce them reliably. This lesson covers the techniques that achieve 99%+ JSON validity—from data format requirements to training configuration.

The key insight: structured output isn't about teaching the model new knowledge. It's about teaching the model a new output format that it must follow precisely.

The JSON Validity Challenge

A model trained for natural language generation has no incentive to produce valid JSON. Consider what can go wrong:

Error TypeExampleFrequency
Missing quotes{name: "value"}Common
Wrong quote type{'name': 'value'}Common
Trailing comma{"a": 1, "b": 2,}Very common
Unescaped content"title": "Review "Q4" budget"Common
Incomplete JSON{"name": "value"Occasional
Mixed content{"name": "value"} Great!Common

Even one of these errors per 100 responses means your agent fails 1% of the time. At 10,000 daily calls, that's 100 failures—unacceptable for production.

Target: 99%+ JSON validity (fewer than 1 in 100 malformed responses)

Data Format Requirements

Training data for structured outputs must be precisely formatted:

Correct String Escaping

// User input might contain quotes
"user": "Create a task called \"Budget Review\" for Q4"

// The arguments field must escape internal JSON
"arguments": "{\"title\": \"Budget Review\", \"description\": \"Q4 analysis\"}"

Common escaping needs:

  • Double quotes inside strings: \"
  • Backslashes: \\
  • Newlines: \n
  • Tab characters: \t

Consistent Null Handling

When the assistant outputs tool calls, content must be null:

{
"role": "assistant",
"content": null, // NOT "", NOT "null", NOT missing
"tool_calls": [...]
}

Type Consistency

Arguments must match the schema types exactly:

// If schema says priority is a string enum
"priority": "high" // CORRECT
"priority": 1 // WRONG - integer instead of string

// If schema says count is an integer
"count": 5 // CORRECT
"count": "5" // WRONG - string instead of integer

Generating Training Data with AI

You'll generate hundreds of training examples. AI can help—but you must validate.

The Generation Loop

┌─────────────────────────────────────────────┐
│ 1. Define example template with schema │
└─────────────────────────────────────────────┘


┌─────────────────────────────────────────────┐
│ 2. Generate diverse user messages with AI │
└─────────────────────────────────────────────┘


┌─────────────────────────────────────────────┐
│ 3. Generate corresponding tool calls │
└─────────────────────────────────────────────┘


┌─────────────────────────────────────────────┐
│ 4. Validate JSON and schema compliance │
└─────────────────────────────────────────────┘


┌─────────────────────────────────────────────┐
│ 5. Review edge cases, fix errors │
└─────────────────────────────────────────────┘


┌─────────────────┐
│ Repeat until │
│ 500+ examples │
└─────────────────┘

Collaborative Data Generation

Your request to AI:

Generate 20 diverse user messages for the create_task tool.
Vary:
- Formality (casual to professional)
- Completeness (title only vs all fields)
- Phrasing (command, request, statement)
- Edge cases (special characters, dates, priorities)

For each message, generate the correct tool_calls output.
Format as JSONL with one example per line.

AI's response (partial):

{"messages":[{"role":"system","content":"..."},{"role":"user","content":"Add a task to buy milk"},{"role":"assistant","content":null,"tool_calls":[{"id":"call_001","type":"function","function":{"name":"create_task","arguments":"{\"title\":\"Buy milk\"}"}}]}]}
{"messages":[{"role":"system","content":"..."},{"role":"user","content":"Create high priority task: Review Q4 budget by Friday"},{"role":"assistant","content":null,"tool_calls":[{"id":"call_002","type":"function","function":{"name":"create_task","arguments":"{\"title\":\"Review Q4 budget\",\"priority\":\"high\",\"due_date\":\"2024-01-19\"}"}}]}]}

Your validation feedback:

I found issues in examples 5, 12, and 17:
- Example 5: "priortiy" typo in arguments
- Example 12: date format wrong (01/19/2024 instead of 2024-01-19)
- Example 17: missing closing brace

Please fix these and regenerate. Also, none of the examples
include special characters in titles. Add 5 examples with:
- Quotes in task titles
- Ampersands or other special chars
- Very long titles (50+ chars)

This back-and-forth is Layer 2 collaboration in action: you specify requirements, AI generates, you validate and refine.

Validation Script

Automate validation to catch errors before training:

import json
from typing import List, Dict

def validate_training_example(example: Dict) -> List[str]:
"""Validate a single training example. Returns list of errors."""
errors = []

# Check messages structure
if "messages" not in example:
errors.append("Missing 'messages' key")
return errors

messages = example["messages"]

# Check for required roles
roles = [m.get("role") for m in messages]
if "system" not in roles:
errors.append("Missing system message")
if "user" not in roles:
errors.append("Missing user message")
if "assistant" not in roles:
errors.append("Missing assistant message")

# Validate assistant message with tool_calls
for msg in messages:
if msg.get("role") == "assistant" and "tool_calls" in msg:
# Content must be null
if msg.get("content") is not None:
errors.append("Assistant content must be null when tool_calls present")

# Validate each tool call
for tc in msg.get("tool_calls", []):
if "function" not in tc:
errors.append(f"Tool call missing 'function': {tc}")
continue

func = tc["function"]

# Validate function name
valid_tools = ["create_task", "update_task", "complete_task", "list_tasks"]
if func.get("name") not in valid_tools:
errors.append(f"Unknown tool: {func.get('name')}")

# Validate arguments is valid JSON string
args_str = func.get("arguments", "")
try:
args = json.loads(args_str)

# Validate argument types based on tool
if func.get("name") == "create_task":
if "title" not in args:
errors.append("create_task missing required 'title'")
if "priority" in args and args["priority"] not in ["low", "medium", "high"]:
errors.append(f"Invalid priority: {args['priority']}")

except json.JSONDecodeError as e:
errors.append(f"Invalid JSON in arguments: {e}")

return errors


def validate_dataset(filepath: str) -> Dict:
"""Validate entire dataset file (JSONL format)."""
results = {"valid": 0, "invalid": 0, "errors": []}

with open(filepath, "r") as f:
for line_num, line in enumerate(f, 1):
try:
example = json.loads(line)
errors = validate_training_example(example)

if errors:
results["invalid"] += 1
results["errors"].append({
"line": line_num,
"errors": errors
})
else:
results["valid"] += 1

except json.JSONDecodeError as e:
results["invalid"] += 1
results["errors"].append({
"line": line_num,
"errors": [f"JSON parse error: {e}"]
})

return results

Output:

results = validate_dataset("training_data.jsonl")
print(f"Valid: {results['valid']}, Invalid: {results['invalid']}")

# Valid: 485, Invalid: 15
# Errors:
# - Line 23: Invalid priority: HIGH (should be lowercase)
# - Line 45: create_task missing required 'title'
# - Line 67: Invalid JSON in arguments: Expecting ',' delimiter

Fix all errors before training. A clean dataset is essential.

Training Configuration for Structured Output

Several training settings affect structured output quality:

Temperature and Sampling

During training, these don't apply. But when generating synthetic data or evaluating, use:

# For data generation - allow some creativity
temperature = 0.7
top_p = 0.9

# For structured output inference - be deterministic
temperature = 0.0 # or very low like 0.1
top_p = 1.0

Low temperature at inference time reduces JSON errors dramatically.

Loss Masking Strategy

Standard training applies loss to all tokens. For structured output, you want the model to focus on the structured parts:

# Conceptual loss masking for tool calls
training_example = {
"messages": [
{"role": "system", "content": "..."}, # Don't train on this
{"role": "user", "content": "..."}, # Don't train on this
{"role": "assistant", "tool_calls": [...]} # TRAIN ON THIS
]
}

Most fine-tuning frameworks (OpenAI, Hugging Face TRL) handle this automatically when you format data correctly.

Special Token Handling

The model needs to learn when to output tool calls vs natural language. Some frameworks use special tokens:

<|tool_call|>
{"name": "create_task", "arguments": {...}}
<|/tool_call|>

When using OpenAI-compatible fine-tuning, this is handled automatically. For custom training, ensure your tokenizer and data format align.

Data Quality Metrics

Track these metrics for your training dataset:

MetricTargetWhy It Matters
JSON validity100%Invalid examples teach bad patterns
Schema compliance100%Types and required fields must be correct
Tool distributionBalancedDon't train 90% on create_task only
Argument diversityHighVariety in values, not just structure
Edge case coverage10%+Special characters, long strings, nulls

Example Distribution Check

from collections import Counter

def analyze_tool_distribution(filepath: str) -> Dict[str, int]:
"""Count tool usage across dataset."""
tool_counts = Counter()

with open(filepath) as f:
for line in f:
example = json.loads(line)
for msg in example.get("messages", []):
if msg.get("role") == "assistant" and "tool_calls" in msg:
for tc in msg["tool_calls"]:
tool_name = tc["function"]["name"]
tool_counts[tool_name] += 1

return dict(tool_counts)

# Check distribution
dist = analyze_tool_distribution("training_data.jsonl")
print(dist)
# {'create_task': 200, 'update_task': 150, 'complete_task': 100, 'list_tasks': 50}

# Rebalance if needed - list_tasks is underrepresented

Building Your Dataset

Aim for 500+ examples with this distribution:

CategoryCountDescription
create_task150Varied titles, priorities, dates
update_task100Different fields being updated
complete_task75Simple completions
list_tasks75Different filter combinations
Multi-turn50Conversations with 2+ tool calls
Edge cases50Special chars, errors, corrections

Reflect on Your Skill

Update your agentic-tuning skill:

  1. Add validation requirements: JSON validity 100%, schema compliance 100%
  2. Add data generation workflow: Template -> Generate -> Validate -> Fix -> Repeat
  3. Add distribution targets: Balanced tools, 10%+ edge cases

Try With AI

Prompt 1: Generate Edge Cases

I need edge case training examples for create_task. Generate 10 examples
covering:
- Titles with quotes, apostrophes, and special characters
- Very short titles (1-2 words)
- Very long titles (50+ characters)
- Dates in various natural language formats ("tomorrow", "next Friday")
- Missing optional fields
- Invalid inputs that should be handled gracefully

Show both the user message and correct tool_calls for each.

What you're learning: Edge case anticipation—training data must cover unusual inputs to ensure robustness.

Prompt 2: Validate My Examples

I generated these training examples. Please validate each one:

[paste 5-10 examples]

Check for:
1. Valid JSON (proper escaping, no syntax errors)
2. Schema compliance (correct types, required fields)
3. Logical correctness (arguments match user intent)
4. Format compliance (null content, proper tool_calls structure)

For each error, explain what's wrong and show the fix.

What you're learning: Collaborative validation—using AI as a second pair of eyes to catch errors you might miss.

Prompt 3: Improve Data Diversity

Here's my current training data distribution:
- create_task: 200 examples
- update_task: 50 examples
- complete_task: 30 examples
- list_tasks: 20 examples

The distribution is unbalanced. Help me:
1. Generate 50 more update_task examples with varied update scenarios
2. Generate 45 more complete_task examples
3. Generate 55 more list_tasks examples with different filter combinations

For list_tasks, include combinations like:
- No filters (list all)
- Single filter (just priority=high)
- Multiple filters (status=pending AND due_before=2024-01-20)

What you're learning: Data balancing—ensuring the model learns all tools equally well, not just the most common one.

Safety Note

When generating synthetic training data, review edge cases carefully. Examples with phrases like "delete everything" or "ignore previous instructions" should teach appropriate responses (confirmation requests, refusal) rather than compliance. Your training data shapes model behavior.