Skip to main content

Data Quality Principles

You've built your data engineering skill. Now let's understand the principle that makes all fine-tuning efforts succeed or fail: the quality of your training data determines the quality of your fine-tuned model.

This isn't just "garbage in, garbage out." It's worse. Fine-tuning amplifies your data's flaws, encoding mistakes into the model's weights permanently.


The Amplification Problem

When you fine-tune a model, you're teaching it to behave like your training examples. Every pattern in your data becomes a pattern in your model:

What's in Your DataWhat's in Your Model
Accurate task completionsAccurate task completions
Typos and formatting errorsTypos and formatting errors
Inconsistent output stylesInconsistent output styles
Missing edge case handlingMissing edge case handling
Hallucinated informationHallucinated information

The model doesn't know which patterns are intentional and which are mistakes. It learns everything with equal confidence.

Consider this example from a Task API training dataset:

{
"instruction": "Create a task to call the doctor",
"output": "Created task: 'Call the doctor' with priority: high, due: tomorrow"
}

If your dataset has 50 examples where the model assumes "high" priority when none is specified, and 50 examples where it assumes "medium" priority, your fine-tuned model will randomly choose between them. You've trained inconsistency into the weights.


The Four Quality Dimensions

Training data quality has four measurable dimensions. Each affects your fine-tuned model differently:

Dimension 1: Accuracy

Definition: Outputs are factually correct and achieve the stated goal.

Impact of Failures: Hallucinations. The model learns to produce wrong information with high confidence.

Examples in Task API context:

AccurateInaccurate
"Created task 'Buy groceries' due Monday" (if user said Monday)"Created task 'Buy groceries' due Tuesday" (user said Monday)
"Task updated: priority changed to high" (task exists)"Task updated: priority changed to high" (task doesn't exist)
"Error: Cannot create task without a title" (correct validation)"Created task with empty title" (invalid behavior)

Detection method: Human review of outputs against inputs. Expensive but essential.

# Accuracy check: Does output match what input requested?
def check_accuracy(example):
"""Returns True if output correctly fulfills input request."""
# This requires domain knowledge—no automated shortcut
input_text = example["instruction"]
output_text = example["output"]

# For Task API: extract what was requested vs what was done
requested = parse_task_request(input_text)
completed = parse_task_response(output_text)

return requested == completed # Simplified; real check is nuanced

Dimension 2: Consistency

Definition: Same format, style, and patterns across all examples.

Impact of Failures: Unpredictable outputs. Sometimes the model uses one format, sometimes another.

Examples in Task API context:

ConsistentInconsistent
All outputs: "Created task: '[title]'"Some: "Created task: '[title]'" Some: "I've created a task called [title]" Some: "Done! Task [title] is ready"
All priorities: "low", "medium", "high"Some: "low", "medium", "high" Some: "1", "2", "3" Some: "!", "!!", "!!!"
All dates: "due: YYYY-MM-DD"Some: "due: Monday" Some: "due: 2025-01-15" Some: "due: next week"

Detection method: Schema validation and pattern matching.

from pydantic import BaseModel, validator
from typing import Literal

class TaskResponse(BaseModel):
"""Enforces consistent output format."""
action: Literal["created", "updated", "completed", "deleted"]
title: str
priority: Literal["low", "medium", "high"] | None
due_date: str | None # YYYY-MM-DD format

@validator("due_date")
def validate_date_format(cls, v):
if v and not re.match(r"\d{4}-\d{2}-\d{2}", v):
raise ValueError("Date must be YYYY-MM-DD")
return v

Dimension 3: Coverage

Definition: Dataset includes all scenarios the model will encounter in production.

Impact of Failures: Gaps. The model doesn't know how to handle situations not in the training data.

Examples in Task API context:

Complete CoverageCoverage Gaps
Create, update, complete, delete operationsOnly create and complete (update/delete missing)
Low, medium, high priority variationsOnly high priority examples
Valid requests AND error casesOnly happy-path examples
Different phrasing stylesOnly formal phrasing

Detection method: Gap analysis against requirements.

# Coverage check: Do we have examples for all required scenarios?
REQUIRED_SCENARIOS = {
"operations": ["create", "update", "complete", "delete", "list"],
"priorities": ["low", "medium", "high", None], # None = unspecified
"error_cases": ["missing_title", "invalid_date", "task_not_found"],
"phrasing": ["formal", "casual", "imperative", "question"],
}

def check_coverage(dataset):
"""Returns missing scenarios."""
covered = {category: set() for category in REQUIRED_SCENARIOS}

for example in dataset:
# Categorize each example
operation = extract_operation(example)
priority = extract_priority(example)
# ... etc

missing = {}
for category, required in REQUIRED_SCENARIOS.items():
missing[category] = set(required) - covered[category]

return missing

Dimension 4: Diversity

Definition: Varied phrasing, styles, and approaches—not just the same example repeated.

Impact of Failures: Brittleness. The model only understands inputs phrased exactly like training examples.

Examples in Task API context:

DiverseRepetitive
"Create a task to buy milk" "Add 'buy milk' to my list" "I need to remember to buy milk" "New task: buy milk""Create a task to buy milk" "Create a task to buy bread" "Create a task to buy eggs" (same pattern, different nouns)

Detection method: Embedding clustering to find near-duplicates.

from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN

def check_diversity(dataset, threshold=0.85):
"""Finds clusters of too-similar examples."""
model = SentenceTransformer("all-MiniLM-L6-v2")

# Embed all instructions
instructions = [ex["instruction"] for ex in dataset]
embeddings = model.encode(instructions)

# Cluster similar examples
clustering = DBSCAN(eps=1-threshold, min_samples=2, metric="cosine")
clusters = clustering.fit_predict(embeddings)

# Find over-represented clusters
from collections import Counter
cluster_sizes = Counter(clusters)

# Return clusters with too many similar examples
repetitive = {c: size for c, size in cluster_sizes.items()
if c != -1 and size > len(dataset) * 0.1}

return repetitive

The Quality Hierarchy

Not all dimensions are equal. Here's the priority order:

Priority 1: ACCURACY
↓ Inaccurate data teaches hallucination

Priority 2: CONSISTENCY
↓ Inconsistent data teaches unpredictability

Priority 3: COVERAGE
↓ Missing coverage creates gaps

Priority 4: DIVERSITY
↓ Low diversity creates brittleness

Rule: Fix accuracy issues before worrying about diversity. A dataset with 100% accuracy and low diversity outperforms a dataset with high diversity and 80% accuracy.


Quantity vs. Quality: The Real Tradeoff

Conventional wisdom suggests "more data is better." For fine-tuning, this is wrong.

Dataset SizeQualityResult
100 examples100% accurate, consistentExcellent: model learns correct patterns
1000 examples80% accurate, inconsistentPoor: model learns mistakes confidently
500 examples95% accurate, 90% consistentGood: model mostly correct, minor inconsistencies

The research backs this up:

  • Stanford Alpaca (52K examples) performed worse than later datasets with fewer but higher-quality examples
  • Hugging Face studies show diminishing returns after ~500-1000 high-quality examples for most tasks
  • OpenAI fine-tuning guidance recommends starting with 50-100 examples and iterating on quality

Bottom line: Start with fewer examples, verify quality obsessively, then expand if needed.


Quality Audit Checklist

Before training, run this checklist:

## Pre-Training Quality Audit

### Accuracy (Priority 1)
- [ ] Sample reviewed by domain expert (min 10% of dataset)
- [ ] Zero known factual errors in reviewed samples
- [ ] Error cases produce appropriate error messages

### Consistency (Priority 2)
- [ ] Output format validated against schema (100% pass)
- [ ] Date/time formats standardized
- [ ] Priority/status values from controlled vocabulary
- [ ] Tone/style consistent across examples

### Coverage (Priority 3)
- [ ] All required operations included (create/update/complete/delete)
- [ ] Error cases represented (min 10% of dataset)
- [ ] Edge cases included (empty inputs, very long inputs)
- [ ] Multiple phrasing styles per operation

### Diversity (Priority 4)
- [ ] Embedding clustering shows no repetitive clusters
- [ ] Input lengths vary (short commands to long descriptions)
- [ ] Different user personas represented

Real-World Case Study: The Alpaca Problem

When Stanford released Alpaca-7B in 2023, it was trained on 52,000 instruction-following examples generated by GPT-3.5. It seemed impressive at first. But quality issues emerged:

IssueImpact on Model
Inconsistent output formatsModel randomly chose between formats
Factual errors propagated from GPT-3.5Model hallucinated with confidence
Limited coverage (mostly English, general topics)Model struggled with specialized domains
Repetitive phrasing patternsModel was brittle to input variations

Later datasets like Dolly (15K examples, human-written) and OpenAssistant (10K examples, quality-filtered) performed better with fewer examples because quality was prioritized.

Lesson: 52,000 mediocre examples lose to 10,000 excellent ones.


Applying This to Task API

For your Task API fine-tuning project, here's what quality means:

DimensionTask API Standard
AccuracyEvery task creation matches what was requested; every update modifies correct fields; every error message is appropriate
ConsistencyAll outputs follow the same JSON structure; dates always YYYY-MM-DD; priorities always low/medium/high
CoverageAll CRUD operations; all priorities; error cases (missing title, invalid date, not found); various input styles
Diversity"Create task", "Add to list", "I need to remember", "New task", "Schedule", etc. for same operation

The next lessons teach you how to create data meeting these standards. This lesson ensures you know why each standard matters.


Try With AI

Prompt 1: Audit Sample Data

Here's a sample from a training dataset for a Task API assistant:

{"instruction": "make a task for the meeting", "output": "Created task 'meeting'"}
{"instruction": "new task buy groceries", "output": "I've added 'buy groceries' to your tasks!"}
{"instruction": "add task call mom", "output": "Task created: Call mom, Priority: high"}
{"instruction": "create a task to review the budget by Friday", "output": "Done! Task 'review budget' added for Friday."}

Using the four quality dimensions (accuracy, consistency, coverage, diversity),
identify issues in this dataset. For each issue:
1. What dimension does it violate?
2. How would this affect a fine-tuned model?
3. What would you change?

What you're learning: Applying the quality framework to real (flawed) examples. The issues are intentional—your job is to find them.

Prompt 2: Design Quality Gates

I'm building a data pipeline for fine-tuning. Help me design automated
quality gates that run before training:

1. What can I check automatically (Pydantic schemas, regex, embeddings)?
2. What requires human review (accuracy, domain correctness)?
3. How do I balance automation vs manual review for a 500-example dataset?

Be specific about tools and thresholds.

What you're learning: Practical implementation of quality control. Some checks are automatable; others require human judgment. Understanding the boundary is crucial.

Prompt 3: Predict Model Failures

Imagine a training dataset with these characteristics:
- 80% of examples are task creation, 15% are task completion, 5% are updates
- All examples use formal language ("Please create a task for...")
- No error case examples (all inputs are valid)
- Consistent output format

If I fine-tune on this dataset, what problems will the model have in production?
Be specific about failure modes and how they trace back to data quality issues.

What you're learning: Predicting downstream effects of data quality decisions. This mental model helps you catch problems before they're encoded in weights.

Safety Note

Data quality assessment requires domain expertise. AI can help identify patterns and inconsistencies, but determining whether an output is "correct" requires human judgment about what correct means in your context. Always have a domain expert review accuracy before training.