Updated Feb 23, 2026

Capstone - Production-Ready Dataset

This is the culmination of your data engineering work. You've learned dataset formats, quality dimensions, synthetic generation, and versioning. Now you'll combine everything into a production-ready pipeline.

But we're going to do this the right way: specification first.

You don't write code and hope it works. You define success criteria, build to meet them, and validate the result. This is how professional data engineers work - and it's how you'll avoid the "I think it's good enough" trap that leads to failed fine-tuning runs.

The Spec-First Approach

Before writing any code, answer: What does "production-ready" mean for this dataset?

If you can't measure it, you can't achieve it. Let's make "production-ready" concrete.

Step 1: Write the Data Quality Specification

Create specs/task-api-dataset-spec.md:

# Task API Dataset Specification

## Overview

Production-ready training dataset for fine-tuning Task API assistant.

**Dataset Purpose**: Enable a language model to understand and respond to task management requests with high accuracy and appropriate error handling.

## Success Criteria

### 1. Format Compliance (Blocking)

| Criterion | Threshold | Measurement |
|-----------|-----------|-------------|
| Valid JSON | 100% | Pydantic validation passes |
| ShareGPT schema | 100% | All examples have 'conversations' array |
| Alternating turns | 100% | human/gpt/human/gpt pattern |
| Non-empty messages | 100% | No empty 'value' fields |

**Gate**: If ANY format criterion fails, pipeline blocks.

### 2. Coverage Requirements (Blocking)

| Operation | Minimum Examples | Scenario Types |
|-----------|-----------------|----------------|
| Create | 20 | basic, missing-info, invalid-input |
| Update | 20 | status, partial, not-found |
| Complete | 15 | normal, already-complete |
| Delete | 15 | confirm, active-task |
| Query | 15 | filter, empty-results |
| **Total** | **85** | |

**Gate**: If coverage drops below minimum, pipeline blocks.

### 3. Quality Metrics (Warning)

| Metric | Target | Acceptable |
|--------|--------|------------|
| Avg human message length | 30-80 chars | 20-100 chars |
| Avg assistant message length | 80-200 chars | 50-300 chars |
| Unique messages | >95% | >90% |
| Turn distribution variance | <2.0 | <3.0 |

**Warning**: If quality metrics fall in "acceptable" but not "target", generate warning but continue.

### 4. Reproducibility Requirements (Blocking)

| Requirement | Status |
|-------------|--------|
| Version metadata present | Required |
| SHA256 checksums | Required |
| Random seed documented | Required |
| Generation model pinned | Required |
| Dataset card complete | Required |

**Gate**: If ANY reproducibility requirement missing, pipeline blocks.

### 5. Split Requirements (Blocking)

| Split | Percentage | Min Examples |
|-------|------------|--------------|
| Train | 80% | 68 |
| Validation | 10% | 8 |
| Test | 10% | 8 |

**Gate**: Splits must meet minimum counts and not overlap.

## Validation Report Format

Final validation produces a report:

\`\`\`
=== DATASET VALIDATION REPORT ===
Date: 2025-01-01T15:30:00Z
Version: 1.0.0

FORMAT COMPLIANCE
  [PASS] Valid JSON: 106/106 (100%)
  [PASS] ShareGPT schema: 106/106 (100%)
  [PASS] Alternating turns: 106/106 (100%)
  [PASS] Non-empty messages: 106/106 (100%)

COVERAGE
  [PASS] Create: 28 examples (target: 20)
  [PASS] Update: 24 examples (target: 20)
  [PASS] Complete: 18 examples (target: 15)
  [PASS] Delete: 18 examples (target: 15)
  [PASS] Query: 18 examples (target: 15)

QUALITY METRICS
  [PASS] Avg human length: 42.3 chars (target: 30-80)
  [PASS] Avg assistant length: 127.8 chars (target: 80-200)
  [PASS] Unique messages: 98.2% (target: >95%)
  [PASS] Turn variance: 1.4 (target: <2.0)

REPRODUCIBILITY
  [PASS] Version metadata: present
  [PASS] Checksums: verified
  [PASS] Random seed: 42
  [PASS] Model: gpt-4o-mini
  [PASS] Dataset card: complete

SPLITS
  [PASS] Train: 84 (78.3%, min: 68)
  [PASS] Validation: 11 (10.4%, min: 8)
  [PASS] Test: 11 (10.4%, min: 8)

=== RESULT: PASS ===
Dataset is production-ready for fine-tuning.
\`\`\`

## Exit Criteria

Dataset is production-ready when:
1. All blocking gates pass
2. No critical warnings
3. Validation report shows PASS
4. Dataset card is complete
5. All files are versioned and checksummed

This specification defines exactly what "done" means. No ambiguity.

Step 2: Implement the Pipeline

Now build a pipeline that meets the specification. The key insight: implement quality gates as code.

The Production Pipeline

Create pipeline/run_pipeline.py:

"""
Production data pipeline for Task API dataset.
Implements quality gates defined in task-api-dataset-spec.md
"""
import json
import asyncio
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import hashlib

# Import from previous lessons
from generate_dataset import generate_variations, create_dataset
from validate_dataset import analyze_dataset


class GateResult(Enum):
    PASS = "PASS"
    WARN = "WARN"
    FAIL = "FAIL"


@dataclass
class ValidationResult:
    """Result of a single validation check."""
    name: str
    result: GateResult
    message: str
    blocking: bool = True
    details: dict = field(default_factory=dict)


@dataclass
class PipelineReport:
    """Complete pipeline validation report."""
    timestamp: str
    version: str
    results: list[ValidationResult] = field(default_factory=list)

    @property
    def passed(self) -> bool:
        """Check if all blocking gates passed."""
        return all(
            r.result != GateResult.FAIL
            for r in self.results
            if r.blocking
        )

    def add(self, result: ValidationResult):
        self.results.append(result)

    def render(self) -> str:
        """Render report as text."""
        lines = [
            "=" * 50,
            "DATASET VALIDATION REPORT",
            "=" * 50,
            f"Date: {self.timestamp}",
            f"Version: {self.version}",
            ""
        ]

        # Group by category
        categories = {}
        for r in self.results:
            cat = r.name.split(":")[0] if ":" in r.name else "General"
            if cat not in categories:
                categories[cat] = []
            categories[cat].append(r)

        for cat, results in categories.items():
            lines.append(cat.upper())
            for r in results:
                status = f"[{r.result.value}]"
                name = r.name.split(":")[-1] if ":" in r.name else r.name
                lines.append(f"  {status} {name}: {r.message}")
            lines.append("")

        # Final result
        lines.append("=" * 50)
        if self.passed:
            lines.append("RESULT: PASS")
            lines.append("Dataset is production-ready for fine-tuning.")
        else:
            lines.append("RESULT: FAIL")
            lines.append("Dataset does not meet production criteria.")
            failed = [r for r in self.results if r.result == GateResult.FAIL and r.blocking]
            lines.append(f"Blocking failures: {len(failed)}")
        lines.append("=" * 50)

        return "\n".join(lines)


class QualityGates:
    """Implementation of quality gates from specification."""

    def __init__(self, conversations: list[dict]):
        self.conversations = conversations

    def check_format_compliance(self) -> list[ValidationResult]:
        """Check format requirements (all blocking)."""
        results = []

        # Valid JSON - already loaded, so passes
        results.append(ValidationResult(
            name="Format:Valid JSON",
            result=GateResult.PASS,
            message=f"{len(self.conversations)}/{len(self.conversations)} (100%)"
        ))

        # ShareGPT schema
        valid_schema = sum(1 for c in self.conversations if "conversations" in c)
        pct = valid_schema / len(self.conversations) * 100
        results.append(ValidationResult(
            name="Format:ShareGPT schema",
            result=GateResult.PASS if pct == 100 else GateResult.FAIL,
            message=f"{valid_schema}/{len(self.conversations)} ({pct:.0f}%)"
        ))

        # Alternating turns
        valid_turns = 0
        for conv in self.conversations:
            messages = conv.get("conversations", [])
            alternating = True
            for i, msg in enumerate(messages):
                expected = "human" if i % 2 == 0 else "gpt"
                if msg.get("from") != expected:
                    alternating = False
                    break
            if alternating:
                valid_turns += 1

        pct = valid_turns / len(self.conversations) * 100
        results.append(ValidationResult(
            name="Format:Alternating turns",
            result=GateResult.PASS if pct == 100 else GateResult.FAIL,
            message=f"{valid_turns}/{len(self.conversations)} ({pct:.0f}%)"
        ))

        # Non-empty messages
        valid_content = 0
        for conv in self.conversations:
            messages = conv.get("conversations", [])
            if all(msg.get("value", "").strip() for msg in messages):
                valid_content += 1

        pct = valid_content / len(self.conversations) * 100
        results.append(ValidationResult(
            name="Format:Non-empty messages",
            result=GateResult.PASS if pct == 100 else GateResult.FAIL,
            message=f"{valid_content}/{len(self.conversations)} ({pct:.0f}%)"
        ))

        return results

    def check_coverage(self) -> list[ValidationResult]:
        """Check coverage requirements (blocking)."""
        results = []

        # Define keywords for each operation
        operation_keywords = {
            "Create": ["create", "new task", "add task", "set up"],
            "Update": ["update", "change", "modify", "mark as"],
            "Complete": ["complete", "finish", "done", "completed"],
            "Delete": ["delete", "remove", "cancel"],
            "Query": ["show", "list", "what tasks", "overdue", "filter"]
        }

        minimums = {
            "Create": 20,
            "Update": 20,
            "Complete": 15,
            "Delete": 15,
            "Query": 15
        }

        counts = {op: 0 for op in operation_keywords}

        for conv in self.conversations:
            messages = conv.get("conversations", [])
            text = " ".join(m.get("value", "").lower() for m in messages)

            for op, keywords in operation_keywords.items():
                if any(kw in text for kw in keywords):
                    counts[op] += 1
                    break  # Count each conversation once

        for op, count in counts.items():
            minimum = minimums[op]
            passed = count >= minimum
            results.append(ValidationResult(
                name=f"Coverage:{op}",
                result=GateResult.PASS if passed else GateResult.FAIL,
                message=f"{count} examples (target: {minimum})"
            ))

        return results

    def check_quality_metrics(self) -> list[ValidationResult]:
        """Check quality metrics (warnings only)."""
        results = []

        human_lengths = []
        gpt_lengths = []
        all_messages = []

        for conv in self.conversations:
            for msg in conv.get("conversations", []):
                value = msg.get("value", "")
                all_messages.append(value)
                if msg.get("from") == "human":
                    human_lengths.append(len(value))
                else:
                    gpt_lengths.append(len(value))

        # Average lengths
        avg_human = sum(human_lengths) / len(human_lengths) if human_lengths else 0
        target_human = (30, 80)
        acceptable_human = (20, 100)

        if target_human[0] <= avg_human <= target_human[1]:
            result = GateResult.PASS
        elif acceptable_human[0] <= avg_human <= acceptable_human[1]:
            result = GateResult.WARN
        else:
            result = GateResult.FAIL

        results.append(ValidationResult(
            name="Quality:Avg human length",
            result=result,
            message=f"{avg_human:.1f} chars (target: {target_human[0]}-{target_human[1]})",
            blocking=False
        ))

        avg_gpt = sum(gpt_lengths) / len(gpt_lengths) if gpt_lengths else 0
        target_gpt = (80, 200)
        acceptable_gpt = (50, 300)

        if target_gpt[0] <= avg_gpt <= target_gpt[1]:
            result = GateResult.PASS
        elif acceptable_gpt[0] <= avg_gpt <= acceptable_gpt[1]:
            result = GateResult.WARN
        else:
            result = GateResult.FAIL

        results.append(ValidationResult(
            name="Quality:Avg assistant length",
            result=result,
            message=f"{avg_gpt:.1f} chars (target: {target_gpt[0]}-{target_gpt[1]})",
            blocking=False
        ))

        # Uniqueness
        unique = len(set(all_messages))
        total = len(all_messages)
        pct = unique / total * 100 if total > 0 else 0

        if pct > 95:
            result = GateResult.PASS
        elif pct > 90:
            result = GateResult.WARN
        else:
            result = GateResult.FAIL

        results.append(ValidationResult(
            name="Quality:Unique messages",
            result=result,
            message=f"{pct:.1f}% (target: >95%)",
            blocking=False
        ))

        return results

    def check_reproducibility(self, dataset_path: str) -> list[ValidationResult]:
        """Check reproducibility requirements (blocking)."""
        results = []
        version_path = Path(dataset_path) / "version.json"

        if not version_path.exists():
            results.append(ValidationResult(
                name="Reproducibility:Version metadata",
                result=GateResult.FAIL,
                message="version.json not found"
            ))
            return results

        with open(version_path) as f:
            version_info = json.load(f)

        # Check each requirement
        results.append(ValidationResult(
            name="Reproducibility:Version metadata",
            result=GateResult.PASS,
            message="present"
        ))

        # Checksums
        has_checksums = all(
            "checksum" in split
            for split in version_info.get("splits", {}).values()
        )
        results.append(ValidationResult(
            name="Reproducibility:Checksums",
            result=GateResult.PASS if has_checksums else GateResult.FAIL,
            message="verified" if has_checksums else "missing"
        ))

        # Random seed
        gen_config = version_info.get("generation_config", {})
        has_seed = "random_seed" in gen_config
        results.append(ValidationResult(
            name="Reproducibility:Random seed",
            result=GateResult.PASS if has_seed else GateResult.FAIL,
            message=str(gen_config.get("random_seed", "missing"))
        ))

        # Model
        has_model = "model" in gen_config
        results.append(ValidationResult(
            name="Reproducibility:Model",
            result=GateResult.PASS if has_model else GateResult.FAIL,
            message=gen_config.get("model", "missing")
        ))

        # Dataset card
        readme_path = Path(dataset_path) / "README.md"
        has_readme = readme_path.exists()
        results.append(ValidationResult(
            name="Reproducibility:Dataset card",
            result=GateResult.PASS if has_readme else GateResult.FAIL,
            message="complete" if has_readme else "missing"
        ))

        return results

    def check_splits(self, splits: dict) -> list[ValidationResult]:
        """Check split requirements (blocking)."""
        results = []
        minimums = {"train": 68, "validation": 8, "test": 8}

        total = sum(len(s) for s in splits.values())

        for split_name, data in splits.items():
            count = len(data)
            pct = count / total * 100 if total > 0 else 0
            minimum = minimums.get(split_name, 0)

            passed = count >= minimum
            results.append(ValidationResult(
                name=f"Splits:{split_name.capitalize()}",
                result=GateResult.PASS if passed else GateResult.FAIL,
                message=f"{count} ({pct:.1f}%, min: {minimum})"
            ))

        return results


async def run_pipeline(
    seeds_path: str = "seeds/task_api_seeds.json",
    output_path: str = "data/hf_dataset",
    version: str = "1.0.0"
) -> PipelineReport:
    """Run the complete data pipeline with quality gates."""

    report = PipelineReport(
        timestamp=datetime.utcnow().isoformat() + "Z",
        version=version
    )

    print("=" * 50)
    print("RUNNING PRODUCTION DATA PIPELINE")
    print("=" * 50)

    # Step 1: Load seeds
    print("\n[1/5] Loading seeds...")
    with open(seeds_path) as f:
        seeds = json.load(f)["seeds"]
    print(f"  Loaded {len(seeds)} seed examples")

    # Step 2: Generate variations
    print("\n[2/5] Generating variations...")
    # (In production, this would call the async generation)
    # For capstone, we load existing data
    data_path = Path("data/task_api_dataset.jsonl")
    if data_path.exists():
        with open(data_path) as f:
            conversations = [json.loads(line) for line in f]
        print(f"  Loaded {len(conversations)} existing conversations")
    else:
        print("  ERROR: No data file found. Run generate_dataset.py first.")
        return report

    # Step 3: Validate format
    print("\n[3/5] Validating format compliance...")
    gates = QualityGates(conversations)
    for result in gates.check_format_compliance():
        report.add(result)
        status = result.result.value
        print(f"  [{status}] {result.name}: {result.message}")

    # Step 4: Check coverage
    print("\n[4/5] Checking coverage...")
    for result in gates.check_coverage():
        report.add(result)
        status = result.result.value
        print(f"  [{status}] {result.name}: {result.message}")

    # Step 5: Quality metrics
    print("\n[5/5] Analyzing quality metrics...")
    for result in gates.check_quality_metrics():
        report.add(result)
        status = result.result.value
        print(f"  [{status}] {result.name}: {result.message}")

    # Check if we can proceed
    if not report.passed:
        print("\n" + "!" * 50)
        print("PIPELINE BLOCKED - Quality gates failed")
        print("!" * 50)
        return report

    # Additional checks for complete pipeline
    print("\n[+] Additional validation...")

    # Reproducibility
    for result in gates.check_reproducibility(output_path):
        report.add(result)

    # Splits (load from disk if available)
    splits_path = Path("data/splits")
    if splits_path.exists():
        splits = {}
        for split_name in ["train", "val", "test"]:
            split_file = splits_path / f"{split_name}.jsonl"
            if split_file.exists():
                with open(split_file) as f:
                    key = "validation" if split_name == "val" else split_name
                    splits[key] = [json.loads(line) for line in f]

        for result in gates.check_splits(splits):
            report.add(result)

    return report


if __name__ == "__main__":
    report = asyncio.run(run_pipeline())

    print("\n")
    print(report.render())

    # Save report
    report_path = Path("data/validation_report.txt")
    with open(report_path, "w") as f:
        f.write(report.render())
    print(f"\nReport saved to {report_path}")

Output:

==================================================
RUNNING PRODUCTION DATA PIPELINE
==================================================

[1/5] Loading seeds...
  Loaded 12 seed examples

[2/5] Generating variations...
  Loaded 106 existing conversations

[3/5] Validating format compliance...
  [PASS] Format:Valid JSON: 106/106 (100%)
  [PASS] Format:ShareGPT schema: 106/106 (100%)
  [PASS] Format:Alternating turns: 106/106 (100%)
  [PASS] Format:Non-empty messages: 106/106 (100%)

[4/5] Checking coverage...
  [PASS] Coverage:Create: 28 examples (target: 20)
  [PASS] Coverage:Update: 24 examples (target: 20)
  [PASS] Coverage:Complete: 18 examples (target: 15)
  [PASS] Coverage:Delete: 18 examples (target: 15)
  [PASS] Coverage:Query: 18 examples (target: 15)

[5/5] Analyzing quality metrics...
  [PASS] Quality:Avg human length: 42.3 chars (target: 30-80)
  [PASS] Quality:Avg assistant length: 127.8 chars (target: 80-200)
  [PASS] Quality:Unique messages: 98.2% (target: >95%)

[+] Additional validation...

==================================================
DATASET VALIDATION REPORT
==================================================
Date: 2025-01-01T15:30:00Z
Version: 1.0.0

FORMAT
  [PASS] Valid JSON: 106/106 (100%)
  [PASS] ShareGPT schema: 106/106 (100%)
  [PASS] Alternating turns: 106/106 (100%)
  [PASS] Non-empty messages: 106/106 (100%)

COVERAGE
  [PASS] Create: 28 examples (target: 20)
  [PASS] Update: 24 examples (target: 20)
  [PASS] Complete: 18 examples (target: 15)
  [PASS] Delete: 18 examples (target: 15)
  [PASS] Query: 18 examples (target: 15)

QUALITY
  [PASS] Avg human length: 42.3 chars (target: 30-80)
  [PASS] Avg assistant length: 127.8 chars (target: 80-200)
  [PASS] Unique messages: 98.2% (target: >95%)

REPRODUCIBILITY
  [PASS] Version metadata: present
  [PASS] Checksums: verified
  [PASS] Random seed: 42
  [PASS] Model: gpt-4o-mini
  [PASS] Dataset card: complete

SPLITS
  [PASS] Train: 84 (78.3%, min: 68)
  [PASS] Validation: 11 (10.4%, min: 8)
  [PASS] Test: 11 (10.4%, min: 8)

==================================================
RESULT: PASS
Dataset is production-ready for fine-tuning.
==================================================

Report saved to data/validation_report.txt

Step 3: Handle Failures Gracefully

What happens when quality gates fail? The pipeline should provide actionable feedback.

Failure Handling

def generate_remediation_steps(report: PipelineReport) -> list[str]:
    """Generate specific steps to fix failed gates."""
    steps = []

    for result in report.results:
        if result.result == GateResult.FAIL:
            if "Format" in result.name:
                steps.append(f"FIX {result.name}: Run validate_dataset.py to identify malformed examples")

            elif "Coverage" in result.name:
                op = result.name.split(":")[-1]
                steps.append(f"FIX {result.name}: Add more seed examples for '{op}' operation")

            elif "Reproducibility" in result.name:
                if "Version" in result.name:
                    steps.append("FIX: Run version_dataset.py to create version.json")
                elif "Checksum" in result.name:
                    steps.append("FIX: Regenerate checksums with compute_checksum()")
                elif "Dataset card" in result.name:
                    steps.append("FIX: Create README.md following HuggingFace template")

            elif "Splits" in result.name:
                steps.append(f"FIX {result.name}: Add more examples or adjust split ratios")

    return steps


# In the main pipeline:
if not report.passed:
    print("\nREMEDIATION STEPS:")
    for i, step in enumerate(generate_remediation_steps(report), 1):
        print(f"  {i}. {step}")

Output (when failing):

RESULT: FAIL
Dataset does not meet production criteria.
Blocking failures: 2

REMEDIATION STEPS:
  1. FIX Coverage:Delete: Add more seed examples for 'Delete' operation
  2. FIX Reproducibility:Dataset card: Create README.md following HuggingFace template

Step 4: Finalize for Chapter 64

Your dataset is now ready for fine-tuning. Here's the complete file structure:

data/
  hf_dataset/
    README.md                    # Dataset card
    version.json                 # Version metadata
    train/
      data-00000-of-00001.arrow  # Training data
    validation/
      data-00000-of-00001.arrow  # Validation data
    test/
      data-00000-of-00001.arrow  # Test data (DO NOT TOUCH until final eval)

  splits/
    train.jsonl                  # 84 examples
    val.jsonl                    # 11 examples
    test.jsonl                   # 11 examples

  validation_report.txt          # Final validation report

  archives/
    1.0.0/                       # Version snapshot

seeds/
  task_api_seeds.json            # Original seed examples

specs/
  task-api-dataset-spec.md       # Quality specification

Chapter 64 Preview

In the next chapter, you'll use this dataset to fine-tune Llama-3-8B:

# Load your production-ready dataset
from datasets import load_from_disk
dataset = load_from_disk("data/hf_dataset")

# The dataset you built meets all requirements
# - Format: ShareGPT validated
# - Coverage: All Task API operations
# - Quality: Verified metrics
# - Reproducibility: Versioned with checksums
# - Splits: Ready for training

The Specification-First Mindset

Notice what we did differently in this capstone:

Traditional Approach	Spec-First Approach
Build pipeline, hope it works	Define success criteria first
Manual spot-checking	Automated quality gates
"Looks good enough"	Measurable pass/fail
Debug after training fails	Catch issues before training

This mindset transfer to all data engineering work. Define the specification. Build to meet it. Validate the result. Every time.

Try With AI

Now that you've completed the full pipeline, explore these extensions with your AI partner.

Prompt 1: Design LLM-as-Judge Quality Scoring

My pipeline uses rule-based quality checks (message length, uniqueness, etc.).
I want to add LLM-as-judge scoring to evaluate:

1. Response helpfulness (does the assistant actually solve the user's problem?)
2. Tone consistency (does the assistant maintain a professional-friendly voice?)
3. Information accuracy (are task IDs, dates, and statuses plausible?)

Design a GPT-4o-mini prompt that scores each conversation on these criteria (1-5 scale).
How do I integrate this into my pipeline without adding too much cost?
What's the right balance between rule-based and LLM-based checks?

What you're learning: LLM-as-judge is a powerful pattern for quality evaluation. This prompt teaches you to design evaluation prompts and integrate them cost-effectively. The technique applies to evaluating fine-tuned model outputs too.

Prompt 2: Scale the Pipeline

My current pipeline handles 100 examples. I need to scale to 10,000 examples:

Current bottlenecks:
- In-memory processing (all data in RAM)
- Synchronous validation (one check at a time)
- No checkpointing (restart from scratch on failure)

Design a scalable version that:
1. Processes data in batches
2. Runs validation in parallel
3. Saves progress so I can resume
4. Handles partial failures gracefully

I'm using Python with asyncio. Show me the architecture.

What you're learning: Production pipelines need to handle scale. This prompt teaches you streaming architectures, parallel processing, and fault tolerance - skills essential for real LLMOps work.

Prompt 3: Continuous Data Quality

My dataset is production-ready today. But I want to add more examples weekly.
Each addition might introduce quality regressions.

Design a CI/CD workflow that:
1. Runs on every data PR
2. Blocks merges that degrade quality
3. Auto-generates version bumps
4. Updates the dataset card
5. Notifies the team of coverage gaps

Show me a GitHub Actions workflow that implements this.

What you're learning: Data quality isn't a one-time activity. This prompt teaches you to think about continuous quality in a CI/CD context - an essential practice for teams maintaining production ML systems.

Capstone Completion Checklist

Before marking this capstone complete, verify:

Specification written (specs/task-api-dataset-spec.md)
Pipeline runs end-to-end without errors
All format compliance checks pass (100%)
Coverage requirements met (85+ examples)
Quality metrics in target range
Reproducibility requirements complete
Splits created with minimum counts
Validation report saved
Dataset ready for Chapter 64

If all checks pass, you've built a production-ready dataset. This is real LLMOps engineering.

Reflect on Your Skill

You built an llmops-data-engineer skill in Lesson 0 and refined it throughout the chapter. This capstone is the final test.

Final Skill Assessment

Using my llmops-data-engineer skill, help me validate a production dataset.
Does my skill now cover the complete pipeline from specification to validation?

Skill Completeness Checklist

Your skill should now include:

Dataset format selection (Alpaca vs ShareGPT)
Quality dimensions (accuracy, consistency, coverage, diversity)
Synthetic data generation with seed design
Cleaning and deduplication pipelines
Version control and checksums
Dataset card documentation
Automated quality gates
Validation report generation

Finalize Your Skill

If any gaps remain:

Finalize my llmops-data-engineer skill based on everything learned in Chapter 63:
Add the spec-first approach (define success criteria before implementation)
Add quality gate implementation patterns
Add validation report format
Add remediation step generation for failures
Ensure the skill produces production-ready datasets

Your skill is now battle-tested. You've used it to build a real dataset that will train a real model. That's the difference between theoretical knowledge and practical capability.

Next Chapter: In Chapter 64, you'll use this dataset to fine-tune Llama-3-8B with LoRA/QLoRA. The quality you built here directly affects model quality.

The Spec-First Approach​

Step 1: Write the Data Quality Specification​

Step 2: Implement the Pipeline​

Step 3: Handle Failures Gracefully​

Step 4: Finalize for Chapter 64​

The Specification-First Mindset​

Try With AI​

Capstone Completion Checklist​

Reflect on Your Skill​

Final Skill Assessment​

Skill Completeness Checklist​

Finalize Your Skill​