Updated Feb 23, 2026

Dataset Versioning and Management

You've created 100+ training examples for Task API. Now comes a critical question: how do you manage this data over time? What happens when you want to:

Add more examples next month?
Compare model performance across dataset versions?
Share the dataset with your team?
Reproduce training results from three months ago?

Without proper versioning, datasets become chaos. You'll have files like task_api_v2_final_FINAL_updated.jsonl and no idea which one trained your production model.

This lesson teaches dataset versioning using HuggingFace Datasets - the standard for ML data management. You'll learn to version, document, and manage training data like production code.

Why Dataset Versioning Matters

Consider what happens without versioning:

Scenario	Without Versioning	With Versioning
Model regression	"Something changed... was it the data?"	"Compare model v3 (dataset v1.2) vs v4 (dataset v1.3)"
Team collaboration	"Which file did you train on?"	"Load task-api-dataset/v1.2"
Reproducing results	Hope the files haven't changed	Checksums guarantee identical data
Auditing	No idea how data was created	Full provenance in dataset card

Fine-tuned models inherit their training data's properties. If you can't track the data, you can't understand the model.

Step 1: Convert to HuggingFace Format

HuggingFace Datasets provides a standard format that works across the ML ecosystem. Let's convert our JSONL files.

Install the Library

pip install datasets

Create Dataset from JSONL

"""Convert JSONL files to HuggingFace Dataset format."""
from datasets import Dataset, DatasetDict
import json
from pathlib import Path

def load_jsonl(path: str) -> list[dict]:
    """Load JSONL file into list of dicts."""
    with open(path) as f:
        return [json.loads(line) for line in f]

def create_dataset() -> DatasetDict:
    """Create HuggingFace DatasetDict from our splits."""
    # Load each split
    train_data = load_jsonl("data/splits/train.jsonl")
    val_data = load_jsonl("data/splits/val.jsonl")
    test_data = load_jsonl("data/splits/test.jsonl")

    # Convert to HuggingFace Dataset
    dataset = DatasetDict({
        "train": Dataset.from_list(train_data),
        "validation": Dataset.from_list(val_data),
        "test": Dataset.from_list(test_data)
    })

    return dataset

if __name__ == "__main__":
    dataset = create_dataset()
    print(dataset)

    # Save locally
    dataset.save_to_disk("data/hf_dataset")
    print("\nSaved to data/hf_dataset/")

Output:

DatasetDict({
    train: Dataset({
        features: ['conversations'],
        num_rows: 84
    })
    validation: Dataset({
        features: ['conversations'],
        num_rows: 11
    })
    test: Dataset({
        features: ['conversations'],
        num_rows: 11
    })
})

Saved to data/hf_dataset/

Loading the Dataset

Now anyone can load your dataset with a single line:

from datasets import load_from_disk

dataset = load_from_disk("data/hf_dataset")

# Access splits
train = dataset["train"]
print(f"Training examples: {len(train)}")

# Iterate over examples
for example in train.select(range(3)):
    print(example["conversations"])

Output:

Training examples: 84
[{'from': 'human', 'value': 'Create a task for the quarterly budget review by Friday'}, ...]
[{'from': 'human', 'value': 'I need help setting up a new task'}, ...]
[{'from': 'human', 'value': 'Mark task #1523 as in progress'}, ...]

Step 2: Add Version Metadata

Datasets need version information. We'll add metadata that tracks:

Version number (semantic versioning)
Creation timestamp
SHA256 checksums (for integrity verification)
Generation configuration

Create Version Metadata

"""Add version metadata to dataset."""
import hashlib
import json
from datetime import datetime
from pathlib import Path

def compute_checksum(path: str) -> str:
    """Compute SHA256 checksum of a file."""
    sha256 = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            sha256.update(chunk)
    return sha256.hexdigest()

def create_version_info(version: str = "1.0.0") -> dict:
    """Create version metadata for the dataset."""
    return {
        "version": version,
        "created_at": datetime.utcnow().isoformat() + "Z",
        "splits": {
            "train": {
                "num_examples": 84,
                "checksum": compute_checksum("data/splits/train.jsonl")
            },
            "validation": {
                "num_examples": 11,
                "checksum": compute_checksum("data/splits/val.jsonl")
            },
            "test": {
                "num_examples": 11,
                "checksum": compute_checksum("data/splits/test.jsonl")
            }
        },
        "generation_config": {
            "seed_examples": 12,
            "variations_per_seed": 8,
            "model": "gpt-4o-mini",
            "temperature": 0.9,
            "random_seed": 42
        },
        "format": "sharegpt"
    }

if __name__ == "__main__":
    version_info = create_version_info("1.0.0")

    # Save metadata
    with open("data/hf_dataset/version.json", "w") as f:
        json.dump(version_info, f, indent=2)

    print("Version metadata:")
    print(json.dumps(version_info, indent=2))

Output:

{
  "version": "1.0.0",
  "created_at": "2025-01-01T15:30:00Z",
  "splits": {
    "train": {
      "num_examples": 84,
      "checksum": "a1b2c3d4e5f6..."
    },
    "validation": {
      "num_examples": 11,
      "checksum": "f6e5d4c3b2a1..."
    },
    "test": {
      "num_examples": 11,
      "checksum": "1234567890ab..."
    }
  },
  "generation_config": {
    "seed_examples": 12,
    "variations_per_seed": 8,
    "model": "gpt-4o-mini",
    "temperature": 0.9,
    "random_seed": 42
  },
  "format": "sharegpt"
}

Verify Integrity

When loading a dataset, verify its checksums:

def verify_dataset_integrity(dataset_path: str) -> bool:
    """Verify dataset hasn't been modified."""
    version_path = Path(dataset_path) / "version.json"
    with open(version_path) as f:
        version_info = json.load(f)

    for split_name, split_info in version_info["splits"].items():
        # Load the split and compute checksum
        split_path = f"data/splits/{split_name}.jsonl"
        if split_name == "validation":
            split_path = "data/splits/val.jsonl"

        current_checksum = compute_checksum(split_path)
        expected_checksum = split_info["checksum"]

        if current_checksum != expected_checksum:
            print(f"INTEGRITY FAILURE: {split_name} checksum mismatch")
            return False

    print("Dataset integrity verified")
    return True

Step 3: Create a Dataset Card

A dataset card is documentation that travels with your data. It answers: What is this? Where did it come from? How should it be used?

HuggingFace has a standard format. Create data/hf_dataset/README.md:

---
language:
  - en
license: apache-2.0
task_categories:
  - text-generation
  - conversational
tags:
  - task-management
  - fine-tuning
  - synthetic
pretty_name: Task API Conversations
size_categories:
  - n<1K
---

# Task API Conversations Dataset

## Dataset Description

Training data for fine-tuning language models on Task API interactions. The dataset contains multi-turn conversations between users and an AI assistant that manages tasks.

### Dataset Summary

| Attribute | Value |
|-----------|-------|
| Total examples | 106 |
| Training split | 84 |
| Validation split | 11 |
| Test split | 11 |
| Format | ShareGPT (conversations) |
| Language | English |
| Domain | Task management |

### Supported Tasks

- **Instruction-following**: Model responds to task management requests
- **Multi-turn conversation**: Model maintains context across turns
- **Error handling**: Model gracefully handles invalid inputs

### Languages

English (en)

## Dataset Structure

### Data Format

Each example contains a `conversations` array with alternating human/gpt messages:

\`\`\`json
{
  "conversations": [
    {"from": "human", "value": "Create a task for the budget review by Friday"},
    {"from": "gpt", "value": "I'll create that. What priority - low, medium, or high?"},
    {"from": "human", "value": "High priority"},
    {"from": "gpt", "value": "Created task #1847 'Budget review' due Friday with high priority."}
  ]
}
\`\`\`

### Data Splits

| Split | Examples | Purpose |
|-------|----------|---------|
| train | 84 | Model training |
| validation | 11 | Hyperparameter tuning |
| test | 11 | Final evaluation |

### Coverage

The dataset covers these Task API operations:

| Operation | Scenarios |
|-----------|-----------|
| Create | Basic, missing info, invalid input |
| Update | Status change, partial update, not found |
| Complete | Normal completion, already complete |
| Delete | With confirmation, active task handling |
| Query | Filtered list, empty results |

## Dataset Creation

### Generation Method

**Synthetic generation using GPT-4o-mini**

1. 12 seed examples written by domain experts
2. 8 variations generated per seed
3. Pydantic validation for format compliance
4. Manual quality review

### Seed Examples

Seeds were designed to cover:
- Multiple phrasing styles (formal, casual, terse, detailed)
- Happy paths and error cases
- Single-turn and multi-turn interactions
- All CRUD operations

### Quality Assurance

- Schema validation with Pydantic
- Duplicate detection
- Manual review of 10% sample
- Turn distribution analysis

## Considerations for Using the Data

### Intended Use

- Fine-tuning instruction-following models for task management
- Research on conversational AI for productivity applications
- Benchmarking task-oriented dialogue systems

### Limitations

- **Synthetic data**: May not capture all real-world phrasing patterns
- **English only**: Not suitable for multilingual applications
- **Domain-specific**: Trained for task management, not general conversation
- **Scale**: 106 examples is minimal; production use may need more

### Bias and Fairness

The dataset reflects:
- Western business context (business hours, Western holidays)
- Professional communication norms
- Task management workflows common in tech companies

Consider augmenting for other cultural contexts.

## Additional Information

### Version History

| Version | Date | Changes |
|---------|------|---------|
| 1.0.0 | 2025-01-01 | Initial release |

### Citation

\`\`\`bibtex
@dataset{task_api_conversations,
  title={Task API Conversations Dataset},
  author={Your Team},
  year={2025},
  version={1.0.0}
}
\`\`\`

### License

Apache 2.0

### Contact

[Your contact information]

This dataset card provides everything someone needs to understand and use your data responsibly.

Step 4: Implement Version Control Workflow

Now let's establish a workflow for managing dataset versions over time.

Version Bump Script

Create version_dataset.py:

"""Manage dataset versions."""
import json
import shutil
from pathlib import Path
from datetime import datetime

def bump_version(current: str, bump_type: str = "patch") -> str:
    """Increment semantic version."""
    major, minor, patch = map(int, current.split("."))

    if bump_type == "major":
        return f"{major + 1}.0.0"
    elif bump_type == "minor":
        return f"{major}.{minor + 1}.0"
    else:  # patch
        return f"{major}.{minor}.{patch + 1}"

def archive_version(dataset_path: str, version: str):
    """Archive current version before updating."""
    archive_dir = Path("data/archives") / version
    archive_dir.mkdir(parents=True, exist_ok=True)

    # Copy current dataset
    shutil.copytree(dataset_path, archive_dir / "dataset", dirs_exist_ok=True)

    # Copy source files
    for split in ["train", "val", "test"]:
        src = Path("data/splits") / f"{split}.jsonl"
        if src.exists():
            shutil.copy(src, archive_dir / f"{split}.jsonl")

    print(f"Archived version {version} to {archive_dir}")

def create_new_version(
    dataset_path: str,
    bump_type: str = "patch",
    changelog: str = ""
):
    """Create a new version of the dataset."""
    # Load current version
    version_path = Path(dataset_path) / "version.json"
    with open(version_path) as f:
        version_info = json.load(f)

    current_version = version_info["version"]

    # Archive current
    archive_version(dataset_path, current_version)

    # Bump version
    new_version = bump_version(current_version, bump_type)

    # Update version info
    version_info["version"] = new_version
    version_info["created_at"] = datetime.utcnow().isoformat() + "Z"
    version_info["changelog"] = changelog
    version_info["previous_version"] = current_version

    # Recalculate checksums
    for split_name in version_info["splits"]:
        split_file = f"data/splits/{split_name}.jsonl"
        if split_name == "validation":
            split_file = "data/splits/val.jsonl"
        if Path(split_file).exists():
            version_info["splits"][split_name]["checksum"] = compute_checksum(split_file)

    # Save new version
    with open(version_path, "w") as f:
        json.dump(version_info, f, indent=2)

    print(f"Created version {new_version}")
    return new_version

if __name__ == "__main__":
    import sys

    bump_type = sys.argv[1] if len(sys.argv) > 1 else "patch"
    changelog = sys.argv[2] if len(sys.argv) > 2 else "Minor updates"

    create_new_version("data/hf_dataset", bump_type, changelog)

Output:

python version_dataset.py minor "Added 20 new examples for recurring tasks"

# Output:
Archived version 1.0.0 to data/archives/1.0.0
Created version 1.1.0

Step 5: Reproducibility Practices

Reproducibility means someone else (or future you) can recreate exactly the same dataset.

The Reproducibility Checklist

Requirement	How We Meet It
Fixed random seed	`random_seed: 42` in version.json
Model version pinned	`model: gpt-4o-mini` with date
Checksums	SHA256 for each split
Generation code	Saved in repository
Seed examples	Included in dataset
Parameters documented	Temperature, variations per seed

Reproducibility Verification Script

"""Verify dataset reproducibility."""
import json
from pathlib import Path

def check_reproducibility(dataset_path: str) -> dict:
    """Check that all reproducibility requirements are met."""
    version_path = Path(dataset_path) / "version.json"
    readme_path = Path(dataset_path) / "README.md"

    results = {
        "version_file": version_path.exists(),
        "readme_file": readme_path.exists(),
        "has_checksums": False,
        "has_random_seed": False,
        "has_model_info": False,
        "has_generation_config": False
    }

    if results["version_file"]:
        with open(version_path) as f:
            version_info = json.load(f)

        # Check for checksums
        splits = version_info.get("splits", {})
        results["has_checksums"] = all(
            "checksum" in split for split in splits.values()
        )

        # Check generation config
        gen_config = version_info.get("generation_config", {})
        results["has_random_seed"] = "random_seed" in gen_config
        results["has_model_info"] = "model" in gen_config
        results["has_generation_config"] = bool(gen_config)

    # Calculate score
    passed = sum(results.values())
    total = len(results)

    print(f"Reproducibility check: {passed}/{total} requirements met")
    for check, passed in results.items():
        status = "PASS" if passed else "FAIL"
        print(f"  [{status}] {check}")

    return results

if __name__ == "__main__":
    check_reproducibility("data/hf_dataset")

Output:

Reproducibility check: 6/6 requirements met
  [PASS] version_file
  [PASS] readme_file
  [PASS] has_checksums
  [PASS] has_random_seed
  [PASS] has_model_info
  [PASS] has_generation_config

The Complete Dataset Structure

After implementing versioning, your dataset directory looks like this:

data/
  hf_dataset/
    README.md                 # Dataset card
    version.json              # Version metadata
    dataset_dict.json         # HuggingFace metadata
    train/
      data-00000-of-00001.arrow
    validation/
      data-00000-of-00001.arrow
    test/
      data-00000-of-00001.arrow

  archives/
    1.0.0/
      dataset/                # Full dataset snapshot
      train.jsonl
      val.jsonl
      test.jsonl

  splits/
    train.jsonl               # Current working files
    val.jsonl
    test.jsonl

Common Mistakes

Mistake 1: Not documenting synthetic generation

# Wrong - no provenance
This dataset contains task management conversations.

# Right - full provenance
This dataset was synthetically generated using GPT-4o-mini.
- 12 seed examples written by domain experts
- 8 variations per seed (temperature=0.9)
- Validated with Pydantic schemas

Anyone using your data needs to know it's synthetic and how it was created.

Mistake 2: Forgetting the test split

# Wrong - only train/val
dataset = DatasetDict({
    "train": train,
    "validation": val
})

# Right - include test for final evaluation
dataset = DatasetDict({
    "train": train,
    "validation": val,
    "test": test  # Never used during training
})

The test split must remain untouched until final evaluation. Using it during development invalidates your results.

Mistake 3: Modifying data without version bump

# Wrong - silently modify
echo '{"conversations": [...]}' >> data/splits/train.jsonl

# Right - version first
python version_dataset.py patch "Added edge case examples"
echo '{"conversations": [...]}' >> data/splits/train.jsonl

Every data change needs a new version. Otherwise, you can't compare model performance across training runs.

Try With AI

Now that you understand dataset versioning, explore these advanced topics with your AI partner.

Prompt 1: Design Dataset Schema Evolution

My Task API dataset is version 1.0. I want to add a new field 'tool_calls' to support
fine-tuning for function calling:

{
  "conversations": [...],
  "tool_calls": [{"name": "create_task", "arguments": {...}}]
}

How do I evolve my dataset schema while maintaining backward compatibility?
What versioning strategy (major/minor/patch) is appropriate?
Should I migrate existing examples or create a parallel dataset?

What you're learning: Schema evolution is inevitable as requirements change. This prompt teaches you to think about backward compatibility and migration strategies - skills that transfer to any data management context.

Prompt 2: Compare Versioning Approaches

I'm choosing between three dataset versioning approaches:

1. HuggingFace Hub (push to hub, use revisions)
2. DVC (Data Version Control with Git)
3. Manual versioning (what we built in this lesson)

My constraints:
- Team of 3 people
- Private data (cannot be public)
- Training happens on cloud (need remote access)
- Want to compare models across dataset versions

Which approach fits best? What are the trade-offs?

What you're learning: There's no single "right" versioning system. This prompt helps you evaluate tools based on your actual constraints. The analysis skills transfer to any tooling decision.

Prompt 3: Automate Quality Gates

I want to prevent bad data from entering my dataset. Design a CI/CD pipeline that:

1. Validates format (Pydantic schemas)
2. Checks for duplicates
3. Verifies minimum quality score
4. Updates version automatically
5. Generates dataset card updates

Show me a GitHub Actions workflow that implements these quality gates.
The pipeline should block merges that would degrade dataset quality.

What you're learning: Manual processes don't scale. This prompt teaches you to think about automation and quality gates - essential skills for production ML systems. The CI/CD patterns apply beyond just datasets.

Reflect on Your Skill

You built an llmops-data-engineer skill in Lesson 0. This lesson covered versioning and management - critical for any data engineering role.

Test Your Skill

Using my llmops-data-engineer skill, help me set up dataset versioning.
Does my skill include version control practices, dataset cards, and reproducibility?

Identify Gaps

Consider:

Did your skill mention HuggingFace Datasets format?
Did it include dataset card requirements?
Did it cover checksums and integrity verification?
Did it address version history and archives?

Improve Your Skill

If you found gaps:

Update my llmops-data-engineer skill to include:
HuggingFace Datasets format conversion
Dataset card template (following HuggingFace standard)
Version metadata with checksums
Archive workflow for version history
Reproducibility checklist

Your skill should now guide you through the complete lifecycle from raw data to versioned, documented datasets ready for production fine-tuning.

Why Dataset Versioning Matters​

Step 1: Convert to HuggingFace Format​

Step 2: Add Version Metadata​

Step 3: Create a Dataset Card​

Step 4: Implement Version Control Workflow​

Step 5: Reproducibility Practices​

The Complete Dataset Structure​

Common Mistakes​

Try With AI​

Reflect on Your Skill​

Test Your Skill​

Identify Gaps​

Improve Your Skill​

Why Dataset Versioning Matters

Step 1: Convert to HuggingFace Format

Step 2: Add Version Metadata

Step 3: Create a Dataset Card

Step 4: Implement Version Control Workflow

Step 5: Reproducibility Practices

The Complete Dataset Structure

Common Mistakes

Try With AI

Reflect on Your Skill

Test Your Skill

Identify Gaps

Improve Your Skill