Updated Feb 23, 2026

Model Export Formats

Your fine-tuned model exists as a collection of weight tensors. To deploy it, you need to package these weights in a format that inference engines can load efficiently. This decision affects memory usage, loading speed, and which platforms you can target.

This lesson covers the three formats you will encounter: GGUF (the modern standard for local inference), safetensors (the Hugging Face standard for cloud and research), and legacy formats (GGML, PyTorch .bin) that you may need to convert from.

The Format Landscape

Model formats evolved as deployment needs changed. Understanding this history helps you make informed decisions.

Era	Format	Primary Use Case	Status
2020-2022	PyTorch .bin / .pth	Research, training	Still used for training
2022-2023	GGML	Early local inference	Deprecated
2023	safetensors	Safe, fast Hugging Face	Current for cloud
2023-present	GGUF	Local inference standard	Current for local

The key insight: different formats serve different deployment targets. There is no single "best" format—only the right format for your deployment scenario.

GGUF: The Local Inference Standard

GGUF (GPT-Generated Unified Format) was designed by Georgi Gerganov for the llama.cpp project. It replaced GGML to address critical limitations that emerged as models grew larger and more diverse.

Why GGUF Exists

GGML had fundamental problems:

Hardcoded architectures: Adding new model types required code changes
Limited metadata: No standardized way to store tokenizer, template, or configuration
Versioning issues: No forward/backward compatibility guarantees

GGUF solves these with a flexible, self-describing format:

┌─────────────────────────────────────────────────────────┐
│ GGUF File Structure                                     │
├─────────────────────────────────────────────────────────┤
│ Header                                                  │
│   - Magic number (GGUF)                                 │
│   - Version                                             │
│   - Tensor count                                        │
│   - Metadata KV count                                   │
├─────────────────────────────────────────────────────────┤
│ Metadata (Key-Value pairs)                              │
│   - general.architecture: "llama"                       │
│   - general.name: "task-api-model"                      │
│   - llama.context_length: 4096                          │
│   - tokenizer.ggml.model: "llama"                       │
│   - tokenizer.ggml.tokens: [...]                        │
│   - (any custom metadata you need)                      │
├─────────────────────────────────────────────────────────┤
│ Tensor Data                                             │
│   - Weight matrices                                     │
│   - Quantized as specified (Q4_K_M, Q8_0, etc.)         │
└─────────────────────────────────────────────────────────┘

GGUF Key Properties

Property	Value	Implication
Self-contained	Yes	Single file deployment
Metadata extensible	Yes	Store tokenizer, template, config
Architecture agnostic	Yes	LLaMA, Mistral, Qwen, etc.
Quantization options	Many	Q4_K_M, Q5_K_M, Q8_0, F16
Memory mapping	Yes	Fast loading, shared memory

When to Use GGUF

Use GGUF when:

Deploying to Ollama, llama.cpp, or LM Studio
Targeting consumer hardware (laptop, desktop)
Serving models without GPU
Single-file distribution is important

Do not use GGUF when:

Training or fine-tuning (use safetensors)
Cloud deployment with GPU (use safetensors + vLLM)
Need to load into PyTorch/Transformers (use safetensors)

Safetensors: The Cloud and Research Standard

Safetensors was developed by Hugging Face to address security and performance issues with PyTorch's pickle-based serialization.

Why Safetensors Exists

PyTorch's native format uses Python pickle, which:

Allows arbitrary code execution: Security vulnerability
Loads slowly: Must deserialize entire file
Wastes memory: Cannot memory-map efficiently

Safetensors solves these:

┌─────────────────────────────────────────────────────────┐
│ Safetensors File Structure                              │
├─────────────────────────────────────────────────────────┤
│ Header (JSON, 8 bytes length prefix)                    │
│   {                                                     │
│     "model.layers.0.self_attn.q_proj.weight": {         │
│       "dtype": "F16",                                   │
│       "shape": [4096, 4096],                            │
│       "data_offsets": [0, 33554432]                     │
│     },                                                  │
│     ...                                                 │
│   }                                                     │
├─────────────────────────────────────────────────────────┤
│ Tensor Data (raw bytes, memory-mappable)                │
│   - Continuous byte array                               │
│   - Zero-copy loading possible                          │
└─────────────────────────────────────────────────────────┘

Safetensors Key Properties

Property	Value	Implication
Safe	Yes	No code execution
Fast loading	Yes	Memory mapping, zero-copy
Hugging Face native	Yes	Direct Hub integration
Framework agnostic	Partial	PyTorch, TensorFlow, JAX
Quantization	No	Full precision only

When to Use Safetensors

Use safetensors when:

Storing model checkpoints during training
Uploading to Hugging Face Hub
Deploying with vLLM, TGI, or cloud inference
Need to load into PyTorch/Transformers

Do not use safetensors when:

Deploying to Ollama (convert to GGUF)
Memory-constrained local inference (use quantized GGUF)
Single-file distribution is required

Legacy Formats: What You Might Encounter

GGML (Deprecated)

GGML was the original llama.cpp format. You may encounter .ggml files from 2023, but:

No longer supported by current llama.cpp
Convert to GGUF before use
Some older tutorials reference GGML patterns

PyTorch .bin / .pth

The original PyTorch serialization format:

Still used for training checkpoints
Security risk due to pickle
Convert to safetensors for storage and sharing

Format Selection Decision Tree

When choosing a format, work through this decision tree:

Start: What is your deployment target?
│
├─► Local inference (Ollama, llama.cpp, LM Studio)
│   └─► Use GGUF with appropriate quantization
│
├─► Cloud inference (vLLM, TGI, SageMaker)
│   └─► Use safetensors with optional quantization
│
├─► Training / Fine-tuning
│   └─► Use safetensors (or PyTorch .bin for legacy)
│
├─► Hugging Face Hub distribution
│   └─► Use safetensors + config.json
│
└─► Multiple targets
    └─► Store as safetensors, convert to GGUF for local

Converting Between Formats

Safetensors to GGUF (Most Common)

When you fine-tune with Unsloth or Transformers, you get safetensors. To deploy on Ollama:

from unsloth import FastLanguageModel

# Load your fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./my_finetuned_model",  # safetensors location
    max_seq_length=2048,
    load_in_4bit=False,  # Load full precision for export
)

# Export to GGUF with quantization
model.save_pretrained_gguf(
    "task-api-model",
    tokenizer,
    quantization_method="q4_k_m"  # Choose quantization
)

Output:

Saving model to task-api-model/...
Converting to GGUF format...
Applying q4_k_m quantization...
Done! File: task-api-model/task-api-model-q4_k_m.gguf

PyTorch to Safetensors

If you have legacy PyTorch checkpoints:

import torch
from safetensors.torch import save_file

# Load PyTorch model
state_dict = torch.load("model.bin", map_location="cpu")

# Remove optimizer state if present
if "model_state_dict" in state_dict:
    state_dict = state_dict["model_state_dict"]

# Save as safetensors
save_file(state_dict, "model.safetensors")

Output:

Saved model.safetensors (6.2 GB)

GGML to GGUF (Legacy Conversion)

If you have old GGML files:

# Use llama.cpp conversion script
python convert-llama-ggml-to-gguf.py \
    --input old_model.ggml \
    --output new_model.gguf

Format Comparison Summary

Criterion	GGUF	Safetensors	PyTorch .bin
Security	Safe	Safe	Unsafe (pickle)
Loading speed	Fast	Fast	Slow
Memory efficiency	Excellent	Good	Poor
Quantization	Built-in	External	External
Single file	Yes	No (needs config)	No
Ollama compatible	Yes	No	No
vLLM compatible	No	Yes	Yes
Training compatible	No	Yes	Yes

Reflect on Your Skill

Update your model-serving skill with format selection guidance:

## Export Format Selection

### Quick Reference

| Target | Format | Quantization |
|--------|--------|--------------|
| Ollama | GGUF | q4_k_m (default) |
| vLLM | safetensors | awq or gptq |
| Hugging Face | safetensors | none |
| Training | safetensors | none |

### Conversion Commands

Safetensors → GGUF:
model.save_pretrained_gguf("output", tokenizer, quantization_method="q4_k_m")

PyTorch → Safetensors:
save_file(state_dict, "model.safetensors")

Try With AI

Use your AI companion (Claude, ChatGPT, Gemini, or similar).

Prompt 1: Analyze Your Deployment Scenario

I have a fine-tuned Task API model (3B parameters) stored as safetensors.
I want to:
1. Serve it locally on my laptop (16GB RAM, no GPU)
2. Also deploy to a cloud GPU for high-traffic API

Help me create a deployment plan:
- Which format for each target?
- What quantization level for each?
- What conversion steps do I need?

What you are learning: Multi-target deployment planning. Real applications often need to serve the same model in different environments.

Prompt 2: Troubleshoot Format Issues

I'm trying to load a model into Ollama but getting errors. The model was
downloaded from Hugging Face and has these files:
- config.json
- model.safetensors
- tokenizer.json
- tokenizer_config.json

What steps do I need to convert this for Ollama? What could go wrong?

What you are learning: Format conversion troubleshooting. Many deployment problems stem from format mismatches.

Prompt 3: Evaluate Format Tradeoffs

I need to choose between two deployment approaches for my Task API model:
A) Convert to GGUF Q4_K_M (~4GB) and serve with Ollama
B) Keep as safetensors and serve with vLLM on cloud GPU

Help me analyze:
1. What is the quality difference between Q4 and full precision?
2. What is the cost difference (local vs cloud)?
3. What is the latency difference?
4. Which would you recommend for a startup with limited budget?

What you are learning: Strategic format decisions. Format choice affects cost, quality, and scalability.

Safety Note

When converting models, always verify the output works correctly before deploying to production. Run a few test prompts through the converted model to ensure responses match expectations. Conversion bugs can cause subtle quality degradation that is not immediately obvious.

The Format Landscape​

GGUF: The Local Inference Standard​

Why GGUF Exists​

GGUF Key Properties​

When to Use GGUF​

Safetensors: The Cloud and Research Standard​

Why Safetensors Exists​

Safetensors Key Properties​

When to Use Safetensors​

Legacy Formats: What You Might Encounter​

GGML (Deprecated)​

PyTorch .bin / .pth​

Format Selection Decision Tree​

Converting Between Formats​

Safetensors to GGUF (Most Common)​

PyTorch to Safetensors​

GGML to GGUF (Legacy Conversion)​

Format Comparison Summary​

Reflect on Your Skill​

Try With AI​

Prompt 1: Analyze Your Deployment Scenario​

Prompt 2: Troubleshoot Format Issues​

Prompt 3: Evaluate Format Tradeoffs​

Safety Note​