Skip to main content

Model Export Formats

Your fine-tuned model exists as a collection of weight tensors. To deploy it, you need to package these weights in a format that inference engines can load efficiently. This decision affects memory usage, loading speed, and which platforms you can target.

This lesson covers the three formats you will encounter: GGUF (the modern standard for local inference), safetensors (the Hugging Face standard for cloud and research), and legacy formats (GGML, PyTorch .bin) that you may need to convert from.

The Format Landscape

Model formats evolved as deployment needs changed. Understanding this history helps you make informed decisions.

EraFormatPrimary Use CaseStatus
2020-2022PyTorch .bin / .pthResearch, trainingStill used for training
2022-2023GGMLEarly local inferenceDeprecated
2023safetensorsSafe, fast Hugging FaceCurrent for cloud
2023-presentGGUFLocal inference standardCurrent for local

The key insight: different formats serve different deployment targets. There is no single "best" format—only the right format for your deployment scenario.

GGUF: The Local Inference Standard

GGUF (GPT-Generated Unified Format) was designed by Georgi Gerganov for the llama.cpp project. It replaced GGML to address critical limitations that emerged as models grew larger and more diverse.

Why GGUF Exists

GGML had fundamental problems:

  1. Hardcoded architectures: Adding new model types required code changes
  2. Limited metadata: No standardized way to store tokenizer, template, or configuration
  3. Versioning issues: No forward/backward compatibility guarantees

GGUF solves these with a flexible, self-describing format:

┌─────────────────────────────────────────────────────────┐
│ GGUF File Structure │
├─────────────────────────────────────────────────────────┤
│ Header │
│ - Magic number (GGUF) │
│ - Version │
│ - Tensor count │
│ - Metadata KV count │
├─────────────────────────────────────────────────────────┤
│ Metadata (Key-Value pairs) │
│ - general.architecture: "llama" │
│ - general.name: "task-api-model" │
│ - llama.context_length: 4096 │
│ - tokenizer.ggml.model: "llama" │
│ - tokenizer.ggml.tokens: [...] │
│ - (any custom metadata you need) │
├─────────────────────────────────────────────────────────┤
│ Tensor Data │
│ - Weight matrices │
│ - Quantized as specified (Q4_K_M, Q8_0, etc.) │
└─────────────────────────────────────────────────────────┘

GGUF Key Properties

PropertyValueImplication
Self-containedYesSingle file deployment
Metadata extensibleYesStore tokenizer, template, config
Architecture agnosticYesLLaMA, Mistral, Qwen, etc.
Quantization optionsManyQ4_K_M, Q5_K_M, Q8_0, F16
Memory mappingYesFast loading, shared memory

When to Use GGUF

Use GGUF when:

  • Deploying to Ollama, llama.cpp, or LM Studio
  • Targeting consumer hardware (laptop, desktop)
  • Serving models without GPU
  • Single-file distribution is important

Do not use GGUF when:

  • Training or fine-tuning (use safetensors)
  • Cloud deployment with GPU (use safetensors + vLLM)
  • Need to load into PyTorch/Transformers (use safetensors)

Safetensors: The Cloud and Research Standard

Safetensors was developed by Hugging Face to address security and performance issues with PyTorch's pickle-based serialization.

Why Safetensors Exists

PyTorch's native format uses Python pickle, which:

  1. Allows arbitrary code execution: Security vulnerability
  2. Loads slowly: Must deserialize entire file
  3. Wastes memory: Cannot memory-map efficiently

Safetensors solves these:

┌─────────────────────────────────────────────────────────┐
│ Safetensors File Structure │
├─────────────────────────────────────────────────────────┤
│ Header (JSON, 8 bytes length prefix) │
│ { │
│ "model.layers.0.self_attn.q_proj.weight": { │
│ "dtype": "F16", │
│ "shape": [4096, 4096], │
│ "data_offsets": [0, 33554432] │
│ }, │
│ ... │
│ } │
├─────────────────────────────────────────────────────────┤
│ Tensor Data (raw bytes, memory-mappable) │
│ - Continuous byte array │
│ - Zero-copy loading possible │
└─────────────────────────────────────────────────────────┘

Safetensors Key Properties

PropertyValueImplication
SafeYesNo code execution
Fast loadingYesMemory mapping, zero-copy
Hugging Face nativeYesDirect Hub integration
Framework agnosticPartialPyTorch, TensorFlow, JAX
QuantizationNoFull precision only

When to Use Safetensors

Use safetensors when:

  • Storing model checkpoints during training
  • Uploading to Hugging Face Hub
  • Deploying with vLLM, TGI, or cloud inference
  • Need to load into PyTorch/Transformers

Do not use safetensors when:

  • Deploying to Ollama (convert to GGUF)
  • Memory-constrained local inference (use quantized GGUF)
  • Single-file distribution is required

Legacy Formats: What You Might Encounter

GGML (Deprecated)

GGML was the original llama.cpp format. You may encounter .ggml files from 2023, but:

  • No longer supported by current llama.cpp
  • Convert to GGUF before use
  • Some older tutorials reference GGML patterns

PyTorch .bin / .pth

The original PyTorch serialization format:

  • Still used for training checkpoints
  • Security risk due to pickle
  • Convert to safetensors for storage and sharing

Format Selection Decision Tree

When choosing a format, work through this decision tree:

Start: What is your deployment target?

├─► Local inference (Ollama, llama.cpp, LM Studio)
│ └─► Use GGUF with appropriate quantization

├─► Cloud inference (vLLM, TGI, SageMaker)
│ └─► Use safetensors with optional quantization

├─► Training / Fine-tuning
│ └─► Use safetensors (or PyTorch .bin for legacy)

├─► Hugging Face Hub distribution
│ └─► Use safetensors + config.json

└─► Multiple targets
└─► Store as safetensors, convert to GGUF for local

Converting Between Formats

Safetensors to GGUF (Most Common)

When you fine-tune with Unsloth or Transformers, you get safetensors. To deploy on Ollama:

from unsloth import FastLanguageModel

# Load your fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./my_finetuned_model", # safetensors location
max_seq_length=2048,
load_in_4bit=False, # Load full precision for export
)

# Export to GGUF with quantization
model.save_pretrained_gguf(
"task-api-model",
tokenizer,
quantization_method="q4_k_m" # Choose quantization
)

Output:

Saving model to task-api-model/...
Converting to GGUF format...
Applying q4_k_m quantization...
Done! File: task-api-model/task-api-model-q4_k_m.gguf

PyTorch to Safetensors

If you have legacy PyTorch checkpoints:

import torch
from safetensors.torch import save_file

# Load PyTorch model
state_dict = torch.load("model.bin", map_location="cpu")

# Remove optimizer state if present
if "model_state_dict" in state_dict:
state_dict = state_dict["model_state_dict"]

# Save as safetensors
save_file(state_dict, "model.safetensors")

Output:

Saved model.safetensors (6.2 GB)

GGML to GGUF (Legacy Conversion)

If you have old GGML files:

# Use llama.cpp conversion script
python convert-llama-ggml-to-gguf.py \
--input old_model.ggml \
--output new_model.gguf

Format Comparison Summary

CriterionGGUFSafetensorsPyTorch .bin
SecuritySafeSafeUnsafe (pickle)
Loading speedFastFastSlow
Memory efficiencyExcellentGoodPoor
QuantizationBuilt-inExternalExternal
Single fileYesNo (needs config)No
Ollama compatibleYesNoNo
vLLM compatibleNoYesYes
Training compatibleNoYesYes

Reflect on Your Skill

Update your model-serving skill with format selection guidance:

## Export Format Selection

### Quick Reference

| Target | Format | Quantization |
|--------|--------|--------------|
| Ollama | GGUF | q4_k_m (default) |
| vLLM | safetensors | awq or gptq |
| Hugging Face | safetensors | none |
| Training | safetensors | none |

### Conversion Commands

Safetensors → GGUF:
model.save_pretrained_gguf("output", tokenizer, quantization_method="q4_k_m")

PyTorch → Safetensors:
save_file(state_dict, "model.safetensors")

Try With AI

Use your AI companion (Claude, ChatGPT, Gemini, or similar).

Prompt 1: Analyze Your Deployment Scenario

I have a fine-tuned Task API model (3B parameters) stored as safetensors.
I want to:
1. Serve it locally on my laptop (16GB RAM, no GPU)
2. Also deploy to a cloud GPU for high-traffic API

Help me create a deployment plan:
- Which format for each target?
- What quantization level for each?
- What conversion steps do I need?

What you are learning: Multi-target deployment planning. Real applications often need to serve the same model in different environments.

Prompt 2: Troubleshoot Format Issues

I'm trying to load a model into Ollama but getting errors. The model was
downloaded from Hugging Face and has these files:
- config.json
- model.safetensors
- tokenizer.json
- tokenizer_config.json

What steps do I need to convert this for Ollama? What could go wrong?

What you are learning: Format conversion troubleshooting. Many deployment problems stem from format mismatches.

Prompt 3: Evaluate Format Tradeoffs

I need to choose between two deployment approaches for my Task API model:
A) Convert to GGUF Q4_K_M (~4GB) and serve with Ollama
B) Keep as safetensors and serve with vLLM on cloud GPU

Help me analyze:
1. What is the quality difference between Q4 and full precision?
2. What is the cost difference (local vs cloud)?
3. What is the latency difference?
4. Which would you recommend for a startup with limited budget?

What you are learning: Strategic format decisions. Format choice affects cost, quality, and scalability.

Safety Note

When converting models, always verify the output works correctly before deploying to production. Run a few test prompts through the converted model to ensure responses match expectations. Conversion bugs can cause subtle quality degradation that is not immediately obvious.