Skip to main content

Quantization for Inference

Your fine-tuned model is 14GB in full precision. Your laptop has 8GB of RAM. Something has to give. Quantization reduces model size by representing weights with fewer bits, trading some quality for dramatically smaller memory footprint and faster inference.

This lesson helps you navigate the quantization landscape. You will learn when Q4 is good enough, when Q8 is worth the extra memory, and how to measure the quality difference for your specific use case.

The Quantization Tradeoff

Quantization compresses model weights from their original precision (usually 16-bit floating point) to lower bit representations. The math is straightforward: fewer bits means smaller files.

PrecisionBits per WeightRelative SizeRelative Quality
F16 (full)16 bits100% (baseline)100%
Q8_08 bits~50%~99%
Q6_K6 bits~40%~98%
Q5_K_M5 bits~35%~97%
Q4_K_M4 bits~30%~95%
Q4_04 bits~25%~92%
Q2_K2 bits~15%~80%

The quality column is approximate. Actual quality loss depends on the model architecture, training data, and your specific task. A model fine-tuned for structured outputs may tolerate more aggressive quantization than one designed for creative writing.

Understanding K-Quants

Not all quantization at the same bit level is equal. The "K" in formats like Q4_K_M indicates k-quant quantization, which uses a more sophisticated approach than simple round-to-nearest quantization.

How K-Quants Work

Standard quantization (Q4_0) applies the same precision reduction to all layers uniformly. K-quants recognize that some layers are more sensitive to precision loss than others.

Standard Quantization (Q4_0):
┌────────────────────────────────────────────────┐
│ All layers → 4-bit uniform quantization │
│ Simple, fast, but more quality loss │
└────────────────────────────────────────────────┘

K-Quant (Q4_K_M):
┌────────────────────────────────────────────────┐
│ Attention layers → Higher precision (5-6 bit) │
│ FFN layers → Lower precision (4 bit) │
│ Embeddings → Preserved quality │
│ Result: Same average bits, better quality │
└────────────────────────────────────────────────┘

K-Quant Variants

The suffix indicates the aggressiveness of the quantization:

VariantSizeQualityUse Case
Q4_K_SSmallestLowerExtreme memory constraints
Q4_K_MMediumBetterDefault recommendation
Q5_K_SSmallerGoodBalance size and quality
Q5_K_MMediumBetterQuality-sensitive applications
Q6_KLargerExcellentNear-lossless inference

Recommendation: Start with Q4_K_M. It provides the best quality-to-size ratio for most use cases. Move to Q5_K_M if you notice quality issues, or Q6_K if you have sufficient memory and need maximum quality.

Measuring Quality Loss

Quality loss from quantization is not just a theoretical concern. You can measure it using perplexity, a standard metric for language model evaluation.

Perplexity Explained

Perplexity measures how well the model predicts the next token. Lower is better. A perplexity of 10 means the model is, on average, as uncertain as choosing between 10 equally likely options.

# Conceptual perplexity calculation
import math

def calculate_perplexity(log_probs):
"""
log_probs: list of log probabilities for each token
Lower perplexity = better prediction
"""
avg_log_prob = sum(log_probs) / len(log_probs)
perplexity = math.exp(-avg_log_prob)
return perplexity

Output:

Full precision model: perplexity = 5.23
Q4_K_M model: perplexity = 5.41
Quality loss: ~3.4%

Benchmark Results

Real-world benchmarks from llama.cpp on a 7B parameter model:

QuantizationFile SizePPL WikiTextLoad TimeInference Speed
F1613.5 GB5.2312s1.0x baseline
Q8_07.2 GB5.256s1.3x faster
Q6_K5.5 GB5.285s1.5x faster
Q5_K_M4.8 GB5.314s1.7x faster
Q4_K_M4.1 GB5.413s2.0x faster
Q4_03.8 GB5.683s2.2x faster

The Q4_K_M sweet spot: 70% size reduction, 3% perplexity increase, 2x speed improvement.

Hardware-Based Selection

Your hardware dictates what quantization levels are practical.

RAM-Only Inference (No GPU)

When running on CPU with system RAM:

Available RAM → Maximum Model Size → Quantization Choice

8 GB RAM:
└─► 7B model: Q4_K_M (~4GB) ✓
└─► 7B model: Q8_0 (~7GB) ✓ (tight)
└─► 13B model: Q4_K_M (~8GB) ✗ (no room for OS)

16 GB RAM:
└─► 7B model: Q8_0 (~7GB) ✓
└─► 7B model: F16 (~14GB) ✗ (leaves ~2GB for OS)
└─► 13B model: Q4_K_M (~8GB) ✓
└─► 13B model: Q5_K_M (~9GB) ✓

32 GB RAM:
└─► 7B model: F16 (~14GB) ✓
└─► 13B model: Q8_0 (~14GB) ✓
└─► 34B model: Q4_K_M (~20GB) ✓

Rule of thumb: Leave 4-6 GB free for the operating system and other applications. Model size should be at most (RAM - 6GB).

GPU Inference (VRAM)

With GPU acceleration, you need to fit the model in VRAM for full speed:

VRAM → Model + Context → Quantization

8 GB VRAM (RTX 3070, RTX 4060):
└─► 7B Q4_K_M + 4K context ✓
└─► 7B Q8_0 + 2K context ✓
└─► 13B Q4_K_M ✗ (need ~10GB)

12 GB VRAM (RTX 3080, RTX 4070):
└─► 7B Q8_0 + 8K context ✓
└─► 13B Q4_K_M + 4K context ✓
└─► 13B Q5_K_M + 2K context ✓

24 GB VRAM (RTX 3090, RTX 4090):
└─► 13B Q8_0 + 8K context ✓
└─► 34B Q4_K_M + 4K context ✓
└─► 70B Q4_K_M ✗ (need ~40GB)

Apple Silicon

Apple Silicon shares memory between CPU and GPU, enabling larger models:

MacUnified MemoryRecommended Max
M1/M2 (8GB)8 GB7B Q4_K_M
M1/M2 (16GB)16 GB13B Q4_K_M or 7B Q8_0
M1/M2 Pro (32GB)32 GB34B Q4_K_M or 13B Q8_0
M1/M2 Max (64GB)64 GB70B Q4_K_M
M2 Ultra (192GB)192 GB70B Q8_0 or 405B Q4_K_M

Converting to GGUF with Quantization

If you fine-tuned with Unsloth, use the built-in export:

from unsloth import FastLanguageModel

# Load your fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./task-api-model",
max_seq_length=2048,
load_in_4bit=False, # Export from full precision
)

# Export with Q4_K_M quantization
model.save_pretrained_gguf(
"task-api-q4",
tokenizer,
quantization_method="q4_k_m"
)

# For higher quality, use Q5_K_M
model.save_pretrained_gguf(
"task-api-q5",
tokenizer,
quantization_method="q5_k_m"
)

Output:

Saving model to task-api-q4/...
Converting to GGUF format...
Applying q4_k_m quantization...
Done! File: task-api-q4/task-api-model-q4_k_m.gguf (4.1 GB)

Saving model to task-api-q5/...
Converting to GGUF format...
Applying q5_k_m quantization...
Done! File: task-api-q5/task-api-model-q5_k_m.gguf (4.8 GB)

Method 2: llama.cpp Direct Conversion

For models not trained with Unsloth:

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Install dependencies
pip install -r requirements.txt

# Convert safetensors to GGUF (F16 first)
python convert-hf-to-gguf.py \
../task-api-model \
--outfile ../task-api-f16.gguf \
--outtype f16

# Quantize to Q4_K_M
./llama-quantize \
../task-api-f16.gguf \
../task-api-q4_k_m.gguf \
Q4_K_M

Output:

Loading model from ../task-api-f16.gguf
Quantizing to Q4_K_M...
Writing output to ../task-api-q4_k_m.gguf
Original size: 13.5 GB
Quantized size: 4.1 GB
Compression ratio: 3.29x

Method 3: Multiple Quantization Levels

Create multiple versions for different deployment targets:

# Create all common quantization levels
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
./llama-quantize \
../task-api-f16.gguf \
../task-api-${QUANT}.gguf \
$QUANT
done

Output:

Created: task-api-Q4_K_M.gguf (4.1 GB)
Created: task-api-Q5_K_M.gguf (4.8 GB)
Created: task-api-Q6_K.gguf (5.5 GB)
Created: task-api-Q8_0.gguf (7.2 GB)

Verifying Quantization Quality

After quantization, verify the model still performs acceptably:

# Test the quantized model
import requests

def test_model_quality(model_name: str, test_prompts: list[str]):
"""Compare outputs from quantized model against expected behavior."""
for prompt in test_prompts:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model_name,
"prompt": prompt,
"stream": False
}
)
result = response.json()
print(f"Prompt: {prompt[:50]}...")
print(f"Response: {result['response'][:200]}...")
print("---")

# Test prompts specific to Task API model
test_prompts = [
"Create a task: Review quarterly report by Friday",
"List all high-priority tasks",
"Mark task 3 as complete",
]

test_model_quality("task-api:q4_k_m", test_prompts)

Output:

Prompt: Create a task: Review quarterly report by Fri...
Response: {"action": "create_task", "title": "Review quarterly report", "due_date": "Friday", "priority": "normal"}...
---
Prompt: List all high-priority tasks...
Response: {"action": "list_tasks", "filter": {"priority": "high"}}...
---
Prompt: Mark task 3 as complete...
Response: {"action": "update_task", "id": 3, "status": "complete"}...
---

Reflect on Your Skill

Update your model-serving skill with quantization selection guidance:

## Quantization Selection

### Quick Decision Guide

| Your Situation | Recommended Quant |
|----------------|-------------------|
| First deployment, unsure | Q4_K_M |
| Quality-sensitive app | Q5_K_M or Q6_K |
| Memory-constrained | Q4_K_S |
| Maximum quality needed | Q8_0 |
| Research/development | F16 |

### Memory Requirements

7B model:
- Q4_K_M: ~4GB
- Q5_K_M: ~5GB
- Q8_0: ~7GB
- F16: ~14GB

### Quality Verification

Always test quantized model with domain-specific prompts.
Perplexity increase of 5-10% is usually acceptable.
If outputs degrade noticeably, move up one quantization level.

Try With AI

Use your AI companion (Claude, ChatGPT, Gemini, or similar).

Prompt 1: Calculate Your Budget

I have a fine-tuned 7B model stored as safetensors (13.5 GB).
My target hardware:
- MacBook Pro M2 with 16GB unified memory
- Need to leave 4GB free for other applications
- Want to support 4096 token context length

Help me:
1. Calculate how much memory the model can use
2. Recommend a quantization level
3. Estimate the final file size and inference speed

What you are learning: Hardware-constrained optimization. You are building the habit of calculating memory budgets before deployment.

Prompt 2: Compare Quality Impact

I am deploying a Task API model that generates structured JSON for
task management. The model was fine-tuned to output specific JSON schemas.

Compare Q4_K_M vs Q5_K_M for this use case:
1. How might quantization affect JSON structure accuracy?
2. Is structured output more or less sensitive to quantization than free-form text?
3. What tests would verify the quantized model still produces valid JSON?

What you are learning: Task-specific quality analysis. Different applications have different tolerance for quantization quality loss.

Prompt 3: Troubleshoot Quality Issues

After converting my model to Q4_K_M, I'm seeing some issues:
- Some JSON outputs are malformed (missing closing braces)
- Occasional nonsense tokens in responses
- Works fine for simple prompts but fails on complex ones

What could cause these issues? Should I:
A) Use a higher quantization level (Q5_K_M)?
B) Check my conversion process for errors?
C) Look for issues in the original model?
D) Something else?

What you are learning: Quantization debugging. When quality degrades, you need to diagnose whether it is a quantization issue or something else.

Safety Note

Quantization is a one-way operation on the model weights. Always keep your original safetensors or full-precision GGUF file. If you discover quality issues later, you will need the original to re-quantize at a different level.