Skip to main content

Quantization Deep Dive

You calculated your VRAM budget in Lesson 4 and discovered a problem: Llama-3-8B needs 16GB in FP16, but your T4 GPU only has 15GB. Even without training overhead, the base model won't fit.

This is the situation 90% of practitioners face. The models we want to run exceed the hardware we can afford. But here's the insight that changed everything: you don't need 16 bits per weight. With the right techniques, you can compress models to 4 bits per weight—achieving 4x memory reduction with surprisingly small quality loss.

This lesson teaches you to wield quantization as a precision tool, not a blunt hammer. By the end, you'll load Llama-3-8B in under 5GB of VRAM, leaving room for training overhead.

The Precision Hierarchy

Every number in a neural network has a representation. The question is: how many bits do we need?

FormatBits per WeightVRAM for 8B ModelPrecision LossUse Case
FP3232 bits32 GBNone (baseline)Research, exact gradients
FP1616 bits16 GBMinimalStandard training
INT88 bits8 GBLowProduction inference
INT44 bits4 GBModerateMemory-constrained training
NF44 bits4 GBLow (optimized)QLoRA fine-tuning

Why Fewer Bits Works

Neural network weights cluster around zero in a roughly normal distribution. Most weights are small values like 0.0023 or -0.0156. Very few are large values like 5.7 or -12.3.

This distribution matters because:

  • FP16 allocates bits uniformly across all possible values
  • NF4 (Normal Float 4) allocates bits based on the actual distribution, giving more precision to common values near zero

Result: NF4 achieves better quality than naive 4-bit quantization because it's designed for how neural networks actually work.

The Memory Math

Calculating savings is straightforward:

def calculate_model_memory(params_billions: float, bits_per_param: int) -> float:
"""Calculate model memory in GB.

Args:
params_billions: Model size in billions of parameters
bits_per_param: Bits used per parameter (32, 16, 8, or 4)

Returns:
Memory requirement in GB
"""
bytes_per_param = bits_per_param / 8
total_bytes = params_billions * 1e9 * bytes_per_param
return total_bytes / (1024 ** 3)

# Example: Llama-3-8B at different precisions
model_size = 8.0 # 8 billion parameters

print("Llama-3-8B Memory Requirements:")
print(f"FP32 (32-bit): {calculate_model_memory(model_size, 32):.1f} GB")
print(f"FP16 (16-bit): {calculate_model_memory(model_size, 16):.1f} GB")
print(f"INT8 (8-bit): {calculate_model_memory(model_size, 8):.1f} GB")
print(f"INT4 (4-bit): {calculate_model_memory(model_size, 4):.1f} GB")

Output:

Llama-3-8B Memory Requirements:
FP32 (32-bit): 29.8 GB
FP16 (16-bit): 14.9 GB
INT8 (8-bit): 7.5 GB
INT4 (4-bit): 3.7 GB

Notice the progression: each halving of bits halves the memory. Moving from FP16 to INT4 gives you 4x compression.

BitsAndBytes: The Practical Tool

BitsAndBytes is the library that makes quantization accessible. Developed by Tim Dettmers, it provides GPU-accelerated 4-bit and 8-bit operations that integrate directly with Hugging Face Transformers.

Installation

On Colab or your GPU machine:

pip install bitsandbytes transformers accelerate

Basic 4-bit Loading

Here's how to load a model in 4-bit precision:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normal Float 4 - optimized for neural networks
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16 for stability
)

model_name = "meta-llama/Llama-3.2-1B-Instruct" # Start with smaller model for demo

# Load tokenizer (no quantization needed)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto", # Automatically place on GPU
)

print(f"Model loaded successfully!")
print(f"Model dtype: {model.dtype}")

Output:

Model loaded successfully!
Model dtype: torch.float32

Note: The model reports FP32 because that's the container format, but the actual weights are stored in 4-bit NF4 format internally.

Understanding the Configuration

Each parameter in BitsAndBytesConfig serves a specific purpose:

ParameterValuePurpose
load_in_4bitTrueEnable 4-bit quantization
bnb_4bit_quant_type"nf4"Use Normal Float 4 (better than plain INT4)
bnb_4bit_compute_dtypebfloat16Dequantize to bfloat16 for computation
bnb_4bit_use_double_quantTrueQuantize the quantization constants too

Double Quantization: The Extra Compression

Here's a subtlety that matters for memory-constrained environments: quantization itself requires storing scaling factors. For every block of weights (typically 64 or 128), you store one FP32 constant.

Double quantization quantizes these constants too, saving an additional 0.4 bits per parameter:

# Full QLoRA-ready configuration with double quantization
qlora_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Quantize the scaling factors
)

# Memory savings breakdown for 8B model:
# Base 4-bit: 8B × 4 bits = 4 GB
# Scaling factors: 8B / 64 × 32 bits ≈ 0.4 GB
# With double quant: 8B × 4.5 bits ≈ 4.2 GB (vs 4.4 GB without)

For an 8B model, double quantization saves roughly 200MB—enough to matter when you're at the edge of VRAM limits.

Quantization Quality Analysis

The question everyone asks: "How much quality do I lose?"

Measuring Quality Loss

There are two approaches:

1. Perplexity Measurement (automatic, domain-general)

from datasets import load_dataset
import math

def measure_perplexity(model, tokenizer, texts, max_length=512):
"""Measure model perplexity on sample texts."""
model.eval()
total_loss = 0
total_tokens = 0

for text in texts[:10]: # Sample 10 texts
inputs = tokenizer(text, return_tensors="pt",
truncation=True, max_length=max_length)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
total_loss += outputs.loss.item() * inputs["input_ids"].size(1)
total_tokens += inputs["input_ids"].size(1)

perplexity = math.exp(total_loss / total_tokens)
return perplexity

# Lower perplexity = better model
# Typical results:
# - FP16: perplexity ~6-8
# - 4-bit NF4: perplexity ~6.5-8.5 (5-10% increase)

2. Task-Specific Evaluation (manual, domain-specific)

def compare_outputs(model, tokenizer, prompt):
"""Generate response for qualitative comparison."""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
do_sample=True,
)

return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test prompt
prompt = """You are a task management assistant.

User: Create a task for reviewing the quarterly budget by Friday.
Assistant:"""

response = compare_outputs(model, tokenizer, prompt)
print(response)

Output:

You are a task management assistant.

User: Create a task for reviewing the quarterly budget by Friday.
Assistant: I'll create that task for you:

**Task:** Review Q4 Budget
**Due Date:** Friday
**Priority:** High
**Category:** Finance

Would you like me to set a reminder for Thursday to give you time to prepare?

Research Findings on Quality Loss

Published research on QLoRA (Dettmers et al., 2023) found:

BenchmarkFP164-bit NF4Degradation
MMLU67.2%66.9%-0.3%
HellaSwag85.1%84.8%-0.4%
GSM8K52.3%51.8%-1.0%

For most applications, 4-bit quantization loses 1-2% accuracy—a tradeoff worth making when it enables running models at all.

When to Use Each Precision Level

The decision framework:

IF inference only AND quality critical:
→ Use INT8 (best quality/memory balance)

IF training AND memory constrained (Colab, consumer GPU):
→ Use 4-bit NF4 with double quantization (QLoRA)

IF training AND memory available (A100, H100):
→ Use FP16 or BF16 (no quantization needed)

IF research/debugging:
→ Use FP32 (exact gradients, reproducibility)

Practical Decision Table

Your SituationRecommendedMemory for 8BWhy
T4 GPU (15GB)4-bit NF4~5GBOnly option that fits with training overhead
A10 GPU (24GB)4-bit NF4 or INT8~5-8GBComfortable margin; choose based on quality needs
A100 (40GB)FP16/BF16~16GBNo need to quantize; maximum quality
Inference onlyINT8~8GBBetter quality than 4-bit when not training

Verifying Your Quantized Model

After loading, verify the model works correctly:

def verify_quantized_model(model, tokenizer):
"""Quick sanity check for quantized model."""

# 1. Check model device
print(f"Model device: {next(model.parameters()).device}")

# 2. Check memory usage
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
print(f"VRAM allocated: {allocated:.2f} GB")
print(f"VRAM reserved: {reserved:.2f} GB")

# 3. Quick generation test
test_prompt = "The capital of France is"
inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=10)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Test generation: {result}")

# 4. Check if output is reasonable
if "Paris" in result:
print("Model verification: PASSED")
else:
print("Model verification: Check output quality")

verify_quantized_model(model, tokenizer)

Output:

Model device: cuda:0
VRAM allocated: 1.24 GB
VRAM reserved: 1.38 GB
Test generation: The capital of France is Paris, which is also the largest city
Model verification: PASSED

Common Quantization Issues

Issue 1: CUDA Out of Memory on Load

# WRONG: Loading full model first, then quantizing
model = AutoModelForCausalLM.from_pretrained(model_name) # OOM here!
model = quantize(model)

# CORRECT: Quantization during load
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto",
)

Issue 2: Slow Generation

Quantized models can be slower than FP16 because of dequantization overhead:

# Enable optimizations for faster inference
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto",
torch_dtype=torch.bfloat16, # Faster compute dtype
attn_implementation="flash_attention_2", # If supported
)

Issue 3: Quality Issues with Specific Tasks

Some tasks are more sensitive to quantization:

  • Math reasoning: More affected (use INT8 if possible)
  • Creative writing: Less affected (4-bit usually fine)
  • Code generation: Moderate (test on your domain)

When quality matters, always compare against non-quantized baseline on your specific use case.

Try With AI

Use your AI companion (Claude, ChatGPT, or Gemini).

Prompt 1: Calculate Quantization Savings

I have an NVIDIA RTX 3090 with 24GB VRAM. I want to run Mistral-7B
for fine-tuning with QLoRA.

Help me calculate:
1. Memory needed for the base model at FP16 vs 4-bit NF4
2. How much headroom I'll have for training overhead (gradients, activations)
3. What batch size might be feasible

Walk me through the calculations step by step, and tell me if my hardware
choice is reasonable for this task.

What you're learning: Applying the VRAM calculation framework to your specific hardware scenario, with AI helping verify your reasoning.

Prompt 2: Debug a Quantization Problem

I tried loading Llama-3-8B with BitsAndBytes on Colab, but I'm getting
this error:

"RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB"

Here's my code:
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
load_in_4bit=True
)

What's wrong? I thought 4-bit quantization should fit in 15GB easily.
Ask me clarifying questions about my setup if needed, then help me fix it.

What you're learning: Iterative debugging with AI—providing context, receiving diagnostic questions, and converging on a solution.

Prompt 3: Design a Quality Evaluation Plan

I'm fine-tuning a model for customer support ticket classification.
The quantized model needs to maintain at least 95% of the original
accuracy on my test set.

Help me design an evaluation plan:
1. What metrics should I track?
2. How should I compare FP16 vs 4-bit versions?
3. What's my fallback if 4-bit quality is too low?

Challenge me if my 95% threshold is too strict or too lenient
for this use case.

What you're learning: Strategic planning for quantization—when to accept quality tradeoffs and when to find alternatives.

Safety Note

When loading quantized models, verify outputs on your specific use case before deploying. Quantization affects different tasks differently—what works for general chat may not work for specialized domains like medical or legal text.