Skip to main content

VRAM Budget Planning

You've counted the parameters. You know an 8B model has 8 billion weights. Now the critical question: how much GPU memory do you actually need to train it? The answer isn't just "8 billion times bytes per parameter." Training requires memory for weights, gradients, optimizer states, and activations. Getting this calculation wrong means your training crashes with "CUDA out of memory."

This lesson teaches you to plan your VRAM budget before starting any training job. You'll learn the components of GPU memory usage, how to calculate requirements for different scenarios, and what optimization techniques let you fit larger models on smaller GPUs.


Why VRAM Planning Matters

Consider this common scenario:

You: Load 8B model on T4 (15GB VRAM) for training
GPU: CUDA out of memory
You: But 8B at FP16 is only 16GB... wait, 16GB > 15GB
GPU: Also you forgot gradients, optimizer, and activations
You: How much memory do I actually need?

The model weights are just the starting point. Training adds significant memory overhead.

The VRAM Stack

GPU Memory Usage During Training:
┌─────────────────────────────────────────────────────────────────┐
│ Model Weights (base memory) │
├─────────────────────────────────────────────────────────────────┤
│ Gradients (same size as weights for full fine-tuning) │
├─────────────────────────────────────────────────────────────────┤
│ Optimizer States (2× weights for AdamW) │
├─────────────────────────────────────────────────────────────────┤
│ Activations (scales with batch size × sequence length) │
├─────────────────────────────────────────────────────────────────┤
│ CUDA Kernels + Workspace (~500MB overhead) │
└─────────────────────────────────────────────────────────────────┘

Let's calculate each component.


Component 1: Model Weights

The foundation of memory usage. You learned this in Lesson 3:

Weights Memory = Parameters × Bytes per parameter
PrecisionBytes/Param8B Model
FP32432 GB
FP16/BF16216 GB
INT818 GB
INT4/NF40.54 GB

Key insight: Quantization is powerful. Going from FP16 to NF4 reduces weight memory by 4×.


Component 2: Gradients

During backpropagation, you compute gradients for each parameter. For full fine-tuning:

Gradient Memory = Parameters × Bytes per gradient

Gradients typically stored at FP16:

ModelGradient Memory (FP16)
8B model8B × 2 = 16 GB
7B model7B × 2 = 14 GB

Why this matters: Full fine-tuning doubles your weight memory just for gradients.

LoRA Reduces Gradient Memory

With LoRA, you only compute gradients for the adapter parameters:

LoRA Parameters = 2 × rank × hidden_dim × num_target_modules × layers

For Llama-8B with LoRA rank 16, targeting Q and V:

LoRA params = 2 × 16 × 4,096 × 2 × 32 = 8.4M parameters
Gradient memory = 8.4M × 2 bytes = 16.8 MB

Compare: 16 GB (full) vs 16.8 MB (LoRA) = 1000× reduction in gradient memory.


Component 3: Optimizer States

Optimizers like AdamW maintain state for each trainable parameter:

StatePurposeMemory
First moment (m)Momentum1× parameters (FP32)
Second moment (v)Adaptive learning rate1× parameters (FP32)

For full fine-tuning with AdamW:

Optimizer Memory = Parameters × 8 bytes (two FP32 tensors)
8B model = 8B × 8 = 64 GB

This is often the memory killer. You need 64 GB just for optimizer states on an 8B model.

LoRA Reduces Optimizer Memory

With LoRA's 8.4M trainable parameters:

Optimizer memory = 8.4M × 8 = 67 MB

Compare: 64 GB (full) vs 67 MB (LoRA) = ~1000× reduction.


Component 4: Activations

The intermediate values computed during the forward pass. These are stored for the backward pass.

Activation memory depends on:

  • Batch size
  • Sequence length
  • Number of layers
  • Hidden dimension

Rough estimate:

Activations ≈ 2 × layers × batch_size × seq_length × hidden_dim × bytes

For Llama-8B with batch_size=1, seq_length=2048:

Activations = 2 × 32 × 1 × 2,048 × 4,096 × 2 bytes
= 1.07 GB per sample

Batch size scales linearly: batch_size=4 → 4.3 GB of activations.

Gradient Checkpointing Trades Compute for Memory

Instead of storing all activations, recompute them during backward pass:

Without checkpointing: Store all layer activations (high memory)
With checkpointing: Store every N-th layer, recompute between (low memory, ~30% slower)

Memory savings: Often 60-70% reduction in activation memory.


The Complete VRAM Formula

Putting it all together for full fine-tuning:

Total VRAM = Weights + Gradients + Optimizer + Activations + Overhead

= (P × Bw) + (P × Bg) + (P × Bo) + (2 × L × B × S × H × Ba) + 0.5GB

Where:
P = Parameters
Bw = Bytes per weight (2 for FP16, 0.5 for 4-bit)
Bg = Bytes per gradient (2 for FP16)
Bo = Bytes for optimizer (8 for AdamW)
L = Layers
B = Batch size
S = Sequence length
H = Hidden dimension
Ba = Bytes per activation (2 for FP16)

Example: Full Fine-Tuning Llama-8B

Parameters: 8B
Precision: FP16
Batch size: 1
Sequence: 2048
Layers: 32
Hidden: 4096

Weights: 8B × 2 = 16 GB
Gradients: 8B × 2 = 16 GB
Optimizer: 8B × 8 = 64 GB
Activations: 2 × 32 × 1 × 2048 × 4096 × 2 = 1 GB
Overhead: 0.5 GB

Total: 16 + 16 + 64 + 1 + 0.5 = 97.5 GB

Result: Full fine-tuning Llama-8B requires ~100 GB VRAM. That's an A100 80GB... times two.


Making It Fit: QLoRA

QLoRA (Quantized LoRA) combines two techniques:

  1. 4-bit quantization: Reduce weight memory
  2. LoRA: Reduce gradient and optimizer memory

QLoRA VRAM Calculation

Quantized Weights: 8B × 0.5 = 4 GB
LoRA Gradients: 8.4M × 2 = 0.017 GB
LoRA Optimizer: 8.4M × 8 = 0.067 GB
Activations: ~1 GB (with gradient checkpointing)
Overhead: 0.5 GB

Total: 4 + 0.017 + 0.067 + 1 + 0.5 ≈ 5.6 GB

Result: QLoRA drops 97.5 GB to 5.6 GB. This fits on a T4 with room to spare!

Why QLoRA Works

The magic: quantized weights are used in forward pass, but gradients flow only through the small LoRA adapters. You get:

  • Memory of a 4-bit model
  • Trainability (can backprop through LoRA)
  • Quality approaching full fine-tuning

Batch Size and Sequence Length Trade-offs

Given fixed GPU memory, you can trade batch size for sequence length:

Activation memory ∝ batch_size × sequence_length
GPU MemoryBatch 1Batch 4Batch 8
Seq 512
Seq 1024Maybe
Seq 2048Maybe
Seq 4096Maybe

For QLoRA on T4 (15GB), typical configurations:

  • Batch 1, Seq 2048: ~6 GB (comfortable)
  • Batch 4, Seq 512: ~7 GB (comfortable)
  • Batch 2, Seq 1024: ~6.5 GB (comfortable)

Gradient Accumulation: Effective Larger Batches

Can't fit batch_size=8 in memory? Use gradient accumulation:

training_args = TrainingArguments(
per_device_train_batch_size=2, # Actual batch in memory
gradient_accumulation_steps=4, # Accumulate 4 steps
# Effective batch size = 2 × 4 = 8
)

This gives you the training stability of larger batches without the memory cost.


Working Through a Scenario Together

Let's plan a training job step by step. Imagine you're preparing to fine-tune Llama-3-8B for Task API on a Colab T4 (15GB VRAM).

Your initial specification:

  • Model: Llama-3-8B
  • Target: Task API JSON generation
  • Hardware: T4 (15GB VRAM)
  • Data: 500 examples, average 300 tokens each

First attempt: Full fine-tuning at FP16

Weights:    16 GB
Gradients: 16 GB
Optimizer: 64 GB
Total: 96 GB → IMPOSSIBLE on T4

Second attempt: What about just freezing most layers?

You ask AI: "Can I freeze layers 0-28 and only train layers 29-31?"

AI suggests: "That would reduce gradients and optimizer to ~10% (3 layers of 32). But you still need to load all weights: 16GB for weights alone exceeds your 15GB VRAM."

Key insight from AI: "You need quantization just to load the model, regardless of training strategy."

Third attempt: QLoRA

4-bit weights: 4 GB
LoRA adapters: 0.1 GB
Activations: 1 GB (with grad checkpointing)
Overhead: 0.5 GB
Total: 5.6 GB → Fits with margin!

Your refinement: "Wait, I have 15GB but only using 5.6GB. Can I increase batch size for faster training?"

AI calculates: "With 9.4 GB headroom:

  • Activation memory per sample: ~1 GB
  • You could theoretically do batch_size=8, but that's 8 GB activations
  • Total: 4 + 8 + 0.5 = 12.5 GB
  • That's cutting it close. Recommend batch_size=4 with seq_length=1024."

Final configuration:

# QLoRA configuration for T4
from peft import LoraConfig
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)

training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch 16
max_seq_length=1024,
gradient_checkpointing=True,
)

Estimated VRAM: ~8-10 GB, comfortable on T4.

This iterative process, where you refine your approach based on calculations and constraints, is exactly how practitioners plan real training jobs.


Quick Reference: VRAM Estimates

Use these tables for rapid estimation:

Inference (Model Loading Only)

ModelFP16INT8INT4/NF4
1B2 GB1 GB0.5 GB
7-8B14-16 GB7-8 GB4-5 GB
13B26 GB13 GB7 GB
70B140 GB70 GB35-40 GB

Training (QLoRA with Gradient Checkpointing)

ModelBatch 1, Seq 2KBatch 4, Seq 512Batch 4, Seq 1K
8B~6 GB~7 GB~9 GB
13B~9 GB~10 GB~13 GB
70B~45 GB~50 GB~60 GB

GPU Recommendations

GPUVRAMModel SizesTraining
T415 GBUp to 8B (quantized)QLoRA
RTX 309024 GBUp to 13B (quantized)QLoRA
RTX 409024 GBUp to 13B (quantized)QLoRA
A100 40GB40 GBUp to 13B (FP16) or 70B (quantized)Full FT or QLoRA
A100 80GB80 GBUp to 70B (quantized)QLoRA

Python: VRAM Calculator

Use this code to estimate VRAM for your specific configuration:

def estimate_vram_gb(
params_billions: float,
weight_precision: str = "fp16",
training: bool = True,
lora: bool = False,
lora_rank: int = 16,
gradient_checkpointing: bool = True,
batch_size: int = 1,
seq_length: int = 2048,
num_layers: int = 32,
hidden_dim: int = 4096
) -> dict:
"""
Estimate VRAM requirements for LLM inference or training.

Returns breakdown by component.
"""
params = params_billions * 1e9

# Weight memory by precision
precision_bytes = {
"fp32": 4,
"fp16": 2,
"bf16": 2,
"int8": 1,
"int4": 0.5,
"nf4": 0.5
}
weight_bytes = precision_bytes[weight_precision]
weights_gb = (params * weight_bytes) / 1e9

if not training:
# Inference only
overhead_gb = 0.5
return {
"weights": weights_gb,
"overhead": overhead_gb,
"total": weights_gb + overhead_gb
}

# Training components
if lora:
# LoRA trainable parameters (rough estimate)
# 2 matrices × rank × hidden × 2 target modules × layers
lora_params = 2 * lora_rank * hidden_dim * 2 * num_layers
grad_params = lora_params
optim_params = lora_params
else:
grad_params = params
optim_params = params

# Gradients (FP16)
gradients_gb = (grad_params * 2) / 1e9

# Optimizer (AdamW: 2× FP32)
optimizer_gb = (optim_params * 8) / 1e9

# Activations (rough estimate)
activation_base = 2 * num_layers * batch_size * seq_length * hidden_dim * 2
if gradient_checkpointing:
activation_base *= 0.4 # ~60% savings
activations_gb = activation_base / 1e9

overhead_gb = 0.5

total = weights_gb + gradients_gb + optimizer_gb + activations_gb + overhead_gb

return {
"weights": round(weights_gb, 2),
"gradients": round(gradients_gb, 2),
"optimizer": round(optimizer_gb, 2),
"activations": round(activations_gb, 2),
"overhead": overhead_gb,
"total": round(total, 2)
}

# Examples
print("Llama-8B Full Fine-tuning (FP16):")
print(estimate_vram_gb(8, "fp16", training=True, lora=False))

print("\nLlama-8B QLoRA (4-bit):")
print(estimate_vram_gb(8, "nf4", training=True, lora=True))

print("\nLlama-8B Inference Only (4-bit):")
print(estimate_vram_gb(8, "nf4", training=False))

Output:

Llama-8B Full Fine-tuning (FP16):
{'weights': 16.0, 'gradients': 16.0, 'optimizer': 64.0, 'activations': 2.15, 'overhead': 0.5, 'total': 98.65}

Llama-8B QLoRA (4-bit):
{'weights': 4.0, 'gradients': 0.0, 'optimizer': 0.0, 'activations': 0.86, 'overhead': 0.5, 'total': 5.36}

Llama-8B Inference Only (4-bit):
{'weights': 4.0, 'overhead': 0.5, 'total': 4.5}

Common Mistakes and Fixes

MistakeSymptomFix
Forgot optimizer statesOOM during first training stepUse QLoRA or 8-bit optimizer
Batch size too highOOM after a few stepsReduce batch size, use gradient accumulation
Sequence too longOOM during forward passReduce max_seq_length or use gradient checkpointing
No gradient checkpointingOOM with even batch=1Enable gradient_checkpointing=True
Loading FP16 on small GPUOOM during model.loadUse 4-bit quantization from the start

The OOM Debugging Checklist

When you get CUDA out of memory:

  1. Is model loaded at correct precision?
  2. Is gradient checkpointing enabled?
  3. What's the batch size and sequence length?
  4. Is LoRA/QLoRA properly configured?
  5. Any other models/tensors in GPU memory?

Try With AI

Now that you understand VRAM budgeting, practice planning with your AI companion.

Prompt 1: Plan Your Training Configuration

I want to fine-tune Llama-3-8B on Colab Free Tier (T4, 15GB VRAM).

My training data:
- 1,000 examples
- Average length: 400 tokens
- Maximum length: 1,200 tokens
- Format: instruction/response pairs

Help me:
1. Calculate expected VRAM for different configurations
2. Choose between max_seq_length 512, 1024, or 2048
3. Determine the largest batch size I can use
4. Decide on gradient accumulation for effective batch size 16

Show your calculations and explain trade-offs.

What you're learning: Applying VRAM estimation to real planning decisions.

Prompt 2: Optimize an Existing Configuration

My current training setup on a T4:
- Llama-3-8B with QLoRA (4-bit)
- Batch size: 1
- Sequence length: 2048
- Gradient checkpointing: enabled
- Estimated VRAM: ~6GB

But training is slow (only using batch 1). I have 9GB of unused VRAM.
Help me optimize:

1. Should I increase batch size or sequence length?
2. What batch size can I safely use?
3. How does this affect training speed and quality?
4. If I increase batch size to 4, should I reduce gradient accumulation?

I want faster training without OOM crashes.

What you're learning: You specify your constraints, and together you converge on an optimized configuration that neither would have reached alone.

Prompt 3: Explore the Memory-Compute Trade-off

I'm confused about when to use gradient checkpointing vs. just reducing
batch size. Both save memory, but in different ways.

Help me understand:
1. What exactly does gradient checkpointing trade (memory for what)?
2. When is it better to reduce batch size instead?
3. For training quality, is it better to use checkpointing with batch=4
or no checkpointing with batch=1?
4. How do these choices interact with learning rate and convergence?

Walk me through the decision framework.

What you're learning: Understanding the deeper trade-offs helps you make better configuration decisions beyond just "will it fit."

Safety Note

Before starting a long training run, always do a quick test with 1-2 batches to verify your VRAM calculations are correct. Training jobs can run for hours; discovering an OOM error after 30 minutes wastes compute credits and time. The T4 on Colab Free Tier also has session limits (~90 minutes of continuous GPU use), so save checkpoints frequently.