Updated Feb 23, 2026

VRAM Budget Planning

You've counted the parameters. You know an 8B model has 8 billion weights. Now the critical question: how much GPU memory do you actually need to train it? The answer isn't just "8 billion times bytes per parameter." Training requires memory for weights, gradients, optimizer states, and activations. Getting this calculation wrong means your training crashes with "CUDA out of memory."

This lesson teaches you to plan your VRAM budget before starting any training job. You'll learn the components of GPU memory usage, how to calculate requirements for different scenarios, and what optimization techniques let you fit larger models on smaller GPUs.

Why VRAM Planning Matters

Consider this common scenario:

You: Load 8B model on T4 (15GB VRAM) for training
GPU: CUDA out of memory
You: But 8B at FP16 is only 16GB... wait, 16GB > 15GB
GPU: Also you forgot gradients, optimizer, and activations
You: How much memory do I actually need?

The model weights are just the starting point. Training adds significant memory overhead.

The VRAM Stack

GPU Memory Usage During Training:
┌─────────────────────────────────────────────────────────────────┐
│ Model Weights (base memory)                                     │
├─────────────────────────────────────────────────────────────────┤
│ Gradients (same size as weights for full fine-tuning)          │
├─────────────────────────────────────────────────────────────────┤
│ Optimizer States (2× weights for AdamW)                        │
├─────────────────────────────────────────────────────────────────┤
│ Activations (scales with batch size × sequence length)         │
├─────────────────────────────────────────────────────────────────┤
│ CUDA Kernels + Workspace (~500MB overhead)                     │
└─────────────────────────────────────────────────────────────────┘

Let's calculate each component.

Component 1: Model Weights

The foundation of memory usage. You learned this in Lesson 3:

Weights Memory = Parameters × Bytes per parameter

Precision	Bytes/Param	8B Model
FP32	4	32 GB
FP16/BF16	2	16 GB
INT8	1	8 GB
INT4/NF4	0.5	4 GB

Key insight: Quantization is powerful. Going from FP16 to NF4 reduces weight memory by 4×.

Component 2: Gradients

During backpropagation, you compute gradients for each parameter. For full fine-tuning:

Gradient Memory = Parameters × Bytes per gradient

Gradients typically stored at FP16:

Model	Gradient Memory (FP16)
8B model	8B × 2 = 16 GB
7B model	7B × 2 = 14 GB

Why this matters: Full fine-tuning doubles your weight memory just for gradients.

LoRA Reduces Gradient Memory

With LoRA, you only compute gradients for the adapter parameters:

LoRA Parameters = 2 × rank × hidden_dim × num_target_modules × layers

For Llama-8B with LoRA rank 16, targeting Q and V:

LoRA params = 2 × 16 × 4,096 × 2 × 32 = 8.4M parameters
Gradient memory = 8.4M × 2 bytes = 16.8 MB

Compare: 16 GB (full) vs 16.8 MB (LoRA) = 1000× reduction in gradient memory.

Component 3: Optimizer States

Optimizers like AdamW maintain state for each trainable parameter:

State	Purpose	Memory
First moment (m)	Momentum	1× parameters (FP32)
Second moment (v)	Adaptive learning rate	1× parameters (FP32)

For full fine-tuning with AdamW:

Optimizer Memory = Parameters × 8 bytes (two FP32 tensors)
8B model = 8B × 8 = 64 GB

This is often the memory killer. You need 64 GB just for optimizer states on an 8B model.

LoRA Reduces Optimizer Memory

With LoRA's 8.4M trainable parameters:

Optimizer memory = 8.4M × 8 = 67 MB

Compare: 64 GB (full) vs 67 MB (LoRA) = ~1000× reduction.

Component 4: Activations

The intermediate values computed during the forward pass. These are stored for the backward pass.

Activation memory depends on:

Batch size
Sequence length
Number of layers
Hidden dimension

Rough estimate:

Activations ≈ 2 × layers × batch_size × seq_length × hidden_dim × bytes

For Llama-8B with batch_size=1, seq_length=2048:

Activations = 2 × 32 × 1 × 2,048 × 4,096 × 2 bytes
           = 1.07 GB per sample

Batch size scales linearly: batch_size=4 → 4.3 GB of activations.

Gradient Checkpointing Trades Compute for Memory

Instead of storing all activations, recompute them during backward pass:

Without checkpointing: Store all layer activations (high memory)
With checkpointing: Store every N-th layer, recompute between (low memory, ~30% slower)

Memory savings: Often 60-70% reduction in activation memory.

The Complete VRAM Formula

Putting it all together for full fine-tuning:

Total VRAM = Weights + Gradients + Optimizer + Activations + Overhead

= (P × Bw) + (P × Bg) + (P × Bo) + (2 × L × B × S × H × Ba) + 0.5GB

Where:
P = Parameters
Bw = Bytes per weight (2 for FP16, 0.5 for 4-bit)
Bg = Bytes per gradient (2 for FP16)
Bo = Bytes for optimizer (8 for AdamW)
L = Layers
B = Batch size
S = Sequence length
H = Hidden dimension
Ba = Bytes per activation (2 for FP16)

Example: Full Fine-Tuning Llama-8B

Parameters: 8B
Precision: FP16
Batch size: 1
Sequence: 2048
Layers: 32
Hidden: 4096

Weights:    8B × 2 = 16 GB
Gradients:  8B × 2 = 16 GB
Optimizer:  8B × 8 = 64 GB
Activations: 2 × 32 × 1 × 2048 × 4096 × 2 = 1 GB
Overhead:   0.5 GB

Total: 16 + 16 + 64 + 1 + 0.5 = 97.5 GB

Result: Full fine-tuning Llama-8B requires ~100 GB VRAM. That's an A100 80GB... times two.

Making It Fit: QLoRA

QLoRA (Quantized LoRA) combines two techniques:

4-bit quantization: Reduce weight memory
LoRA: Reduce gradient and optimizer memory

QLoRA VRAM Calculation

Quantized Weights: 8B × 0.5 = 4 GB
LoRA Gradients:    8.4M × 2 = 0.017 GB
LoRA Optimizer:    8.4M × 8 = 0.067 GB
Activations:       ~1 GB (with gradient checkpointing)
Overhead:          0.5 GB

Total: 4 + 0.017 + 0.067 + 1 + 0.5 ≈ 5.6 GB

Result: QLoRA drops 97.5 GB to 5.6 GB. This fits on a T4 with room to spare!

Why QLoRA Works

The magic: quantized weights are used in forward pass, but gradients flow only through the small LoRA adapters. You get:

Memory of a 4-bit model
Trainability (can backprop through LoRA)
Quality approaching full fine-tuning

Batch Size and Sequence Length Trade-offs

Given fixed GPU memory, you can trade batch size for sequence length:

Activation memory ∝ batch_size × sequence_length

GPU Memory	Batch 1	Batch 4	Batch 8
Seq 512	✓	✓	✓
Seq 1024	✓	✓	Maybe
Seq 2048	✓	Maybe	✗
Seq 4096	Maybe	✗	✗

For QLoRA on T4 (15GB), typical configurations:

Batch 1, Seq 2048: ~6 GB (comfortable)
Batch 4, Seq 512: ~7 GB (comfortable)
Batch 2, Seq 1024: ~6.5 GB (comfortable)

Gradient Accumulation: Effective Larger Batches

Can't fit batch_size=8 in memory? Use gradient accumulation:

training_args = TrainingArguments(
    per_device_train_batch_size=2,      # Actual batch in memory
    gradient_accumulation_steps=4,       # Accumulate 4 steps
    # Effective batch size = 2 × 4 = 8
)

This gives you the training stability of larger batches without the memory cost.

Working Through a Scenario Together

Let's plan a training job step by step. Imagine you're preparing to fine-tune Llama-3-8B for Task API on a Colab T4 (15GB VRAM).

Your initial specification:

Model: Llama-3-8B
Target: Task API JSON generation
Hardware: T4 (15GB VRAM)
Data: 500 examples, average 300 tokens each

First attempt: Full fine-tuning at FP16

Weights:    16 GB
Gradients:  16 GB
Optimizer:  64 GB
Total:      96 GB → IMPOSSIBLE on T4

Second attempt: What about just freezing most layers?

You ask AI: "Can I freeze layers 0-28 and only train layers 29-31?"

AI suggests: "That would reduce gradients and optimizer to ~10% (3 layers of 32). But you still need to load all weights: 16GB for weights alone exceeds your 15GB VRAM."

Key insight from AI: "You need quantization just to load the model, regardless of training strategy."

Third attempt: QLoRA

4-bit weights: 4 GB
LoRA adapters: 0.1 GB
Activations:   1 GB (with grad checkpointing)
Overhead:      0.5 GB
Total:         5.6 GB → Fits with margin!

Your refinement: "Wait, I have 15GB but only using 5.6GB. Can I increase batch size for faster training?"

AI calculates: "With 9.4 GB headroom:

Activation memory per sample: ~1 GB
You could theoretically do batch_size=8, but that's 8 GB activations
Total: 4 + 8 + 0.5 = 12.5 GB
That's cutting it close. Recommend batch_size=4 with seq_length=1024."

Final configuration:

# QLoRA configuration for T4
from peft import LoraConfig
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # effective batch 16
    max_seq_length=1024,
    gradient_checkpointing=True,
)

Estimated VRAM: ~8-10 GB, comfortable on T4.

This iterative process, where you refine your approach based on calculations and constraints, is exactly how practitioners plan real training jobs.

Quick Reference: VRAM Estimates

Use these tables for rapid estimation:

Inference (Model Loading Only)

Model	FP16	INT8	INT4/NF4
1B	2 GB	1 GB	0.5 GB
7-8B	14-16 GB	7-8 GB	4-5 GB
13B	26 GB	13 GB	7 GB
70B	140 GB	70 GB	35-40 GB

Training (QLoRA with Gradient Checkpointing)

Model	Batch 1, Seq 2K	Batch 4, Seq 512	Batch 4, Seq 1K
8B	~6 GB	~7 GB	~9 GB
13B	~9 GB	~10 GB	~13 GB
70B	~45 GB	~50 GB	~60 GB

GPU Recommendations

GPU	VRAM	Model Sizes	Training
T4	15 GB	Up to 8B (quantized)	QLoRA
RTX 3090	24 GB	Up to 13B (quantized)	QLoRA
RTX 4090	24 GB	Up to 13B (quantized)	QLoRA
A100 40GB	40 GB	Up to 13B (FP16) or 70B (quantized)	Full FT or QLoRA
A100 80GB	80 GB	Up to 70B (quantized)	QLoRA

Python: VRAM Calculator

Use this code to estimate VRAM for your specific configuration:

def estimate_vram_gb(
    params_billions: float,
    weight_precision: str = "fp16",
    training: bool = True,
    lora: bool = False,
    lora_rank: int = 16,
    gradient_checkpointing: bool = True,
    batch_size: int = 1,
    seq_length: int = 2048,
    num_layers: int = 32,
    hidden_dim: int = 4096
) -> dict:
    """
    Estimate VRAM requirements for LLM inference or training.

    Returns breakdown by component.
    """
    params = params_billions * 1e9

    # Weight memory by precision
    precision_bytes = {
        "fp32": 4,
        "fp16": 2,
        "bf16": 2,
        "int8": 1,
        "int4": 0.5,
        "nf4": 0.5
    }
    weight_bytes = precision_bytes[weight_precision]
    weights_gb = (params * weight_bytes) / 1e9

    if not training:
        # Inference only
        overhead_gb = 0.5
        return {
            "weights": weights_gb,
            "overhead": overhead_gb,
            "total": weights_gb + overhead_gb
        }

    # Training components
    if lora:
        # LoRA trainable parameters (rough estimate)
        # 2 matrices × rank × hidden × 2 target modules × layers
        lora_params = 2 * lora_rank * hidden_dim * 2 * num_layers
        grad_params = lora_params
        optim_params = lora_params
    else:
        grad_params = params
        optim_params = params

    # Gradients (FP16)
    gradients_gb = (grad_params * 2) / 1e9

    # Optimizer (AdamW: 2× FP32)
    optimizer_gb = (optim_params * 8) / 1e9

    # Activations (rough estimate)
    activation_base = 2 * num_layers * batch_size * seq_length * hidden_dim * 2
    if gradient_checkpointing:
        activation_base *= 0.4  # ~60% savings
    activations_gb = activation_base / 1e9

    overhead_gb = 0.5

    total = weights_gb + gradients_gb + optimizer_gb + activations_gb + overhead_gb

    return {
        "weights": round(weights_gb, 2),
        "gradients": round(gradients_gb, 2),
        "optimizer": round(optimizer_gb, 2),
        "activations": round(activations_gb, 2),
        "overhead": overhead_gb,
        "total": round(total, 2)
    }

# Examples
print("Llama-8B Full Fine-tuning (FP16):")
print(estimate_vram_gb(8, "fp16", training=True, lora=False))

print("\nLlama-8B QLoRA (4-bit):")
print(estimate_vram_gb(8, "nf4", training=True, lora=True))

print("\nLlama-8B Inference Only (4-bit):")
print(estimate_vram_gb(8, "nf4", training=False))

Output:

Llama-8B Full Fine-tuning (FP16):
{'weights': 16.0, 'gradients': 16.0, 'optimizer': 64.0, 'activations': 2.15, 'overhead': 0.5, 'total': 98.65}

Llama-8B QLoRA (4-bit):
{'weights': 4.0, 'gradients': 0.0, 'optimizer': 0.0, 'activations': 0.86, 'overhead': 0.5, 'total': 5.36}

Llama-8B Inference Only (4-bit):
{'weights': 4.0, 'overhead': 0.5, 'total': 4.5}

Common Mistakes and Fixes

Mistake	Symptom	Fix
Forgot optimizer states	OOM during first training step	Use QLoRA or 8-bit optimizer
Batch size too high	OOM after a few steps	Reduce batch size, use gradient accumulation
Sequence too long	OOM during forward pass	Reduce max_seq_length or use gradient checkpointing
No gradient checkpointing	OOM with even batch=1	Enable `gradient_checkpointing=True`
Loading FP16 on small GPU	OOM during model.load	Use 4-bit quantization from the start

The OOM Debugging Checklist

When you get CUDA out of memory:

Is model loaded at correct precision?
Is gradient checkpointing enabled?
What's the batch size and sequence length?
Is LoRA/QLoRA properly configured?
Any other models/tensors in GPU memory?

Try With AI

Now that you understand VRAM budgeting, practice planning with your AI companion.

Prompt 1: Plan Your Training Configuration

I want to fine-tune Llama-3-8B on Colab Free Tier (T4, 15GB VRAM).

My training data:
- 1,000 examples
- Average length: 400 tokens
- Maximum length: 1,200 tokens
- Format: instruction/response pairs

Help me:
1. Calculate expected VRAM for different configurations
2. Choose between max_seq_length 512, 1024, or 2048
3. Determine the largest batch size I can use
4. Decide on gradient accumulation for effective batch size 16

Show your calculations and explain trade-offs.

What you're learning: Applying VRAM estimation to real planning decisions.

Prompt 2: Optimize an Existing Configuration

My current training setup on a T4:
- Llama-3-8B with QLoRA (4-bit)
- Batch size: 1
- Sequence length: 2048
- Gradient checkpointing: enabled
- Estimated VRAM: ~6GB

But training is slow (only using batch 1). I have 9GB of unused VRAM.
Help me optimize:

1. Should I increase batch size or sequence length?
2. What batch size can I safely use?
3. How does this affect training speed and quality?
4. If I increase batch size to 4, should I reduce gradient accumulation?

I want faster training without OOM crashes.

What you're learning: You specify your constraints, and together you converge on an optimized configuration that neither would have reached alone.

Prompt 3: Explore the Memory-Compute Trade-off

I'm confused about when to use gradient checkpointing vs. just reducing
batch size. Both save memory, but in different ways.

Help me understand:
1. What exactly does gradient checkpointing trade (memory for what)?
2. When is it better to reduce batch size instead?
3. For training quality, is it better to use checkpointing with batch=4
   or no checkpointing with batch=1?
4. How do these choices interact with learning rate and convergence?

Walk me through the decision framework.

What you're learning: Understanding the deeper trade-offs helps you make better configuration decisions beyond just "will it fit."

Safety Note

Before starting a long training run, always do a quick test with 1-2 batches to verify your VRAM calculations are correct. Training jobs can run for hours; discovering an OOM error after 30 minutes wastes compute credits and time. The T4 on Colab Free Tier also has session limits (~90 minutes of continuous GPU use), so save checkpoints frequently.

Why VRAM Planning Matters​

The VRAM Stack​

Component 1: Model Weights​

Component 2: Gradients​

LoRA Reduces Gradient Memory​

Component 3: Optimizer States​

LoRA Reduces Optimizer Memory​

Component 4: Activations​

Gradient Checkpointing Trades Compute for Memory​

The Complete VRAM Formula​

Example: Full Fine-Tuning Llama-8B​

Making It Fit: QLoRA​

QLoRA VRAM Calculation​

Why QLoRA Works​

Batch Size and Sequence Length Trade-offs​

Gradient Accumulation: Effective Larger Batches​

Working Through a Scenario Together​

Quick Reference: VRAM Estimates​

Inference (Model Loading Only)​

Training (QLoRA with Gradient Checkpointing)​

GPU Recommendations​

Python: VRAM Calculator​

Common Mistakes and Fixes​

The OOM Debugging Checklist​

Try With AI​

Prompt 1: Plan Your Training Configuration​

Prompt 2: Optimize an Existing Configuration​

Prompt 3: Explore the Memory-Compute Trade-off​

Safety Note​