Skip to main content

Parameter Counting and Model Sizes

When someone says "Llama-3-8B has 8 billion parameters," what does that actually mean? And why does it matter for fine-tuning? This lesson demystifies model sizes by teaching you to count parameters yourself. You'll understand exactly where those billions come from and how model size relates to capability, cost, and memory requirements.

Knowing how to count parameters transforms you from someone who reads model cards to someone who can verify them. When you understand that 50% of an LLM's parameters live in the FFN layers, you can make informed decisions about which components to target during fine-tuning.


What is a Parameter?

A parameter is a learnable weight in a neural network. During training, these weights are adjusted to minimize loss. In a transformer:

  • Embedding weights: Map tokens to vectors (and back)
  • Attention weights: Q, K, V projections and output projection
  • FFN weights: Up-projection and down-projection matrices
  • Layer norm weights: Scale and shift parameters

Every parameter is a floating-point number. At FP16 precision, each parameter takes 2 bytes. So:

Model memory (bytes) = Parameters × Bytes per parameter
8 billion × 2 bytes = 16 GB

This is why an 8B model needs roughly 16GB of VRAM just to load the weights at FP16.


The Parameter Counting Formula

Here's the formula for counting transformer parameters:

Total Parameters = Embedding + (Attention + FFN + LayerNorm) × Layers + Output Head

Let's break down each component.

1. Embedding Layer

The embedding table maps vocabulary tokens to hidden vectors:

Embedding Parameters = Vocabulary Size × Hidden Dimension

For Llama-3-8B:

  • Vocabulary: 128,256 tokens
  • Hidden dimension: 4,096
Embedding = 128,256 × 4,096 = 525,336,576 ≈ 525M parameters

2. Attention Layer (per layer)

Each attention layer has:

ComponentFormulaLlama-3-8B
Q projectionhidden × hidden4,096 × 4,096 = 16.8M
K projectionhidden × kv_heads × head_dim4,096 × 8 × 128 = 4.2M
V projectionhidden × kv_heads × head_dim4,096 × 8 × 128 = 4.2M
Output projectionhidden × hidden4,096 × 4,096 = 16.8M

Note: Llama-3 uses Grouped Query Attention (GQA), where K and V projections are smaller than Q. This saves parameters and memory.

Attention per layer ≈ 16.8M + 4.2M + 4.2M + 16.8M = 42M parameters

3. FFN Layer (per layer)

The Feed-Forward Network has:

ComponentFormulaLlama-3-8B
Gate projectionhidden × intermediate4,096 × 14,336 = 58.7M
Up projectionhidden × intermediate4,096 × 14,336 = 58.7M
Down projectionintermediate × hidden14,336 × 4,096 = 58.7M

Llama uses a gated FFN (SwiGLU), so there are three projections instead of two:

FFN per layer = 58.7M × 3 = 176M parameters

4. Layer Normalization (per layer)

Each layer has 2 layer norms, each with scale and shift:

LayerNorm per layer = 2 × (2 × hidden) = 2 × (2 × 4,096) = 16,384 ≈ 0.016M

This is negligible compared to attention and FFN.

5. Output Head

The output projection maps hidden states back to vocabulary:

Output Head = Hidden × Vocabulary = 4,096 × 128,256 = 525M

Note: Some models tie the output head to the embedding weights (same matrix). Llama-3 uses separate weights.


Complete Calculation: Llama-3-8B

Now let's sum everything:

ComponentFormulaParameters
Embedding128,256 × 4,096525M
Attention × 32 layers42M × 321,344M
FFN × 32 layers176M × 325,632M
LayerNorm × 32 layers0.016M × 320.5M
Output Head128,256 × 4,096525M
Total8,027M ≈ 8B

The name "8B" comes from this calculation. The actual parameter count varies slightly based on exact architecture details.

Where Do the Parameters Live?

Parameter Distribution in Llama-3-8B:
┌─────────────────────────────────────────────────────────────────┐
│ FFN (70%) │
│ ████████████████████████████████████████████████████████████████│
├─────────────────────────────────────────────────────────────────┤
│ Attention (17%) │
│ ████████████████ │
├─────────────────────────────────────────────────────────────────┤
│ Embedding + Output (13%) │
│ ████████████ │
└─────────────────────────────────────────────────────────────────┘

Key insight: The FFN layers contain ~70% of all parameters. This is why some fine-tuning techniques (like LoRA) focus on attention matrices - they're a smaller, more targeted intervention.


Python Code: Count Parameters Programmatically

You can verify parameter counts with code:

from transformers import AutoModelForCausalLM, AutoConfig

def count_parameters(model_name: str) -> dict:
"""Count parameters by component for a transformer model."""

# Load config only (no weights)
config = AutoConfig.from_pretrained(model_name)

# Extract architecture details
vocab_size = config.vocab_size
hidden_size = config.hidden_size
num_layers = config.num_hidden_layers
intermediate_size = config.intermediate_size
num_heads = config.num_attention_heads
num_kv_heads = getattr(config, 'num_key_value_heads', num_heads)
head_dim = hidden_size // num_heads

# Calculate per-component parameters
embedding = vocab_size * hidden_size

# Attention (Q is full, K/V are grouped)
q_proj = hidden_size * hidden_size
k_proj = hidden_size * (num_kv_heads * head_dim)
v_proj = hidden_size * (num_kv_heads * head_dim)
o_proj = hidden_size * hidden_size
attention_per_layer = q_proj + k_proj + v_proj + o_proj

# FFN (gated: gate + up + down)
ffn_per_layer = 3 * hidden_size * intermediate_size

# Layer norms (negligible but include for completeness)
layernorm_per_layer = 4 * hidden_size # 2 norms × (scale + shift)

# Total per layer
per_layer = attention_per_layer + ffn_per_layer + layernorm_per_layer

# Output head (often tied to embedding, but count separately)
output_head = hidden_size * vocab_size

total = embedding + (per_layer * num_layers) + output_head

return {
"model": model_name,
"vocab_size": vocab_size,
"hidden_size": hidden_size,
"num_layers": num_layers,
"embedding": embedding,
"attention_per_layer": attention_per_layer,
"ffn_per_layer": ffn_per_layer,
"total_attention": attention_per_layer * num_layers,
"total_ffn": ffn_per_layer * num_layers,
"output_head": output_head,
"total_parameters": total,
"total_billions": total / 1e9
}

# Count parameters for Llama-3.2-1B
result = count_parameters("meta-llama/Llama-3.2-1B")
for key, value in result.items():
if isinstance(value, int) and value > 1000:
print(f"{key}: {value:,}")
else:
print(f"{key}: {value}")

Output:

model: meta-llama/Llama-3.2-1B
vocab_size: 128256
hidden_size: 2048
num_layers: 16
embedding: 262,668,288
attention_per_layer: 12,582,912
ffn_per_layer: 50,331,648
total_attention: 201,326,592
total_ffn: 805,306,368
output_head: 262,668,288
total_parameters: 1,531,969,536
total_billions: 1.53

This code works for any Hugging Face model. Try it with meta-llama/Llama-3.1-8B to verify the 8B calculation.


Model Size Tiers: Choosing the Right Model

Models come in distinct size tiers, each with different trade-offs:

Size TierExamplesVRAM (FP16)QualityLatencyUse Case
1-3BLlama-3.2-1B, Phi-3-mini2-6 GBGood for focused tasksVery fastMobile, edge, rapid prototyping
7-8BLlama-3-8B, Mistral-7B14-16 GBStrong general capabilityFastBest for Colab T4 fine-tuning
13BLlama-2-13B26 GBBetter reasoningModerateSingle high-end GPU
30-34BCodeLlama-34B68 GBExcellent capabilitySlowMulti-GPU or cloud
70B+Llama-3-70B140+ GBNear-frontier performanceVery slowCluster deployment

The Sweet Spot: 7-8B Models

For fine-tuning on consumer hardware (like Colab T4), 7-8B models hit the sweet spot:

  1. Fits with quantization: 8B in 4-bit = ~5GB, leaving room for training
  2. Strong capability: Sufficient parameters for complex reasoning
  3. Practical latency: Generates ~30 tokens/second on T4
  4. Good fine-tuning transfer: Enough capacity to learn domain patterns

When to Go Smaller (1-3B)

  • Mobile or edge deployment
  • Latency-critical applications
  • Very narrow task (classification, extraction)
  • Limited compute budget

When to Go Larger (70B+)

  • Complex multi-step reasoning required
  • Frontier quality expectations
  • Production with cloud GPU budget
  • RAG over large knowledge bases

Scaling Laws: The Pattern Behind Model Sizes

Research by OpenAI and others revealed predictable relationships between model size and performance:

The Core Relationship

Loss ∝ (Parameters)^(-0.076) × (Compute)^(-0.050) × (Data)^(-0.095)

In plain terms:

  • Doubling parameters reduces loss by ~5%
  • Doubling compute reduces loss by ~3.5%
  • Doubling data reduces loss by ~6.5%

Implications for Model Selection

If you want...You need...
10% better loss~4× parameters
Meaningful quality jump~10× parameters
Human-level on benchmarkDepends on task complexity

Compute-Optimal Training (Chinchilla)

The Chinchilla research showed that many models were undertrained for their size:

Optimal tokens = 20 × Parameters

For an 8B model: 20 × 8B = 160B tokens of training data

Why this matters for fine-tuning: Your fine-tuning dataset is tiny compared to pre-training (thousands vs. billions of tokens). You're making surgical adjustments, not training from scratch. This is why LoRA (small adapter matrices) works so well.


From Parameters to Memory

Converting parameters to memory requirements:

Memory (bytes) = Parameters × Bytes per parameter

Precision Levels

PrecisionBytes/Param8B ModelNotes
FP32432 GBFull precision, rarely used
FP16/BF16216 GBStandard training precision
INT818 GB8-bit quantization
INT4/NF40.54 GB4-bit quantization (QLoRA)

Quick Memory Estimates

Use this mental shortcut:

VRAM for loading ≈ Parameters (B) × 2 (for FP16)
VRAM for loading ≈ Parameters (B) × 0.5 (for INT4)

Examples:

  • Llama-3-8B at FP16: 8 × 2 = 16 GB
  • Llama-3-8B at INT4: 8 × 0.5 = 4 GB
  • Llama-3-70B at FP16: 70 × 2 = 140 GB
  • Llama-3-70B at INT4: 70 × 0.5 = 35 GB

Note: Training requires additional memory for gradients, optimizer states, and activations. Lesson 4 covers the complete VRAM budget.


Practice: Estimate Model Parameters

Given this configuration, estimate the total parameters:

Model: Custom-LLM-2B
- Vocabulary: 50,000
- Hidden dimension: 2,048
- Layers: 24
- FFN intermediate: 8,192
- Attention heads: 16 (no GQA)

Work through it:

  1. Embedding: 50,000 × 2,048 = 102,400,000 = 102.4M

  2. Attention per layer:

    • Q: 2,048 × 2,048 = 4.2M
    • K: 2,048 × 2,048 = 4.2M (no GQA, full size)
    • V: 2,048 × 2,048 = 4.2M
    • O: 2,048 × 2,048 = 4.2M
    • Total attention: 16.8M per layer
  3. FFN per layer:

    • 3 × 2,048 × 8,192 = 50.3M per layer
  4. Per layer total: 16.8M + 50.3M = 67.1M

  5. All layers: 67.1M × 24 = 1,610M

  6. Output head: 2,048 × 50,000 = 102.4M

  7. Total: 102.4M + 1,610M + 102.4M = 1,815M ≈ 1.8B

The model name "2B" is approximately correct (the ~10% difference comes from rounding and minor components).


Verifying Model Cards

Model cards on Hugging Face claim parameter counts. Verify them:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
torch_dtype="auto",
device_map="cpu"
)

# Count actual loaded parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"In billions: {total_params / 1e9:.2f}B")

Output:

Total parameters: 1,235,814,400
Trainable parameters: 1,235,814,400
In billions: 1.24B

This matches the expected ~1.2B for Llama-3.2-1B.


Summary Table: Common Models

Reference table for quick lookups:

ModelParametersLayersHiddenHeadsFFNVRAM (FP16)VRAM (4-bit)
Llama-3.2-1B1.2B162,048328,1922.4 GB0.7 GB
Llama-3-8B8.0B324,0963214,33616 GB5 GB
Mistral-7B7.2B324,0963214,33614.4 GB4.5 GB
Llama-3-70B70B808,1926428,672140 GB40 GB

Try With AI

Now that you can count parameters, explore the implications with your AI companion.

Prompt 1: Verify Your Understanding

I calculated that Llama-3-8B has these parameter distributions:
- FFN: 5.6B (70%)
- Attention: 1.3B (17%)
- Embeddings: 1.0B (13%)

If I'm fine-tuning with LoRA targeting only Q and V matrices in attention:
1. What percentage of total parameters am I adapting?
2. Why is LoRA effective despite touching so few parameters?
3. Would targeting FFN layers be more or less effective?

Help me connect parameter distribution to fine-tuning strategy.

What you're learning: Understanding parameter distribution informs fine-tuning decisions.

Prompt 2: Size Selection for Your Use Case

I want to fine-tune a model for Task API (structured task management).
My constraints:
- Colab Free Tier (T4 GPU, 15GB VRAM)
- Want to train, not just inference
- Task complexity: extracting entities and generating JSON

Help me analyze:
1. What size model should I target? Explain the trade-off.
2. Is 1B sufficient for JSON generation or do I need 7B?
3. How would my choice change if I had an A100 (40GB VRAM)?

Walk me through the selection process.

What you're learning: Selecting the right model size balances capability, constraints, and use case.

Prompt 3: Scaling Law Implications

I learned about scaling laws: doubling parameters gives ~5% loss improvement.

But I'm fine-tuning, not pre-training. Help me understand:
1. Do scaling laws apply to fine-tuning the same way as pre-training?
2. For a narrow task (Task API), is an 8B model "overkill"?
3. What's the relationship between task complexity and model size?

Challenge my assumption that bigger is always better for fine-tuning.

What you're learning: Scaling laws provide context, but fine-tuning follows different dynamics than pre-training.

Safety Note

When loading large models for parameter counting, use device_map="cpu" to avoid GPU memory issues. For models above 13B, use low_cpu_mem_usage=True or just load the config (AutoConfig.from_pretrained) which doesn't download weights. Loading 70B weights on a consumer machine will exhaust RAM and potentially crash your system.