Skip to main content

vLLM Production Architecture

Ollama is excellent for local development and single-user inference. But what happens when you need to serve 1,000 concurrent users? When your latency budget is 100ms? When you are running on expensive GPU clusters and every percentage of efficiency matters?

This is where vLLM enters the picture. vLLM is a high-throughput LLM serving engine designed for production workloads. This lesson explains how it works and when you should consider it for your deployments.

Note: vLLM requires substantial GPU resources (typically 24GB+ VRAM for 7B models). This lesson focuses on conceptual understanding and architecture. Hands-on labs would require GPU resources beyond typical Colab free tier limits.

The Production Serving Challenge

Consider the differences between development and production serving:

AspectDevelopment (Ollama)Production (vLLM)
Concurrent users1-10100-10,000+
Latency target<1 second<100ms (p99)
GPU utilization20-60%80-95%
Cost optimizationNot criticalPrimary concern
BatchingNoneContinuous
Memory efficiencyGoodMaximum

Production serving requires squeezing maximum performance from expensive GPU hardware while maintaining strict latency guarantees.

Understanding KV Cache

To understand vLLM's innovations, you first need to understand the KV cache problem.

What is KV Cache?

During inference, transformer models compute attention over all previous tokens. Without caching, this means recomputing attention for every token position at every step.

Without KV Cache (exponentially slow):
Step 1: Compute attention for tokens [1]
Step 2: Compute attention for tokens [1, 2]
Step 3: Compute attention for tokens [1, 2, 3]
...
Step N: Compute attention for tokens [1, 2, ..., N]
Total computations: 1 + 2 + 3 + ... + N = O(N^2)

The KV cache stores the Key and Value projections from previous tokens:

With KV Cache (linear):
Step 1: Compute K, V for token 1, cache them
Step 2: Compute K, V for token 2, cache, attend over [1, 2]
Step 3: Compute K, V for token 3, cache, attend over [1, 2, 3]
...
Step N: Compute K, V for token N, attend over cached [1..N-1]
Total computations: O(N) per step

The Memory Problem

KV cache is essential for performance, but it consumes significant memory:

KV Cache Size per Request:
┌───────────────────────────────────────────────────────────────┐
│ Size = 2 × num_layers × num_heads × head_dim × context_len │
│ │
│ Example (LLaMA 7B, 4096 context): │
│ Size = 2 × 32 × 32 × 128 × 4096 × 2 bytes (FP16) │
│ Size = ~1 GB per request │
└───────────────────────────────────────────────────────────────┘

With 10 concurrent requests at 4096 context, you need 10 GB just for KV cache. With 100 requests, you need 100 GB. This does not scale.

Static vs Dynamic Allocation

Traditional serving (including Ollama) pre-allocates KV cache for maximum context length:

Traditional KV Cache Allocation:
┌────────────────────────────────────────────────────────────┐
│ Request 1: Using 512 tokens [████░░░░░░░░░░░░░░░░░░░░░░] │
│ Request 2: Using 128 tokens [██░░░░░░░░░░░░░░░░░░░░░░░░] │
│ Request 3: Using 1024 tokens [████████░░░░░░░░░░░░░░░░░░] │
│ │
│ Each bar is 4096 tokens allocated │
│ Memory waste: ~80% (allocated but unused) │
└────────────────────────────────────────────────────────────┘

PagedAttention: vLLM's Core Innovation

PagedAttention applies virtual memory concepts to KV cache management. Instead of allocating contiguous memory per request, it allocates memory in fixed-size blocks (pages).

How PagedAttention Works

PagedAttention Block Structure:
┌─────────────────────────────────────────────────────────────┐
│ Physical Memory: Divided into fixed-size blocks (16 tokens) │
│ │
│ Block Pool: │
│ [Block 0] [Block 1] [Block 2] [Block 3] [Block 4] [Block 5] │
│ │
│ Block Table (per request): │
│ Request 1: [Block 0] → [Block 2] → [Block 4] (48 tokens) │
│ Request 2: [Block 1] → [Block 3] (32 tokens) │
│ Request 3: [Block 5] (16 tokens) │
│ │
│ New blocks allocated only when needed │
│ Completed requests return blocks to pool │
└─────────────────────────────────────────────────────────────┘

Memory Efficiency Comparison

ScenarioTraditionalPagedAttentionImprovement
10 requests, avg 500 tokens40 GB5 GB8x
100 requests, avg 200 tokens400 GB20 GB20x
Mixed workloadHigh fragmentationMinimal wasteVariable

PagedAttention typically achieves 2-4x higher throughput compared to traditional memory management.

Prefix Caching

PagedAttention enables efficient prefix sharing. When multiple requests share the same system prompt:

Traditional: Each request caches system prompt separately
┌────────────────────────────────────────────────────────────┐
│ Request 1: [System Prompt Copy 1] [User Query 1] │
│ Request 2: [System Prompt Copy 2] [User Query 2] │
│ Request 3: [System Prompt Copy 3] [User Query 3] │
│ │
│ Memory: 3x system prompt │
└────────────────────────────────────────────────────────────┘

PagedAttention: Share system prompt blocks
┌────────────────────────────────────────────────────────────┐
│ Shared System Prompt Blocks: [A] [B] [C] │
│ │
│ Request 1 Block Table: [A] [B] [C] [D1] │
│ Request 2 Block Table: [A] [B] [C] [D2] │
│ Request 3 Block Table: [A] [B] [C] [D3] │
│ │
│ Memory: 1x system prompt + 3x unique content │
└────────────────────────────────────────────────────────────┘

For applications with consistent system prompts (like your Task API model), this can reduce memory usage by 30-50%.

Continuous Batching

Traditional batching waits for a batch to complete before starting a new one. Continuous batching adds and removes requests dynamically.

Traditional (Static) Batching

Static Batching Timeline:
┌─────────────────────────────────────────────────────────────┐
│ Time → │
│ │
│ Batch 1: [R1─────] [R2──────] [R3────] │
│ ↓────────────────────────────↓ │
│ Start End (wait for longest) │
│ Batch 2 starts ─────→ │
│ │
│ GPU idle between R1 completion and Batch 1 end │
└─────────────────────────────────────────────────────────────┘

Continuous Batching

Continuous Batching Timeline:
┌─────────────────────────────────────────────────────────────┐
│ Time → │
│ │
│ Batch: [R1─────] [R2──────────] [R3────] │
│ ↑ ↑ ↑ │
│ [R4 joins] [R5 joins] [R6 joins] │
│ ↓ ↓ │
│ [R1 done] [R3 done] │
│ │
│ Requests join and leave dynamically │
│ GPU always processing at capacity │
└─────────────────────────────────────────────────────────────┘

Throughput Impact

Batching StrategyThroughputLatency Distribution
No batching1x baselineConsistent
Static batching (size 8)3-4xHigh variance
Continuous batching6-10xLow variance

Continuous batching increases throughput while maintaining predictable latency.

Advanced Optimizations

Tensor Parallelism

For models too large for a single GPU, vLLM supports tensor parallelism across multiple GPUs:

Tensor Parallelism (2 GPUs):
┌─────────────────────────────────────────────────────────────┐
│ Model Layer │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Weight Matrix W (8192 x 8192) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ↓ Split │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ GPU 0: W[:, :4096] │ │ GPU 1: W[:, 4096:] │ │
│ └────────────────────┘ └────────────────────┘ │
│ ↓ All-reduce │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Combined output │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Speculative Decoding

Use a smaller "draft" model to propose tokens, then verify with the main model:

Speculative Decoding:
┌────────────────────────────────────────────────────────────┐
│ 1. Draft model (small, fast) proposes N tokens │
│ "The quick brown fox" → [jumps, over, the, lazy, dog] │
│ │
│ 2. Main model verifies all N tokens in parallel │
│ [✓jumps, ✓over, ✓the, ✗slow, -] │
│ │
│ 3. Accept verified prefix, resample from divergence │
│ Accept: [jumps, over, the] │
│ Resample starting from position 4 │
│ │
│ Result: 3 tokens in time of 1 main model forward pass │
└────────────────────────────────────────────────────────────┘

Speculative decoding can improve throughput by 2-3x for appropriate workloads.

vLLM vs Ollama: Decision Framework

When should you choose each solution?

Choose Ollama When

  • Development and testing environment
  • Single-user or low-concurrency scenarios
  • Local deployment on consumer hardware
  • Rapid prototyping with multiple models
  • Simplicity is more important than efficiency

Choose vLLM When

  • Production deployment with many concurrent users
  • Strict latency SLAs (p99 < 100ms)
  • GPU cost optimization is critical
  • Need advanced features (prefix caching, speculative decoding)
  • Deploying on cloud GPU instances

Hybrid Approach

Many teams use both:

Development Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ Local Development → Ollama (simple, fast iteration) │
│ ↓ │
│ Staging → vLLM on single GPU (production-like testing) │
│ ↓ │
│ Production → vLLM cluster (scaled deployment) │
└─────────────────────────────────────────────────────────────┘

vLLM Architecture Overview

A typical vLLM deployment architecture:

Production vLLM Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Load Balancer (nginx/HAProxy) │
│ ↓ (round-robin) │
├─────────────────────────────────────────────────────────────┤
│ vLLM Workers (GPU instances) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │
│ │ 2x A100 │ │ 2x A100 │ │ 2x A100 │ │
│ │ TP=2 │ │ TP=2 │ │ TP=2 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Shared Model Storage (S3/GCS/NFS) │
│ - model.safetensors │
│ - tokenizer files │
│ - config.json │
└─────────────────────────────────────────────────────────────┘

Key Configuration Parameters

ParameterDescriptionTypical Value
--tensor-parallel-sizeGPUs per worker1, 2, 4, 8
--gpu-memory-utilizationVRAM fraction to use0.85-0.95
--max-num-batched-tokensMax tokens per batch8192-32768
--max-model-lenMax context length4096-131072
--block-sizePagedAttention block size16

Cost Comparison

Estimating costs for a production workload (1M requests/day, 500 tokens average):

DeploymentHardwareMonthly CostThroughput
Ollama (CPU)32-core server$300-500Too slow
Ollama (GPU)1x RTX 4090$200-40050-100 req/s
vLLM (single)1x A100 80GB$1,500-2,500500-1000 req/s
vLLM (cluster)4x A100 80GB$6,000-10,0002000-4000 req/s

For 1M requests/day (~12 req/s average), a single A100 with vLLM handles the load comfortably with margin for peaks.

Reflect on Your Skill

Update your model-serving skill with production architecture considerations:

## Production Serving (vLLM)

### When to Consider vLLM
- >100 concurrent users
- <100ms latency requirement
- GPU cost optimization needed
- Cloud deployment

### Key Concepts
- PagedAttention: Virtual memory for KV cache
- Continuous batching: Dynamic request scheduling
- Prefix caching: Share system prompt blocks

### Decision Checklist
[ ] Calculate concurrent user load
[ ] Define latency SLA
[ ] Estimate GPU costs
[ ] Consider hybrid (Ollama dev → vLLM prod)

Try With AI

Use your AI companion (Claude, ChatGPT, Gemini, or similar).

Prompt 1: Architecture Analysis

I am planning production deployment for my Task API model with:
- Expected load: 500,000 requests per day
- Average prompt: 100 tokens, average response: 50 tokens
- P99 latency target: 200ms
- Model: 7B parameters

Help me analyze:
1. What is the throughput requirement in requests per second?
2. Would vLLM on a single A100 handle this load?
3. What configuration would you recommend?
4. What would be the approximate monthly GPU cost?

What you are learning: Capacity planning. Production deployments require quantitative analysis of load and resources.

Prompt 2: Compare Optimization Strategies

I need to reduce latency for my production LLM deployment. Current P99 is 500ms, target is 100ms.

Compare these optimization strategies:
A) Upgrade from A10G to A100 GPU
B) Enable speculative decoding with 1B draft model
C) Reduce context length from 4096 to 2048
D) Increase batch size and accept higher latency variance

For each, explain the tradeoffs and expected improvement.

What you are learning: Optimization tradeoffs. Each technique has costs and benefits that depend on your specific workload.

Prompt 3: Design Production Architecture

I need to design a production serving architecture that:
1. Handles 2000 requests per second at peak
2. Has 99.9% uptime SLA
3. Deploys my 13B parameter model
4. Uses AWS infrastructure
5. Minimizes cost while meeting requirements

Help me design:
- GPU instance types and count
- Load balancing strategy
- Failover approach
- Estimated monthly cost

What you are learning: Production architecture design. Real deployments balance performance, reliability, and cost.

Safety Note

When planning production deployments, always include a rollback plan. vLLM and other serving frameworks are rapidly evolving. A configuration that works today may behave differently after an update. Maintain the ability to quickly revert to a known-good configuration.