vLLM Production Architecture
Ollama is excellent for local development and single-user inference. But what happens when you need to serve 1,000 concurrent users? When your latency budget is 100ms? When you are running on expensive GPU clusters and every percentage of efficiency matters?
This is where vLLM enters the picture. vLLM is a high-throughput LLM serving engine designed for production workloads. This lesson explains how it works and when you should consider it for your deployments.
Note: vLLM requires substantial GPU resources (typically 24GB+ VRAM for 7B models). This lesson focuses on conceptual understanding and architecture. Hands-on labs would require GPU resources beyond typical Colab free tier limits.
The Production Serving Challenge
Consider the differences between development and production serving:
| Aspect | Development (Ollama) | Production (vLLM) |
|---|---|---|
| Concurrent users | 1-10 | 100-10,000+ |
| Latency target | <1 second | <100ms (p99) |
| GPU utilization | 20-60% | 80-95% |
| Cost optimization | Not critical | Primary concern |
| Batching | None | Continuous |
| Memory efficiency | Good | Maximum |
Production serving requires squeezing maximum performance from expensive GPU hardware while maintaining strict latency guarantees.
Understanding KV Cache
To understand vLLM's innovations, you first need to understand the KV cache problem.
What is KV Cache?
During inference, transformer models compute attention over all previous tokens. Without caching, this means recomputing attention for every token position at every step.
Without KV Cache (exponentially slow):
Step 1: Compute attention for tokens [1]
Step 2: Compute attention for tokens [1, 2]
Step 3: Compute attention for tokens [1, 2, 3]
...
Step N: Compute attention for tokens [1, 2, ..., N]
Total computations: 1 + 2 + 3 + ... + N = O(N^2)
The KV cache stores the Key and Value projections from previous tokens:
With KV Cache (linear):
Step 1: Compute K, V for token 1, cache them
Step 2: Compute K, V for token 2, cache, attend over [1, 2]
Step 3: Compute K, V for token 3, cache, attend over [1, 2, 3]
...
Step N: Compute K, V for token N, attend over cached [1..N-1]
Total computations: O(N) per step
The Memory Problem
KV cache is essential for performance, but it consumes significant memory:
KV Cache Size per Request:
┌───────────────────────────────────────────────────────────────┐
│ Size = 2 × num_layers × num_heads × head_dim × context_len │
│ │
│ Example (LLaMA 7B, 4096 context): │
│ Size = 2 × 32 × 32 × 128 × 4096 × 2 bytes (FP16) │
│ Size = ~1 GB per request │
└───────────────────────────────────────────────────────────────┘
With 10 concurrent requests at 4096 context, you need 10 GB just for KV cache. With 100 requests, you need 100 GB. This does not scale.
Static vs Dynamic Allocation
Traditional serving (including Ollama) pre-allocates KV cache for maximum context length:
Traditional KV Cache Allocation:
┌────────────────────────────────────────────────────────────┐
│ Request 1: Using 512 tokens [████░░░░░░░░░░░░░░░░░░░░░░] │
│ Request 2: Using 128 tokens [██░░░░░░░░░░░░░░░░░░░░░░░░] │
│ Request 3: Using 1024 tokens [████████░░░░░░░░░░░░░░░░░░] │
│ │
│ Each bar is 4096 tokens allocated │
│ Memory waste: ~80% (allocated but unused) │
└────────────────────────────────────────────────────────────┘
PagedAttention: vLLM's Core Innovation
PagedAttention applies virtual memory concepts to KV cache management. Instead of allocating contiguous memory per request, it allocates memory in fixed-size blocks (pages).
How PagedAttention Works
PagedAttention Block Structure:
┌─────────────────────────────────────────────────────────────┐
│ Physical Memory: Divided into fixed-size blocks (16 tokens) │
│ │
│ Block Pool: │
│ [Block 0] [Block 1] [Block 2] [Block 3] [Block 4] [Block 5] │
│ │
│ Block Table (per request): │
│ Request 1: [Block 0] → [Block 2] → [Block 4] (48 tokens) │
│ Request 2: [Block 1] → [Block 3] (32 tokens) │
│ Request 3: [Block 5] (16 tokens) │
│ │
│ New blocks allocated only when needed │
│ Completed requests return blocks to pool │
└─────────────────────────────────────────────────────────────┘
Memory Efficiency Comparison
| Scenario | Traditional | PagedAttention | Improvement |
|---|---|---|---|
| 10 requests, avg 500 tokens | 40 GB | 5 GB | 8x |
| 100 requests, avg 200 tokens | 400 GB | 20 GB | 20x |
| Mixed workload | High fragmentation | Minimal waste | Variable |
PagedAttention typically achieves 2-4x higher throughput compared to traditional memory management.
Prefix Caching
PagedAttention enables efficient prefix sharing. When multiple requests share the same system prompt:
Traditional: Each request caches system prompt separately
┌────────────────────────────────────────────────────────────┐
│ Request 1: [System Prompt Copy 1] [User Query 1] │
│ Request 2: [System Prompt Copy 2] [User Query 2] │
│ Request 3: [System Prompt Copy 3] [User Query 3] │
│ │
│ Memory: 3x system prompt │
└────────────────────────────────────────────────────────────┘
PagedAttention: Share system prompt blocks
┌────────────────────────────────────────────────────────────┐
│ Shared System Prompt Blocks: [A] [B] [C] │
│ │
│ Request 1 Block Table: [A] [B] [C] [D1] │
│ Request 2 Block Table: [A] [B] [C] [D2] │
│ Request 3 Block Table: [A] [B] [C] [D3] │
│ │
│ Memory: 1x system prompt + 3x unique content │
└────────────────────────────────────────────────────────────┘
For applications with consistent system prompts (like your Task API model), this can reduce memory usage by 30-50%.
Continuous Batching
Traditional batching waits for a batch to complete before starting a new one. Continuous batching adds and removes requests dynamically.
Traditional (Static) Batching
Static Batching Timeline:
┌─────────────────────────────────────────────────────────────┐
│ Time → │
│ │
│ Batch 1: [R1─────] [R2──────] [R3────] │
│ ↓────────────────────────────↓ │
│ Start End (wait for longest) │
│ Batch 2 starts ─────→ │
│ │
│ GPU idle between R1 completion and Batch 1 end │
└─────────────────────────────────────────────────────────────┘
Continuous Batching
Continuous Batching Timeline:
┌─────────────────────────────────────────────────────────────┐
│ Time → │
│ │
│ Batch: [R1─────] [R2──────────] [R3────] │
│ ↑ ↑ ↑ │
│ [R4 joins] [R5 joins] [R6 joins] │
│ ↓ ↓ │
│ [R1 done] [R3 done] │
│ │
│ Requests join and leave dynamically │
│ GPU always processing at capacity │
└─────────────────────────────────────────────────────────────┘
Throughput Impact
| Batching Strategy | Throughput | Latency Distribution |
|---|---|---|
| No batching | 1x baseline | Consistent |
| Static batching (size 8) | 3-4x | High variance |
| Continuous batching | 6-10x | Low variance |
Continuous batching increases throughput while maintaining predictable latency.
Advanced Optimizations
Tensor Parallelism
For models too large for a single GPU, vLLM supports tensor parallelism across multiple GPUs:
Tensor Parallelism (2 GPUs):
┌─────────────────────────────────────────────────────────────┐
│ Model Layer │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Weight Matrix W (8192 x 8192) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ↓ Split │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ GPU 0: W[:, :4096] │ │ GPU 1: W[:, 4096:] │ │
│ └────────────────────┘ └────────────────────┘ │
│ ↓ All-reduce │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Combined output │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Speculative Decoding
Use a smaller "draft" model to propose tokens, then verify with the main model:
Speculative Decoding:
┌────────────────────────────────────────────────────────────┐
│ 1. Draft model (small, fast) proposes N tokens │
│ "The quick brown fox" → [jumps, over, the, lazy, dog] │
│ │
│ 2. Main model verifies all N tokens in parallel │
│ [✓jumps, ✓over, ✓the, ✗slow, -] │
│ │
│ 3. Accept verified prefix, resample from divergence │
│ Accept: [jumps, over, the] │
│ Resample starting from position 4 │
│ │
│ Result: 3 tokens in time of 1 main model forward pass │
└────────────────────────────────────────────────────────────┘
Speculative decoding can improve throughput by 2-3x for appropriate workloads.
vLLM vs Ollama: Decision Framework
When should you choose each solution?
Choose Ollama When
- Development and testing environment
- Single-user or low-concurrency scenarios
- Local deployment on consumer hardware
- Rapid prototyping with multiple models
- Simplicity is more important than efficiency
Choose vLLM When
- Production deployment with many concurrent users
- Strict latency SLAs (p99 < 100ms)
- GPU cost optimization is critical
- Need advanced features (prefix caching, speculative decoding)
- Deploying on cloud GPU instances
Hybrid Approach
Many teams use both:
Development Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ Local Development → Ollama (simple, fast iteration) │
│ ↓ │
│ Staging → vLLM on single GPU (production-like testing) │
│ ↓ │
│ Production → vLLM cluster (scaled deployment) │
└─────────────────────────────────────────────────────────────┘
vLLM Architecture Overview
A typical vLLM deployment architecture:
Production vLLM Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Load Balancer (nginx/HAProxy) │
│ ↓ (round-robin) │
├─────────────────────────────────────────────────────────────┤
│ vLLM Workers (GPU instances) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │
│ │ 2x A100 │ │ 2x A100 │ │ 2x A100 │ │
│ │ TP=2 │ │ TP=2 │ │ TP=2 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Shared Model Storage (S3/GCS/NFS) │
│ - model.safetensors │
│ - tokenizer files │
│ - config.json │
└─────────────────────────────────────────────────────────────┘
Key Configuration Parameters
| Parameter | Description | Typical Value |
|---|---|---|
--tensor-parallel-size | GPUs per worker | 1, 2, 4, 8 |
--gpu-memory-utilization | VRAM fraction to use | 0.85-0.95 |
--max-num-batched-tokens | Max tokens per batch | 8192-32768 |
--max-model-len | Max context length | 4096-131072 |
--block-size | PagedAttention block size | 16 |
Cost Comparison
Estimating costs for a production workload (1M requests/day, 500 tokens average):
| Deployment | Hardware | Monthly Cost | Throughput |
|---|---|---|---|
| Ollama (CPU) | 32-core server | $300-500 | Too slow |
| Ollama (GPU) | 1x RTX 4090 | $200-400 | 50-100 req/s |
| vLLM (single) | 1x A100 80GB | $1,500-2,500 | 500-1000 req/s |
| vLLM (cluster) | 4x A100 80GB | $6,000-10,000 | 2000-4000 req/s |
For 1M requests/day (~12 req/s average), a single A100 with vLLM handles the load comfortably with margin for peaks.
Reflect on Your Skill
Update your model-serving skill with production architecture considerations:
## Production Serving (vLLM)
### When to Consider vLLM
- >100 concurrent users
- <100ms latency requirement
- GPU cost optimization needed
- Cloud deployment
### Key Concepts
- PagedAttention: Virtual memory for KV cache
- Continuous batching: Dynamic request scheduling
- Prefix caching: Share system prompt blocks
### Decision Checklist
[ ] Calculate concurrent user load
[ ] Define latency SLA
[ ] Estimate GPU costs
[ ] Consider hybrid (Ollama dev → vLLM prod)
Try With AI
Use your AI companion (Claude, ChatGPT, Gemini, or similar).
Prompt 1: Architecture Analysis
I am planning production deployment for my Task API model with:
- Expected load: 500,000 requests per day
- Average prompt: 100 tokens, average response: 50 tokens
- P99 latency target: 200ms
- Model: 7B parameters
Help me analyze:
1. What is the throughput requirement in requests per second?
2. Would vLLM on a single A100 handle this load?
3. What configuration would you recommend?
4. What would be the approximate monthly GPU cost?
What you are learning: Capacity planning. Production deployments require quantitative analysis of load and resources.
Prompt 2: Compare Optimization Strategies
I need to reduce latency for my production LLM deployment. Current P99 is 500ms, target is 100ms.
Compare these optimization strategies:
A) Upgrade from A10G to A100 GPU
B) Enable speculative decoding with 1B draft model
C) Reduce context length from 4096 to 2048
D) Increase batch size and accept higher latency variance
For each, explain the tradeoffs and expected improvement.
What you are learning: Optimization tradeoffs. Each technique has costs and benefits that depend on your specific workload.
Prompt 3: Design Production Architecture
I need to design a production serving architecture that:
1. Handles 2000 requests per second at peak
2. Has 99.9% uptime SLA
3. Deploys my 13B parameter model
4. Uses AWS infrastructure
5. Minimizes cost while meeting requirements
Help me design:
- GPU instance types and count
- Load balancing strategy
- Failover approach
- Estimated monthly cost
What you are learning: Production architecture design. Real deployments balance performance, reliability, and cost.
Safety Note
When planning production deployments, always include a rollback plan. vLLM and other serving frameworks are rapidly evolving. A configuration that works today may behave differently after an update. Maintain the ability to quickly revert to a known-good configuration.