Capstone - Todo Batch Processing Performance
Throughout Part 5, you built a Todo app that started with simple task dictionaries and grew into a class-based system. You've mastered data structures, functions, classes, and the fundamentals of Python. Now comes the synthesis—building a production-ready task batch processing system that demonstrates true parallel task processing on multiple CPU cores.
This capstone is ambitious in scope but achievable with scaffolding. You're implementing a system that real companies use: multiple task processors reasoning about batches of todos in parallel, sharing results safely, and providing performance insights through benchmarking. The patterns you learn here scale directly to Part 6 (async agents) and beyond to distributed task queues like Celery.
What makes this capstone realistic: The batch processing system IS the benchmark workload. You're not building a toy system and then separately building benchmarks—you're building a system that measures itself while operating, demonstrating both functional correctness and performance optimization in one coherent project. And you understand WHY Python's GIL made this hard before free-threading.
Section 1: Todo Batch Processing Architecture
What Is a Todo Batch Processor?
In this lesson, a task batch processor is an independent computational unit that:
- Accepts input (a batch of todo tasks)
- Performs processing (validation, priority calculation, status updates)
- Produces output (structured result with metrics)
- Reports timing (how long the batch took to process)
Think of processors like team members working on todo batches. Each member gets a batch of tasks, processes them (validating priority, checking due dates, updating status), and reports results. The coordinator assigns work and collects results without waiting for anyone to finish before starting the next batch.
Todo Batch Processing Architecture
A todo batch system orchestrates multiple processors:
- Processor Pool: Collection of independent task processors ready to work
- Batch Distribution: Assigning batches of todos to processors
- Shared Results Container: Thread-safe collection holding all processing results
- Coordinator: Main thread that launches processors, waits for completion, and validates results
Here's a visual overview of the architecture:
Coordinator Thread
├── Launch Processor 1 (Thread 1) → Process Batch A
├── Launch Processor 2 (Thread 2) → Process Batch B
├── Launch Processor 3 (Thread 3) → Process Batch C
└── Launch Processor 4 (Thread 4) → Process Batch D
All processors work in PARALLEL (if free-threading enabled)
↓
Shared Results Container (Thread-Safe)
├── Result from Processor 1 (50 tasks processed, 0.23s)
├── Result from Processor 2 (50 tasks processed, 0.21s)
├── Result from Processor 3 (50 tasks processed, 0.24s)
└── Result from Processor 4 (50 tasks processed, 0.22s)
Coordinator collects results and produces report
With free-threading enabled, all four processors execute simultaneously on separate CPU cores (if available), achieving ~4x speedup on a 4-core machine.
Why Free-Threading Matters for Todo Processing
Consider a scenario: You have a task management system processing thousands of todo items daily. Each batch undergoes CPU-bound processing (priority recalculation, due date validation, status aggregation). You need to process multiple batches in parallel.
Traditional threading (with GIL):
- Processors 1-4 take turns holding the GIL
- Only one processes at a time; others wait (pseudo-concurrency)
- 4 processors on 4-core machine: ~1x performance (no speedup, just overhead)
- Processing 4 batches takes 4x sequential time
Free-threaded Python (GIL optional):
- Processors 1-4 execute simultaneously on separate cores
- No GIL overhead; true parallelism
- 4 processors on 4-core machine: ~3.5–4x performance gain (linear scaling)
- Processing 4 batches takes ~1x sequential time
This difference is transformative for todo applications at scale—batch processing finally gets the performance it deserves.
💬 AI Colearning Prompt
"Explain how a batch processing system differs from a traditional multi-threaded application. What makes batch processors independent units? How does free-threading change the performance characteristics when you have thousands of todo tasks to process?"
🎓 Expert Insight
In production todo systems, you don't design batch processing by accident. You understand that processor independence unlocks parallelism, and free-threading unlocks the hardware you paid for. This capstone teaches you to think architecturally about task processing at scale.
Section 2: Building the Foundation - Simple Batch Processor System
Let's start with Example 8: a scaffolded batch processor system that you'll extend throughout this lesson.
Example 8: Simple Todo Batch Processor Framework
Specification reference: Foundation code for capstone project; demonstrates processor pattern, thread launching, and result collection using the Task entity.
AI Prompt used:
"Create a Python 3.14 task batch processing system with: (1) Task dataclass with title, priority, done status, (2) ProcessingResult dataclass storing results, (3) TaskBatchProcessor class with batch processing method, (4) ThreadSafeResultCollector for results, (5) Free-threading detection, (6) Main launch function. Type hints throughout."
Generated code (tested on Python 3.14):
Loading Python environment...
Validation steps:
- ✅ Code tested on Python 3.14 with free-threading disabled (GIL mode)
- ✅ Code tested on Python 3.14 with free-threading enabled (no GIL mode)
- ✅ All type hints present; code passes
mypy --strictcheck - ✅ Exception handling: Processors that fail don't crash system
- ✅ Thread-safety verified: Multiple processors can append results simultaneously
- ✅ Uses Task dataclass consistent with Part 5 todo progression
Validation results: Speedup factor observed:
- Traditional threading (GIL): ~1.0-1.2x (little benefit; mostly overhead)
- Free-threaded Python: ~3.2x on 4-core machine (excellent scaling)
Section 3: Extending the System - Multiple Processor Types
Now that you understand the foundation, let's extend the system to demonstrate realistic diversity. Real batch processing systems have different processor types performing specialized tasks.
Design: Introducing Processor Specialization
Instead of identical processors, let's create a system with 3 processor types:
- TaskValidator: Validates task priorities and due dates (validation processing)
- TaskProcessor: Computes task status and updates (state transition processing)
- TaskReporter: Generates reports from task data (aggregation processing)
Each has different computational characteristics and duration profiles. This demonstrates that batch processing systems often combine processors with heterogeneous workloads.
🚀 CoLearning Challenge
Ask your AI Co-Teacher:
"I want to extend the foundation code with two more processor types: TaskValidator (validates task priorities between 1-10) and TaskReporter (summarizes task completion rates). Keep the foundation code. Show me the new processor classes and how they integrate with the existing system. Then explain how this demonstrates processor heterogeneity in a real batch system."
Expected outcome: You'll understand that batch processing systems don't require all processors to be identical. You'll see how inheritance or composition can model different processor types while maintaining compatible interfaces.
Section 4: Benchmarking Comparison - Three Approaches
The capstone's heart is benchmarking: comparing free-threaded Python against traditional threading and multiprocessing. This demonstrates why free-threading matters for todo batch processing.
Setting Up the Benchmark
We'll measure three approaches simultaneously:
- Traditional Threading (GIL-Constrained): Pseudo-concurrent (built-in)
- Free-Threaded Python (Optional): True parallel (if available)
- Multiprocessing: True parallel (always available, higher overhead)
For each approach, we measure:
- Execution Time: Total wall-clock time to process all batches
- CPU Usage: Percentage of available CPU utilized
- Memory Usage: Peak memory during batch processing
- Scalability: Speedup factor vs sequential execution
Example 8 Extension: Benchmarking Framework
To build comprehensive benchmarking, ask your AI Co-Teacher:
🚀 CoLearning Challenge
"Build a benchmarking framework that runs the todo batch processing system three ways: (1) Traditional threading, (2) Free-threaded Python (with fallback to traditional if not available), (3) Multiprocessing. Measure execution time, CPU percent, peak memory. Process batches of 100 tasks with 2, 4, and 8 processors. Create a table comparing results. Explain which is fastest for this workload and why."
Expected outcome: You'll implement working benchmarks, interpret performance data, and articulate why free-threading wins for task processing workloads.
✨ Teaching Tip
Use Claude Code to explore the
psutillibrary for measuring CPU and memory. Ask: "Show me how to measure CPU percent and peak memory during task processing. How do I get accurate measurements without interfering with the actual work?"
Section 5: Building the Dashboard
A production system needs visibility into performance. Let's build a benchmarking dashboard that displays results in human-readable format.
What the Dashboard Should Show
╔════════════════════════════════════════════════════════════════════╗
║ Todo Batch Processing Benchmark Results ║
╠════════════════════════════════════════════════════════════════════╣
║ Approach │ Time (s) │ Speedup │ CPU % │ Memory (MB) ║
╟───────────────────────┼──────────┼─────────┼────────┼──────────────╢
║ Traditional Threading │ 2.34 │ 1.0x │ 45% │ 12.5 ║
║ Free-Threaded Python │ 0.68 │ 3.4x │ 94% │ 14.2 ║
║ Multiprocessing │ 0.85 │ 2.8x │ 88% │ 28.3 ║
╚════════════════════════════════════════════════════════════════════╝
Winner: Free-Threaded Python
└─ 3.4x faster than traditional threading
└─ Excellent CPU utilization (94%)
└─ Reasonable memory overhead (14.2 MB)
└─ Best for CPU-bound batch processing with shared task state
🚀 CoLearning Challenge
"Create a benchmarking dashboard that displays results from all three approaches in a formatted ASCII table. Include a 'winner' analysis explaining which approach is fastest for batch processing and why. Explain why the performance pattern matters for scaling todo systems."
Expected outcome: You'll build a utility that transforms raw benchmark data into actionable insights for system design decisions.
Section 6: Shared State Management and Thread Safety
Batch processing systems require careful coordination. Multiple processors writing to shared state simultaneously introduces race conditions if not properly managed.
Thread-Safe Patterns
We already used threading.Lock in Example 8. Let's understand when and why it's necessary.
Pattern 1: Guarded Shared State (Lock)
Loading Python environment...
Pattern 2: Thread-Safe Data Structures
Python's queue.Queue and collections.deque are built thread-safe:
Loading Python environment...
💬 AI Colearning Prompt
"Explain the difference between guarded shared state (using Lock) and thread-safe collections (using Queue). When would you use each approach in a batch processing system?"
Defensive Design: Avoiding Shared State
The safest approach is minimal shared state. Instead of multiple processors writing to a shared list, use patterns that reduce contention:
- Per-processor result containers (processors write only to their own storage)
- Collect at the end (results come back when processors complete)
- Immutable results (processors can't modify data after creation)
This approach reduces lock contention and makes reasoning about thread safety simpler.
Section 7: Error Resilience and Failure Handling
Production systems must handle failures. What happens if one processor crashes while processing a batch? Should the entire system stop?
Answer: No. Processors should fail independently. One processor's failure shouldn't crash the system or lose unprocessed tasks.
Implementing Processor Isolation
Example 8 already includes try/except in batch processing:
Loading Python environment...
Key practices:
- Processor wraps its own work in try/except
- Failures return structured result (not exceptions to caller)
- System continues with remaining processors
- Failed results tracked (for debugging and retry logic)
🚀 CoLearning Challenge
"Add a test case where one processor deliberately fails (e.g., processing a malformed task). Show that the system continues and collects results from all other processors. Explain how this demonstrates resilience in production batch systems."
Expected outcome: You'll understand production-ready error handling and how to design systems that degrade gracefully when processing fails.
Section 8: Production Readiness and Scaling Preview
This capstone system runs on a single machine with threads. How does it scale?
From Single Machine to Production
What you've built (Single Machine):
- Multiple processors using free-threading
- Shared memory (same Python process)
- Synchronous batch collection
- Batch processing on local threads
How it scales (Part 6: Async Agents & Task Queues):
Async Task System (Same Machine)
Event Loop → [TaskProcessor 1 (coroutine)]
→ [TaskProcessor 2 (coroutine)]
→ [TaskProcessor 3 (coroutine)]
Async advantages: Handle I/O without threads
Tradeoff: CPU-bound work still needs free-threading or multiprocessing
Further scaling (Distributed Task Queues - Production):
Celery / RQ Distributed Queue
Redis Queue → [Worker 1 (Machine A): Process Batches]
→ [Worker 2 (Machine B): Process Batches]
→ [Worker 3 (Machine C): Process Batches]
→ [Results aggregated via queue]
Scales to thousands of machines; fault-tolerant;
same batch processing logic, now distributed.
Resource Efficiency
Free-threaded Python is transformative for processing large todo datasets:
Traditional (GIL):
- 4 processors on 4-core machine: Needs 4 containers (one per processor)
- Cost: 4 × container overhead
- CPU utilization: ~25% (wasted due to GIL)
Free-threaded:
- 4 processors on 4-core machine: One container with 4 threads
- Cost: 1 × container overhead
- CPU utilization: ~95% (efficient parallelism)
Production impact: Free-threading reduces infrastructure costs by ~75% for CPU-bound batch processing.
Section 9: Bringing It Together - Capstone Synthesis
Now you'll integrate everything into a complete capstone project.
Capstone Requirements
Part A: Todo Batch Processing System
- 3+ batch processors (from Section 3 extensions)
- Each processor processes independent task batch
- Thread-safe result collection
- Free-threading detection (print status at startup)
- Error handling (system continues if processor fails)
- Execution timing (measure individual and total time)
- Uses Task dataclass consistently
Part B: Benchmarking Dashboard
- Compare three approaches (traditional, free-threaded, multiprocessing)
- Measure: execution time, CPU %, memory, speedup
- Display results in formatted table
- Winner analysis (which is fastest and why?)
- Scalability analysis (performance at 2, 4, 8 processor counts)
Part C: Production Context Documentation
- Describe how this scales to async agents (Part 6)
- Explain how this connects to distributed task queues (Celery, RQ)
- Explain resource efficiency gains with free-threading
- Document design decisions made
- Create deployment checklist for production
Implementation Workflow
-
Step 1: Extend Example 8 (~40 min)
- Add 2 more processor types (Section 3)
- Build comprehensive benchmarking (Section 4)
- Create dashboard (Section 5)
-
Step 2: Add Resilience (~30 min)
- Implement error handling (Section 7)
- Test with intentional processor failures
- Verify system continues processing
-
Step 3: Measure and Document (~60 min)
- Run benchmarks on your machine
- Collect data across processor counts (2, 4, 8)
- Create production readiness document
-
Step 4: Validate and Iterate (~30 min)
- Review results with AI co-teacher
- Optimize based on insights
- Prepare for async transition in Part 6
✨ Teaching Tip
Use Claude Code throughout this capstone. Describe what you want to build, ask AI to generate a first draft, then validate and extend. This is how professional developers work with todo systems at scale. Your job: think architecturally, validate outputs, integrate components.
Section 10: Common Pitfalls and Production Lessons
Pitfall 1: Forgetting Lock Scope
Wrong:
Loading Python environment...
Right:
Loading Python environment...
Pitfall 2: Confusing Multiprocessing with Free-Threading
- Multiprocessing: Separate processes, separate Python interpreters, high overhead, true parallelism always
- Free-threaded: Same process, one interpreter, low overhead, true parallelism only on multi-core
For batch task processing, free-threading is superior (shared memory, lower overhead).
Pitfall 3: Benchmarking Mistakes
Wrong:
Loading Python environment...
Right:
Loading Python environment...
Pitfall 4: Assuming Free-Threading Always Wins
Free-threading excels for CPU-bound workloads with shared state. It's not automatically faster than alternatives:
- I/O-bound work: asyncio still beats free-threading (no GIL overhead means asyncio wins)
- Isolated work: Multiprocessing avoids lock contention (sometimes faster if minimal result sharing)
- Hybrid workloads: Combine approaches (free-threading for CPU processors, asyncio for I/O tasks)
Challenge 6: The Complete Todo Batch Processing Capstone (5-Part)
This is a 5-part bidirectional learning challenge where you complete, evaluate, and reflect on your production batch processing system.
Verification and Benchmarking Phase
Your Challenge: Ensure your built system actually demonstrates the concurrency concepts with real todo data.
Verification Checklist:
- Run your complete batch processing system from Part 4 of the main lesson
- Measure performance with three approaches:
- Traditional Python (GIL enabled)
- Free-threaded Python 3.14 (if available)
- ProcessPoolExecutor (for comparison)
- Verify correct results: all task batches process successfully
- Test error handling: kill one processor mid-run; system continues with unprocessed tasks
- Document timing:
{approach: (total_time, speedup_vs_sequential, cpu_utilization)}
Expected Behavior:
- Traditional threading: 1.0x speedup (GIL blocks parallelism)
- Free-threaded Python: 3–4x speedup on 4 cores (true parallelism)
- ProcessPoolExecutor: 2–3x speedup (process overhead overhead)
- All approaches process identical todo batches (correctness verified)
Deliverable: Create /tmp/batch_processing_verification.md documenting:
- Measured speedups for each approach
- CPU core utilization patterns
- Memory usage comparison
- Error handling confirmation
- Recommendation: which approach for production todo processing?
Performance Analysis Phase
Your Challenge: Understand WHERE time is spent and HOW to optimize batch processing.
Analysis Tasks:
- Profile each processor: Which processor is slowest? Which uses most CPU?
- Identify critical path: Which processor blocks other processors from completing?
- Measure processor communication overhead: How much time spent passing results?
- Test scaling: Run with 2, 3, 4, 5, 6 processors—what's the speedup pattern?
- Create timeline visualization: Show when each processor runs, where idle time exists
Expected Observations:
- One processor is likely the bottleneck (slowest)
- Processor communication is negligible vs computation
- Scaling benefits flatten after ~4 processors (diminishing returns as CPU cores saturate)
- Idle time exists if task batches are load-imbalanced
Self-Validation:
- Can you explain why performance stops improving beyond 4 processors?
- What would happen if you rebalanced task distribution across processors?
- How would results change with 20 processors on 4 cores?
Learning Production Optimization
Your AI Prompt:
"I built a 4-processor system that achieves 3.2x speedup on 4 cores with free-threading. But when I test with 8 processors, speedup only goes to 3.4x, not 4x. Teach me: 1) Why does speedup plateau when processing todo batches? 2) How do I profile to find the bottleneck? 3) What optimization strategies exist (load balancing, work distribution, architectural changes)? 4) Is 3.4x good enough or should I redesign? Show me decision framework."
AI's Role: Explain scaling limitations (Amdahl's law), show profiling techniques, discuss realistic optimization strategies, help you decide between "good enough" and "optimize more."
Interactive Moment: Ask a clarifying question:
"You mentioned load balancing. But my processors handle different todo validations (priority checking, date validation, status updates). They can't be perfectly balanced. How do I handle inherently unbalanced workloads?"
Expected Outcome: AI clarifies that perfect scalability is rare, optimization is contextual, and knowing when to stop optimizing is important. You learn production mindset.
System Architecture and Extension Phase
Setup: AI generates an optimized version using techniques like load balancing and work stealing. Your job is to verify benefits and teach AI about trade-offs.
AI's Initial Code (ask for this):
"Show me an optimized version of the batch processing system that: 1) Implements load balancing (distribute work based on processor capacity), 2) Uses work-stealing queues (idle processors grab tasks from busy processors), 3) Measures and reports per-processor efficiency. Benchmark against my original version and show if optimization actually helps."
Your Task:
- Run the optimized version. Measure speedup and overhead
- Compare to original: did optimization help or hurt?
- Identify issues:
- Did load balancing add complexity?
- Does work-stealing introduce contention?
- Is the overhead worth the gain?
- Teach AI:
"Your optimized version is 5% faster but uses 3x more code. For production todo processing, is that worth it? How do I measure 'complexity cost' vs performance gain?"
Your Edge Case Discovery: Ask AI:
"What if I extend this to 100 processors on 4 cores? Your current optimization still won't help because we're CPU-limited, not work-imbalanced. What architectural changes are needed? Is free-threading still the right choice, or should I switch to distributed task queues (Celery, RQ)?"
Expected Outcome: You discover that optimization has diminishing returns. You learn to think about architectural limits and when to change approach entirely.
Reflection and Synthesis Phase
Your Challenge: Synthesize everything you've learned about CPython and concurrency into principle-based thinking about batch processing.
Reflection Tasks:
-
Conceptual Mapping: Create diagram showing how Lessons 1-5 concepts connect:
- CPython internals (Lesson 1) → GIL design choice
- Performance optimizations (Lesson 2) → only help single-threaded
- GIL constraints (Lesson 3) → blocked threading for batch work
- Free-threading solution (Lesson 4) → removes GIL constraint
- Concurrency decision framework (Lesson 5) → applies decision at batch scale
-
Decision Artifacts: Document your production decisions:
- Why did you choose free-threaded Python for batch task processing?
- What performance metric mattered most (throughput? latency? memory)?
- What would trigger a redesign (more batches? more processors? data volume)?
- How does this system connect to Part 6 async agents and distributed queues?
-
Production Readiness Checklist:
- System demonstrates 3x+ speedup on 4 cores (GIL solved)
- Correct results on all approaches (functional equivalence)
- Error handling resilient (processors fail independently)
- Scaling characteristics understood (where speedup plateaus)
- Thread safety verified (no race conditions on shared state)
- Performance profiled (bottleneck identified)
- Deployment strategy defined (free-threading vs alternatives)
-
AI Conversation: Discuss system as if explaining to colleague:
"Our batch processing system uses free-threaded Python because [reason]. It achieves [speedup] on [cores] processing [tasks/second]. The bottleneck is [component]. For production, we'd scale by [approach - vertical to more cores, or horizontal to task queues]. We chose free-threading over multiprocessing because [tradeoff analysis]. What production issues might we hit?"
Expected Outcome: AI identifies realistic production concerns (dependency compatibility, deployment complexity, monitoring needs). You learn from production experience vicariously.
Deliverable: Save to /tmp/capstone_reflection.md:
- Concept map showing how CPython → GIL → free-threading → batch processing → distributed queues
- Decision documentation: why free-threading for this workload
- Performance characteristics: speedup, bottleneck, scaling limits
- Production deployment strategy: how batch processing scales beyond single machine
- Identified risks and mitigation strategies
- Lessons learned about concurrency decision-making for data processing
Chapter Synthesis: From CPython Internals to Production Todo Systems
You've now mastered:
- Layer 1 (Foundations): CPython architecture and implementation choices
- Layer 2 (Collaboration): Understanding GIL and its consequences
- Layer 3 (Intelligence): Free-threading as solution and its tradeoffs
- Layer 4 (Integration): Concurrency decision framework applied to batch processing at scale
You can now:
- Make informed choices about Python implementation and concurrency approach for data processing
- Benchmark systems and identify bottlenecks using data
- Scale from single-machine batch processing to distributed task queues (preview for Part 6)
- Design batch processing systems with appropriate parallelism strategy
- Explain CPython design choices and their production implications
- Connect single-machine batch concepts to async systems and distributed queues
Connection to Part 6: Understanding GIL and free-threading prepares you for Part 6 where you'll build async agents. The patterns you've learned here—parallel processing, thread safety, benchmarking, performance analysis—apply directly to agentic AI systems.
Time Estimate: 55-70 minutes (10 min verification, 12 min analysis, 12 min coach interaction, 12 min optimization, 9-24 min reflection)
Key Takeaway: You've moved from "I understand CPython" to "I design production batch processing systems knowing how CPython works and what constraints/capabilities it provides." The next frontier is scaling beyond single-machine (Part 6: async agents, then distributed task queues).
Try With AI
How do you build a todo batch processing system that achieves 3-4x CPU speedup with free-threading while handling processor failures gracefully?
🔍 Explore Batch Processing Architecture:
"Design a 4-processor system where each processor handles a batch of todo tasks. Show the architecture with TaskBatchProcessor class, thread launching, shared results container, and coordinator. Explain why free-threading enables 4x speedup for processing todo batches vs traditional threading."
🎯 Practice Comprehensive Benchmarking:
"Implement benchmarks comparing: (1) sequential execution (process all batches one after another), (2) traditional threading (with GIL), (3) free-threaded Python, (4) multiprocessing. For each, measure time, CPU%, memory processing 400 todo tasks across 4 batches. Create comparison table showing winner and trade-offs."
🧪 Test Thread Safety in Batch Processing:
"Create shared ResultCollector that multiple processors write to simultaneously when processing todo batches. Show race condition without Lock, then fix with threading.Lock(). Explain why free-threading exposes concurrency bugs that GIL hid."
🚀 Apply to Production Deployment:
"Explain how this single-machine batch processing system scales to async agents (Part 6) or distributed task queues like Celery (processing todos across multiple machines). What changes? What stays the same? How does free-threading reduce infrastructure costs for todo apps?"