TDD Philosophy for Agent Development
Your agent test suite costs $1,847 per month to run.
Every time you push code, 50 test cases fire. Each test calls OpenAI, generating roughly 2,000 tokens at $0.003 per 1K tokens. That's $0.30 per test run. With 20 pushes per day, 30 days per month, you're burning $1,847 just to validate your code works.
But here's the real problem: tests that cost money don't run often.
When running tests feels expensive, developers skip them. They push untested code. They merge pull requests with failing coverage. The agent ships with bugs that a $0.30 test would have caught—bugs that now cost $3,000 in customer support and reputation damage.
This is the $50 test suite problem: your testing infrastructure is so expensive that you can't afford to use it properly.
There's a better way. And it costs exactly $0.00 per test run.
The TDD Philosophy: Test First, Then Implement
Test-Driven Development flips the traditional coding workflow on its head.
Traditional approach:
- Write code
- Manually test it
- Hope it works
- Maybe write tests later (but probably not)
TDD approach:
- Write a failing test that defines desired behavior
- Write minimum code to make test pass
- Refactor while keeping tests green
- Repeat
The key insight: you write the test before the implementation exists.
This sounds backwards until you understand what it accomplishes:
| Benefit | Why It Matters |
|---|---|
| Specification clarity | Writing a test forces you to define exactly what the code should do |
| Instant feedback | You know immediately when something breaks |
| Confidence to refactor | Tests catch regressions, so you can improve code structure freely |
| Documentation that runs | Tests show how code is meant to be used—and they prove it works |
For agent development, TDD provides an additional critical benefit: isolation from expensive LLM calls.
When you test agent code with TDD, you mock the LLM responses. The test validates your code logic—endpoint behavior, database operations, tool execution—without ever calling OpenAI or Anthropic.
This is how you get from $1,847/month to $0/month.
The Critical Distinction: TDD vs Evals
Here's where most developers get confused.
TDD and Evals both involve testing AI agents. But they test fundamentally different things using fundamentally different approaches.
| Aspect | TDD (This Chapter) | Evals (Chapter 47) |
|---|---|---|
| Question | Does the code work correctly? | Does the LLM reason well? |
| Nature | Deterministic | Probabilistic |
| Output | Pass/Fail | Scores (0-1) |
| Tests | Functions, APIs, DB operations | Response quality, faithfulness |
| Speed | Fast (mocked LLM) | Slow (real LLM calls) |
| Cost | Zero (no API calls) | High (API calls required) |
Let me break this down with concrete examples.
TDD Tests Code Correctness
Consider an agent that manages tasks. TDD tests answer questions like:
- Does the
/tasksendpoint return a 201 status when creating a task? - Does the database constraint prevent duplicate task titles?
- Does the
create_tasktool function parse arguments correctly? - Does the error handler return proper JSON when the LLM times out?
These questions have deterministic answers. Given the same input, you always get the same output. The test either passes or fails—there's no score between 0 and 1.
And critically: none of these questions require actually calling an LLM.
You mock the LLM response. You provide fake JSON that represents what OpenAI would return. Your code processes that fake response exactly as it would process a real one. If the code logic is correct, the test passes.
Evals Measure LLM Reasoning Quality
Now consider different questions:
- When asked "What are my high-priority tasks?", does the agent give a helpful response?
- Does the agent correctly interpret ambiguous requests?
- Are the agent's explanations accurate and trustworthy?
- Does the agent refuse inappropriate requests?
These questions require real LLM calls because you're measuring how the LLM thinks—not how your code processes the LLM's output.
There's no way to mock this. The whole point is to see what GPT-4 or Claude actually produces when given your prompt. And the answer isn't pass/fail—it's a quality score that might vary across runs.
Why the Distinction Matters
If you conflate TDD and Evals, you make expensive mistakes:
Mistake 1: Testing code logic with real LLM calls
Your endpoint handler is broken—it doesn't properly extract the task title from the response. You run a test that calls OpenAI, gets a response, and fails... but is the bug in your code or in the LLM's output? You can't tell without mocking.
Mistake 2: Trying to mock LLM reasoning quality
You create a mock that returns "Here are your high-priority tasks: Task 1, Task 2." Your test passes. But when users hit the real system, the LLM returns rambling, unhelpful responses. You tested your mock, not your agent.
The rule: TDD for code correctness (mocked, fast, free). Evals for LLM quality (real calls, slow, expensive).
What TDD Tests for Agents
Here's what belongs in your TDD test suite:
1. API Endpoint Behavior
async def test_create_task_returns_201(client):
response = await client.post(
"/api/tasks",
json={"title": "Test Task", "priority": "high"}
)
assert response.status_code == 201
assert response.json()["title"] == "Test Task"
Output:
tests/test_tasks.py::test_create_task_returns_201 PASSED
This test verifies your FastAPI endpoint handles requests correctly. No LLM involved.
2. Database Operations
async def test_cascade_delete_removes_subtasks(session):
project = Project(name="Test Project")
session.add(project)
await session.commit()
task = Task(title="Test Task", project_id=project.id)
session.add(task)
await session.commit()
await session.delete(project)
await session.commit()
result = await session.get(Task, task.id)
assert result is None # Cascade deleted
Output:
tests/test_models.py::test_cascade_delete_removes_subtasks PASSED
This test verifies your SQLModel relationships work correctly. No LLM involved.
3. Tool Function Logic
def test_validate_input_rejects_injection():
malicious = "'; DROP TABLE users; --"
with pytest.raises(ValidationError):
validate_input(malicious)
Output:
tests/test_tools.py::test_validate_input_rejects_injection PASSED
This test verifies your tool's input validation catches security threats. No LLM involved.
4. Pipeline Flow with Mocked LLM
@respx.mock
async def test_agent_creates_task_on_request(client):
# Mock what OpenAI would return
respx.post("https://api.openai.com/v1/chat/completions").mock(
return_value=httpx.Response(200, json={
"choices": [{
"message": {
"tool_calls": [{
"function": {
"name": "create_task",
"arguments": '{"title": "Buy groceries"}'
}
}]
}
}]
})
)
response = await client.post(
"/api/agent/chat",
json={"message": "Create a task to buy groceries"}
)
assert response.status_code == 200
# Verify task was actually created in database
tasks = await client.get("/api/tasks")
assert any(t["title"] == "Buy groceries" for t in tasks.json())
Output:
tests/test_agent.py::test_agent_creates_task_on_request PASSED
This test verifies your entire pipeline—from API to LLM processing to database—works correctly. The LLM is mocked, so the test is fast and free.
What TDD Does NOT Test
These questions require Evals, not TDD:
| Question | Why It's an Eval |
|---|---|
| Is the agent's response helpful? | Requires human judgment or LLM-as-judge |
| Does the agent interpret ambiguous requests correctly? | Depends on LLM reasoning, not code logic |
| Are responses safe and appropriate? | Requires running against content filters |
| Does the agent stay on topic? | Measures LLM behavior, not code behavior |
| Is the explanation accurate and trustworthy? | Requires factuality evaluation against ground truth |
You'll learn to build evaluation systems for these questions in Chapter 47. For now, the key insight: don't try to TDD these qualities by mocking LLM responses.
When you mock the response, you're testing that your code handles that specific mock correctly. You're not testing that your prompt produces good outputs when fed to the real LLM.
The Cost Calculation: Why This Matters
Let's revisit that $1,847/month test suite.
Without mocking:
- 50 tests per run
- 2,000 tokens per test (request + response)
- $0.003 per 1K tokens (GPT-4o pricing)
- Cost per test: $0.006
- Cost per full run: $0.30
- Runs per day: 20 (CI/CD on every push)
- Monthly cost: $0.30 x 20 x 30 = $180
That seems manageable—until you consider:
- Multiple developers running tests locally
- Test runs during debugging (10x more frequent)
- Integration tests that make 5-10 LLM calls each
Realistic monthly cost: $1,000-2,000
With mocking:
- 50 tests per run
- Zero API calls
- Cost per test: $0.00
- Cost per full run: $0.00
- Monthly cost: $0.00
But cost isn't even the main benefit.
Speed comparison:
| Approach | Time per Test | Full Suite (50 tests) |
|---|---|---|
| Real LLM calls | 2-5 seconds | 2-4 minutes |
| Mocked responses | 10-50 milliseconds | <3 seconds |
When tests take 3 seconds, you run them constantly. After every change. Before every commit. During every code review.
When tests take 4 minutes, you run them once before merging and hope nothing broke.
The difference: catching bugs in 3 seconds versus discovering them in production.
Try With AI
Use your AI assistant to practice applying the TDD vs Evals framework.
Prompt 1: Categorize Agent Behaviors
I'm building a customer support agent that:
1. Creates support tickets in our database
2. Retrieves customer order history via API
3. Generates helpful responses to customer questions
4. Escalates complex issues to human agents
5. Validates customer identity before accessing accounts
Help me categorize each behavior: Should it be tested with TDD
(deterministic, mockable) or Evals (requires real LLM, measures
quality)? For each, explain your reasoning.
What you're learning: The skill of categorizing behaviors is foundational to building an effective test strategy. You need to know which tests to mock and which tests require real LLM calls before you can build either.
Prompt 2: Calculate Your Testing Cost
I'm planning tests for an agent with these characteristics:
- 30 API endpoints to test
- 10 database operations to validate
- 5 agent tools that need unit tests
- 3 integration flows that test the full pipeline
If I use real LLM calls for all tests:
- Average 1,500 tokens per call
- GPT-4o at $0.0025 per 1K tokens
- I run tests 15 times per day
- 22 working days per month
Calculate my monthly testing cost. Then calculate what I save by
mocking the LLM calls and only using real calls for the 3 integration
flows.
What you're learning: Quantifying costs makes the case for test infrastructure investment. When you can say "mocking saves us $X per month," you can justify the time spent learning these patterns.
Prompt 3: Apply to Your Domain
I'm building an agent for [describe your domain—e.g., "scheduling
medical appointments" or "processing insurance claims" or "tutoring
students in math"].
Based on the TDD vs Evals framework from this lesson, help me identify:
1. Three code behaviors I should test with TDD (deterministic,
mockable, tests code correctness)
2. Two quality aspects I should measure with Evals (probabilistic,
requires real LLM, measures reasoning quality)
For each, suggest what a test or eval might look like at a high level.
What you're learning: Applying abstract frameworks to concrete domains is how knowledge becomes skill. Your domain expertise combined with the TDD/Evals framework produces a test strategy tailored to your specific agent.
Reflect on Your Skill
If you created the agent-tdd skill in Lesson 0, let's verify and improve it.
Test Your Skill:
Using my agent-tdd skill, explain when to use TDD versus Evals for
testing an AI agent. Does my skill correctly distinguish:
- Deterministic tests from probabilistic evaluations?
- Code correctness from LLM reasoning quality?
- Zero-cost mocked tests from expensive real-call tests?
Identify Gaps:
After running that prompt, ask yourself:
- Does my skill include the comparison table from this lesson?
- Does it explain the cost implications of unmocked tests?
- Does it list what TDD tests versus what it doesn't test?
Improve Your Skill:
If you found gaps, update your skill:
My agent-tdd skill needs a clearer TDD vs Evals section. Add:
1. The six-dimension comparison table (Question, Nature, Output,
Tests, Speed, Cost)
2. A "What TDD Tests" section listing: endpoint behavior, DB
operations, tool logic, pipeline flow with mocked LLM
3. A "What TDD Does NOT Test" section: response quality,
interpretation accuracy, safety, topic adherence
4. A cost analysis showing why mocked tests enable rapid iteration
Your skill grows as you learn. By the end of this chapter, your agent-tdd skill will encode everything you've learned—a reusable asset for every future agent project.