TDD Philosophy for Agent Development

Your agent test suite costs $1,847 per month to run.

Every time you push code, 50 test cases fire. Each test calls OpenAI, generating roughly 2,000 tokens at $0.003 per 1K tokens. That's $0.30 per test run. With 20 pushes per day, 30 days per month, you're burning $1,847 just to validate your code works.

But here's the real problem: tests that cost money don't run often.

When running tests feels expensive, developers skip them. They push untested code. They merge pull requests with failing coverage. The agent ships with bugs that a $0.30 test would have caught—bugs that now cost $3,000 in customer support and reputation damage.

This is the $50 test suite problem: your testing infrastructure is so expensive that you can't afford to use it properly.

There's a better way. And it costs exactly $0.00 per test run.

The TDD Philosophy: Test First, Then Implement

Test-Driven Development flips the traditional coding workflow on its head.

Traditional approach:

Write code
Manually test it
Hope it works
Maybe write tests later (but probably not)

TDD approach:

Write a failing test that defines desired behavior
Write minimum code to make test pass
Refactor while keeping tests green
Repeat

The key insight: you write the test before the implementation exists.

This sounds backwards until you understand what it accomplishes:

Benefit	Why It Matters
Specification clarity	Writing a test forces you to define exactly what the code should do
Instant feedback	You know immediately when something breaks
Confidence to refactor	Tests catch regressions, so you can improve code structure freely
Documentation that runs	Tests show how code is meant to be used—and they prove it works

For agent development, TDD provides an additional critical benefit: isolation from expensive LLM calls.

When you test agent code with TDD, you mock the LLM responses. The test validates your code logic—endpoint behavior, database operations, tool execution—without ever calling OpenAI or Anthropic.

This is how you get from $1,847/month to $0/month.

The Critical Distinction: TDD vs Evals

Here's where most developers get confused.

TDD and Evals both involve testing AI agents. But they test fundamentally different things using fundamentally different approaches.

Aspect	TDD (This Chapter)	Evals (Chapter 47)
Question	Does the code work correctly?	Does the LLM reason well?
Nature	Deterministic	Probabilistic
Output	Pass/Fail	Scores (0-1)
Tests	Functions, APIs, DB operations	Response quality, faithfulness
Speed	Fast (mocked LLM)	Slow (real LLM calls)
Cost	Zero (no API calls)	High (API calls required)

Let me break this down with concrete examples.

TDD Tests Code Correctness

Consider an agent that manages tasks. TDD tests answer questions like:

Does the /tasks endpoint return a 201 status when creating a task?
Does the database constraint prevent duplicate task titles?
Does the create_task tool function parse arguments correctly?
Does the error handler return proper JSON when the LLM times out?

These questions have deterministic answers. Given the same input, you always get the same output. The test either passes or fails—there's no score between 0 and 1.

And critically: none of these questions require actually calling an LLM.

You mock the LLM response. You provide fake JSON that represents what OpenAI would return. Your code processes that fake response exactly as it would process a real one. If the code logic is correct, the test passes.

Evals Measure LLM Reasoning Quality

Now consider different questions:

When asked "What are my high-priority tasks?", does the agent give a helpful response?
Does the agent correctly interpret ambiguous requests?
Are the agent's explanations accurate and trustworthy?
Does the agent refuse inappropriate requests?

These questions require real LLM calls because you're measuring how the LLM thinks—not how your code processes the LLM's output.

There's no way to mock this. The whole point is to see what GPT-4 or Claude actually produces when given your prompt. And the answer isn't pass/fail—it's a quality score that might vary across runs.

Why the Distinction Matters

If you conflate TDD and Evals, you make expensive mistakes:

Mistake 1: Testing code logic with real LLM calls

Your endpoint handler is broken—it doesn't properly extract the task title from the response. You run a test that calls OpenAI, gets a response, and fails... but is the bug in your code or in the LLM's output? You can't tell without mocking.

Mistake 2: Trying to mock LLM reasoning quality

You create a mock that returns "Here are your high-priority tasks: Task 1, Task 2." Your test passes. But when users hit the real system, the LLM returns rambling, unhelpful responses. You tested your mock, not your agent.

The rule: TDD for code correctness (mocked, fast, free). Evals for LLM quality (real calls, slow, expensive).

What TDD Tests for Agents

Here's what belongs in your TDD test suite:

1. API Endpoint Behavior

async def test_create_task_returns_201(client):
    response = await client.post(
        "/api/tasks",
        json={"title": "Test Task", "priority": "high"}
    )
    assert response.status_code == 201
    assert response.json()["title"] == "Test Task"

Output:

tests/test_tasks.py::test_create_task_returns_201 PASSED

This test verifies your FastAPI endpoint handles requests correctly. No LLM involved.

2. Database Operations

async def test_cascade_delete_removes_subtasks(session):
    project = Project(name="Test Project")
    session.add(project)
    await session.commit()

    task = Task(title="Test Task", project_id=project.id)
    session.add(task)
    await session.commit()

    await session.delete(project)
    await session.commit()

    result = await session.get(Task, task.id)
    assert result is None  # Cascade deleted

Output:

tests/test_models.py::test_cascade_delete_removes_subtasks PASSED

This test verifies your SQLModel relationships work correctly. No LLM involved.

3. Tool Function Logic

def test_validate_input_rejects_injection():
    malicious = "'; DROP TABLE users; --"

    with pytest.raises(ValidationError):
        validate_input(malicious)

Output:

tests/test_tools.py::test_validate_input_rejects_injection PASSED

This test verifies your tool's input validation catches security threats. No LLM involved.

4. Pipeline Flow with Mocked LLM

@respx.mock
async def test_agent_creates_task_on_request(client):
    # Mock what OpenAI would return
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        return_value=httpx.Response(200, json={
            "choices": [{
                "message": {
                    "tool_calls": [{
                        "function": {
                            "name": "create_task",
                            "arguments": '{"title": "Buy groceries"}'
                        }
                    }]
                }
            }]
        })
    )

    response = await client.post(
        "/api/agent/chat",
        json={"message": "Create a task to buy groceries"}
    )

    assert response.status_code == 200
    # Verify task was actually created in database
    tasks = await client.get("/api/tasks")
    assert any(t["title"] == "Buy groceries" for t in tasks.json())

Output:

tests/test_agent.py::test_agent_creates_task_on_request PASSED

This test verifies your entire pipeline—from API to LLM processing to database—works correctly. The LLM is mocked, so the test is fast and free.

What TDD Does NOT Test

These questions require Evals, not TDD:

Question	Why It's an Eval
Is the agent's response helpful?	Requires human judgment or LLM-as-judge
Does the agent interpret ambiguous requests correctly?	Depends on LLM reasoning, not code logic
Are responses safe and appropriate?	Requires running against content filters
Does the agent stay on topic?	Measures LLM behavior, not code behavior
Is the explanation accurate and trustworthy?	Requires factuality evaluation against ground truth

You'll learn to build evaluation systems for these questions in Chapter 47. For now, the key insight: don't try to TDD these qualities by mocking LLM responses.

When you mock the response, you're testing that your code handles that specific mock correctly. You're not testing that your prompt produces good outputs when fed to the real LLM.

The Cost Calculation: Why This Matters

Let's revisit that $1,847/month test suite.

Without mocking:

50 tests per run
2,000 tokens per test (request + response)
$0.003 per 1K tokens (GPT-4o pricing)
Cost per test: $0.006
Cost per full run: $0.30
Runs per day: 20 (CI/CD on every push)
Monthly cost: $0.30 x 20 x 30 = $180

That seems manageable—until you consider:

Multiple developers running tests locally
Test runs during debugging (10x more frequent)
Integration tests that make 5-10 LLM calls each

Realistic monthly cost: $1,000-2,000

With mocking:

50 tests per run
Zero API calls
Cost per test: $0.00
Cost per full run: $0.00
Monthly cost: $0.00

But cost isn't even the main benefit.

Speed comparison:

Approach	Time per Test	Full Suite (50 tests)
Real LLM calls	2-5 seconds	2-4 minutes
Mocked responses	10-50 milliseconds	<3 seconds

When tests take 3 seconds, you run them constantly. After every change. Before every commit. During every code review.

When tests take 4 minutes, you run them once before merging and hope nothing broke.

The difference: catching bugs in 3 seconds versus discovering them in production.

Try With AI

Use your AI assistant to practice applying the TDD vs Evals framework.

Prompt 1: Categorize Agent Behaviors

I'm building a customer support agent that:
1. Creates support tickets in our database
2. Retrieves customer order history via API
3. Generates helpful responses to customer questions
4. Escalates complex issues to human agents
5. Validates customer identity before accessing accounts

Help me categorize each behavior: Should it be tested with TDD
(deterministic, mockable) or Evals (requires real LLM, measures
quality)? For each, explain your reasoning.

What you're learning: The skill of categorizing behaviors is foundational to building an effective test strategy. You need to know which tests to mock and which tests require real LLM calls before you can build either.

Prompt 2: Calculate Your Testing Cost

I'm planning tests for an agent with these characteristics:
- 30 API endpoints to test
- 10 database operations to validate
- 5 agent tools that need unit tests
- 3 integration flows that test the full pipeline

If I use real LLM calls for all tests:
- Average 1,500 tokens per call
- GPT-4o at $0.0025 per 1K tokens
- I run tests 15 times per day
- 22 working days per month

Calculate my monthly testing cost. Then calculate what I save by
mocking the LLM calls and only using real calls for the 3 integration
flows.

What you're learning: Quantifying costs makes the case for test infrastructure investment. When you can say "mocking saves us $X per month," you can justify the time spent learning these patterns.

Prompt 3: Apply to Your Domain

I'm building an agent for [describe your domain—e.g., "scheduling
medical appointments" or "processing insurance claims" or "tutoring
students in math"].

Based on the TDD vs Evals framework from this lesson, help me identify:
1. Three code behaviors I should test with TDD (deterministic,
   mockable, tests code correctness)
2. Two quality aspects I should measure with Evals (probabilistic,
   requires real LLM, measures reasoning quality)

For each, suggest what a test or eval might look like at a high level.

What you're learning: Applying abstract frameworks to concrete domains is how knowledge becomes skill. Your domain expertise combined with the TDD/Evals framework produces a test strategy tailored to your specific agent.

Reflect on Your Skill

If you created the agent-tdd skill in Lesson 0, let's verify and improve it.

Test Your Skill:

Using my agent-tdd skill, explain when to use TDD versus Evals for
testing an AI agent. Does my skill correctly distinguish:
- Deterministic tests from probabilistic evaluations?
- Code correctness from LLM reasoning quality?
- Zero-cost mocked tests from expensive real-call tests?

Identify Gaps:

After running that prompt, ask yourself:

Does my skill include the comparison table from this lesson?
Does it explain the cost implications of unmocked tests?
Does it list what TDD tests versus what it doesn't test?

Improve Your Skill:

If you found gaps, update your skill:

My agent-tdd skill needs a clearer TDD vs Evals section. Add:

1. The six-dimension comparison table (Question, Nature, Output,
   Tests, Speed, Cost)
2. A "What TDD Tests" section listing: endpoint behavior, DB
   operations, tool logic, pipeline flow with mocked LLM
3. A "What TDD Does NOT Test" section: response quality,
   interpretation accuracy, safety, topic adherence
4. A cost analysis showing why mocked tests enable rapid iteration

Your skill grows as you learn. By the end of this chapter, your agent-tdd skill will encode everything you've learned—a reusable asset for every future agent project.

The TDD Philosophy: Test First, Then Implement​

The Critical Distinction: TDD vs Evals​

TDD Tests Code Correctness​

Evals Measure LLM Reasoning Quality​

Why the Distinction Matters​

What TDD Tests for Agents​

1. API Endpoint Behavior​

2. Database Operations​

3. Tool Function Logic​

4. Pipeline Flow with Mocked LLM​

What TDD Does NOT Test​

The Cost Calculation: Why This Matters​

Try With AI​

Prompt 1: Categorize Agent Behaviors​

Prompt 2: Calculate Your Testing Cost​

Prompt 3: Apply to Your Domain​

Reflect on Your Skill​

The TDD Philosophy: Test First, Then Implement

The Critical Distinction: TDD vs Evals

TDD Tests Code Correctness

Evals Measure LLM Reasoning Quality

Why the Distinction Matters

What TDD Tests for Agents

1. API Endpoint Behavior

2. Database Operations

3. Tool Function Logic

4. Pipeline Flow with Mocked LLM

What TDD Does NOT Test

The Cost Calculation: Why This Matters

Try With AI

Prompt 1: Categorize Agent Behaviors

Prompt 2: Calculate Your Testing Cost

Prompt 3: Apply to Your Domain

Reflect on Your Skill