From Reading to Specifying: What Is TDG?
Emma turns her laptop toward James. On the screen:
def celsius_to_fahrenheit(celsius: float) -> float: ...
def test_freezing():
assert celsius_to_fahrenheit(0.0) == 32.0
def test_boiling():
assert celsius_to_fahrenheit(100.0) == 212.0
Two words you already know from Chapter 45: def means "define a function" and assert means "insist this is true; fail loudly if it is not." Two new symbols, -> float and ..., also appear. James reads them aloud in the next few lines. Lesson 2 teaches them formally, one at a time.
"That is the whole specification," she says.
James counts the lines. "Five lines? At my old job, our project specs were thirty-page documents that nobody read."
Emma almost smiles. "How many of those specs got implemented exactly as written?"
James shrugs. "...Maybe none."
Emma taps the screen. "What do you think each line does?"
James leans forward. "Okay, let me work through this. The first line names the function and says it takes a float and returns a float. The three dots mean the body is empty, no implementation yet. And the two test functions say what the correct answers must be: zero Celsius is thirty-two Fahrenheit, a hundred Celsius is two-twelve."
Emma waits. "And then?"
"If the body is empty, you tell Claude Code to fill it in. It reads the types and the tests and writes the formula." He pauses. "That is SDD. But the specification is Python instead of English."
Emma nods. "Same method. Different language."
From English to Python
Think back to Chapter 16. You learned a three-step workflow called Spec-Driven Development (SDD):
- You describe what you want in plain English, in a markdown file
- AI builds it by reading your description and generating the result
- You check the result by reading what the AI produced and deciding if it matches what you asked for
You have been doing this in every chapter since. This chapter's version of the same workflow is called Test-Driven Generation (TDG). The only thing that changes is step 1, how you describe what you want:
| SDD (Chapter 16) | TDG (This Chapter) | |
|---|---|---|
| Step 1: You describe | English sentences in a markdown file | Python types and tests |
| Step 2: AI builds | AI implements | AI implements |
| Step 3: You check | You read and review | pytest + pyright check automatically, then you read |
Instead of writing an English sentence like "Make a function that converts Celsius to Fahrenheit and returns a decimal number," you write the same idea in Python:
def celsius_to_fahrenheit(celsius: float) -> float: ...
def test_freezing():
assert celsius_to_fahrenheit(0.0) == 32.0
def test_boiling():
assert celsius_to_fahrenheit(100.0) == 212.0
Five lines. The function signature says what it does. The -> float says what it returns. The ... (called an "ellipsis," meaning three dots) says the body is empty. The two tests say what the correct answers must be. That is the specification, more precise than any English paragraph could be.
You are not expected to write these five lines yet. This lesson is about understanding the loop: reading, not doing. Lesson 2 teaches the three new vocabulary words (return, -> float, ...) one at a time, and you will write your first specification there. For now, read and follow along.
The TDG Loop
TDG follows five steps. You will use this exact loop in every lesson of this chapter, and in every chapter after this one.
Step 1: Specify. Write a function stub (the function's name, its types, and ... where the body should be) plus two test assertions. A stub is a placeholder: it defines the function's shape but leaves the body empty for AI to fill in. This is your specification. It says what the function must do without saying how.
Step 2: Check types. Run uv run pyright. The type checker must pass. If pyright reports an error, your specification has a type problem; fix it before going further. (This is like running a spell-check on your English spec.)
Step 3: Generate. Tell Claude Code: "Implement the function that passes these tests. Do not modify the tests." AI reads your types and tests, then writes the function body.
Step 4: Verify. Run uv run pytest -v. The tests must pass. If they fail, go back to Step 3 and re-prompt. If they pass, go to Step 5.
Step 5: Read. Apply PRIMM from Chapter 45. Read the generated code. Predict what it does for a new input. Build a trace table (the step-by-step prediction technique from Chapter 45) if the logic is not obvious. This is where your Chapter 45 skills earn their keep: you do not trust AI output; you verify it with your eyes.
┌──────────┐ ┌─────────────┐ ┌──────────┐ ┌────────┐ ┌──────┐
│ 1.Specify │────▶│ 2.Check │────▶│ 3.Generate│────▶│4.Verify│────▶│5.Read│
│ stub+tests│ │ types │ │ (AI) │ │ pytest │ │ PRIMM│
└──────────┘ └─────────────┘ └──────────┘ └────────┘ └──────┘
▲ │
└── if FAIL ─────┘
That is the entire method. Five steps. The same loop, every time.
A Complete Worked Example (Read, Don't Do)
Let us walk through one full TDG cycle so you can see how every step works. You do not need to type anything; just read and predict.
Step 1: Specify
The function: double. It takes an integer and returns an integer that is twice the input.
# smartnotes/math_helpers.py
def double(n: int) -> int: ...
# tests/test_math_helpers.py
from smartnotes.math_helpers import double
def test_double_positive():
assert double(5) == 10
def test_double_zero():
assert double(0) == 0
Five lines of specification (the stub plus two tests). The import line tells Python where to find the function: "go to the file smartnotes/math_helpers.py and bring in the double function so the tests can use it." You saw imports in Chapter 44 when setting up SmartNotes. You do not need to memorize this pattern; each lesson provides the import line for you to copy. The key point: this line connects your test file to your function file.
Step 2: Check Types
$ uv run pyright
0 errors, 0 warnings, 0 informations
Pyright passes. Why? Because ... (the ellipsis) makes the function a stub: pyright trusts the type annotations and ignores the missing body. The types are consistent. n: int goes in, -> int comes out. No type errors.
Step 3: Generate
The prompt to Claude Code:
Implement the `double` function in smartnotes/math_helpers.py
so that all tests in tests/test_math_helpers.py pass.
Do not modify the tests.
The AI reads the stub and the tests, and replaces the ... with:
def double(n: int) -> int:
return n * 2
One line of implementation. return n * 2 means the function hands back n multiplied by 2. (You will learn what return does in Lesson 2.)
Step 4: Verify
$ uv run pytest tests/test_math_helpers.py -v
tests/test_math_helpers.py::test_double_positive PASSED
tests/test_math_helpers.py::test_double_zero PASSED
2 passed
Both tests pass. GREEN: the function does what the specification demanded.
Step 5: Read (PRIMM)
Now apply Chapter 45. Read the generated code: return n * 2.
Predict: What does double(7) return? Work it out: 7 * 2 = 14. The function returns 14.
Predict: What does double(-3) return? Work it out: -3 * 2 = -6. The function returns -6.
Does this match your understanding of "double"? Yes. Multiplying by 2 doubles a number, including negatives. The implementation is correct, and you know why it is correct because you read it.
Before continuing, make sure you can answer these:
- In the worked example, what would
double(-5)return? - Which step of the TDG loop uses pytest?
- In which step do you apply PRIMM from Chapter 45?
Answers: -10 (because -5 * 2 = -10), Step 4 (Verify), and Step 5 (Read).
Your Specification Leverage
Count the lines. You wrote 5 lines of specification (1 stub + 2 tests + 2 assertions). AI generated 2 lines of implementation. The ratio is small here because the function is small. In later chapters, your specifications will be 5-10 lines and AI will generate 20-50 lines. The ratio grows, but the loop stays the same.
This is the core idea behind TDG: your 5 lines of specification leverage AI to produce correct, tested, type-checked code. You write the what. AI writes the how. Pytest and pyright verify the result. Your Chapter 45 skills let you read and understand what AI wrote.
If you already know TDD
If you have used Test-Driven Development before, TDG is the same discipline (Red-Green-Refactor) but AI writes the Green step. Your role shifts from implementation to verification. Kent Beck described TDD in 1999 (Extreme Programming Explained) and the 2003 book Test-Driven Development: By Example. TDG is an emergent adaptation of that discipline for AI-assisted development, described by multiple practitioners since 2023.
If TDG is clear but the SDD connection feels abstract, that is normal. The connection will become concrete when you write your first test in Lesson 2. For now, the key takeaway is: you write the specification (types + tests), AI writes the implementation.
Why Tests Make AI Better
Research shows that tests improve AI output accuracy by 9-46% compared to English-only specifications.
Research: why tests improve AI accuracy
You might wonder: does writing tests actually improve what AI generates? Yes. Research confirms it.
-
A study of 15 programmers found that giving AI a test suite improved code accuracy by roughly 46 percentage points compared to natural language alone. Programmers in the study also reported less cognitive load, because the tests did the explaining for them [Fakhoury et al., "LLM-Based Test-Driven Interactive Code Generation," IEEE TSE 2024].
-
A separate study found that including tests alongside problem descriptions improved AI code generation by roughly 8-12% across standard benchmarks. The largest gains came from AI using failed tests to fix its own mistakes [Mathews and Nagappan, "Test-Driven Development and LLM-based Code Generation," ASE 2024].
-
Anthropic, the company behind Claude Code, calls giving AI a way to verify its own output "the single highest-leverage thing you can do." Their best practices guide recommends providing tests, screenshots, or expected outputs so Claude can check itself [Anthropic, "Claude Code Best Practices," 2025].
The pattern is consistent: tests are better specifications than English. They are precise, unambiguous, and machine-verifiable. When you give AI a test suite, it generates better code than when you describe what you want in words.
A study testing TDG at the scale of entire code repositories found something surprising: a concise, diverse test suite can be more effective than a comprehensive one [Hu et al., "TENET: Leveraging Tests Beyond Validation for Code Generation," 2025, preprint (a research paper shared publicly before peer review)]. Two well-chosen tests produce better AI output than ten redundant tests. More is not always better; clarity beats quantity. This is why this chapter starts with two assertions per function, not twenty.
SDD to TDG: The Translation
If you completed Chapter 16, you already know the three-phase SDD workflow. TDG maps directly onto it:
| SDD Phase | What You Did in Ch 16 | What You Do in TDG |
|---|---|---|
| Specification | Wrote a markdown spec describing what you wanted | Write a function stub with types + test assertions |
| Implementation | AI generated files from your spec | AI generates the function body from your stub + tests |
| Verification | You reviewed the output against your spec | pytest checks automatically; you read the code with PRIMM |
The method is the same. The precision is higher. English specifications can be ambiguous: "convert the temperature" could mean Celsius to Fahrenheit or Fahrenheit to Celsius. A Python test that says assert celsius_to_fahrenheit(0.0) == 32.0 cannot be misunderstood.
Try With AI
If Claude Code is not already running, open your terminal, navigate to your SmartNotes project folder, and type claude. You should see the Claude Code prompt ready for input. If you need a refresher, Chapter 44 covers the setup.
Open Claude Code in your SmartNotes project and try these prompts. You are exploring, not building. Use these to see TDG in action before you do it yourself in Lesson 2.
Prompt 1: Check Your Understanding
I just read about Test-Driven Generation (TDG). Here is my
understanding: "TDG means you write tests first, AI writes
the code, and pytest verifies the result."
Is my summary accurate? What important detail am I missing?
Read AI's response. Did it mention the type-checking step (pyright)? Did it clarify that you also read the generated code, not just run the tests? Compare its corrections to the five-step loop from this lesson.
What you're learning: You are formulating your own understanding first, then using AI to identify gaps, the same pattern you will use when reviewing AI-generated code.
Prompt 2: Compare TDG to Just Asking AI
What is the difference between these two approaches?
Approach A: "Write me a function that converts Celsius to Fahrenheit."
Approach B: I write this first:
def celsius_to_fahrenheit(celsius: float) -> float: ...
assert celsius_to_fahrenheit(0.0) == 32.0
assert celsius_to_fahrenheit(100.0) == 212.0
Then I say: "Implement the function that passes these tests."
Why does Approach B produce more reliable results? What can
go wrong with Approach A that cannot go wrong with Approach B?
What you're learning: You are understanding WHY specifications matter, not just how to write them. This is the conceptual argument for TDG over "just ask AI."
Prompt 3: Spot the Specification Bug
Here is a TDG specification:
def double(n: int) -> int: ...
def test_double():
assert double(3) == 9
Is the specification correct? If not, what is wrong: the
function name, the test value, or both? What should it say?
Read AI's response. The test says double(3) == 9, but doubling 3 is 6, not 9. This is a specification error: the test defines the wrong "correct" answer. The implementation would be correct (returning 9) but the function would not actually double anything.
What you're learning: Specifications can be wrong too. The test is not automatically correct just because you wrote it. You need to verify your own tests against common sense, the same way you verify AI-generated code.
PRIMM-AI+ Practice: Understanding the TDG Loop
In Chapter 45 you used /predict and /investigate to read Python code. Here you use the same commands for a different job: verifying AI-generated code. The commands work the same way. The question changes from "what does this code print?" to "did AI implement my specification correctly?"
This lesson is primarily a reading lesson, but the PRIMM-AI+ Run step asks you to use Claude Code to see the loop in action. This is a brief, guided interaction, not a full hands-on exercise. You will do your first real TDG cycle in Lesson 2.
Predict [AI-FREE]
Press Shift+Tab to enter Plan Mode before predicting. In Plan Mode, Claude Code will discuss and analyze without making changes.
Read the following specification without running it. Predict: if AI implemented this function correctly, what would test_negate_positive check? What value would negate(4) need to return for the test to pass?
def negate(n: int) -> int: ...
def test_negate_positive():
assert negate(4) == -4
def test_negate_zero():
assert negate(0) == 0
Write your prediction and a confidence score from 1-5 before continuing.
Check your prediction
negate(4) must return -4 for the test to pass. The function negates (flips the sign of) its input. negate(0) must return 0 because negating zero is still zero. If you predicted this correctly with confidence 4-5, your reading skills from Chapter 45 are working.
Run
Press Shift+Tab to exit Plan Mode. In Claude Code, type:
Create a file smartnotes/negate.py with this stub:
def negate(n: int) -> int: ...
Create a file tests/test_negate.py with:
from smartnotes.negate import negate
def test_negate_positive():
assert negate(4) == -4
def test_negate_zero():
assert negate(0) == 0
Then implement the negate function so all tests pass.
Run uv run pytest tests/test_negate.py -v. Compare the result to your prediction.
Investigate
In Claude Code, type /investigate @smartnotes/negate.py to examine the generated implementation. Write a one-sentence explanation of how it works. Common implementations: return -n or return n * -1 or return 0 - n. All three are correct. Which did AI choose?
Error Taxonomy: If the tests fail, classify the error. Is it a type error (wrong types), a logic error (wrong calculation), or a specification error (the tests themselves are wrong)? In this case, the specification is correct, so any failure is in AI's implementation.
Modify
Change the specification: add a third test that checks negate(-7) == 7. Predict: will the existing implementation pass this new test without changes? Run it and compare.
Make [Mastery Gate]
Without looking back at this lesson, explain the five steps of the TDG loop in your own words. Then explain: what is the difference between TDG and the SDD workflow from Chapter 16? If you can answer both questions, you have the conceptual foundation for the rest of this chapter.
Common Mistakes
| Mistake | What Goes Wrong | How to Avoid It |
|---|---|---|
| Skipping Step 5 (Read) | You trust the green bar without understanding the code; subtle bugs go undetected | Always predict output for a new input after GREEN |
| Writing tests after implementation | Tests become rubber stamps that confirm what AI wrote, instead of specifications that define what it should write | Write tests first. The specification defines "correct" |
| Confusing TDG with "just asking AI" | Without tests, you cannot verify whether AI's code is correct. You are guessing, not specifying | TDG = tests first, then AI generates. The order matters |
James looks at his notes. Five steps. He has not written a single line of code yet, but the method already makes sense. Five lines of specification. One loop. Every time.
"It is like a purchase order," he says. "At my old job, the best purchase orders were short. Item, quantity, delivery date, acceptance criteria. One page. The supplier figured out how to source it and ship it. I did not tell them which warehouse to pull from or which truck to load."
Emma stops in the doorway. "That is exactly right. Your specification says what the function must do. The AI figures out how. The tests are your acceptance criteria: if the delivery does not match, you reject it."
James grins. "Five lines of purchase order. One line of delivery."
"And the ratio gets better," Emma says. "Right now it is five to one because the function is tiny. Later, your specifications stay around five to ten lines while AI generates twenty, thirty, fifty. The purchase order stays short. The delivery gets bigger."
She pauses, then adds: "I think the research on test accuracy is solid, the nine to forty-six percent improvement. But I am not entirely sure how well those numbers transfer to beginners versus experienced developers. The studies mostly tested professionals. Your mileage may vary."
"Noted," James says. "So what is next?"
"Lesson 2. Three new Python words: return, the arrow annotation, and the ellipsis. Then you write your first real specification and see your first RED test, failing on purpose, because the body is still empty."