AI Generates, You Verify
Emma stands up. "You have the stub. You have the tests. Pyright passes. Pytest fails. You know the next step." She picks up her coffee. "I will be back in ten minutes."
James watches her leave. He looks at his terminal. Two red failures. The stub with three dots where the body should be. He knows the next step: prompt Claude Code to fill in the dots.
He types the prompt. Claude Code generates one line. He stares at it. return celsius * 9 / 5 + 32. He recognizes the formula from school: multiply by nine-fifths, add thirty-two. He runs the tests.
2 passed
Green. Both tests pass. James sits back. Five lines of specification. One line of implementation. And it works.
When Emma comes back, James shows her the terminal. "Green," he says. "Both tests pass. We are done, right?"
Emma sets down her coffee. "What does the generated code do?"
"It converts Celsius to Fahrenheit. The tests prove it."
Emma crosses her arms. "The tests prove it returns the right number for zero and a hundred. What about every other number?"
James hesitates. "Okay, let me look at the actual line." He reads it again. "It multiplies celsius by 9, divides by 5, adds 32. That is the formula."
Emma leans against the desk. "And if I give it minus forty?"
James does the math in his head:
- -40 x 9 = -360
- -360 / 5 = -72
- -72 + 32 = -40
"Minus forty. The same number. That is weird."
"That is the crossover point. Celsius and Fahrenheit meet at minus forty." She glances at the green test output. "The tests do not check that. Should they?"
James adds a third test before Emma finishes her coffee.
Step 3: Generate
In Lesson 2, you wrote a function stub and two tests. Pyright passed (types are valid). Pytest failed (no implementation). You are at the RED stage, exactly where the TDG loop says you should be.
Now you move to Step 3: ask Claude Code to write the implementation.
Open Claude Code in your SmartNotes project and use this prompt:
Implement the celsius_to_fahrenheit function in
smartnotes/temperature.py so that all tests in
tests/test_temperature.py pass. Do not modify the tests.
That last sentence, "Do not modify the tests," is important. AI sometimes tries to change the tests instead of fixing the implementation. Your tests are the specification. They define what "correct" means. The implementation must match them, not the other way around.
Before prompting AI, save your test file in Git. A commit is like saving a snapshot of your file. If anything changes later, you can always go back to this snapshot. In Claude Code, type:
Commit the file tests/test_temperature.py with the message
"test: add celsius_to_fahrenheit specification"
Claude Code will run the Git commands for you. To verify the commit succeeded, type show me my recent commits. You should see your test file in the latest entry. If AI later modifies your tests despite your instruction, you can ask Claude Code "show me what changed in my test file" to see the difference. Your specification is sacred; protect it.
What AI Generates
The AI reads your stub (the function signature and types) and your tests (the expected input-output pairs). It replaces the ... with:
def celsius_to_fahrenheit(celsius: float) -> float:
return celsius * 9 / 5 + 32
One line of implementation. The formula: multiply by 9, divide by 5, add 32. This is the standard Celsius-to-Fahrenheit conversion formula: F = C x 9/5 + 32.
Step 4: Verify
Run both tools:
$ uv run pyright
0 errors, 0 warnings, 0 informations
$ uv run pytest tests/test_temperature.py -v
tests/test_temperature.py::test_freezing_point PASSED
tests/test_temperature.py::test_boiling_point PASSED
2 passed
GREEN. Both tests pass. The type checker is clean. Your specification demanded that celsius_to_fahrenheit(0.0) return 32.0 and that celsius_to_fahrenheit(100.0) return 212.0. The implementation delivers both.
Step 5: Read (PRIMM)
The tests pass, but you are not done. Step 5 says: read the generated code. Apply PRIMM from Chapter 45. Do not just trust the green bar. Understand how the implementation works.
Predict
Look at the generated line: return celsius * 9 / 5 + 32
Predict: What does celsius_to_fahrenheit(37.0) return? (37°C is normal human body temperature.)
Work it out by hand:
37.0 * 9=333.0333.0 / 5=66.666.6 + 32=98.6
Your prediction: 98.6. That is the well-known body temperature in Fahrenheit. The formula checks out.
Predict: What does celsius_to_fahrenheit(-40.0) return?
Work it out:
-40.0 * 9=-360.0-360.0 / 5=-72.0-72.0 + 32=-40.0
Your prediction: -40.0. The same number. This is the crossover point where Celsius and Fahrenheit are equal. If your prediction matches your domain knowledge, the implementation is correct.
Trace Table
Build a trace table for celsius_to_fahrenheit(100.0):
| Expression | Value | How |
|---|---|---|
celsius | 100.0 | Function parameter |
celsius * 9 | 900.0 | 100.0 x 9 |
celsius * 9 / 5 | 180.0 | 900.0 / 5 |
celsius * 9 / 5 + 32 | 212.0 | 180.0 + 32 |
return value | 212.0 | Handed back to caller |
The trace confirms: the function returns 212.0 for input 100.0. This matches the test assertion assert celsius_to_fahrenheit(100.0) == 212.0.
Before continuing, verify your understanding:
- In the trace table, what is the value of
celsius * 9whencelsiusis100.0? - Why do we predict output for inputs NOT in our original tests?
- What is the crossover point where Celsius equals Fahrenheit?
Answers: 900.0, because tests only check the cases we wrote (reading checks the logic for all inputs), and -40.
The Ratio Starts Small
Count what happened:
- You wrote: 6 lines (1 stub + 1 import + 2 test functions with 2 assertions)
- AI wrote: 1 line (the return statement with the formula)
- Total working code: 7 lines, fully tested, type-checked
This is a tiny example. The ratio is 6:1; you wrote more than AI. But the ratio flips as functions get more complex. In later chapters, your specification will stay around 5-10 lines while AI generates 20, 30, or 50 lines of implementation. Your leverage grows. The loop stays the same.
The Trust Gap
Why read the generated code at all? The tests pass. Pyright is clean. Why not just move on?
Because tests only check the cases you wrote. Your two tests check 0°C and 100°C. They do not check -40°C, 37°C, or 1000°C. A function could pass both tests but still be wrong for other inputs. For example, AI could write if celsius == 0: return 32.0 and elif celsius == 100: return 212.0 with no general formula. That would pass both tests but fail for every other input. The term for writing specific values directly into code instead of using a formula is hardcoding, and it is one of the most common AI mistakes.
This is the trust gap: the distance between "the tests pass" and "the code is correct for all inputs." In the Stack Overflow 2025 developer survey, 66% of developers cite as a frustration with AI tools "solutions that are almost right, but not quite." The code compiles, the obvious tests pass, but there is a subtle flaw you do not notice until production (the live version of the software that real users rely on).
Reading the generated code is how you close the trust gap. The tests verify the cases you specified. Your PRIMM reading verifies the logic. Together, they give you confidence that the function works for all inputs, not just the two you tested.
The "Do not modify the tests" rule is the TDG equivalent of treating tests as a contract. In professional workflows, the specification (contract) is owned by the consumer, not the producer. If you have worked with consumer-driven contract testing (Pact, Spring Cloud Contract), the discipline is the same: the tests define the expected behavior, and the implementation must conform. AI is the producer. You are the consumer. Protect the contract.
Sometimes AI generates code using Python features not covered in this chapter, such as list comprehensions, conditional expressions, or helper functions. If the generated code looks unfamiliar, that is not a problem with your reading skills. In Claude Code, type: "Rewrite this using only basic arithmetic and a return statement." You will get a simpler version you can read and verify with PRIMM.
Your Independent TDG Cycle
You just completed a full TDG cycle with guidance. Now do one independently.
Your function: fahrenheit_to_celsius, the inverse of what you just built.
The formula: C = (F - 32) x 5/9
Known values:
- 32°F = 0°C (freezing point)
- 212°F = 100°C (boiling point)
Your steps:
-
Add the stub to the same
smartnotes/temperature.pyfile wherecelsius_to_fahrenheitlives:def fahrenheit_to_celsius(fahrenheit: float) -> float: ... -
Add the import and two tests to the same
tests/test_temperature.pyfile. Update your existing import line to include both functions, then add the test functions below your existing tests. Your file should look like this when you are done:# Update your import line to include both functions
from smartnotes.temperature import celsius_to_fahrenheit, fahrenheit_to_celsius
# These tests are already in your file (do not re-type them)
def test_freezing_point():
assert celsius_to_fahrenheit(0.0) == 32.0
def test_boiling_point():
assert celsius_to_fahrenheit(100.0) == 212.0
# Add everything below this line
def test_freezing_f_to_c():
assert fahrenheit_to_celsius(32.0) == 0.0
def test_boiling_f_to_c():
assert fahrenheit_to_celsius(212.0) == 100.0 -
Run
uv run pyright. Should show0 errors, 0 warnings, 0 informations. -
Run
uv run pytest tests/test_temperature.py -v. You should see RED:test_freezing_f_to_c FAILED
test_boiling_f_to_c FAILED -
Prompt Claude Code: "Implement fahrenheit_to_celsius so all tests pass. Do not modify the tests."
-
Run
uv run pytest tests/test_temperature.py -v. You should see GREEN:test_freezing_f_to_c PASSED
test_boiling_f_to_c PASSED -
Read the generated code. Predict: what does
fahrenheit_to_celsius(98.6)return? (Hint: normal body temperature in Celsius is a familiar number.)
If you completed all seven steps and your prediction in step 7 was correct, you have done a full TDG cycle independently. That is the mastery gate for this lesson.
Try With AI
Prompt 1: The Trust Gap in Practice
I have a function that passes both of its tests. But I only
wrote two tests. How many inputs exist that I have NOT tested?
Is a function that passes two tests "correct" or just
"correct for two inputs"? What is the difference?
What you're learning: The trust gap between "tests pass" and "code is correct for all inputs." Two tests prove two points. They do not prove every point. This is why Step 5 (Read) exists: reading the generated code catches patterns that a small number of tests cannot.
Prompt 2: Edge Case Discovery
I have a celsius_to_fahrenheit function with these tests:
- celsius_to_fahrenheit(0.0) == 32.0
- celsius_to_fahrenheit(100.0) == 212.0
What are three additional test cases I should consider?
Explain why each one is a useful edge case.
Read the suggestions. Do they include -40 degrees (the crossover point where Celsius and Fahrenheit are equal)? Negative values? Very large numbers? Evaluate whether each suggestion adds real value or is redundant.
What you're learning: You are reviewing AI-generated test suggestions, evaluating specification quality, not just code quality. This is a different skill from reading code: you are asking "did I test enough?" not "is the code correct?"
Prompt 3: What Could a Hardcoded Solution Look Like?
Suppose AI generated this implementation for celsius_to_fahrenheit:
def celsius_to_fahrenheit(celsius: float) -> float:
if celsius == 0.0:
return 32.0
if celsius == 100.0:
return 212.0
return 0.0
Would my two tests pass? Is this implementation correct?
What does this tell me about the limits of testing?
What you're learning: A hardcoded implementation can pass every test you wrote and still be completely wrong for every other input. This is the strongest argument for Step 5 (Read): after GREEN, you must read the generated code and ask "does this implementation use a real formula, or is it just matching my test values?"
PRIMM-AI+ Practice: Verifying AI Output
Predict [AI-FREE]
Press Shift+Tab to enter Plan Mode before predicting.
Suppose AI generated this implementation for a function called km_to_miles:
def km_to_miles(km: float) -> float:
return km * 0.621371
Predict: what does km_to_miles(10.0) return? What does km_to_miles(0.0) return? Write your predictions and a confidence score from 1-5.
Check your prediction
km_to_miles(10.0) returns 6.21371. The calculation: 10.0 x 0.621371 = 6.21371.
km_to_miles(0.0) returns 0.0. The calculation: 0.0 x 0.621371 = 0.0.
If you predicted both correctly, you can read and evaluate simple AI-generated code.
Run
Press Shift+Tab to exit Plan Mode. Create the stub and tests for km_to_miles. Use the test values from your prediction. Run the full TDG cycle: pyright, pytest (RED), prompt, pytest (GREEN).
Investigate
In Claude Code, type /investigate @smartnotes/km_to_miles.py to examine the generated implementation. Is the conversion factor accurate? (The actual factor is approximately 0.621371. Check whether AI used this value or a rounded version like 0.62.) If AI rounded, is the rounding acceptable for your tests?
Error Taxonomy: If the tests fail because of floating-point precision (e.g., 6.21371 vs 6.213710000000001), classify this as a precision error, a subtype of logic error (not one of the five named categories in Chapter 43) caused by how computers store decimal numbers. In later chapters, you will learn to use pytest.approx() for these cases.
Modify
Add a third test: km_to_miles(42.195), the marathon distance. To get the exact expected value, run uv run python -c "print(42.195 * 0.621371)" in your terminal rather than rounding by hand. Predict: will the existing implementation pass this new test? Run it.
Make [Mastery Gate]
Complete one full TDG cycle for a function you choose, from stub to GREEN to reading the generated code. Pick a simple unit conversion or a SmartNotes domain function. Your cycle must include all five steps: specify, check types, generate, verify, read. If you can do this without looking back at the lesson instructions, you own the TDG loop.
Common Mistakes
| Mistake | What Goes Wrong | How to Avoid It |
|---|---|---|
| Letting AI modify your tests | Your specification changes to match AI's output; you lose the definition of "correct" | Always say "Do not modify the tests" and commit tests before prompting |
| Skipping PRIMM after GREEN | A hardcoded implementation could pass your two tests but fail for all other inputs | Predict output for a new input not in your tests, then verify |
| Not committing tests before prompting | If AI modifies your test file, you have no way to recover the original specification | Commit your test file in Git before every AI prompt |
James commits the green tests. Two functions. Two TDG cycles. Both green.
He catches himself thinking about what function to specify next, not what formula to write, but what specification to design. The shift happened somewhere between the first stub and the second green bar.
He mentions it to Emma later. "It is like learning to write purchase orders instead of building furniture," he says. "You describe the result, someone else builds it, and you inspect what arrives."
Emma pauses. "That is a better analogy than mine." She tilts her head. "The loop is a circle: specify, check, generate, verify, read. When it works, you stop seeing the circle. You just see the function."