Skip to main content

Axiom VII: Tests Are the Specification

You ask your AI assistant to write a function that calculates shipping costs. It returns clean, well-documented code. The function handles domestic orders perfectly. You deploy it. Three days later, customer support floods with complaints: international orders are charged zero shipping. The function looked correct. It ran without errors. It even had a docstring explaining what it did. But nobody defined what "correct" actually meant for international orders, so the AI made a reasonable assumption that happened to be wrong.

Now imagine an alternative: before asking the AI for any implementation, you write five tests. One test asserts domestic orders get standard rates. Another asserts international orders get a surcharge. A third tests free shipping thresholds. A fourth tests invalid inputs. A fifth tests boundary conditions. You hand these tests to the AI and say: "Write the implementation that passes all five." The AI generates code. You run the tests. Two fail. You tell the AI: "Tests 3 and 5 are failing. Fix the implementation." It regenerates. All five pass. You accept the code.

The difference is not that you tested after the fact. The difference is that your tests were the specification. They defined correctness before any implementation existed. The implementation was generated to match the specification, not the other way around.

The Problem Without This Axiom

When you skip tests-first development with AI, you fall into a predictable failure pattern:

You describe what you want in natural language. "Write a function that calculates shipping costs based on weight, destination, and order total." This feels precise, but it is ambiguous. What are the weight brackets? What counts as "international"? What is the free shipping threshold? Does it return a float, a Decimal, or an integer in cents?

The AI fills in the gaps with assumptions. It picks reasonable defaults. Weight brackets at 1kg, 5kg, 10kg. International means non-US. Free shipping above $50. Returns a float. Each assumption is plausible. Some are wrong for your business.

You verify by reading the code. You scan the implementation, check the logic, and convince yourself it looks right. But reading code is not the same as running it. Your eyes skip edge cases. You miss the off-by-one error at the 5kg boundary. You overlook the case where destination is None.

Bugs appear in production. The code that "looked right" fails on real data. Now you are debugging generated code you did not write, trying to understand the AI's assumptions, fixing issues that would never have existed if correctness had been defined upfront.

This pattern is not unique to AI. It is the oldest problem in software development: ambiguous specifications produce correct-looking code that does the wrong thing. But AI amplifies the problem because it generates plausible code faster than you can verify it by reading.

The Axiom Defined

Test-Driven Generation (TDG): Write tests FIRST that define correct behavior, then prompt AI: "Write the implementation that passes these tests." Tests are the specification. The implementation is disposable.

This axiom transforms tests from a verification tool into a specification language. Tests are not something you write after the code to check it works. Tests are the precise, executable definition of what "works" means.

Three consequences follow:

  1. Tests are permanent. Implementations are disposable. If the AI generates bad code, you do not debug it. You throw it away and regenerate. The tests remain unchanged because they define the requirement, not the solution.

  2. Tests are precise where natural language is ambiguous. "Calculate shipping costs" is vague. assert calculate_shipping(weight=5.0, destination="UK", total=45.99) == 12.50 is unambiguous. The test says exactly what the function must return for exactly those inputs.

  3. Tests enable parallel generation. You can ask the AI to generate ten different implementations. Run all ten against your tests. Keep the one that passes. This is selection, not debugging.

From Principle to Axiom

In Chapter 4, you learned Principle 3: Verification as Core Step. That principle taught you to verify every action an agent takes, to never trust output without checking it, and to build verification into your workflow rather than treating it as optional cleanup.

Axiom VII takes that principle and sharpens it into a specific practice:

Principle 3Axiom VII
Verify that actions succeededDefine what "success" means before the action
Check work after it is doneSpecify correct behavior before generation
Verification is reactiveSpecification is proactive
"Did this work?""What does working look like?"
Catches errorsPrevents errors from being accepted

The principle says: always verify. The axiom says: design through verification. Write the verification first, and it becomes the specification that guides generation.

This distinction matters in practice. A developer who follows Principle 3 might generate code, then write tests to check it. A developer who follows Axiom VII writes tests first, then generates code that must pass them. The first developer is verifying. The second developer is specifying.

TDG: The AI-Era Testing Workflow

Test-Driven Generation adapts the classic TDD cycle for AI-powered development. Here is how the two compare:

TDD (Traditional)

Write failing test → Write implementation → Refactor → Repeat

In TDD, you write both the test and the implementation yourself. The test guides your implementation decisions. Refactoring improves code quality while keeping tests green.

TDG (AI-Era)

Write failing test → Prompt AI with test + types → Run tests → Accept or Regenerate

In TDG, you write the test yourself but the AI generates the implementation. If tests fail, you do not debug. You regenerate. The implementation is disposable because you can always get another one. The test is permanent because it encodes your requirements.

The TDG Workflow in Detail

Step 1: Write Failing Tests

Define what correct behavior looks like. Be specific about inputs, outputs, edge cases, and error conditions:

Loading Python environment...

Notice what these tests accomplish: they define the weight brackets (under 1kg, 1-5kg, over 5kg), the international surcharge amount, the free shipping threshold, and all error conditions. Someone reading these tests knows exactly what the function must do without seeing any implementation.

Step 2: Prompt AI with Tests + Types

Give the AI your tests and any type annotations that constrain the solution:

Here are my pytest tests for a shipping calculator (see test_shipping.py above).

Write the implementation in shipping.py that passes all these tests.

Constraints:
- Function signature: calculate_shipping(weight_kg: float, destination: str, order_total: float) -> float
- Use only standard library
- Raise ValueError for invalid inputs with the exact messages tested

Step 3: Run Tests on AI Output

pytest test_shipping.py -v

If all tests pass, the implementation matches your specification. If some fail, you have two options: regenerate the entire implementation, or show the AI the failing tests and ask it to fix only those cases.

Step 4: Accept or Regenerate

If tests pass: accept the implementation. It conforms to your specification. You do not need to read it line by line (though you may want to check for obvious inefficiencies).

If tests fail: do not debug the generated code. Tell the AI which tests fail and ask for a new implementation. The tests are right. The implementation is wrong. Regenerate.

Tests 8 and 9 are failing. The international orders should NOT get free shipping
even when order_total exceeds 75.00. Fix the implementation.

This is the power of TDG: you never argue with the AI about correctness. The tests define correctness. Either the code passes or it does not.

Writing Effective Specifications (Tests)

Good TDG tests are specifications, not implementation checks. The distinction is critical.

Specify Behavior, Not Implementation

A behavior specification says what the function must do:

Loading Python environment...

An implementation check says how the function must work:

Loading Python environment...

The first test remains valid whether the function uses sorting, a heap, or a linear scan. The second test breaks if you refactor the internals, even if behavior is preserved. In TDG, implementation-coupled tests are especially harmful because they prevent the AI from choosing the best approach.

Use pytest Fixtures for Shared State

When multiple tests need the same setup, use fixtures to keep tests focused on assertions:

Loading Python environment...

Fixtures define the world your tests operate in. When you send these to the AI, the fixture tells it exactly what data structures and setup the implementation must support.

Use Parametrize for Specification Tables

When a function has many input-output pairs, pytest.mark.parametrize expresses the specification as a table:

Loading Python environment...

This is a specification table. It says: "For these exact inputs, produce these exact outputs." The AI can implement any algorithm it wants as long as it matches the table. This pattern works especially well for data transformation functions where business rules are complex.

Use Markers for Test Categories

Organize tests by category so you can run subsets:

Loading Python environment...

Run specific categories: pytest -m unit for fast feedback, pytest -m integration for thorough checks.

The Test Pyramid

Not all tests are created equal. The test pyramid organizes tests by scope and cost:

         /\
/ \ E2E Tests (few)
/ E2E\ Full system, real dependencies
/------\ Slow, expensive, high confidence
/ \
/Integration\ Integration Tests (some)
/ \ Multiple components, real I/O
/--------------\ Medium speed, medium confidence
/ \
/ Unit Tests \ Unit Tests (many)
/ \ Single function, no I/O
/--------------------\ Fast, cheap, focused
LevelWhat It TestsSpeedCostWhen to Use
UnitSingle function, pure logicMillisecondsFreeEvery function with business logic
IntegrationComponents working togetherSecondsLowAPI endpoints, database queries
E2EFull system behaviorMinutesHighCritical user workflows

TDG at Each Level

Unit tests are your primary TDG specification. They define individual function behavior precisely:

Loading Python environment...

Integration tests define how components interact:

Loading Python environment...

E2E tests define user-visible behavior:

Loading Python environment...

For TDG, aim for this distribution: 70% unit, 20% integration, 10% E2E. Unit tests are the most effective specifications because they are precise, fast, and independent.

Coverage as a Metric

Code coverage measures how much of your implementation is exercised by tests. For TDG work, target 80% minimum coverage:

pytest --cov=shipping --cov-report=term-missing

Coverage tells you where your specification has gaps. If a branch is not covered, it means you have not specified what should happen in that case, and the AI's assumption is unverified.

But coverage is a floor, not a ceiling. 100% line coverage does not mean your specification is complete. A function can have every line executed but still be wrong for inputs you did not test. Coverage catches omissions. Good test design catches incorrect behavior.

Anti-Patterns

These patterns undermine TDG. Recognize and avoid them:

Anti-PatternWhy It FailsTDG Alternative
Testing after implementationTests confirm what code does, not what it should do. You test the AI's assumptions instead of your requirements.Write tests first. The tests define requirements.
Tests coupled to implementationMocking internals, checking call order, asserting private state. Tests break on any refactor, preventing regeneration.Test inputs and outputs only. Any correct implementation should pass.
No tests ("it's just a script")Without specification, you cannot regenerate. Every bug requires manual debugging of code you did not write.Even scripts need specs. Three tests beat zero tests.
AI-generated tests for AI-generated codeCircular logic: the same assumptions that produce wrong code produce wrong tests. Neither catches the other's errors.You write tests (the specification). AI writes implementation (the solution).
Happy-path-only testingOnly testing the expected case. Edge cases, error conditions, and boundary values are unspecified. AI handles them however it wants.Test the sad path. Test boundaries. Test invalid inputs.
Overly rigid assertionsAsserting exact floating-point values, exact string formatting, exact timestamps. Tests fail on valid implementations.Use pytest.approx(), pattern matching, and relative assertions where appropriate.

The Circular Testing Trap

The most dangerous anti-pattern deserves special attention. When you ask AI to generate both the implementation and the tests, you get circular validation:

You: "Write a function to calculate tax and tests for it."
AI: [Writes function that uses 7% rate]
AI: [Writes tests that assert 7% rate]

The tests pass. Everything looks correct. But you never specified what the tax rate should be. The AI assumed 7%. If your business requires 8.5%, both the code and the tests are wrong, and neither catches the other.

In TDG, you are the specification authority. You decide what correct means. The AI is the implementation engine. It figures out how to achieve what you specified. Never delegate both roles to the AI.

Safety Note

TDG does not replace security review or performance testing. Tests specify functional correctness: given these inputs, produce these outputs. They do not automatically catch:

  • Security vulnerabilities: SQL injection, path traversal, authentication bypass. These require security-specific testing (SAST tools, penetration testing).
  • Performance issues: An implementation that passes all functional tests might be O(n^2) when O(n) is required. Add explicit performance assertions for critical paths.
  • Concurrency bugs: Race conditions may not manifest in sequential test execution. Use stress testing for concurrent code.
  • Resource leaks: Memory leaks, file handle leaks, connection pool exhaustion. Requires runtime monitoring (Axiom X).

TDG gives you functional correctness. Combine it with Axiom IX (Verification is a Pipeline) and Axiom X (Observability) for comprehensive quality assurance.

Try With AI

Prompt 1: Your First TDG Cycle (Experiencing the Workflow)

I want to practice Test-Driven Generation. Here is my specification as pytest tests:

```python
import pytest
from converter import temperature_convert

def test_celsius_to_fahrenheit():
assert temperature_convert(0, "C", "F") == 32.0

def test_fahrenheit_to_celsius():
assert temperature_convert(212, "F", "C") == 100.0

def test_celsius_to_kelvin():
assert temperature_convert(0, "C", "K") == 273.15

def test_invalid_unit_raises():
with pytest.raises(ValueError, match="Unknown unit"):
temperature_convert(100, "C", "X")

def test_below_absolute_zero_raises():
with pytest.raises(ValueError, match="below absolute zero"):
temperature_convert(-300, "C", "K")

Write the implementation in converter.py that passes all 5 tests. Do NOT modify the tests. The tests are the specification.


**What you're learning:** The core TDG rhythm. You wrote the specification (tests). The AI generates the implementation. You run the tests to verify. If they pass, you accept. If they fail, you regenerate. Notice how the tests precisely define behavior (including error messages) without dictating how the conversion is calculated internally.

### Prompt 2: Specification Design (Writing Tests That Specify, Not Constrain)

I need to build a function called summarize_scores(scores: list[int]) -> dict that takes a list of student test scores (0-100) and returns a summary dictionary.

Help me write pytest tests that SPECIFY the behavior without constraining the implementation. I want to test:

  • Normal case (mix of scores)
  • Empty list (edge case)
  • All same scores
  • Invalid scores (negative, above 100)
  • Single score

For each test, explain:

  1. What behavior am I specifying?
  2. Why is this a behavior test, not an implementation test?
  3. What implementation freedom does the AI retain?

Do NOT write the implementation yet. I want to understand specification design first.


**What you're learning:** The difference between specifying behavior and constraining implementation. Good TDG tests say "given this input, produce this output" without saying "use this algorithm" or "call this internal method." You are learning to leave implementation freedom for the AI while being precise about what correctness means.

### Prompt 3: TDG for Your Domain (Applying to Real Work)

I'm building [describe a real feature you need: a pricing calculator, a data validator, a text parser, a scheduling function, etc.].

Help me apply Test-Driven Generation:

  1. First, ask me 5 clarifying questions about the expected behavior:

    • What are the inputs and their types?
    • What are the outputs?
    • What are the edge cases?
    • What errors should be raised and when?
    • What are the business rules?
  2. Based on my answers, write a complete pytest test file that serves as the specification. Include: fixtures for test data, parametrize for rule tables, edge case tests, error tests.

  3. Then generate the implementation that passes all tests.

  4. Finally, suggest 3 additional tests I might have missed that would make my specification more complete.

Walk me through each step so I understand the TDG process for my specific domain.


**What you're learning:** Applying TDG to your own problems. The clarifying questions teach you what information a specification needs. The test file shows you how to structure a complete specification. The additional tests reveal gaps in your thinking. This is the skill that transfers: learning to think in specifications rather than implementations, regardless of what you are building.

---

*Next: Axiom VIII explores how version control provides the persistent memory layer that stores both your specifications and the implementations they generate.*