All Tests Green
James ran the command. Three failures. Two in the content tools, one in the code sandbox.
"Two in content, one in submit_code." He stared at the terminal output. Lines of red, tracebacks, assertion errors.
Emma leaned over. "Read the assertion messages. Not the stack trace. The assertion tells you what the test expected and what it got. That is the diagnosis."
James scrolled past the traceback to the assertion line. AssertionError: expected status 'blocked' for free tier accessing chapter 10, got chapter content instead. He read it aloud. "The tier gate did not block the request."
"Good. Now describe that to Claude Code. Not 'the test failed.' Tell it what was expected, what happened, and which tool."
You are doing exactly what James is doing. Running pytest, reading failures, describing them to Claude Code, and running again until every test passes. Then you go further: you find the edge cases Claude Code missed.
Step 1: Run the Suite
Open Claude Code in your tutorclaw-mcp project. Run the test suite:
uv run pytest -v
The -v flag shows each test by name with its result. You will see a list of test names followed by PASSED or FAILED.
Read the output. You are looking for two things:
- How many tests passed. This is your starting count.
- Which tests failed. Each failure shows an assertion message.
If every test passes on the first run, skip to Step 3 (adding edge cases). If tests fail, continue to Step 2.
Step 2: The Debug Cycle
For each failing test, follow this cycle:
Read the assertion message. Scroll past the traceback (the chain of file paths and line numbers) and find the line that starts with AssertionError or assert. This line tells you what the test expected and what it actually got.
| Failure Pattern | What the Assertion Tells You |
|---|---|
| Missing data directory | FileNotFoundError: No such file or directory: 'data/learners.json' |
| JSON file not initialized | json.JSONDecodeError when reading an empty file vs a missing file |
| Tier gating not applied | expected 'blocked', got actual content |
| Confidence out of range | assert 0.0 <= confidence <= 1.0 fails after a delta pushes past the boundary |
| Duplicate registration | expected new ID, got existing ID or expected error, got success |
Describe the failure to Claude Code. Copy the assertion message and send it with context:
This test failed:
test_get_chapter_content_free_tier_blocked
The assertion says: expected the tool to return a blocked status
for a free tier learner accessing chapter 10, but it returned
the actual chapter content instead.
The tier check in get_chapter_content should block chapters
above 5 for free tier learners. Fix this.
Notice what this message contains: the test name, the assertion in plain language, what should have happened, and a clear instruction. Claude Code does not need the full traceback. It needs the diagnosis.
Run the tests again. After Claude Code applies the fix:
uv run pytest -v
Check whether the previously failing test now passes. Check whether the fix broke any other test (a fix in one tool can affect shared state that another tool depends on).
Repeat. If new failures appear, follow the same cycle. Run, read, describe, fix, run. Each cycle should reduce the failure count. If the count goes up instead of down, the fix introduced a regression. Describe that to Claude Code:
Your fix for the tier gating broke the update_progress test.
Before your change, update_progress passed. After your change,
it fails with this assertion:
[paste the new assertion]
Fix the tier gating without breaking update_progress.
This back-and-forth is the normal workflow. You steer; Claude Code implements. The test suite is the referee that tells both of you whether the fix worked.
Step 3: Add Edge Cases
All existing tests pass. Now think about what Claude Code did not test.
When Claude Code generated the test suite in Lesson 11, it tested the scenarios you described: valid inputs, invalid inputs, tier gating, and state persistence. But it tested from the center of each scenario, not the edges.
Edge cases live at boundaries. Ask yourself these questions:
What happens on first run? The data/learners.json file does not exist yet. No learner has ever registered. Does register_learner create the file, or does it crash looking for a file that is not there?
What happens at the limits? Confidence is a score between 0.0 and 1.0. If a learner's confidence is already 1.0 and assess_response returns a positive delta of 0.1, does the confidence stay at 1.0 or overflow to 1.1? Same question at the bottom: confidence at 0.0 with a negative delta.
What happens with invalid boundaries? A learner requests chapter 0. Or chapter 999. Chapter 0 does not exist. Chapter 999 is far beyond your content library. Does the tool return a clear error or crash with a KeyError?
What happens with empty input? submit_code receives an empty string as the code to execute. Does the sandbox handle it gracefully?
What happens with duplicates? A learner named "Alice" registers. Then another registration request for "Alice" arrives. Should the tool return the existing learner, create a second one, or return an error?
Describe these edge cases to Claude Code:
All tests pass. Now I want to add edge case tests. Here are
the scenarios:
1. register_learner when data/learners.json does not exist yet
(first run, no data directory or empty directory)
2. assess_response when confidence is already 1.0 and the
delta is positive (should clamp at 1.0, not exceed it)
3. get_chapter_content when the learner requests chapter 0
or chapter 999 (neither exists)
4. submit_code when the code string is empty
5. register_learner when a learner with the same name
already exists
Write tests for each of these. If any test reveals a bug
in the tool code, fix the tool code too.
Claude Code writes the tests. Some will pass immediately (the tool already handles the case). Some will fail (the tool does not handle it). For the failures, Claude Code fixes the tool and the test passes.
Run the full suite again:
uv run pytest -v
Count the total. Your starting number from Step 1 has grown. Every new test is a new guard.
Step 4: The Green Suite
Every test passes. Count them.
uv run pytest -v --tb=no | tail -1
That final line shows the total count: something like 28 passed. This number is your safety net.
From this point forward, every change you make to TutorClaw runs against this suite. When you add tier gating in Lesson 13, the tier gating gets new tests, and the existing tests prove nothing else broke. When you wire Stripe in Lesson 14, the payment tool gets its own tests. The existing 28 (or however many you have) stay green.
The key insight: these tests run against JSON files. They test what each tool accepts and what it returns. They do not test how data is stored. If you later replace JSON with PostgreSQL, you change the storage functions. The tests still pass because they verify tool behavior, not storage internals. You swap the data layer; the tool interface stays the same; the tests stay green.
That portability is why you built tests before integrations. The test suite is the contract. Everything you build from here is tested against that contract.
Try With AI
Exercise 1: Find More Edges
Review the test suite for TutorClaw. Which boundary conditions
are still not tested? Think about: what happens when JSON files
are corrupted (invalid JSON), when two tools modify the same
learner record in rapid succession, and when a tool receives
a learner_id that does not exist in the system.
List 3 untested boundaries and write tests for them.
What you are learning: Edge case identification is a skill that improves with practice. Claude Code generates tests from requirements, but boundary conditions require someone who thinks about what could go wrong in production. Corrupted files, missing records, and race conditions are the failures users find first.
Exercise 2: Coverage by Tool
Group all TutorClaw tests by which tool they exercise.
Show me a count: how many tests per tool? Which tool has the
fewest tests? Add 2 more tests for the tool with the lowest
coverage.
What you are learning: Test coverage is not evenly distributed. The tools you described in the most detail got the most tests. The ones you described briefly got fewer. Reviewing coverage by tool reveals where your descriptions were thin, not just where the code is weak.
Exercise 3: The Database Swap Thought Experiment
Imagine I replace data/learners.json with a PostgreSQL database.
Walk me through which test files would change, which would stay
the same, and which would need new tests. Do not make the change.
Just explain the impact on the test suite.
What you are learning: Tests that verify tool inputs and outputs survive infrastructure changes. Tests that verify storage details (file existence, JSON structure) break when the storage changes. Understanding this distinction helps you write tests that protect behavior rather than implementation. The tools are the stable interface; the storage behind them is replaceable.
James ran the final command. Green across the board.
"Twenty-eight tests. All green." He scrolled through the list. register_learner: 5 tests. get_learner_state: 3 tests. get_chapter_content: 4 tests. assess_response: 4 tests. Every tool covered.
He thought for a moment. "What if a learner registers twice with the same name?"
"Write the test."
He described the scenario to Claude Code. Thirty seconds later, test twenty-nine existed. He ran the suite. Green.
"Twenty-nine."
Emma smiled. "My first test suite for a 9-tool server had 11 tests. I wrote each one by hand. You described a test matrix to Claude Code and got 28 on the first pass. I spent a weekend writing 11. You spent one lesson describing requirements and one lesson finding edges."
"That feels uneven."
"It is. The skill is not writing tests. The skill is knowing what to test. You identified the edges Claude Code missed: first run, confidence limits, duplicate names, empty input. That is the human value. Describing the matrix is the mechanical part." She closed her laptop. "That suite is your safety net. Every change from now on runs against these 29 tests. When you add tier gating in the next lesson, the gating logic gets new tests. The existing ones prove nothing else broke."
James nodded. Twenty-nine guards. All standing.