The Test Suite

Emma walked in to find James scrolling through his WhatsApp conversation with TutorClaw. Nine tools, all working. He had tested each one by hand over the past three sessions: register a learner, fetch content, generate guidance, assess a response, submit code, get an upgrade URL.

"Nine tools. All working from WhatsApp. How do you know they still work tomorrow?"

James looked up. "I test them."

"Every tool, every time? By hand?"

"That is the point of the test suite." James paused. "But I do not know where to start. Nine tools, each with different behaviors. Some are gated by tier. Some write to JSON files. That is a lot of combinations."

Emma pulled a chair over. "Start with a matrix. List every tool down the left side. List every kind of test across the top. Fill in the cells. That is your test design. Then describe it to Claude Code and let it write the actual tests."

"I design what to test. Claude Code writes how to test it."

"Exactly. The matrix is the hard part. The pytest code is the easy part."

You are doing exactly what James is doing. You will design a test matrix for all 9 TutorClaw tools, describe your test requirements to Claude Code, and let it generate a complete pytest suite. Some tests will fail. That is normal. Lesson 12 is dedicated to fixing failures and adding edge cases.

Step 1: Design the Test Matrix

Before sending anything to Claude Code, design the matrix on paper (or in a text file). Every tool gets tested across four categories. Not every category applies to every tool.

The four test categories:

Category	What It Tests	Which Tools Need It
Valid input	Correct parameters produce correct response	All 9 tools
Invalid input	Missing or wrong parameters produce a clear error	All 9 tools
Tier gating	Free tier blocked from premium content, paid tier gets access	Only gated tools: get_chapter_content, get_exercises, submit_code
State persistence	Register a learner, simulate a restart (reload JSON), verify data survives	Only state tools: register_learner, get_learner_state, update_progress

Your test matrix:

Tool	Valid Input	Invalid Input	Tier Gating	State Persistence
register_learner	Yes	Yes	No	Yes
get_learner_state	Yes	Yes	No	Yes
update_progress	Yes	Yes	No	Yes
get_chapter_content	Yes	Yes	Yes	No
get_exercises	Yes	Yes	Yes	No
generate_guidance	Yes	Yes	No	No
assess_response	Yes	Yes	No	No
submit_code	Yes	Yes	Yes	No
get_upgrade_url	Yes	Yes	No	No

Count the cells: 9 tools with valid input (9 tests) + 9 tools with invalid input (9 tests) + 3 tools with tier gating (3 tests) + 3 tools with state persistence (3 tests) = at least 24 tests across the suite.

This matrix is the design artifact. It tells you exactly what coverage you need before a single line of test code exists.

Step 2: Plan the Test Organization

Tests are easier to maintain when they are grouped by tool category, not crammed into one giant file. The test files match the tool groups from Lessons 3 through 6:

Test File	Tools Covered	Why This Grouping
`tests/test_state_tools.py`	register_learner, get_learner_state, update_progress	All three write and read JSON state files
`tests/test_content_tools.py`	get_chapter_content, get_exercises	Both read local content and have tier gating
`tests/test_pedagogy_tools.py`	generate_guidance, assess_response	Both implement PRIMM-Lite logic
`tests/test_code_tools.py`	submit_code	Code execution with tier gating
`tests/test_monetization_tools.py`	get_upgrade_url	Stripe checkout link generation

Five test files. Each covers a logical group. When a state tool breaks, you look in test_state_tools.py, not in a 500-line monolith.

Step 3: Describe Test Requirements to Claude Code

Open Claude Code in your tutorclaw-mcp project. Send this prompt:

I need a pytest test suite for all 9 TutorClaw tools. Here are my
requirements:

TEST ORGANIZATION:
- tests/test_state_tools.py (register_learner, get_learner_state,
  update_progress)
- tests/test_content_tools.py (get_chapter_content, get_exercises)
- tests/test_pedagogy_tools.py (generate_guidance, assess_response)
- tests/test_code_tools.py (submit_code)
- tests/test_monetization_tools.py (get_upgrade_url)

TEST CATEGORIES (apply per tool as appropriate):

1. Valid input: Call with correct parameters, verify correct response
2. Invalid input: Call with missing or wrong parameters, verify the
   tool returns a clear error (not a crash)
3. Tier gating (for get_chapter_content, get_exercises, submit_code
   only): Free tier learner blocked from premium content, paid tier
   learner gets access
4. State persistence (for register_learner, get_learner_state,
   update_progress only): Register a learner, write data to JSON,
   reload the JSON from disk, verify data survived

ISOLATION:
All tests must use a temporary directory for JSON state files and
content files. No test should read from or write to the production
data directory. Use pytest fixtures (like tmp_path) for isolation.

Build this test suite.

Notice what this prompt contains:

File organization (which test file covers which tools)
Test categories with clear criteria for each
Which categories apply to which tools (the matrix from Step 1)
Isolation requirement (temporary directories)

It does not contain any Python. No def test_ functions. No assert statements. No fixture code. Those are implementation decisions. Claude Code handles them.

Step 4: Review What Claude Code Built

Claude Code generates the five test files. Before running anything, review the structure. Ask:

Show me the test function names in each file. I want to verify
coverage against my test matrix before running anything.

Claude Code should list something like:

tests/test_state_tools.py:

test_register_learner_valid
test_register_learner_missing_name
test_register_learner_persists_after_reload
test_get_learner_state_valid
test_get_learner_state_unknown_learner
test_get_learner_state_persists_after_reload
test_update_progress_valid
test_update_progress_missing_learner_id
test_update_progress_persists_after_reload

tests/test_content_tools.py:

test_get_chapter_content_valid
test_get_chapter_content_invalid_chapter
test_get_chapter_content_free_tier_blocked
test_get_chapter_content_paid_tier_allowed
test_get_exercises_valid
test_get_exercises_invalid_chapter
test_get_exercises_free_tier_blocked
test_get_exercises_paid_tier_allowed

Compare these names against your matrix. Every cell in the matrix should have at least one matching test function. If a cell is missing, steer Claude Code:

The matrix says submit_code needs a tier gating test. I do not see
test_submit_code_free_tier_blocked in test_code_tools.py. Add it.

This is the same describe-steer-verify cycle from building the tools. Now you are applying it to tests.

Step 5: Run the Suite

uv run pytest -v

The -v flag shows each test name and its result. You will see a mix of passes and failures.

Expected output looks something like:

tests/test_state_tools.py::test_register_learner_valid PASSED
tests/test_state_tools.py::test_register_learner_missing_name PASSED
tests/test_state_tools.py::test_register_learner_persists PASSED
tests/test_content_tools.py::test_get_chapter_content_valid PASSED
tests/test_content_tools.py::test_get_chapter_content_free_tier FAILED
tests/test_pedagogy_tools.py::test_generate_guidance_valid PASSED
tests/test_pedagogy_tools.py::test_assess_response_valid FAILED
tests/test_code_tools.py::test_submit_code_valid PASSED
tests/test_code_tools.py::test_submit_code_free_tier FAILED
...

Some tests pass. Some fail. That is the point. A test suite that passes 100% on the first run is either trivial or lying. The failures tell you where the problems are.

Step 6: Read the Failures (Do Not Fix Yet)

Look at each failure message. Note which tool, which category, and what went wrong. Do not ask Claude Code to fix anything yet. That is Lesson 12.

For now, record the failures:

Ask Claude Code:

Summarize the test failures. For each failing test, tell me:
1. Which tool
2. Which test category (valid, invalid, tier gating, persistence)
3. What the expected result was
4. What the actual result was

Do not fix anything. Just report.

This is your test report. It is a precise map of what works and what does not. No guessing, no "it seemed fine from WhatsApp." The suite proves it.

James ran uv run pytest -v and watched the output scroll. Green dots. Red dots. A final summary line.

"Three failures." He frowned.

"Good." Emma leaned forward and read the screen. "Three failures out of twenty-four tests. You know exactly where the problems are. That is the point of a test suite."

"But they were working from WhatsApp."

"Working from WhatsApp means the happy path works when you test it by hand, once, with the exact input you happened to type. Working from a test suite means every path works, every time, with every kind of input." She pointed at the three red lines. "Those failures are a gift. You know the tool name, the test category, and the exact mismatch between expected and actual."

James stared at the failures. "I thought I would have to debug all of this myself."

"You describe the failures to Claude Code. It fixes them. Same workflow as building the tools." Emma paused, then added more quietly: "I will be honest. I never know if a test suite is thorough enough. Three categories per tool is a solid start. Edge cases always surprise you later. But you have a foundation now, and that matters more than perfection."

She tapped the screen. "Lesson 12. Fix the failures. Add the edge cases that this run revealed. Get to green."

Try With AI

Exercise 1: Audit the Test Matrix

Review the test matrix and ask Claude Code whether any categories are missing:

Here is my test matrix for TutorClaw:
- Valid input: all 9 tools
- Invalid input: all 9 tools
- Tier gating: get_chapter_content, get_exercises, submit_code
- State persistence: register_learner, get_learner_state, update_progress

Are there test categories I am missing? What about boundary conditions
like empty strings, very long names, or special characters in learner
names? Suggest additional test categories but do not write the tests.

What you are learning: A test matrix is a living document. The first version covers the obvious categories. Reviewing it with AI reveals gaps you had not considered, like input boundary conditions and concurrent access patterns.

Exercise 2: Explain the Isolation Requirement

Ask Claude Code to explain what would happen without test isolation:

My TutorClaw tests use temporary directories for JSON files. What
would happen if the tests read from and wrote to the production
data/learners.json file instead? Give me three specific problems
that would occur.

What you are learning: Test isolation is not a best practice you memorize. It is a design decision with concrete consequences. Understanding the failure mode (corrupted production data, tests that pass locally but fail on a clean machine, tests that interfere with each other) makes the decision obvious rather than arbitrary.

Exercise 3: Compare to Manual Testing

Ask Claude Code to compare your automated suite to the manual WhatsApp testing you did in Lesson 8:

In Lesson 8, I tested TutorClaw by sending WhatsApp messages and
checking the responses. Now I have a pytest suite. What does the
pytest suite catch that WhatsApp testing misses? What does WhatsApp
testing catch that pytest misses? When would I use each?

What you are learning: Automated tests and manual integration tests serve different purposes. Neither replaces the other. Knowing when to use each saves you from both false confidence (all tests pass, but the real flow is broken) and wasted effort (manually testing what a script can verify in seconds).

Step 1: Design the Test Matrix​

Step 2: Plan the Test Organization​

Step 3: Describe Test Requirements to Claude Code​

Step 4: Review What Claude Code Built​

Step 5: Run the Suite​

Step 6: Read the Failures (Do Not Fix Yet)​

Try With AI​

Exercise 1: Audit the Test Matrix​

Exercise 2: Explain the Isolation Requirement​

Exercise 3: Compare to Manual Testing​