The Test Suite
Emma walked in to find James scrolling through his WhatsApp conversation with TutorClaw. Nine tools, all working. He had tested each one by hand over the past three sessions: register a learner, fetch content, generate guidance, assess a response, submit code, get an upgrade URL.
"Nine tools. All working from WhatsApp. How do you know they still work tomorrow?"
James looked up. "I test them."
"Every tool, every time? By hand?"
"That is the point of the test suite." James paused. "But I do not know where to start. Nine tools, each with different behaviors. Some are gated by tier. Some write to JSON files. That is a lot of combinations."
Emma pulled a chair over. "Start with a matrix. List every tool down the left side. List every kind of test across the top. Fill in the cells. That is your test design. Then describe it to Claude Code and let it write the actual tests."
"I design what to test. Claude Code writes how to test it."
"Exactly. The matrix is the hard part. The pytest code is the easy part."
You are doing exactly what James is doing. You will design a test matrix for all 9 TutorClaw tools, describe your test requirements to Claude Code, and let it generate a complete pytest suite. Some tests will fail. That is normal. Lesson 12 is dedicated to fixing failures and adding edge cases.
Step 1: Design the Test Matrix
Before sending anything to Claude Code, design the matrix on paper (or in a text file). Every tool gets tested across four categories. Not every category applies to every tool.
The four test categories:
| Category | What It Tests | Which Tools Need It |
|---|---|---|
| Valid input | Correct parameters produce correct response | All 9 tools |
| Invalid input | Missing or wrong parameters produce a clear error | All 9 tools |
| Tier gating | Free tier blocked from premium content, paid tier gets access | Only gated tools: get_chapter_content, get_exercises, submit_code |
| State persistence | Register a learner, simulate a restart (reload JSON), verify data survives | Only state tools: register_learner, get_learner_state, update_progress |
Your test matrix:
| Tool | Valid Input | Invalid Input | Tier Gating | State Persistence |
|---|---|---|---|---|
| register_learner | Yes | Yes | No | Yes |
| get_learner_state | Yes | Yes | No | Yes |
| update_progress | Yes | Yes | No | Yes |
| get_chapter_content | Yes | Yes | Yes | No |
| get_exercises | Yes | Yes | Yes | No |
| generate_guidance | Yes | Yes | No | No |
| assess_response | Yes | Yes | No | No |
| submit_code | Yes | Yes | Yes | No |
| get_upgrade_url | Yes | Yes | No | No |
Count the cells: 9 tools with valid input (9 tests) + 9 tools with invalid input (9 tests) + 3 tools with tier gating (3 tests) + 3 tools with state persistence (3 tests) = at least 24 tests across the suite.
This matrix is the design artifact. It tells you exactly what coverage you need before a single line of test code exists.
Step 2: Plan the Test Organization
Tests are easier to maintain when they are grouped by tool category, not crammed into one giant file. The test files match the tool groups from Lessons 3 through 6:
| Test File | Tools Covered | Why This Grouping |
|---|---|---|
tests/test_state_tools.py | register_learner, get_learner_state, update_progress | All three write and read JSON state files |
tests/test_content_tools.py | get_chapter_content, get_exercises | Both read local content and have tier gating |
tests/test_pedagogy_tools.py | generate_guidance, assess_response | Both implement PRIMM-Lite logic |
tests/test_code_tools.py | submit_code | Code execution with tier gating |
tests/test_monetization_tools.py | get_upgrade_url | Stripe checkout link generation |
Five test files. Each covers a logical group. When a state tool breaks, you look in test_state_tools.py, not in a 500-line monolith.
Step 3: Describe Test Requirements to Claude Code
Open Claude Code in your tutorclaw-mcp project. Send this prompt:
I need a pytest test suite for all 9 TutorClaw tools. Here are my
requirements:
TEST ORGANIZATION:
- tests/test_state_tools.py (register_learner, get_learner_state,
update_progress)
- tests/test_content_tools.py (get_chapter_content, get_exercises)
- tests/test_pedagogy_tools.py (generate_guidance, assess_response)
- tests/test_code_tools.py (submit_code)
- tests/test_monetization_tools.py (get_upgrade_url)
TEST CATEGORIES (apply per tool as appropriate):
1. Valid input: Call with correct parameters, verify correct response
2. Invalid input: Call with missing or wrong parameters, verify the
tool returns a clear error (not a crash)
3. Tier gating (for get_chapter_content, get_exercises, submit_code
only): Free tier learner blocked from premium content, paid tier
learner gets access
4. State persistence (for register_learner, get_learner_state,
update_progress only): Register a learner, write data to JSON,
reload the JSON from disk, verify data survived
ISOLATION:
All tests must use a temporary directory for JSON state files and
content files. No test should read from or write to the production
data directory. Use pytest fixtures (like tmp_path) for isolation.
Build this test suite.
Notice what this prompt contains:
- File organization (which test file covers which tools)
- Test categories with clear criteria for each
- Which categories apply to which tools (the matrix from Step 1)
- Isolation requirement (temporary directories)
It does not contain any Python. No def test_ functions. No assert statements. No fixture code. Those are implementation decisions. Claude Code handles them.
Step 4: Review What Claude Code Built
Claude Code generates the five test files. Before running anything, review the structure. Ask:
Show me the test function names in each file. I want to verify
coverage against my test matrix before running anything.
Claude Code should list something like:
tests/test_state_tools.py:
test_register_learner_validtest_register_learner_missing_nametest_register_learner_persists_after_reloadtest_get_learner_state_validtest_get_learner_state_unknown_learnertest_get_learner_state_persists_after_reloadtest_update_progress_validtest_update_progress_missing_learner_idtest_update_progress_persists_after_reload
tests/test_content_tools.py:
test_get_chapter_content_validtest_get_chapter_content_invalid_chaptertest_get_chapter_content_free_tier_blockedtest_get_chapter_content_paid_tier_allowedtest_get_exercises_validtest_get_exercises_invalid_chaptertest_get_exercises_free_tier_blockedtest_get_exercises_paid_tier_allowed
Compare these names against your matrix. Every cell in the matrix should have at least one matching test function. If a cell is missing, steer Claude Code:
The matrix says submit_code needs a tier gating test. I do not see
test_submit_code_free_tier_blocked in test_code_tools.py. Add it.
This is the same describe-steer-verify cycle from building the tools. Now you are applying it to tests.
Step 5: Run the Suite
uv run pytest -v
The -v flag shows each test name and its result. You will see a mix of passes and failures.
Expected output looks something like:
tests/test_state_tools.py::test_register_learner_valid PASSED
tests/test_state_tools.py::test_register_learner_missing_name PASSED
tests/test_state_tools.py::test_register_learner_persists PASSED
tests/test_content_tools.py::test_get_chapter_content_valid PASSED
tests/test_content_tools.py::test_get_chapter_content_free_tier FAILED
tests/test_pedagogy_tools.py::test_generate_guidance_valid PASSED
tests/test_pedagogy_tools.py::test_assess_response_valid FAILED
tests/test_code_tools.py::test_submit_code_valid PASSED
tests/test_code_tools.py::test_submit_code_free_tier FAILED
...
Some tests pass. Some fail. That is the point. A test suite that passes 100% on the first run is either trivial or lying. The failures tell you where the problems are.
Step 6: Read the Failures (Do Not Fix Yet)
Look at each failure message. Note which tool, which category, and what went wrong. Do not ask Claude Code to fix anything yet. That is Lesson 12.
For now, record the failures:
Ask Claude Code:
Summarize the test failures. For each failing test, tell me:
1. Which tool
2. Which test category (valid, invalid, tier gating, persistence)
3. What the expected result was
4. What the actual result was
Do not fix anything. Just report.
This is your test report. It is a precise map of what works and what does not. No guessing, no "it seemed fine from WhatsApp." The suite proves it.
James ran uv run pytest -v and watched the output scroll. Green dots. Red dots. A final summary line.
"Three failures." He frowned.
"Good." Emma leaned forward and read the screen. "Three failures out of twenty-four tests. You know exactly where the problems are. That is the point of a test suite."
"But they were working from WhatsApp."
"Working from WhatsApp means the happy path works when you test it by hand, once, with the exact input you happened to type. Working from a test suite means every path works, every time, with every kind of input." She pointed at the three red lines. "Those failures are a gift. You know the tool name, the test category, and the exact mismatch between expected and actual."
James stared at the failures. "I thought I would have to debug all of this myself."
"You describe the failures to Claude Code. It fixes them. Same workflow as building the tools." Emma paused, then added more quietly: "I will be honest. I never know if a test suite is thorough enough. Three categories per tool is a solid start. Edge cases always surprise you later. But you have a foundation now, and that matters more than perfection."
She tapped the screen. "Lesson 12. Fix the failures. Add the edge cases that this run revealed. Get to green."
Try With AI
Exercise 1: Audit the Test Matrix
Review the test matrix and ask Claude Code whether any categories are missing:
Here is my test matrix for TutorClaw:
- Valid input: all 9 tools
- Invalid input: all 9 tools
- Tier gating: get_chapter_content, get_exercises, submit_code
- State persistence: register_learner, get_learner_state, update_progress
Are there test categories I am missing? What about boundary conditions
like empty strings, very long names, or special characters in learner
names? Suggest additional test categories but do not write the tests.
What you are learning: A test matrix is a living document. The first version covers the obvious categories. Reviewing it with AI reveals gaps you had not considered, like input boundary conditions and concurrent access patterns.
Exercise 2: Explain the Isolation Requirement
Ask Claude Code to explain what would happen without test isolation:
My TutorClaw tests use temporary directories for JSON files. What
would happen if the tests read from and wrote to the production
data/learners.json file instead? Give me three specific problems
that would occur.
What you are learning: Test isolation is not a best practice you memorize. It is a design decision with concrete consequences. Understanding the failure mode (corrupted production data, tests that pass locally but fail on a clean machine, tests that interfere with each other) makes the decision obvious rather than arbitrary.
Exercise 3: Compare to Manual Testing
Ask Claude Code to compare your automated suite to the manual WhatsApp testing you did in Lesson 8:
In Lesson 8, I tested TutorClaw by sending WhatsApp messages and
checking the responses. Now I have a pytest suite. What does the
pytest suite catch that WhatsApp testing misses? What does WhatsApp
testing catch that pytest misses? When would I use each?
What you are learning: Automated tests and manual integration tests serve different purposes. Neither replaces the other. Knowing when to use each saves you from both false confidence (all tests pass, but the real flow is broken) and wasted effort (manually testing what a script can verify in seconds).