Harden and Polish
Emma pulled up James's TutorClaw on her phone and typed a single character into the WhatsApp chat: an empty string, just a space and send.
The response came back fast: a wall of Python. TypeError, KeyError, a file path from James's laptop, a line number deep inside server.py.
"Your product just leaked its implementation to the user," Emma said, turning the screen toward him.
James squinted at the traceback. "That is just an edge case. Nobody sends an empty message."
"Every user is an edge case generator. A two-year-old borrows a phone and mashes the keyboard. Someone pastes an emoji where a name should go. A learner types chapter 9999 because they are curious." She set the phone down. "You built a product that works when inputs are perfect. Now make it work when inputs are not."
You are doing exactly what James is doing. TutorClaw works when everything goes right. Now you make it handle the cases where everything goes wrong.
In this lesson, you send malformed inputs to every tool, observe the failures, describe hardening requirements to Claude Code, add structured logging, and verify the result. By the end, every bad input produces a clear message instead of a crash, and every tool call leaves a structured record in a log file.
Step 1: Send Malformed Inputs
Before fixing anything, see what breaks. Open Claude Code in your tutorclaw-mcp project and ask it to test each tool with bad inputs:
Test each of the 9 TutorClaw tools with malformed inputs and show
me what happens. Try these specific cases:
- register_learner with an empty string for the name
- get_learner_state with a learner_id that does not exist
- get_chapter_content with chapter number -1
- get_chapter_content with chapter number 9999
- submit_code with code that tries to import os
- get_upgrade_url for a learner_id that is not in the system
- assess_response with an empty answer string
- update_progress with a negative confidence value
Run each one and show me the exact error response.
Categorize what you see:
| Failure Type | What It Looks Like | Why It Matters |
|---|---|---|
| Crash | Python traceback returned to the user | Leaks file paths, line numbers, internal variable names |
| Confusing error | "Error: None" or "KeyError: learner-xyz" | User has no idea what to do differently |
| Silent success | Tool accepts chapter -1 and returns empty content | No signal that input was wrong |
Step 2: Describe Hardening to Claude Code
Now describe the fix. Tell Claude Code what "valid" means for each tool:
I want to add input validation and clear error handling to all 9
TutorClaw tools. Here are the rules:
For ALL tools:
- No empty strings for any required text parameter
- Return a clear error message that tells the user what was wrong
and what to do instead (not a stack trace, not a Python exception)
- Never expose file paths, line numbers, or internal variable names
in error responses
Specific validation rules:
- register_learner: name must be 1-200 characters, no control
characters
- get_learner_state: if learner_id does not exist, return
"Learner not found. Register first with your name."
- get_chapter_content: chapter number must be a positive integer
within the range of available chapters
- get_exercises: same chapter range validation
- submit_code: reject any code containing import statements for
os, sys, subprocess, or shutil (basic safety)
- get_upgrade_url: if learner_id does not exist, return
"Learner not found" (not a crash)
- assess_response: answer must be a non-empty string
- update_progress: confidence must be between 0.0 and 1.0
- generate_guidance: stage must be one of the valid PRIMM stages
Wrap all tool handlers in error handling so that unexpected errors
return "Something went wrong. Please try again." instead of a
traceback.
Review the changes Claude Code proposes before approving. Check: does each validation rule produce a message that helps the user fix their input? Is there a catch-all handler for unexpected errors?
Step 3: Add Structured Logging
Validation tells users what went wrong. Logging tells you what happened:
Add JSON-structured logging to the TutorClaw server. Every tool
call should log a JSON object with these fields:
- timestamp: ISO 8601 format
- tool_name: which tool was called
- learner_id: who called it (or "anonymous" if not available)
- parameters: the input parameters with any sensitive values
replaced by "[REDACTED]"
- result_status: "success" or "error"
- error_message: the error message if status is "error", omitted
if success
- duration_ms: how long the tool call took in milliseconds
Write logs to data/tutorclaw.log, one JSON object per line.
Use Python's built-in logging module, not print statements.
One JSON object per line means each log entry is a complete, parseable record. You can filter by tool name, find all errors for a specific learner, or calculate average response times. Print statements give you none of that.
Step 4: Verify Hardening
Resend the same malformed inputs from Step 1:
Run the same malformed input tests from earlier:
- register_learner with empty name
- get_learner_state with a nonexistent learner_id
- get_chapter_content with chapter -1 and 9999
- submit_code with code that imports os
- get_upgrade_url for a nonexistent learner
- assess_response with empty answer
- update_progress with negative confidence
Show me the error response for each one. Then show me the last
10 entries in data/tutorclaw.log.
Compare the results to Step 1:
| Tool | Before | After |
|---|---|---|
register_learner("") | Python TypeError traceback | "Name is required. Provide a name between 1 and 200 characters." |
get_chapter_content(-1) | Empty response, no error | "Invalid chapter number. Choose a chapter between 1 and N." |
submit_code("import os") | Code executed successfully | "Code contains restricted imports (os). Remove them and try again." |
Check the log file. Each malformed input should have produced a structured entry with these fields:
| Field | Example Value |
|---|---|
| timestamp | 2026-04-04T14:23:01.442Z |
| tool_name | register_learner |
| learner_id | anonymous |
| parameters | name: (empty) |
| result_status | error |
| error_message | Name is required. Provide a name between 1 and 200 characters. |
| duration_ms | 2 |
If any tool still crashes, still returns a confusing message, or does not appear in the log, describe the specific gap to Claude Code and have it fix that tool.
Step 5: Update the Test Suite
The pytest suite from Lessons 11-12 tested the happy path. Hardening added new behavior that needs test coverage:
Add hardening tests to the pytest suite. For each tool, add tests
for:
- Empty string inputs where strings are required
- Out-of-range numeric values (negative, zero, impossibly large)
- Nonexistent learner_ids
- Invalid enum values (wrong PRIMM stage names)
- Restricted code submissions (import os, import subprocess)
Each test should verify two things:
1. The tool returns a clear error message (not a traceback)
2. The tool does not crash (returns a proper response object)
Run the full suite after adding the tests.
Run the suite:
uv run pytest
All tests, both the original suite and the new hardening tests, should pass.
Try With AI
Exercise 1: Audit the Error Messages
Review every error message Claude Code wrote for your validation layer:
List every error message in the TutorClaw server. For each one,
evaluate: does this message tell the user (1) what went wrong,
(2) what they should do instead? Flag any message that fails
either test.
What you are learning: A good error message is a tiny piece of documentation. "Invalid input" fails both tests. "Chapter number must be between 1 and 12. You sent -1." passes both. The quality of your error messages determines whether users retry with correct input or give up.
Exercise 2: Stress the Logging
Generate a burst of tool calls and analyze the log:
Call register_learner 5 times with valid names, then call
get_chapter_content 3 times (2 valid, 1 invalid), then call
submit_code with restricted code once.
After all calls complete, read data/tutorclaw.log and answer:
- How many total entries are there?
- How many have result_status "error"?
- What is the average duration_ms across all calls?
- Which tool was called most frequently?
What you are learning: Structured logs are queryable data. When TutorClaw has real users, you can answer "which tool fails most often?" and "are response times getting slower?" without adding any new code. The log file is your operations dashboard.
Exercise 3: Design a Log Alert Rule
Think about what patterns in the log would indicate a problem worth investigating:
Look at the structured log format for TutorClaw. If I wanted to
set up alerts for production, what three log patterns would you
monitor? For each pattern, explain what it would catch and why
it matters.
Example pattern: more than 10 errors from the same learner_id
in 5 minutes (possible confused user or automated abuse).
What you are learning: Structured logs are not just for debugging after something breaks. They are the foundation for monitoring and alerting. The fields you chose in Step 3 (tool_name, learner_id, result_status, duration_ms) are exactly the fields that monitoring systems query. Good logging design happens before you need the logs.
James resent every malformed input from the morning. Empty names, impossible chapter numbers, restricted imports. Each one came back with a clear sentence telling the user what went wrong and how to fix it.
He opened the log file. Neat rows of JSON, one per line. Timestamp, tool name, learner ID, status, duration. Every call recorded.
"The product feels professional now," he said.
"Professional is a word for 'it does not leak its internals when surprised.'" Emma closed her laptop halfway, then paused. "I shipped a product once that returned database connection strings in error messages. Host, port, username, the full connection URL. A security researcher found it through normal usage, not even trying to break anything." She opened the laptop back up. "Responsible disclosure, fortunately. But we pushed an emergency patch at 2 AM. That is why I care about error messages more than features."
James looked at the log file again. "So we have validation, logging, and tests for both. What is left?"
"Publishing. Your product works. Your product handles surprises. Your product records what happens. Now other people need to be able to install it." She pointed at the ClawHub tab in his browser. "Lesson 20."