Build the Code and Upgrade Tools
James counted on his fingers. "Seven tools done. Two to go. Code execution and the upgrade link."
Emma looked up from her terminal. "The code tool is the dangerous one. A learner sends Python code and your server runs it. Think about that for a second."
"So we need a sandbox."
"Eventually. Right now we need a mock that is good enough to test the tutoring flow. A subprocess with a timeout. Five seconds. If the code finishes, return the output. If it hangs, kill it. Basic safety: block imports of os and subprocess so nobody deletes your files from inside a tutoring session."
"And the upgrade URL?"
"Placeholder. A string that looks like a Stripe checkout link. Real Stripe comes after tests." She turned back to her screen. "Two tools, both mocked. One hour."
You are doing exactly what James is doing. Two more tools to build, both using mock implementations that are good enough to test the full product flow. Production-grade sandboxing and real Stripe integration come later. Right now, the goal is completing the tool surface.
Tool 7: submit_code
This tool lets a learner submit Python code and get the output back. In production, you would run that code in a Docker container or a browser-based sandbox like Pyodide. For now, a subprocess with a timeout is sufficient to test the tutoring loop.
Open Claude Code in your tutorclaw-mcp project and describe the tool:
Add a submit_code tool to the TutorClaw MCP server. It takes a string
of Python code from the learner. It runs the code in a subprocess with
a 5-second timeout. It captures stdout and stderr and returns both.
Include basic safety checks before execution:
- Reject code that imports os, subprocess, or shutil
- Reject code that uses open() for file access
If the code times out, return an error message saying the code took
too long. If the safety check fails, return an error saying which
import or function was blocked.
This is a mock sandbox. Real sandboxing comes later. The mock is
sufficient for testing the tutoring flow.
Notice what you told Claude Code and what you left out. You described the behavior (run code, capture output), the constraints (timeout, blocked imports), and the scope (mock, not production). You did not describe how to implement subprocess calls or how to parse import statements. Those are implementation decisions for Claude Code.
Review and Steer
When Claude Code returns the spec, check three things:
| Check | What to Look For |
|---|---|
| Tool description | Specific enough that the agent knows to call this tool when a learner submits code, not when they ask a question about code |
| Safety checks | The blocked imports and functions are listed explicitly, not hidden behind a vague "security check" |
| Timeout behavior | The spec says what happens when code times out (error message, not a crash) |
If the description is vague, steer it:
Make the tool description more specific. It should say: "Execute
learner-submitted Python code in a sandboxed subprocess. Call this
when a learner submits code for evaluation, not when they ask
questions about code."
Once the spec looks right, approve the build:
The spec looks good. Build this.
Verify submit_code
After Claude Code finishes building, test the tool with three cases:
Case 1: Valid code that finishes quickly
Call submit_code with print("hello"). You should get back stdout containing hello.
Case 2: Code that runs forever
Call submit_code with while True: pass. The tool should return a timeout error after 5 seconds, not hang indefinitely.
Case 3: Blocked import
Call submit_code with import os; os.listdir("."). The tool should reject the code before running it, returning an error that names the blocked import.
All three cases passing means the mock sandbox works for testing purposes. It is not production-safe (a determined user could still cause problems), but it is sufficient to test the tutoring flow where a learner submits exercises.
Tool 8: get_upgrade_url
This tool creates a checkout link for learners who want to upgrade from free to paid. Real Stripe integration comes in Lesson 14. For now, a placeholder URL is enough to prove the upgrade flow works.
Describe the tool to Claude Code:
Add a get_upgrade_url tool to the TutorClaw MCP server. It takes a
learner_id. It checks the learner's current tier from the JSON state.
If the learner is on the free tier, return a mock upgrade URL like
"https://checkout.stripe.com/mock-session-id".
If the learner is already on the paid tier, return an error saying
the learner is already upgraded.
This is a placeholder. Real Stripe checkout replaces this in Lesson 14.
Verify get_upgrade_url
Two test cases:
Case 1: Free-tier learner requests upgrade
Call get_upgrade_url with a learner ID that has a free tier. You should get back the mock URL.
Case 2: Paid-tier learner requests upgrade
Call get_upgrade_url with a learner ID that has a paid tier (you may need to manually edit the JSON state file to set a learner's tier to "paid" for this test). You should get an error message indicating the learner is already upgraded.
Both cases passing means the upgrade flow will work when you wire all nine tools together in the next lesson.
Nine Tools Complete
Step back and count. Over four lessons, you described nine tools to Claude Code:
| Lesson | Tools Built | Count |
|---|---|---|
| L3: State Tools | register_learner, get_learner_state, update_progress | 3 |
| L4: Content Tools | get_chapter_content, get_exercises | 2 |
| L5: Pedagogy Tools | generate_guidance, assess_response | 2 |
| L6: Code and Upgrade Tools | submit_code, get_upgrade_url | 2 |
| Total | 9 |
Every tool was built the same way: describe what you need, steer the spec, let Claude Code implement, verify the result. The tools themselves are different (state management, content delivery, teaching methodology, code execution, payments), but the workflow never changed.
Two of those tools are mocks. That is intentional. The submit_code tool runs real code in a subprocess, which is enough to test whether the tutoring loop works. The get_upgrade_url tool returns a placeholder, which is enough to test whether the tier-gating flow works. Neither mock blocks progress on the product. Both will be replaced when the product needs production infrastructure.
A mock becomes technical debt when you forget it exists. Write a comment in both tools: "MOCK: Replace with real implementation in L14 (Stripe) and production sandbox." The comment is a reminder, not an apology.
Try With AI
Exercise 1: Add a Third Safety Constraint
The current submit_code mock blocks os, subprocess, shutil, and open(). There are other dangerous operations a learner could attempt. Describe an additional constraint to Claude Code:
The submit_code tool should also reject code that uses eval() or
exec(). These functions can execute arbitrary code and bypass the
import restrictions. Add this check and update the tests.
Run the tests after the change. Does the new constraint work without breaking existing tests?
What you are learning: Safety constraints are additive. Each one narrows what untrusted code can do. The skill is knowing which constraints matter for your threat model and which ones are overkill for a mock.
Exercise 2: Test the Boundary Between Mock and Real
Ask Claude Code to evaluate where your mock falls short:
Compare the submit_code mock to a real sandboxed code execution
environment. What can a determined user do in the current mock that
they could not do in a Docker-based sandbox? List specific attack
vectors.
What you are learning: Mocks have known limitations. The value of a mock is not that it is safe. The value is that it lets you test the product flow while you defer the safety investment. Understanding the gap between mock and production helps you decide when to replace it.
Exercise 3: Design a Better Mock Error Message
When get_upgrade_url is called for a paid learner, the error message matters because the agent reads it and decides what to tell the learner. Ask Claude Code to improve the error:
The error message from get_upgrade_url when the learner is already
paid says "learner is already upgraded." Rewrite it so the agent
knows to congratulate the learner and suggest they explore paid
content instead. The error message is context for the agent, not
a message shown to the learner.
What you are learning: Tool error messages are instructions to the agent. A good error message tells the agent what to do next, not just what went wrong. This is the same context engineering principle from tool descriptions, applied to error paths.
James scrolled through his terminal. "Nine tools. register_learner, get_learner_state, update_progress, get_chapter_content, get_exercises, generate_guidance, assess_response, submit_code, get_upgrade_url." He counted them off on his fingers. "Four lessons. All working."
"All working in isolation," Emma corrected. "Each tool does its job when you call it directly. But nobody is calling them in sequence yet. A real tutoring session starts with register, moves to content, generates guidance, assesses a response, maybe executes submitted code. That is a chain of tool calls, not nine separate calls."
"So the next step is wiring them together?"
Emma nodded, then paused. "I will say this, though. I have shipped mocks that became permanent. Three years later someone finds a subprocess call with a five-second timeout in production and wonders who thought that was acceptable." She pointed at his screen. "Set a reminder. Write a comment. Something that says: this is a mock, replace it by Lesson 14. If you do not mark it, you will forget it."
"You sound like you are speaking from experience."
"I am speaking from a post-mortem." She picked up her coffee. "Next lesson: wire all nine tools into one server and run a complete tutoring flow. Isolation is over. Integration starts."