Skip to main content

Build the Code and Upgrade Tools

James counted on his fingers. "Seven tools done. Two to go. Code execution and the upgrade link."

Emma looked up from her terminal. "The code tool is the dangerous one. A learner sends Python code and your server runs it. Think about that for a second."

"So we need a sandbox."

"Eventually. Right now we need a mock that is good enough to test the tutoring flow. A subprocess with a timeout. Five seconds. If the code finishes, return the output. If it hangs, kill it. Basic safety: block imports of os and subprocess so nobody deletes your files from inside a tutoring session."

"And the upgrade URL?"

"Placeholder. A string that looks like a Stripe checkout link. Real Stripe comes after tests." She turned back to her screen. "Two tools, both mocked. One hour."


You are doing exactly what James is doing. Two more tools to build, both using mock implementations that are good enough to test the full product flow. Production-grade sandboxing and real Stripe integration come later. Right now, the goal is completing the tool surface.

Tool 7: submit_code

This tool lets a learner submit Python code and get the output back. In production, you would run that code in a Docker container or a browser-based sandbox like Pyodide. For now, a subprocess with a timeout is sufficient to test the tutoring loop.

Open Claude Code in your tutorclaw-mcp project and describe the tool:

Add a submit_code tool to the TutorClaw MCP server. It takes a string
of Python code from the learner. It runs the code in a subprocess with
a 5-second timeout. It captures stdout and stderr and returns both.

Include basic safety checks before execution:
- Reject code that imports os, subprocess, or shutil
- Reject code that uses open() for file access

If the code times out, return an error message saying the code took
too long. If the safety check fails, return an error saying which
import or function was blocked.

This is a mock sandbox. Real sandboxing comes later. The mock is
sufficient for testing the tutoring flow.

Notice what you told Claude Code and what you left out. You described the behavior (run code, capture output), the constraints (timeout, blocked imports), and the scope (mock, not production). You did not describe how to implement subprocess calls or how to parse import statements. Those are implementation decisions for Claude Code.

Review and Steer

When Claude Code returns the spec, check three things:

CheckWhat to Look For
Tool descriptionSpecific enough that the agent knows to call this tool when a learner submits code, not when they ask a question about code
Safety checksThe blocked imports and functions are listed explicitly, not hidden behind a vague "security check"
Timeout behaviorThe spec says what happens when code times out (error message, not a crash)

If the description is vague, steer it:

Make the tool description more specific. It should say: "Execute
learner-submitted Python code in a sandboxed subprocess. Call this
when a learner submits code for evaluation, not when they ask
questions about code."

Once the spec looks right, approve the build:

The spec looks good. Build this.

Verify submit_code

After Claude Code finishes building, test the tool with three cases:

Case 1: Valid code that finishes quickly

Call submit_code with print("hello"). You should get back stdout containing hello.

Case 2: Code that runs forever

Call submit_code with while True: pass. The tool should return a timeout error after 5 seconds, not hang indefinitely.

Case 3: Blocked import

Call submit_code with import os; os.listdir("."). The tool should reject the code before running it, returning an error that names the blocked import.

All three cases passing means the mock sandbox works for testing purposes. It is not production-safe (a determined user could still cause problems), but it is sufficient to test the tutoring flow where a learner submits exercises.

Tool 8: get_upgrade_url

This tool creates a checkout link for learners who want to upgrade from free to paid. Real Stripe integration comes in Lesson 14. For now, a placeholder URL is enough to prove the upgrade flow works.

Describe the tool to Claude Code:

Add a get_upgrade_url tool to the TutorClaw MCP server. It takes a
learner_id. It checks the learner's current tier from the JSON state.

If the learner is on the free tier, return a mock upgrade URL like
"https://checkout.stripe.com/mock-session-id".

If the learner is already on the paid tier, return an error saying
the learner is already upgraded.

This is a placeholder. Real Stripe checkout replaces this in Lesson 14.

Verify get_upgrade_url

Two test cases:

Case 1: Free-tier learner requests upgrade

Call get_upgrade_url with a learner ID that has a free tier. You should get back the mock URL.

Case 2: Paid-tier learner requests upgrade

Call get_upgrade_url with a learner ID that has a paid tier (you may need to manually edit the JSON state file to set a learner's tier to "paid" for this test). You should get an error message indicating the learner is already upgraded.

Both cases passing means the upgrade flow will work when you wire all nine tools together in the next lesson.

Nine Tools Complete

Step back and count. Over four lessons, you described nine tools to Claude Code:

LessonTools BuiltCount
L3: State Toolsregister_learner, get_learner_state, update_progress3
L4: Content Toolsget_chapter_content, get_exercises2
L5: Pedagogy Toolsgenerate_guidance, assess_response2
L6: Code and Upgrade Toolssubmit_code, get_upgrade_url2
Total9

Every tool was built the same way: describe what you need, steer the spec, let Claude Code implement, verify the result. The tools themselves are different (state management, content delivery, teaching methodology, code execution, payments), but the workflow never changed.

Two of those tools are mocks. That is intentional. The submit_code tool runs real code in a subprocess, which is enough to test whether the tutoring loop works. The get_upgrade_url tool returns a placeholder, which is enough to test whether the tier-gating flow works. Neither mock blocks progress on the product. Both will be replaced when the product needs production infrastructure.

When Mocks Become Debt

A mock becomes technical debt when you forget it exists. Write a comment in both tools: "MOCK: Replace with real implementation in L14 (Stripe) and production sandbox." The comment is a reminder, not an apology.

Try With AI

Exercise 1: Add a Third Safety Constraint

The current submit_code mock blocks os, subprocess, shutil, and open(). There are other dangerous operations a learner could attempt. Describe an additional constraint to Claude Code:

The submit_code tool should also reject code that uses eval() or
exec(). These functions can execute arbitrary code and bypass the
import restrictions. Add this check and update the tests.

Run the tests after the change. Does the new constraint work without breaking existing tests?

What you are learning: Safety constraints are additive. Each one narrows what untrusted code can do. The skill is knowing which constraints matter for your threat model and which ones are overkill for a mock.

Exercise 2: Test the Boundary Between Mock and Real

Ask Claude Code to evaluate where your mock falls short:

Compare the submit_code mock to a real sandboxed code execution
environment. What can a determined user do in the current mock that
they could not do in a Docker-based sandbox? List specific attack
vectors.

What you are learning: Mocks have known limitations. The value of a mock is not that it is safe. The value is that it lets you test the product flow while you defer the safety investment. Understanding the gap between mock and production helps you decide when to replace it.

Exercise 3: Design a Better Mock Error Message

When get_upgrade_url is called for a paid learner, the error message matters because the agent reads it and decides what to tell the learner. Ask Claude Code to improve the error:

The error message from get_upgrade_url when the learner is already
paid says "learner is already upgraded." Rewrite it so the agent
knows to congratulate the learner and suggest they explore paid
content instead. The error message is context for the agent, not
a message shown to the learner.

What you are learning: Tool error messages are instructions to the agent. A good error message tells the agent what to do next, not just what went wrong. This is the same context engineering principle from tool descriptions, applied to error paths.


James scrolled through his terminal. "Nine tools. register_learner, get_learner_state, update_progress, get_chapter_content, get_exercises, generate_guidance, assess_response, submit_code, get_upgrade_url." He counted them off on his fingers. "Four lessons. All working."

"All working in isolation," Emma corrected. "Each tool does its job when you call it directly. But nobody is calling them in sequence yet. A real tutoring session starts with register, moves to content, generates guidance, assesses a response, maybe executes submitted code. That is a chain of tool calls, not nine separate calls."

"So the next step is wiring them together?"

Emma nodded, then paused. "I will say this, though. I have shipped mocks that became permanent. Three years later someone finds a subprocess call with a five-second timeout in production and wonders who thought that was acceptable." She pointed at his screen. "Set a reminder. Write a comment. Something that says: this is a mock, replace it by Lesson 14. If you do not mark it, you will forget it."

"You sound like you are speaking from experience."

"I am speaking from a post-mortem." She picked up her coffee. "Next lesson: wire all nine tools into one server and run a complete tutoring flow. Isolation is over. Integration starts."