Build the Pedagogy Tools
James stared at the WhatsApp response from TutorClaw. He had sent "Teach me about variables" and got back a wall of text: definitions, examples, exercises, all at once.
"It fetches the content and shows it," he said, scrolling through the message. "When I trained new hires at the warehouse, I never did this. I walked them through it. Asked them what they thought would happen before they tried the forklift. Watched them try. Then asked why it went sideways."
Emma looked at the screen. "You just described a pedagogical framework. Predict what happens, run it, investigate why the result was different from the prediction." She pulled up a chair. "That three-step loop has a name: PRIMM-Lite. Build it."
You are doing exactly what James is doing. Your content tools deliver raw material, but they do not teach. In this lesson, you describe two pedagogy tools to Claude Code that turn TutorClaw from a content dump into a tutor that walks learners through material the way James trained warehouse staff.
PRIMM-Lite: The Three-Stage Teaching Loop
Before you describe anything to Claude Code, you need to understand what you are asking it to build. PRIMM-Lite is a simplified teaching methodology with three stages:
| Stage | What Happens | Example |
|---|---|---|
| Predict | The tutor shows code and asks "What do you think this will do?" before revealing the output | "Look at this loop. What will it print?" |
| Run | The tutor reveals the actual output and asks "Does this match your prediction?" | "Here is what it actually prints. Were you right?" |
| Investigate | The tutor asks "Why did the output differ?" or "What would happen if you changed X?" | "Why did it print 5 instead of 4? What if you changed the range?" |
The three stages cycle for each topic. A learner who predicts correctly advances quickly. A learner who predicts incorrectly spends more time in the Investigate stage, building understanding before moving forward.
Two pieces make this work as MCP tools:
- generate_guidance takes the learner's current stage and confidence, plus the chapter content, and returns a stage-appropriate prompt
- assess_response takes the learner's answer, the current stage, and the expected concepts, and returns a confidence_delta (positive or negative) with feedback
Step 1: Describe generate_guidance to Claude Code
Open Claude Code in your tutorclaw-mcp project. You need to explain PRIMM-Lite as a requirement, not as code. Send this message:
I want to add a tool called generate_guidance. Here is how it works:
TutorClaw uses a 3-stage teaching loop called PRIMM-Lite:
- Predict: Show code without output, ask the learner to predict what it does
- Run: Reveal the output, ask if it matched their prediction
- Investigate: Ask why the output differed, or what would change if they modified the code
The tool takes two inputs:
- learner_state: includes the learner's current PRIMM stage (predict, run, or investigate) and their confidence score (0.0 to 1.0)
- chapter_content: the raw content for the current chapter
The tool output should have two parts:
- content: the stage-appropriate text to show the learner
- system_prompt_addition: a short instruction (under 200 tokens) telling
the agent HOW to behave this turn. For example, at the predict stage:
"The learner is in the PREDICT stage. Ask what they think this code
will do. Wait for their answer before revealing the output."
The system_prompt_addition is important because it steers the agent's
behavior for this turn. Without it, the agent might ignore the PRIMM
stage and just dump the content.
Spec this before building.
The system_prompt_addition field is the key design insight in this tool. The content is what the learner sees. The system_prompt_addition is what the agent follows. This is a tool that returns instructions, not just data.
Review the spec Claude Code produces. The key things to check:
| Element | What to Verify |
|---|---|
| Stage handling | Does the tool produce different output for each of the three stages? |
| system_prompt_addition | Does each stage return a different instruction? Predict should tell the agent to wait for a prediction. Run should tell the agent to compare. Investigate should tell the agent to ask probing questions. |
| Confidence awareness | Does it adjust prompt difficulty based on confidence? A learner at 0.2 confidence needs simpler prompts than one at 0.8 |
| Tool description | Is it specific enough that the agent knows when to call this versus get_chapter_content? |
If the tool description is vague ("generates teaching stuff"), steer it:
The tool description needs to be specific. It should say: this tool
generates stage-appropriate teaching prompts using the PRIMM-Lite
methodology. Call this tool when the agent needs to teach a concept,
not when the learner asks for raw content.
Once the spec looks right, tell Claude Code to build it:
The spec looks good. Build this.
Run the tests after it finishes:
uv run pytest
Step 2: Describe assess_response to Claude Code
The second pedagogy tool evaluates whether the learner's answer demonstrates understanding. Send this message:
Add a tool called assess_response. It evaluates a learner's answer
during a PRIMM-Lite teaching session.
It takes three inputs:
- answer_text: what the learner wrote
- primm_stage: which stage (predict, run, or investigate) they are in
- expected_concepts: a list of concepts the answer should demonstrate understanding of
It returns three things:
- confidence_delta: a float between -0.3 and 0.3 that adjusts the learner's confidence score. Positive means the answer showed understanding. Negative means it did not.
- feedback: text explaining what the learner got right or what to focus on next.
- recommendation: a short suggestion for the agent's next action, like "advance to the run stage" or "stay in predict with a simpler example." This is a suggestion the agent can follow, not a command.
A strong answer like "it prints numbers 1 through 5 because range generates
a sequence" should return a positive delta around 0.2.
A vague answer like "it does stuff" should return a negative delta around -0.1
with feedback pointing the learner to the specific concept they missed.
Spec this before building.
Review the spec. Watch for:
- Does the confidence_delta range make sense? Too wide (like -1.0 to 1.0) makes single answers swing the learner's state too far. Too narrow (like -0.01 to 0.01) makes progress invisible.
- Does the tool description distinguish this from generate_guidance? One generates prompts, the other evaluates answers. The agent must never confuse them.
Approve and build:
The spec looks good. Build this.
Run tests:
uv run pytest
Step 3: Verify Both Tools
Now test the pedagogy tools together. Ask Claude Code to call them in sequence:
Test the pedagogy tools for me:
1. Call generate_guidance with a learner in the "predict" stage at
confidence 0.5, using chapter 1 content. Show me what it returns.
2. Call assess_response with the answer "I think it will print hello
world" for the predict stage, with expected concepts of
"print function" and "string output". Show me the confidence delta
and feedback.
3. Call assess_response again with the answer "I don't know" for the
same stage and concepts. Show me how the delta differs.
You are looking for three things:
- generate_guidance at predict stage returns a prompt that shows code and asks the learner to predict, not a prompt that reveals the answer
- assess_response with a reasonable answer returns a positive confidence_delta and encouraging feedback
- assess_response with a vague answer returns a negative confidence_delta and feedback that points the learner toward what they missed
If any of these are wrong, describe the problem to Claude Code and steer the fix. The describe-steer-verify cycle from Chapter 57 applies to every tool you build.
Two Tools, One Teaching Method
Step back and look at what these two tools do together. Before this lesson, TutorClaw could store learner state and fetch content. It had a filing cabinet and a bookshelf. Now it has a teaching method.
When the agent needs to teach a concept, it calls generate_guidance to get a stage-appropriate prompt. When the learner responds, it calls assess_response to evaluate the answer. The confidence_delta feeds back into the learner state (via update_progress from Lesson 3), and the next call to generate_guidance adjusts accordingly. A learner who keeps answering well moves through stages quickly. A learner who struggles gets more support at the current stage.
The agent orchestrates this loop by reading tool descriptions, not by following hardcoded logic. You described the methodology, Claude Code built the implementation, and the agent will use the descriptions to call the right tool at the right time.
Try With AI
Exercise 1: Test All Three Stages
Walk through one complete PRIMM-Lite cycle by calling generate_guidance at each stage:
Call generate_guidance three times for the same chapter content:
1. Learner in "predict" stage, confidence 0.5
2. Learner in "run" stage, confidence 0.5
3. Learner in "investigate" stage, confidence 0.5
Show me all three responses side by side so I can compare them.
What you are learning: Each stage should produce a qualitatively different response. Predict asks for a prediction. Run reveals the answer and asks for comparison. Investigate probes deeper. If the three responses look similar, the tool's stage handling needs steering.
Exercise 2: Confidence Boundaries
Test what happens at extreme confidence values:
Call generate_guidance twice for the same chapter and stage (predict):
1. Learner confidence at 0.1 (struggling)
2. Learner confidence at 0.9 (confident)
How do the prompts differ? Should they differ?
What you are learning: A learner at 0.1 confidence needs simpler, more supportive prompts than one at 0.9. If the tool produces identical prompts regardless of confidence, you may want to steer Claude Code to add confidence-aware difficulty scaling.
Exercise 3: Edge Case Responses
Test assess_response with responses that are technically correct but miss the point:
Call assess_response for the "investigate" stage with expected concepts
"loop iteration" and "range function". Use this answer:
"The code runs and gives output"
What confidence_delta does it return? Now try:
"The for loop runs 5 times because range(5) generates 0,1,2,3,4
and the loop variable takes each value in sequence"
Compare the two deltas and the feedback text.
What you are learning: The quality gap between a vague answer and a specific one should produce a meaningful difference in confidence_delta. If both answers return similar scores, the assessment logic is too generous and needs tighter criteria.
James called generate_guidance with a learner in the predict stage. The response came back: a code snippet with the question "What do you think this will print?"
"It teaches instead of dumping." He grinned. "At the warehouse, the new hires who predicted first always learned the safety protocols faster. Something about committing to an answer before you see the result."
Emma nodded. "That commitment is the whole trick. Predict forces the learner to engage before they can passively scroll." She paused. "I shipped a tutor once without any pedagogical framework. Just search and display. Users said it was a search engine with extra steps. We bolted PRIMM onto it in a weekend and retention tripled." She shrugged. "Should have done it from the start."
"So we have state tools, content tools, and now pedagogy tools." James counted on his fingers. "What is left?"
"Two more. A way to run code and a way to get paid." Emma pointed at his screen. "Lesson 6."