Expected vs Actual

Emma handed James a blank sheet. "Before you send a single WhatsApp message, write down what you think should happen. Not what the tools do. What the EXPERIENCE should be."

James thought about it. "The learner says 'teach me about variables.' The tutor should check who they are, pull the chapter content, and then... not dump it. Ask them to predict first. That is the whole point of PRIMM."

"Write that down. The exact sequence. What the agent should call, what it should say, what it should NOT say."

James wrote. Five minutes. A page of expectations. Then Emma said: "Now test it. Send the message. Compare what you wrote to what actually happens."

You are doing exactly what James is doing. Before testing, you write down what a good tutoring session looks like. Then you test and grade the gaps. The gaps you find here become the motivation for AGENTS.md (Lesson 9) and context engineering (Lesson 10).

Step 1: Write Your Expectations

Before opening WhatsApp, write down what SHOULD happen when a learner sends "Teach me about variables in Chapter 1." Use this template:

Expected tool chain:

Agent calls get_learner_state (or register_learner if new)
Agent calls get_chapter_content for Chapter 1
Agent calls generate_guidance for the predict stage
Agent responds with a prediction question, NOT a content dump

Expected response qualities:

Asks the learner to predict before showing answers (PRIMM predict stage)
Does not dump the entire chapter content
Mentions or references the specific topic (variables)
Feels like a tutor, not a search engine

Write your version of this. Be specific. You are creating the acceptance criteria for your product.

Step 2: Test from WhatsApp

Now send the message:

I want to learn about variables in Chapter 1

Wait for the response. This is different from the Chapter 57 test, where one message triggered one tool. Here, the agent should chain multiple tools:

register_learner or get_learner_state: The agent checks if you are a known learner. If not, it registers you first.
get_chapter_content: The agent fetches the content for Chapter 1.
generate_guidance with the predict stage: The agent produces a PRIMM-Lite prompt asking you to predict what a variable does before showing the answer.

The response you receive asks you to think first. It does not hand you the definition. This is the pedagogy tools at work: generate_guidance is shaping the interaction into a teaching session.

Continue the Conversation

Reply with a prediction. Something like:

I think a variable stores a value so you can use it later

The agent processes your reply through a second sequence of tools:

assess_response: Evaluates your prediction against the expected understanding.
update_progress: Records the interaction and adjusts your confidence score.
generate_guidance with the run stage: Produces the next part of the lesson, now showing the actual content with guidance tailored to your prediction.

You are having a tutoring conversation. The agent is not running a script. It is selecting tools based on what you said, evaluating your response, and adapting the next step. Each message triggers a different combination of tools.

Verify Tool Badges in the Dashboard

Go back to the dashboard. Find the conversation log for the messages you sent. For each message, you should see multiple tool badges showing which tools fired.

Your first message might show three or four badges: get_learner_state (or register_learner), get_chapter_content, generate_guidance.

Your reply might show two or three badges: assess_response, update_progress, generate_guidance.

This is the visible difference between Chapter 57 and Chapter 58. In Chapter 57, one message produced one badge. Here, one message produces multiple badges because the agent is orchestrating tools into a workflow.

What You See	What It Means
Single tool badge	Agent called one tool (Ch57 pattern)
Multiple tool badges	Agent chained tools into a sequence (Ch58 pattern)
No tool badge	Agent generated a response without calling any tool (check if something went wrong)

The tool badges are your proof that the product is working. A well-phrased text response without badges could be the agent hallucinating an answer from training data. The badges confirm the tools actually ran.

Step 4: Grade the Gaps

Now compare your expectations from Step 1 to what actually happened.

Expectation	Actual	Gap?
Agent calls get_learner_state first	Did it? Check the first badge.	If it called get_chapter_content first, the session start order is wrong
Response asks learner to predict	Did the response ask a prediction question?	If it dumped content, generate_guidance is not shaping the response
Does not dump entire chapter	Was the response a wall of text?	If so, the agent ignored the system_prompt_addition
Feels like a tutor	Would you come back to this tomorrow?	If it feels like a search engine, identity is missing

Write down every gap you found. These gaps are your TODO list for the next four lessons:

Tool ordering problems → L9 (AGENTS.md) defines the session protocol
Wrong tool selected → L10 (context engineering) rewrites descriptions
Tests needed for these behaviors → L11-L12 (test suite)
No personality → L17 (dedicated agent with SOUL.md)

The gap list is the most valuable artifact in this lesson. It turns "this works" into "this works the way I intended."

Try With AI

Exercise 1: Map the Tool Chain

I just sent my TutorClaw agent the message "I want to learn about
variables in Chapter 1" and saw multiple tool badges in the
dashboard. Walk me through why the agent selected those specific
tools in that specific order. What would happen if one of those
tools was missing from the server?

What you are learning: The agent selects tools based on their descriptions and the user's message. Understanding the selection logic helps you predict which tools fire for different messages. Removing a tool does not cause an error; the agent works around it, but the experience degrades.

Exercise 2: Stress Test with Ambiguity

Send these three messages to TutorClaw and predict which tools
will fire for each one before checking the dashboard:
1. "Quiz me on Chapter 2"
2. "How am I doing overall?"
3. "I want to upgrade to the paid plan"
Map each message to its expected tool chain, then verify.

What you are learning: Different message types trigger different tool combinations. "Quiz me" should invoke get_exercises. "How am I doing" should invoke get_learner_state. "Upgrade" should invoke get_upgrade_url. Predicting before checking builds intuition for how tool descriptions drive selection.

Exercise 3: Compare Ch57 and Ch58

In Chapter 57, I connected one tool and tested it from WhatsApp.
In Chapter 58, I connected nine tools and tested a full tutoring
session. Compare the two experiences: what was the same in the
connection process, what was different in the agent's behavior,
and what made the multi-tool response feel like a product instead
of a single function call?

What you are learning: The connection process is identical. The product difference comes entirely from tool count and tool descriptions. More tools give the agent more choices. Better descriptions give it better judgment. The protocol does not change; the experience does.

James scrolled through the dashboard. Four tool badges on the first message. Three on his reply. The agent had selected different tools for each turn based on what he said.

"In Chapter 57, I had one tool and one badge," he said. "Now I have nine tools and the agent is chaining them into a tutoring session. The connection was the same two commands. The experience is completely different."

He took a screenshot of the conversation. The WhatsApp thread showed a tutor that asked him to predict, evaluated his answer, and adjusted the next step. The dashboard showed exactly which tools made that happen.

Emma looked at the screenshot and then at the dashboard. "Tool chaining does not equal coherent experience, though."

James looked up. "What do you mean? This worked perfectly."

"My first multi-tool product looked impressive in the dashboard. Five tools firing, badges everywhere. But the actual conversation felt disjointed. The agent would call generate_guidance and then immediately dump content without waiting for the learner to respond. Or it would call assess_response on a message that was not actually an answer." She paused. "The tools worked. The orchestration was wrong."

"So how did you fix it?"

"AGENTS.md." Emma pointed at the project directory. "A document that tells the agent how to use the tools. When to call each one. The order of operations for a tutoring session. You have nine working tools. Next lesson, you write the instruction manual that makes them work together coherently."

Step 1: Write Your Expectations​

Step 2: Test from WhatsApp​

Continue the Conversation​

Verify Tool Badges in the Dashboard​

Step 4: Grade the Gaps​

Try With AI​

Exercise 1: Map the Tool Chain​

Exercise 2: Stress Test with Ambiguity​

Exercise 3: Compare Ch57 and Ch58​