Skip to main content

The Full Engine

James opened the Lesson 2 spec on the left side of his screen. Nine tools, designed on a blank sheet three weeks ago. He opened his terminal on the right side. Nine tools, running, tested, paid, published.

"Everything on the spec is built," he said.

Emma did not move. "Prove it. Walk through every line of that spec and show me where it lives in the code."

James started at the top. register_learner: built in Lesson 3, JSON persistence, test suite passing. get_learner_state: same lesson, same file, same tests. He kept going. Tool by tool, lesson by lesson, matching the paper description to the running implementation.

"I can account for every one," he said after five minutes.

"Good. Now tell me what is missing."


You are doing exactly what James is doing. Open your Lesson 2 spec (or the table below) and walk through every design decision you made. Your job: verify that each one became real code.

The Spec-vs-Implementation Inventory

This table maps every commitment from Lesson 2 to the lesson where you built it and the evidence that it works.

L2 SpecBuilt InStatusEvidence
register_learnerL3DoneJSON persistence, test suite (L11-L12)
get_learner_stateL3DoneJSON persistence, test suite (L11-L12)
update_progressL3DoneConfidence scoring, test suite (L11-L12)
get_chapter_contentL4DoneLocal markdown files, tier gated (L13)
get_exercisesL4DoneLocal files, tier gated (L13)
generate_guidanceL5DonePRIMM-Lite methodology, three stages
assess_responseL5DoneConfidence scoring, stage advancement
submit_codeL6DoneMock sandbox with subprocess
get_upgrade_urlL6, L14DoneMock in L6, real Stripe in L14
Tier gatingL13Donecheck_tier() enforcement, exchange counting
Test suiteL11-L12DoneAll green: valid, invalid, tier, state persistence
Context engineeringL9-L10DoneAGENTS.md orchestration, two-layer descriptions
Agent identityL17DoneSOUL.md and IDENTITY.md
Channel routingL18DoneKeyword triggers, agent binding
HardeningL19DoneInput validation, structured JSON logging
ClawHub publishingL20DonePackage manifest, clawhub publish

Every row maps to a lesson and a test. Nothing from the original spec was skipped.

Your Turn

Open your own Lesson 2 notes or scroll back to the tool contracts. Check each tool against your implementation. If you find something that drifted from the original design, note what changed and why. Intentional changes (you found a better approach during build) are fine. Unintentional drift (you forgot a field or skipped a constraint) is worth fixing.

What You Built vs What Production Needs

The product works. But it works locally, for one user, on your machine. Here is what changes when real users show up, and why none of these changes affect the product itself.

What You HaveWhat Production AddsWhy Later
JSON filesPostgreSQL (Neon)JSON does not scale past hundreds of learners. Concurrent writes corrupt data.
Local content filesCloudflare R2 with WorkersGlobal delivery needs a CDN. Access control needs edge logic.
Mock sandboxDocker container sandboxSecurity isolation for arbitrary code execution from untrusted users.
Local serverVPS with Docker ComposeThe server needs to run around the clock for real learners.
Keyword routingIntent classificationKeywords miss nuanced messages. "I want to get better at coding" has no keyword match.
Test mode StripeProduction StripeReal money, real compliance, real webhook verification.

Look at that table carefully. Every item in the "What Production Adds" column is an infrastructure upgrade. Not a single item changes a tool interface. The inputs and outputs of register_learner are the same whether the data goes to a JSON file or a PostgreSQL table.

This is the key insight: the tests verify the contract, not the implementation. Your test suite calls register_learner with a name and expects a learner_id back. It does not care whether that ID came from a JSON file or a database row. When you swap the storage layer, the tests still pass because the tool still fulfills its contract.

That separation is not an accident. You designed it in Lesson 2 when you wrote the tool contracts. Contracts define the interface. Implementation fills in the details. Swapping the details is a deployment task, not a redesign.

The Verification Ladder

You can evaluate any agent product with these seven levels. Each level builds on the one before it.

LevelQuestionTutorClaw Lesson
1Does each tool work in isolation?L3-L6 (unit tests for each tool)
2Do tools work together?L7-L8 (all nine wired, full session)
3Does the agent select the right tool?L9-L10 (AGENTS.md, context engineering)
4Does the product handle edge cases?L12, L19 (edge case tests, hardening)
5Does the product make money?L13-L15 (tier gating, Stripe, payment flow)
6Does the product have identity?L17 (SOUL.md, IDENTITY.md)
7Does the product degrade gracefully?L16 (shim skill, offline fallback)

Most tutorials stop at Level 2. They show you how to wire tools together and call it done. TutorClaw reaches Level 7 because a product that crashes when the server goes down is not a product anyone will pay for.

The ladder also tells you what to fix first when something breaks. If Level 1 fails (a tool returns wrong output), nothing above it matters. Fix the tool, then work back up. If Level 3 fails (the agent picks the wrong tool), the tools themselves might be fine; the descriptions need refinement. Each level isolates a different class of problem.

Try With AI

Exercise 1: Audit Your Own Product

Open my Lesson 2 tool specifications and my current TutorClaw
codebase side by side. For each of the 9 tools, compare the
original spec (name, inputs, outputs, tier access) against the
actual implementation. List any deviations: fields that changed,
constraints that were added or removed, or behaviors that differ
from the original design. For each deviation, tell me whether it
was an improvement or a regression.

What you are learning: Spec-vs-implementation auditing is a professional skill that applies to any software project. The ability to trace design decisions through implementation builds confidence that nothing was missed and surfaces intentional improvements that should be documented.

Exercise 2: Plan the Database Migration

I am planning to migrate TutorClaw from JSON file storage to
PostgreSQL on Neon. Walk me through what changes and what stays
the same. Which files need rewriting? Which tool interfaces
remain identical? Which tests break and which pass without
modification? Give me a concrete migration checklist.

What you are learning: The separation between interface and implementation is not abstract theory. When you can predict which tests break and which survive a storage migration, you understand the architecture at a level that makes the migration safe instead of scary.

Exercise 3: Apply the Verification Ladder

I have a different agent product idea: a customer support agent
with 5 tools (search_knowledge_base, create_ticket, escalate,
get_order_status, send_followup). Apply the 7-level verification
ladder to this product. What does each level look like? Which
levels are hardest for a support agent compared to a tutoring
agent? Where would you start building?

What you are learning: The verification ladder is not TutorClaw-specific. It works for any agent product. Applying it to a different domain forces you to think about the framework itself, not just the specific product you built. The hardest levels vary by domain: monetization is hard for support agents, tool selection is hard for tutors.


James finished the audit. Every spec item mapped to a lesson and a test. "Nothing is missing," he said. Then he paused. "But it runs on my laptop."

Emma nodded. "What does production need?"

"A database instead of JSON. A real server instead of localhost. A CDN for the content files. And a real sandbox instead of subprocess." He counted on his fingers. "Four infrastructure changes. But the tools stay the same."

"Why do the tools stay the same?"

"Because the tests verify the contract. register_learner takes a name and returns an ID. The test does not care where the ID is stored."

Emma smiled. "That is the sentence I was waiting for." She leaned back. "You described nine tools on a blank sheet. Claude Code built them. You wrote the tests, the descriptions, the identity, and the shim. You published a product. Everything else is making it bigger."

James looked at the terminal. All tests green. Dashboard showing nine tools connected. A Stripe webhook log with test payments processed. A ClawHub package published. He had spent three weeks building this, and every piece of it worked.

"How long did it take you to build your first product this way?" he asked.

Emma hesitated. "Longer than you."

"Because you were learning the tools?"

"No. Because I hand-coded everything. The state management, the tier logic, the webhook handler. I wrote every function myself because I did not trust the tools to get it right." She paused. "You built in days what took me weeks. Not because you are faster. Because you spent your time on product decisions, not implementation details."

James thought about that. His warehouse background had taught him something useful: the manager who obsesses over packing tape never ships on time. You hire good people, give them clear instructions, and verify the result. Claude Code was not that different.

"What is next?" he asked.

"The quiz," Emma said. "Fifty questions. If you can answer them, you understood what you built. If you cannot, you know where to look." She stood up. "And after the quiz, Chapter 59 asks the question you have not asked yet: does TutorClaw make money? Not 'can it accept payments,' because you proved that. The real question: what does each tutoring session cost you, and is the margin sustainable?"

James pulled up his Stripe test dashboard. Payments processed, webhooks fired, tiers upgraded. The product worked. The economics were next.