LLM Pricing for Product Builders
The Great Inversion answered one question and raised another. James now understood that TutorClaw's operating cost was $50-70 per month because the learner provides their own LLM. But that raised an obvious follow-up.
"If my learners are paying for their own tokens," James said, "then I need to know what that costs them. If the cheapest option is ten dollars a day, the inversion looks good on my balance sheet but terrible for adoption."
Emma pulled up a pricing table. "Look at the rightmost column first. Then look at the leftmost. Tell me the ratio."
You are doing exactly what James is doing. The Great Inversion shifts LLM costs to the learner. Now you need to understand the range of what those costs look like, because it directly affects whether learners will use your product.
The 37x Range
Claude Sonnet output costs $15 per million tokens. GPT-5 Nano costs $0.40 per million tokens. That is a 37x difference across the practical range most learners will choose from. Claude Opus sits above this range at $75/M (187x compared to Nano), but few learners use Opus for daily tutoring. Here is the full pricing landscape:
| Model | Input / 1M tokens | Output / 1M tokens | Cost Tier |
|---|---|---|---|
| Claude Opus 4.6 | $15.00 | $75.00 | Premium |
| Claude Sonnet 4.5 | $3.00 | $15.00 | High |
| GPT-5.4 | $2.50 | $15.00 | High |
| GPT-5.4 mini | $0.75 | $4.50 | Mid |
| DeepSeek V3.2 | $0.28 | $0.42 | Low |
| GPT-5 Nano | $0.05 | $0.40 | Ultra-low |
| DeepSeek V3.2 (cache hit) | $0.028 | $0.42 | Near-free |
This table is the terrain your learners navigate. A learner using Claude Opus pays 300x more per input token than a learner using GPT-5 Nano. On output tokens (which dominate tutoring conversations because the tutor generates more text than the learner), the practical range is 37x: Sonnet at $15/M to Nano at $0.40/M.
In Architecture 4, the learner chooses their model in OpenClaw. TutorClaw's skill documentation recommends models by use case: Claude Sonnet for best pedagogical quality, GPT-5.4 mini for cost-effectiveness, DeepSeek for budget. But the operator does not control this choice. The learner picks what they can afford.
Why This Matters to You (the Operator)
You do not pay for these tokens. So why should you care?
Because your product's reputation depends on the learner's experience. If a learner picks the cheapest model and gets confused, garbled tutoring, they do not blame the model. They blame TutorClaw. Your product is what they see; the model is invisible infrastructure.
This creates a product design constraint: TutorClaw must work acceptably across the entire 37x range. Not identically. Not perfectly. But acceptably. A learner on GPT-5 Nano should still get useful pedagogical guidance, even if the natural language is less polished than what Claude Sonnet produces.
The MCP server is what makes this possible. TutorClaw's MCP server returns structured tool responses: which chapter the learner is in, which PRIMM stage to use, which exercise to present, what hints to offer. This pedagogical structure comes from the server, not the model. The LLM wraps that structure in natural language, but the core teaching logic is model-independent.
A weak model might produce a clunky sentence. It will not produce the wrong chapter or the wrong exercise, because that information comes from the MCP server's tool response.
Cost Per Accepted Output
The naive way to evaluate model costs is cost per token. The correct metric is:
Cost Per Accepted Output = (Token Cost + Human Correction Cost) / Accepted Outputs
Work through this with two scenarios.
Scenario A: Budget model. GPT-5 Nano at $0.40/M output tokens. Each tutoring exchange uses approximately 500 output tokens. Token cost per exchange: $0.0002. But the model produces incorrect or confusing pedagogical guidance 40% of the time, requiring the learner to re-prompt or abandon the exchange. If we estimate a correction cost of $0 in direct fees but account for only 60% of outputs being accepted:
- CPAO = $0.0002 / 0.60 = $0.00033 per accepted output
Scenario B: Premium model. Claude Sonnet at $15/M output tokens. Same 500 output tokens per exchange. Token cost per exchange: $0.0075. The model produces incorrect guidance only 5% of the time. 95% of outputs are accepted:
- CPAO = $0.0075 / 0.95 = $0.00789 per accepted output
In pure token economics, Scenario B costs 24x more per accepted output than Scenario A. The budget model wins on price even after adjusting for failures.
But CPAO does not capture trust erosion. A learner who gets confused guidance 40% of the time does not just re-prompt. They stop trusting the tutor. They disengage. They cancel their subscription to the premium intelligence tier. The operator loses revenue not from token costs (which the operator does not pay) but from churn caused by poor model quality on the learner's end.
This is the product builder's dilemma: you cannot control the model, but the model's quality affects your revenue. The MCP server's structured responses narrow the gap. Even when a weak model produces mediocre natural language, the pedagogical scaffolding (chapter position, exercise type, hint sequence) comes from the server. This does not eliminate the quality gap, but it makes budget models usable instead of unusable.
Your CPAO Calculation
Pick two models from the pricing table. Estimate a failure rate for each (how often the output would need correction in a tutoring context). Calculate the CPAO for both.
| Model A: **_** | Model B: **_** | |
|---|---|---|
| Output price per 1M tokens | $ _ | $ _ |
| Tokens per exchange | _ | _ |
| Token cost per exchange | $ _ | $ _ |
| Estimated acceptance rate | _% | _% |
| CPAO | **$ **_**** | **$ **_**** |
The model with the lower CPAO is the better value per successful interaction. But remember: CPAO measures direct cost efficiency. It does not measure the indirect cost of learner frustration, disengagement, or lost trust.
Try With AI
Exercise 1: Daily Cost Across Model Tiers
A learner uses TutorClaw for 50 exchanges per day. Each exchange
involves 4 MCP tool calls and approximately 500 output tokens per
call (2,000 output tokens per exchange).
Using this pricing table:
- Claude Opus 4.6: $75/M output tokens
- Claude Sonnet 4.5: $15/M output tokens
- GPT-5.4 mini: $4.50/M output tokens
- GPT-5 Nano: $0.40/M output tokens
Calculate the daily token cost for each model. Then calculate the
monthly cost (30 days). Present the results in a table showing
daily cost, monthly cost, and the ratio compared to the cheapest option.
What you are learning: Turning per-token prices into real monthly costs makes the 37x range tangible. A learner considering Claude Opus for tutoring needs to know this costs them dollars per day, not fractions of a cent per token. Product builders need these numbers to write honest model recommendations.
Exercise 2: Cost Per Accepted Output Comparison
I am comparing two LLM models for an AI tutoring product:
Model A: $0.40/M output tokens, but produces incorrect or confusing
pedagogical guidance 40% of the time (60% acceptance rate). Each
tutoring exchange uses 500 output tokens.
Model B: $15/M output tokens, produces incorrect guidance only 5%
of the time (95% acceptance rate). Same 500 output tokens per exchange.
Calculate the Cost Per Accepted Output for both models using this
formula: CPAO = Token Cost Per Exchange / Acceptance Rate
Now add a twist: if each failed exchange costs $2.00 in human
correction time (a teacher reviewing and fixing the guidance),
recalculate using: CPAO = (Token Cost + Correction Cost) / Accepted Outputs
Which model is cheaper now? At what acceptance rate would Model A
become cheaper than Model B even with correction costs?
What you are learning: Cost Per Accepted Output is the metric that matters for product decisions. When you add human correction costs (which are real in educational products), cheap models with high failure rates can become more expensive than premium models. The crossover point tells you where the economics flip.
Exercise 3: Why Structured MCP Responses Reduce Model Dependence
TutorClaw uses an MCP server that returns structured tool responses.
When a learner asks for help with Chapter 12, the MCP server returns
data like: chapter number, PRIMM stage, exercise type, hints, and
success criteria. The LLM then wraps this structured data in natural
language for the learner.
Explain why this architecture reduces the impact of model quality
differences. What stays constant regardless of which model the learner
uses? What varies? If the learner switches from Claude Sonnet to
GPT-5 Nano, what specific parts of the tutoring experience change,
and what parts stay the same?
What you are learning: The MCP server is the reason Architecture 4 works across the 37x price range. By separating pedagogical intelligence (server-side, model-independent) from natural language generation (LLM-side, model-dependent), TutorClaw ensures that the core teaching quality is consistent even when the language quality varies. This separation is a design decision with direct economic consequences.
James ran the daily cost numbers for a few models. "Fifty exchanges a day on Claude Opus is $7.50. On GPT-5 Nano, it is four cents. My learners are making that choice every day."
"And some of them will pick the cheapest option regardless of quality," Emma said. "That is the part I am genuinely uncertain about. Model costs have dropped by roughly 10x over two years. I do not know where they will be next year. The pricing table we just studied could look completely different in six months."
"Does that break the architecture?"
"No. That is the point. Architecture 4 is designed to be indifferent to model pricing. Whether the cheapest model costs $0.40 per million or $0.04 per million, the operator's costs stay the same. The question I cannot answer is which budget tier will remain viable for actual tutoring. A model that costs almost nothing but produces confused output is not a bargain; it is a product risk. The MCP server's structured responses help, but they do not eliminate the gap entirely."
James thought about it in supply chain terms. "It is like recommending suppliers to your franchise operators. You can suggest the premium vendor, but the franchisee picks what they can afford. Your job is to make the recipe work with whatever ingredients they source. The recipe is the intelligence. The ingredients are the model."
"That is exactly the design constraint. And the next question is: what does the recipe itself cost to produce and deliver? You know what the learner pays for tokens. Next, you need to see what Panaversity pays for everything else."