Sampling — Servers Calling LLMs

Your research tool needs to summarize documents. Your code analysis tool needs to explain complex patterns. Your customer support server needs to reason about edge cases. These tools require AI inference. But you face a problem:

If the server manages its own API key, it's tightly coupled to a specific model provider. It becomes expensive (running continuously), complex (error handling, rate limiting), and insecure (keys embedded everywhere).

If the server asks the client to call Claude, costs shift to the client, complexity vanishes, and the server stays provider-agnostic. This is sampling.

Sampling is the MCP mechanism that lets servers request LLM inference from clients without managing keys, credentials, or provider relationships. The server asks, "Hey client, can you run Claude on this?" The client responds, "Sure, here's what Claude says." Deterministic operations stay on the server. Reasoning operations route through the client's models.

This is how you build hybrid systems: fast operations on the server, reasoning on the frontier model, scaling without complexity.

The Sampling Problem and Solution

Let's look at what you need vs what sampling gives you:

The Naive Approach (Don't Do This)

import anthropic
import os

@mcp.tool()
async def summarize(text_to_summarize: str):
    """Summarize research findings."""
    # PROBLEM: Server needs its own API key
    client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"Summarize: {text_to_summarize}"
        }]
    )
    return message.content[0].text

Problems:

Server has API key embedded (security risk)
Server pays for inference (cost leakage)
Server depends on Anthropic specifically (no provider flexibility)
Server must handle Claude errors directly (complexity)
Server can't use client's context window (isolated reasoning)

The Sampling Approach (Do This Instead)

@mcp.tool()
async def summarize(text_to_summarize: str, ctx: Context):
    """Summarize research findings using sampling."""
    # Client provides inference—server just specifies intent
    result = await ctx.session.create_message(
        messages=[
            SamplingMessage(
                role="user",
                content=TextContent(type="text", text=f"Summarize: {text_to_summarize}")
            )
        ],
        max_tokens=1000,
        system_prompt="You are a research expert summarizing academic findings"
    )

    if result.content.type == "text":
        return result.content.text
    else:
        raise ValueError("Sampling failed")

Benefits:

No API keys on server (secure)
Client pays for inference (aligned costs)
Client determines model (flexible)
Client handles errors (centralized)
Server can leverage client's context (powerful)

How Sampling Works: The Complete Flow

The pattern involves two sides working together:

Server Side: Request Inference

from mcp import Context
from mcp.types import SamplingMessage, TextContent

@mcp.tool()
async def analyze_code(code_snippet: str, ctx: Context):
    """Analyze Python code using Claude."""

    # Server specifies WHAT to do
    result = await ctx.session.create_message(
        messages=[
            SamplingMessage(
                role="user",
                content=TextContent(
                    type="text",
                    text=f"Identify bugs and suggest fixes:\n{code_snippet}"
                )
            )
        ],
        max_tokens=2000,
        system_prompt="You are an expert Python code reviewer. Focus on bugs, security issues, and performance."
    )

    # Unpack response
    if result.content and result.content[0].type == "text":
        analysis = result.content[0].text
        return {
            "analysis": analysis,
            "source": "claude-3-5-sonnet-20241022"
        }
    else:
        raise ValueError("Code analysis failed")

What's happening:

Server creates SamplingMessage (not a regular message—signals "route to client")
Calls ctx.session.create_message() (special method for sampling)
Specifies system_prompt (instructions for Claude)
Awaits response from client's Claude

Client Side: Handle Sampling Callback

from mcp.client.session import ClientSession
from mcp.types import CreateMessageRequestParams, CreateMessageResult

async def sampling_callback(
    request_id: str,
    params: CreateMessageRequestParams
):
    """Intercept server's sampling requests and route through Claude."""

    # Client has API key and model access—server doesn't
    client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

    # Convert sampling request to standard Claude API call
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",  # Client decides model
        max_tokens=params.max_tokens,
        system=params.system_prompt,
        messages=params.messages  # Already formatted by server
    )

    # Return result back to server
    return CreateMessageResult(
        model="claude-3-5-sonnet-20241022",
        content=[TextContent(
            type="text",
            text=response.content[0].text
        )],
        role="assistant"
    )

# Register callback with client
session = ClientSession(
    transport,
    init_options={
        "sampling_callback": sampling_callback
    }
)

What's happening:

Client defines sampling_callback (intercepts create_message requests)
Client owns API key (server never sees it)
Client calls Claude (determines model, handles errors)
Client returns structured response to server

Comparison: Three Approaches

Factor	Direct Server Call	Sampling	API Pass-Through
Server complexity	High (API management)	Low (just request)	Medium
Cost management	Server pays	Client pays	Hidden
Security	Keys on server (risky)	Keys on client only (safe)	Keys duplicated
Model flexibility	Fixed to server's choice	Client controls model	Fixed upstream
Error handling	Server handles	Client handles	Mixed
Context awareness	Server context only	Client's full context	Limited
Use case	Never (anti-pattern)	Inference requests	Data transform

Key insight: Sampling shifts both cost and responsibility to the client (where they belong). The server focuses on what it knows best; the client focuses on inference.

Real-World Sampling Example: Research Assistant

Here's a complete example combining research operations (server) with AI reasoning (sampling):

Server: Research Tool

import httpx
from mcp import Context
from mcp.types import SamplingMessage, TextContent

@mcp.tool()
async def research_topic(topic: str, ctx: Context):
    """Research a topic and synthesize findings."""

    # Step 1: Server fetches raw research (deterministic)
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"https://api.arxiv.org/query",
            params={"search_query": f"all:{topic}", "max_results": 5}
        )
        papers = response.json()

    # Step 2: Server extracts key information (deterministic)
    research_data = {
        "papers": [
            {
                "title": p["title"],
                "abstract": p["summary"][:500]  # First 500 chars
            }
            for p in papers[:5]
        ]
    }

    # Step 3: Server asks Claude to synthesize (sampling)
    synthesis = await ctx.session.create_message(
        messages=[
            SamplingMessage(
                role="user",
                content=TextContent(
                    type="text",
                    text=f"""Synthesize these research papers into a coherent summary.

Papers:
{json.dumps(research_data['papers'], indent=2)}

Create:
1. Key findings
2. Research gaps
3. Future directions"""
                )
            )
        ],
        max_tokens=1500,
        system_prompt="You are a research analyst synthesizing academic findings for executives."
    )

    # Step 4: Return synthesized findings
    if synthesis.content and synthesis.content[0].type == "text":
        return {
            "raw_papers": len(research_data['papers']),
            "synthesis": synthesis.content[0].text
        }
    else:
        raise ValueError("Synthesis failed")

Client Usage:

# Client registers callback
async def handle_research_sampling(
    request_id: str,
    params: CreateMessageRequestParams
):
    """Route research synthesis through Claude."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=params.max_tokens,
        system=params.system_prompt,
        messages=params.messages
    )
    return CreateMessageResult(
        model="claude-3-5-sonnet-20241022",
        content=response.content,
        role="assistant"
    )

# When client calls the research tool
response = await session.call_tool(
    name="research_topic",
    arguments={"topic": "transformer architectures"}
)
# Output:
# {
#   "raw_papers": 5,
#   "synthesis": "Transformers have evolved from..."
# }

What you see:

Server fetches data, formats it
Server asks Claude to synthesize (sampling)
Claude responds through client
Server returns synthesized result to caller

The server never needed an API key. The client never needed to manage fetching. Clean separation.

When to Use Sampling

Use sampling when:

Server needs to reason about data it collected
Tool requires natural language understanding
You want client to bear inference costs
Tool should work with any frontier model (not tied to one provider)
You need the client's broader context

Examples:

Summarizing documents (server retrieves, Claude synthesizes)
Analyzing code (server parses, Claude explains)
Content moderation (server checks patterns, Claude evaluates intent)
Decision support (server fetches data, Claude recommends action)

Don't use sampling when:

Operation is purely deterministic (pure computation, data lookup)
Operation must complete in under 500ms (sampling has latency)
Server can't format data properly (client shouldn't guess intent)
Cost must be server-controlled (sampling routes to client)

Try With AI

Part 1: Review the Code

You have a document processing server that needs to classify documents. Read this code snippet:

@mcp.tool()
async def classify_document(content: str, ctx: Context):
    """Classify document using sampling."""
    classification = await ctx.session.create_message(
        messages=[
            SamplingMessage(
                role="user",
                content=TextContent(
                    type="text",
                    text=f"Classify this document:\n{content}"
                )
            )
        ],
        max_tokens=200
    )
    return classification.content[0].text

Ask Claude: "This server tool uses sampling to classify documents. What's missing from the implementation that would make it production-ready?"

Pay attention to Claude's suggestions about:

System prompts (does Claude have guidance?)
Error handling (what if sampling fails?)
Response validation (is the classification in the expected format?)

Part 2: Implement Error Handling

Claude will likely suggest handling sampling failures. Based on its suggestions, ask: "Show me how to add retry logic and graceful fallback if sampling fails."

Review the code Claude generates. Ask yourself:

Does this handle the case where sampling times out?
Does this prevent infinite retry loops?
Does the fallback return sensible output?

Part 3: Design a Sampling Workflow

You're building a code review server. It needs to:

Parse Python code (deterministic)
Ask Claude for security issues (sampling)
Ask Claude for performance suggestions (sampling)
Return both analyses

Ask Claude: "Design a tool that does code review using two separate sampling calls—one for security, one for performance. Should these be separate tools or one tool with two sampling requests?"

Compare Claude's recommendation to your instinct. Ask: "Why did you recommend this approach? What's the tradeoff?"

This question forces you to think about sampling composition—when to use multiple samples vs when to combine them.

Part 4: Evaluate Tradeoffs

You're deciding: should your research server call Claude for each paper (5 sampling calls) or fetch all papers then call Claude once?

Ask Claude: "Compare these two approaches: (A) sample Claude for each paper's summary individually, or (B) fetch all papers, then sample Claude once for synthesis. What are the latency, cost, and quality tradeoffs?"

Notice what emerges: Claude will explain reasoning you might not have considered (parallelization, context window efficiency, response quality). Ask yourself: "Which tradeoff matters most for my use case?"

This is sampling in action—not just moving code around, but reasoning about when AI involvement creates value vs when it adds latency without benefit.

Reflect on Your Skill

You built an mcp-server skill in Lesson 0. Test and improve it based on what you learned.

Test Your Skill

Using my mcp-server skill, create a tool that uses sampling to call an LLM through the client.
Does my skill explain when to use sampling vs direct API calls, and how to implement context.session.create_message()?

Identify Gaps

Ask yourself:

Did my skill include sampling patterns and SamplingMessage structures?
Did it explain the tradeoffs between server-side API calls and client-side sampling?

Improve Your Skill

If you found gaps:

My mcp-server skill is missing sampling implementation patterns.
Update it to include when sampling is appropriate, how to use context.session.create_message(), and the architectural benefits of delegating LLM calls to clients.

The Sampling Problem and Solution​

The Naive Approach (Don't Do This)​

The Sampling Approach (Do This Instead)​

How Sampling Works: The Complete Flow​

Server Side: Request Inference​

Client Side: Handle Sampling Callback​

Comparison: Three Approaches​

Real-World Sampling Example: Research Assistant​

When to Use Sampling​

Try With AI​

Reflect on Your Skill​

Test Your Skill​

Identify Gaps​

Improve Your Skill​