File Checkpointing: Recovering from Agent Mistakes

Your agent just modified 12 files and made a critical error in one of them. A traditional agent system? The changes are permanent. You're reverting manually, manually, manually.

The Claude Agent SDK does something radical: It snapshots file state at execution checkpoints. If your agent makes a mistake, you rewind.

This is the difference between agents you can let loose on real projects and agents you have to watch constantly. File checkpointing makes agents resilient.

The Problem: Risky Exploration

Agents are bold. They explore multiple solutions. They refactor code. They try different architectures. This exploration is valuable—it discovers approaches you wouldn't think of.

But exploration is risky:

Agent decides to restructure your authentication flow
Makes changes to 8 files
One file has a syntax error
Now you have 8 files in an inconsistent state

Without checkpointing, you're manually reverting. With checkpointing, you rewind to the moment before exploration started and try a different approach.

Real-World Scenario: Code Refactoring

# Your agent receives this request:
# "Refactor our auth module from callbacks to async/await"

# The agent:
# 1. Analyzes current code (Checkpoint A)
# 2. Starts refactoring auth.py
# 3. Updates middleware.py
# 4. Updates error handling
# 5. Syntax error in session_manager.py (Checkpoint B)

# WITHOUT CHECKPOINTING:
# You manually revert session_manager.py,
# then discover auth.py has a logical error,
# then middleware.py doesn't compile,
# then you give up and revert manually

# WITH CHECKPOINTING:
# await client.rewind_files(checkpoint_a)
# Agent resumes with a different approach

Checkpointing lets your agent be exploratory without being destructive.

Enabling File Checkpointing

File checkpointing is disabled by default. Enable it in ClaudeAgentOptions:

Basic Enablement

from claude_agent_sdk import ClaudeAgentOptions, ClaudeSDKClient
import asyncio

async def main():
    options = ClaudeAgentOptions(
        allowed_tools=["Read", "Edit", "Bash"],
        enable_file_checkpointing=True,
        extra_args={"replay-user-messages": None}  # Required to get checkpoint UUIDs
    )

    async with ClaudeSDKClient(options) as client:
        await client.query("Refactor the auth module to use async/await")

        async for message in client.receive_response():
            print(f"Message type: {message.type}")
            # More on capturing UUIDs below

asyncio.run(main())

Output:

Message type: assistant
Message type: tool_use
Message type: assistant
Message type: result

That's it. Checkpointing is now active. Every file change the agent makes is snapshots with a checkpoint UUID.

Configuration Explained

Parameter	Purpose	Required?
`enable_file_checkpointing=True`	Activate checkpoint tracking	Yes
`extra_args={"replay-user-messages": None}`	Include checkpoint UUIDs in messages	Yes (for recovery)

Why replay-user-messages? This flag tells the SDK to emit UserMessage objects that contain checkpoint UUIDs. Without it, you can still use checkpointing, but you won't know the checkpoint IDs to rewind to.

Capturing Checkpoint UUIDs

File states are identified by checkpoint UUIDs. A new UUID is generated each time the agent executes a tool that modifies files.

Identifying UserMessage Objects

from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions, UserMessage

async def main():
    options = ClaudeAgentOptions(
        allowed_tools=["Read", "Edit"],
        enable_file_checkpointing=True,
        extra_args={"replay-user-messages": None}
    )

    checkpoints = {}

    async with ClaudeSDKClient(options) as client:
        await client.query("Add error handling to auth.py")

        async for message in client.receive_response():
            # Capture checkpoints from UserMessage objects
            if isinstance(message, UserMessage) and message.uuid:
                checkpoint_id = message.uuid
                checkpoints[checkpoint_id] = {
                    "timestamp": message.timestamp or "unknown",
                    "description": "After agent tool execution"
                }
                print(f"Checkpoint created: {checkpoint_id}")

            if message.type == "result":
                print(f"Agent completed. Created {len(checkpoints)} checkpoints")
                break

asyncio.run(main())

Output:

Checkpoint created: 550e8400-e29b-41d4-a716-446655440000
Checkpoint created: 550e8400-e29b-41d4-a716-446655440001
Checkpoint created: 550e8400-e29b-41d4-a716-446655440002
Agent completed. Created 3 checkpoints

Each UUID represents a specific file state. If anything goes wrong after checkpoint 550e8400-e29b-41d4-a716-446655440001, you can rewind to that exact state.

Storing Checkpoints Strategically

In production, you'd store checkpoints with semantic meaning:

checkpoints = {}

async for message in client.receive_response():
    if isinstance(message, UserMessage) and message.uuid:
        # Store with description for later reference
        checkpoints["after_authentication"] = message.uuid

        # Or store chronologically
        if "checkpoints" not in checkpoints:
            checkpoints["checkpoints"] = []
        checkpoints["checkpoints"].append({
            "id": message.uuid,
            "timestamp": message.timestamp,
            "order": len(checkpoints["checkpoints"])
        })

Best practice: Save checkpoint IDs to a file or database during long-running agent sessions. This way, if your program crashes, you can still recover.

Recovering with rewindFiles()

When an agent makes a mistake, use rewindFiles() to restore files to a previous checkpoint.

Basic Recovery Pattern

from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions, UserMessage
import asyncio

async def main():
    options = ClaudeAgentOptions(
        allowed_tools=["Read", "Edit"],
        enable_file_checkpointing=True,
        extra_args={"replay-user-messages": None}
    )

    last_good_checkpoint = None

    async with ClaudeSDKClient(options) as client:
        # Task 1: Initial refactoring
        await client.query("Refactor database.py to use connection pooling")

        async for message in client.receive_response():
            if isinstance(message, UserMessage) and message.uuid:
                last_good_checkpoint = message.uuid

            if message.type == "result":
                result = message.result
                print(f"Task 1 result: {result}")

                # Check if refactoring succeeded
                if "error" in result.lower():
                    print(f"Error detected! Rewinding to checkpoint: {last_good_checkpoint}")

                    # Rewind files to known-good state
                    await client.rewind_files(last_good_checkpoint)
                    print("Files recovered. Ready for retry.")
                else:
                    print("Refactoring successful!")

                break

asyncio.run(main())

Output:

Task 1 result: Added connection pooling. All tests pass.
Refactoring successful!

If there had been an error:

Task 1 result: Syntax error in pool configuration
Error detected! Rewinding to checkpoint: 550e8400-e29b-41d4-a716-446655440001
Files recovered. Ready for retry.

All file changes after that checkpoint are undone instantly.

Recovering and Retrying

async def refactor_with_recovery(client, task_description):
    """Execute a refactoring with automatic recovery on error."""

    checkpoints = []

    # Capture starting checkpoint
    await client.query("")  # Get initial state
    async for message in client.receive_response():
        if isinstance(message, UserMessage) and message.uuid:
            checkpoints.append(message.uuid)
        if message.type == "result":
            break

    starting_checkpoint = checkpoints[-1] if checkpoints else None

    # Attempt refactoring
    await client.query(task_description)

    success = False
    async for message in client.receive_response():
        if isinstance(message, UserMessage) and message.uuid:
            checkpoints.append(message.uuid)

        if message.type == "result":
            if "error" not in message.result.lower():
                success = True
                print(f"Success: {message.result}")
            else:
                print(f"Failed: {message.result}")
                print(f"Recovering to checkpoint: {starting_checkpoint}")
                await client.rewind_files(starting_checkpoint)
            break

    return success

# Usage
async with ClaudeSDKClient(options) as client:
    success = await refactor_with_recovery(
        client,
        "Refactor our logging to use structured JSON format"
    )

    if not success:
        print("Refactoring failed and files were recovered.")
        # Now try a different approach
        await client.query("Create a new logging module in parallel without modifying existing code")

Checkpoint Strategy: When to Checkpoint

Checkpointing happens automatically, but you should understand when to use recovery strategically.

Pattern 1: Risky Operations

Checkpoint BEFORE attempting risky modifications:

async with ClaudeSDKClient(options) as client:
    # Get baseline checkpoint
    baseline_checkpoint = None

    # Baseline operation
    await client.query("Read the entire codebase and summarize architecture")
    async for message in client.receive_response():
        if isinstance(message, UserMessage) and message.uuid:
            baseline_checkpoint = message.uuid
        if message.type == "result":
            break

    print(f"Baseline checkpoint: {baseline_checkpoint}")

    # NOW attempt risky operation
    await client.query("Refactor dependency injection throughout the codebase")

    async for message in client.receive_response():
        if message.type == "result":
            if "error" in message.result:
                # Recover to baseline
                await client.rewind_files(baseline_checkpoint)
                print("Recovered to baseline. Ready to try different approach.")
            break

Pattern 2: Multi-Approach Comparison

Compare multiple refactoring approaches by checkpointing:

async with ClaudeSDKClient(options) as client:
    approaches = {}

    # Approach A: Event-driven refactor
    approach_a_checkpoint = None
    await client.query("Refactor to event-driven architecture")
    async for message in client.receive_response():
        if isinstance(message, UserMessage) and message.uuid:
            approach_a_checkpoint = message.uuid
        if message.type == "result":
            approaches["event-driven"] = {
                "checkpoint": approach_a_checkpoint,
                "result": message.result
            }
            break

    # Rewind to baseline for Approach B
    await client.rewind_files(baseline_checkpoint)

    # Approach B: Service-oriented refactor
    approach_b_checkpoint = None
    await client.query("Refactor to service-oriented architecture")
    async for message in client.receive_response():
        if isinstance(message, UserMessage) and message.uuid:
            approach_b_checkpoint = message.uuid
        if message.type == "result":
            approaches["service-oriented"] = {
                "checkpoint": approach_b_checkpoint,
                "result": message.result
            }
            break

    # Compare both approaches
    print("Approach A (Event-Driven):")
    print(approaches["event-driven"]["result"])
    print("\nApproach B (Service-Oriented):")
    print(approaches["service-oriented"]["result"])

    # Keep the better one, rewind the other
    if "event-driven" in approaches["event-driven"]["result"].lower():
        print("Event-driven looks better. Keeping that checkpoint.")
        # approaches["event-driven"]["checkpoint"] is now the active state
    else:
        await client.rewind_files(approaches["event-driven"]["checkpoint"])
        print("Service-oriented looks better. Reverted to that checkpoint.")

Checkpoint Overhead and Session Management

File checkpointing has minimal overhead (snapshots are efficient), but understanding session lifecycle matters.

Session Persistence with Checkpoints

# Checkpoints persist across session boundaries
# Save session ID and checkpoint during execution

session_id = None
final_checkpoint = None

async with ClaudeSDKClient(options) as client:
    await client.query("Initial task")

    async for message in client.receive_response():
        if hasattr(message, 'subtype') and message.subtype == 'init':
            session_id = message.session_id

        if isinstance(message, UserMessage) and message.uuid:
            final_checkpoint = message.uuid

        if message.type == "result":
            break

# Save these
print(f"Session: {session_id}, Checkpoint: {final_checkpoint}")

# Later: Resume and potentially rewind
async with ClaudeSDKClient(
    ClaudeAgentOptions(
        enable_file_checkpointing=True,
        resume=session_id
    )
) as client:
    # Files are still in state from final_checkpoint

    # Try new approach
    await client.query("Implement feature X")

    # If that fails, rewind to final_checkpoint
    async for message in client.receive_response():
        if message.type == "result" and "error" in message.result:
            await client.rewind_files(final_checkpoint)
            print("Reverted to saved checkpoint")
        break

Why This Matters for Digital FTEs

Checkpointing is the resilience mechanism that makes agents safe for production:

Without checkpointing: A buggy agent modifies 50 files incorrectly. Your customer's codebase is corrupted. You manually fix it. Trust is destroyed.

With checkpointing: A buggy agent modifies 50 files. You detect the error immediately. One command reverts everything. Customer never knows anything went wrong. Agent tries a different approach. Success.

This is why checkpoint-enabled agents can be bold. They can explore aggressive refactorings, complex architectural changes, and risky optimizations—because failure is instantly recoverable.

Try With AI

Use Claude Code or the Agent SDK for these checkpointing exercises.

Prompt 1: Enable Checkpointing and Capture UUIDs

Set up a Claude Agent SDK session with file checkpointing enabled.
Create a small Python file with intentionally poor formatting (inconsistent
indentation, long lines, unclear variable names). Then ask the agent to:
1. Reformat the file properly
2. Show me how to capture checkpoint UUIDs during execution
3. Display the checkpoint IDs for reference

Your task: Write the code that enables checkpointing, runs the refactoring,
and prints all captured checkpoint UUIDs.

What you're learning: How to configure checkpointing and identify the UUID objects that represent file states. Notice that each tool execution creates a new checkpoint—you can recover to any of them.

Prompt 2: Recover from a Deliberate Mistake

Using the same Python file from Prompt 1, ask the agent to:
1. Add a new function that has a syntax error (intentionally buggy)
2. Capture the checkpoint ID BEFORE adding the function
3. Let the agent try to fix the error
4. If the fix fails, show how rewindFiles() would recover to the
   pre-broken state

Your task: Demonstrate checkpoint capture → file modification → recovery.
Print the checkpoint IDs at each step so I can see the sequence.

What you're learning: The recovery pattern—how to detect when an agent task fails and use rewindFiles() to instantly restore files to a known-good state. This is the core resilience mechanism for production agents.

Prompt 3: Compare Multiple Refactoring Approaches

Create a multi-approach comparison:
1. Start with a basic TODO app with global state
2. Refactor to approach A: Dependency injection (checkpoint this state)
3. Rewind files to before approach A
4. Refactor to approach B: Observer pattern (checkpoint this state)
5. Show me the code differences between the two approaches

Your task: Implement the multi-approach pattern—refactoring to two different
architectures from the same starting point, comparing which one is cleaner.
Store checkpoint IDs so you can switch between them.

**Safety note**: Always capture a baseline checkpoint before risky refactorings.
This is how production agents safely explore multiple solution paths.

---

## Reflect on Your Skill

You built a `claude-agent` skill in Lesson 0. Test and improve it based on what you learned.

### Test Your Skill

Using my claude-agent skill, enable file checkpointing and implement recovery. Does my skill cover checkpoint UUIDs and rewindFiles()?

### Identify Gaps

Ask yourself:
- Did my skill explain enable_file_checkpointing configuration?
- Did it show how to capture checkpoint UUIDs and use rewindFiles()?

### Improve Your Skill

If you found gaps:

My claude-agent skill is missing file checkpointing patterns. Update it to include:

Checkpointing enablement
UUID capture from UserMessage
Recovery with rewindFiles()

---

What you're learning: Strategic checkpoint use—how teams would use checkpointing in real projects to safely explore architectural alternatives and compare solutions before committing to one.

The Problem: Risky Exploration​

Real-World Scenario: Code Refactoring​

Enabling File Checkpointing​

Basic Enablement​

Configuration Explained​

Capturing Checkpoint UUIDs​

Identifying UserMessage Objects​

Storing Checkpoints Strategically​

Recovering with rewindFiles()​

Basic Recovery Pattern​

Recovering and Retrying​

Checkpoint Strategy: When to Checkpoint​

Pattern 1: Risky Operations​

Pattern 2: Multi-Approach Comparison​

Checkpoint Overhead and Session Management​

Session Persistence with Checkpoints​

Why This Matters for Digital FTEs​

Try With AI​

Prompt 1: Enable Checkpointing and Capture UUIDs​

Prompt 2: Recover from a Deliberate Mistake​

Prompt 3: Compare Multiple Refactoring Approaches​

The Problem: Risky Exploration

Real-World Scenario: Code Refactoring

Enabling File Checkpointing

Basic Enablement

Configuration Explained

Capturing Checkpoint UUIDs

Identifying UserMessage Objects

Storing Checkpoints Strategically

Recovering with rewindFiles()

Basic Recovery Pattern

Recovering and Retrying

Checkpoint Strategy: When to Checkpoint

Pattern 1: Risky Operations

Pattern 2: Multi-Approach Comparison

Checkpoint Overhead and Session Management

Session Persistence with Checkpoints

Why This Matters for Digital FTEs

Try With AI

Prompt 1: Enable Checkpointing and Capture UUIDs

Prompt 2: Recover from a Deliberate Mistake

Prompt 3: Compare Multiple Refactoring Approaches