Skip to main content

The Context Engineering Playbook: Decision Frameworks for Quality

You've spent nine lessons learning the physics of context engineering. Position sensitivity. Attention budgets. Signal versus noise. Progress files. Memory injection. Context isolation. These aren't abstract concepts anymore. You understand WHY Claude forgets things, WHEN to compact versus clear, and HOW to structure information for maximum attention allocation.

Now it's time to put it all together.

This lesson is the capstone. You'll learn decision frameworks that tell you WHICH technique to apply WHEN. You'll understand how to allocate your context budget across components. And you'll build something real: a production-quality agent that demonstrates every technique from this chapter working in concert.

More importantly, you'll connect everything back to the thesis that drives this entire book.

Back to the Thesis

In Chapter 1, you learned the Agent Factory paradigm: domain experts manufacturing Digital FTEs powered by AI. The thesis:

General Agents BUILD Custom Agents.

You've spent three chapters learning the tools (Chapter 3) and the physics (this chapter). Now answer the question that actually matters:

What separates a $50/month agent from a $5,000/month agent?

Not the model. Every law firm, marketing agency, research lab, and consulting practice has access to Claude, GPT, and Gemini. Your competitor can spin up the same frontier model in minutes.

Not the basic prompts. Prompts plateau quickly. You can only polish a single instruction so much before diminishing returns set in.

The context engineering discipline.

A well-engineered agent:

  • Maintains consistency across sessions (progress files + session architecture)
  • Doesn't degrade over long conversations (attention budget management)
  • Applies tacit knowledge automatically (memory injection)
  • Scales to complex multi-step tasks without drift (context isolation)
  • Improves from use (memory consolidation)

This is what clients pay for. Whether you're building a contract review assistant, a campaign planning agent, a literature synthesis tool, or a code reviewer—the value isn't raw intelligence. Everyone has access to that. Clients pay for reliability, consistency, and domain expertise that accumulates rather than resets.

This chapter gave you the quality control toolkit. This lesson shows you how to apply it.

The Context Engineering Decision Tree

When you encounter a context problem, you need a decision framework. Not "try everything and see what works" but "diagnose the specific problem and apply the specific solution."

Here's the decision tree:

Is context utilization > 70%?
├─ YES → Compaction needed (Lesson 6)
│ Is the current task complete?
│ ├─ YES → /clear (fresh start)
│ └─ NO → /compact with custom instructions
│ "Preserve: [decisions, file changes, current goal]"

└─ NO → Continue, but monitor (/context every 10 messages in yellow zone)

Is this task multi-session?
├─ YES → Progress file architecture (Lesson 7)
│ - Create claude-progress.txt
│ - Decompose into 10-15 granular tasks
│ - Use session initialization protocol
│ - Commit at session boundaries

└─ NO → Single session workflow (standard)

Is workflow drifting from original intent?
├─ YES → Memory injection needed (Lesson 8)
│ - Set up vector DB for domain knowledge
│ - Implement PreToolUse hook
│ - Extract thinking blocks for embedding
│ - Add deduplication layer

└─ NO → UserPromptSubmit injection is sufficient

Is work distributed across multiple agents?
├─ YES → Context isolation (Lesson 9)
│ - Use orchestrator pattern
│ - Fresh context per subagent
│ - Summary-based communication
│ - Careful tool access control

└─ NO → Single agent workflow

Each branch points to a specific lesson and technique. The decision tree isn't "which sounds good"—it's "what does the diagnosis say?"

Reading the Decision Tree

Let's trace through realistic scenarios across different domains.


Scenario A: Legal — Contract Review Assistant

You're reviewing a complex commercial lease agreement. Your AI assistant has been running for 45 minutes, and it's starting to miss liability clauses it caught earlier.

Decision 1: Is context > 70%?

Run /context. Output: Context: 156,000 / 200,000 tokens (78%)

Yes. You're in the orange zone. Compaction needed.

Decision 2: Is the task complete?

No—you're mid-review. You don't want to lose the context of which sections have been analyzed and what red flags were identified.

Action: /compact Focus on the lease review findings and clauses already analyzed. Discard the tangent about formatting preferences from messages 12-18.

Now you're back to ~40% utilization with the important context preserved.

Decision 3: Is this multi-session?

No—you'll finish this contract today.

No action needed for progress files.

Decision 4: Is workflow drifting?

You started reviewing for "indemnification and liability risks" but Claude has shifted to "general grammar corrections." That's drift.

Action: You could implement memory injection, but for a single session, a simpler fix works: re-state your goal explicitly. "Return focus to liability and indemnification clauses. Grammar is out of scope for this review."


Scenario B: Marketing — Campaign Planning Assistant

You're developing a product launch campaign. The AI has been helping for an hour, brainstorming messaging across multiple channels. Now its suggestions are becoming generic, losing the brand voice you established earlier.

Decision 1: Is context > 70%?

Run /context. Output: Context: 145,000 / 200,000 tokens (72%)

Yes, approaching the danger zone. Proactive compaction recommended.

Decision 2: Is the task complete?

No—you've done social media but still need email sequences and landing page copy.

Action: /compact Preserve: brand voice guidelines, approved messaging themes, target persona details. Discard: rejected tagline brainstorms, competitor research tangent.

Decision 3: Is this multi-session?

Yes—this campaign will take several days to develop fully.

Action: Create a progress file:

# Campaign: [Product] Launch

## Approved Direction

- Brand voice: [captured decisions]
- Core messaging: [finalized themes]
- Target persona: [documented]

## Completed

- Social media concepts (Instagram, LinkedIn)

## Next Session

- Email nurture sequence
- Landing page copy

Decision 4: Is workflow drifting?

The suggestions lost your brand's distinctive voice. That's knowledge drift.

Action: Memory injection would help for future sessions. For now, re-inject the brand voice document explicitly: "Review our brand voice guidelines before the next suggestion. All copy must match this tone."


Scenario C: Research — Literature Synthesis Assistant

You're synthesizing 30 papers for a systematic review. The AI has been helping categorize findings, but it's starting to misattribute claims to the wrong papers.

Decision 1: Is context > 70%?

Yes—you've loaded substantial paper summaries.

Decision 2: Is the task complete?

No—you're at paper 18 of 30.

Action: /compact Preserve: synthesis table with paper citations, methodology classification schema. Discard: detailed quotes from papers 1-10 (keep citations only).

Decision 3: Is this multi-session?

Definitely—this is a week-long project.

Action: Your progress file should track which papers have been analyzed, key findings extracted, and where conflicts exist between sources.

Decision 4: Is workflow drifting?

Attribution errors suggest memory pollution from overlapping contexts.

Action: Consider multi-round processing—analyze 5 papers per session, produce a mini-synthesis, then combine syntheses in a final session with fresh context.


Scenario D: Consulting — Proposal Development Assistant

You're developing a consulting proposal. The AI has been helping with the executive summary, scope definition, and pricing rationale. Now it's suggesting deliverables that don't match what you discussed with the client.

Decision 1: Is context > 70%?

Run /context. Moderate usage—not the primary issue.

Decision 2: Is the task complete?

No—you still need the implementation timeline and team bios section.

No compaction needed yet.

Decision 3: Is this multi-session?

Yes—proposals typically span multiple working sessions with client feedback loops.

Action: Create a progress file capturing: client requirements (from discovery call), agreed scope boundaries, pricing approach, and key differentiators.

Decision 4: Is workflow drifting?

Yes—the deliverables don't match client needs. Classic drift.

Action: Re-inject the client requirements document. "Review the discovery call notes before suggesting deliverables. Every deliverable must trace to a stated client need."


Scenario E: Development — Code Review Assistant

You're reviewing a large pull request. The AI has been running for 45 minutes, and quality is degrading—it's missing security issues it caught earlier.

Decision 1: Is context > 70%?

Run /context. Output: Context: 156,000 / 200,000 tokens (78%)

Yes. Compaction needed.

Decision 2: Is the task complete?

No—you're mid-review.

Action: /compact Focus on the PR review findings and files already analyzed. Discard the debugging tangent from messages 12-18.

Decision 3: Is this multi-session?

No—you'll finish this review today.

No action needed for progress files.

Decision 4: Is workflow drifting?

You started reviewing for "security issues" but Claude has shifted to "code style." That's drift.

Action: Re-state your goal explicitly. "Return focus to security issues. Code style is out of scope for this review."


The decision tree didn't tell you to use every technique. It told you which ones match your situation—regardless of your domain.

Context Budget Allocation

Context isn't free. Every token you add competes for attention with every other token. Understanding budget allocation helps you make tradeoffs.

ComponentRecommended %Notes
System prompt5-10%Core identity, constraints (Anthropic-controlled)
CLAUDE.md5-10%Project-specific context (your control)
Tool definitions10-15%Available actions (grows with enabled tools)
Message history30-40%Conversation state (accumulates over session)
Tool outputs20-30%File reads, command results
Reserve buffer10-15%Room for growth, unexpected needs

The reserve buffer is critical. If you're running at 90% utilization, any file read might push you into degradation territory. Keep headroom.

Budget Allocation by Session Phase

These percentages shift throughout a session:

Early session (first 10 messages):

  • System prompt: 15%
  • CLAUDE.md: 15%
  • Tool definitions: 20%
  • Message history: 10%
  • Tool outputs: 25%
  • Reserve: 15%

Mid session (messages 20-40):

  • System prompt: 8%
  • CLAUDE.md: 8%
  • Tool definitions: 12%
  • Message history: 35%
  • Tool outputs: 27%
  • Reserve: 10%

Late session (messages 50+):

  • System prompt: 5%
  • CLAUDE.md: 5%
  • Tool definitions: 8%
  • Message history: 50%
  • Tool outputs: 27%
  • Reserve: 5%

Notice how message history grows to dominate. This is why conversations degrade—the useful context (CLAUDE.md, tools) gets proportionally smaller as conversation noise accumulates.

The Seven Token Budgeting Strategies

When you're over budget, these strategies help you reclaim tokens without losing quality:

Strategy 1: Summarize Large Text Blocks

Before including a large document, summarize it:

Instead of pasting the entire 5,000-word contract,
summarize the key clauses relevant to this review in 200 words.

Instead of including the full research paper,
extract the methodology and key findings relevant to our synthesis.

Instead of loading the complete brand guidelines,
summarize the voice and tone principles for this campaign.

When to use: Documents > 2,000 tokens that you need for reference but don't need verbatim.

Strategy 2: Chunk Documents into Vector DB

Store large knowledge bases in a vector database. Retrieve only relevant chunks:

# Instead of loading 50 pages of precedent contracts
query = "indemnification clause commercial lease"
relevant_chunks = vector_db.search(query, top_k=3)
# Include only the 3 most relevant clauses

# Instead of loading entire research corpus
query = "methodology randomized controlled trial"
relevant_chunks = vector_db.search(query, top_k=5)
# Include only the relevant methodology sections

When to use: Reference materials, documentation, knowledge bases that are too large to include whole.

Strategy 3: Offload to External Memory

Not everything needs to live in context. Use external storage:

  • Database for structured data
  • Files for persistent state
  • Vector DB for semantic search

Pass IDs or references, not full content.

When to use: Data that changes, accumulates, or exceeds context limits.

Strategy 4: Use Relevancy Checks

Before including content, verify it's needed:

# Before including a file
if "authentication" in file_content.lower() and task == "auth_review":
include_file()
else:
skip_file()

When to use: When you're including files "just in case" rather than because they're definitely needed.

Strategy 5: Structure Prompts Wisely

System messages persist across turns. User messages accumulate in history. Structure accordingly:

  • System messages: Stable rules, identity, constraints
  • User messages: Task-specific instructions, changing context
  • Tool results: Ephemeral data, can be summarized or dropped

When to use: When designing your context architecture from scratch.

Strategy 6: Monitor Real-Time

Set guardrails and check regularly:

# In CLAUDE.md

## Context Monitoring

- Run /context every 10 messages when above 50%
- At 70%+: Stop and suggest /compact
- At 85%+: Mandatory compact before continuing

When to use: Always. Monitoring should be automatic, not an afterthought.

Strategy 7: Multi-Round Processing

For tasks requiring more context than fits, process in rounds:

Round 1: Analyze files 1-5, produce summary
Round 2: Analyze files 6-10, produce summary
Round 3: Synthesize summaries into final output

Each round uses fresh context. The final round operates on summaries, not raw data.

When to use: Analysis tasks requiring more input than context allows.

The When-to-Use Framework

Techniques map to situations. Here's the quick reference:

SituationPrimary TechniqueLesson
Context is noisySignal-to-noise audit4
Critical instructions ignoredPosition optimization3
Work spans multiple daysProgress files7
Agent drifts from original intentMemory injection8
Multiple agents conflictingContext isolation9
Context utilization > 70%Compaction or clear6
Same task, different qualityAttention budget management2
Claude "forgets" discussed topicsPosition + compaction3, 6
Tacit knowledge not being appliedMemory extraction + storage5

Notice that problems often have multiple contributing causes. Here are examples across domains:

Legal: "Claude keeps missing liability clauses" might be:

  • Position problem (clause types are in middle of CLAUDE.md)
  • Noise problem (review criteria buried in 400 lines of context)
  • Budget problem (context is at 80%, criteria don't get attention)

Marketing: "Claude keeps going off-brand" might be:

  • Position problem (brand voice buried in reference section)
  • Noise problem (too many example campaigns diluting the signal)
  • Knowledge problem (brand guidelines never properly extracted)

Research: "Claude keeps misattributing findings" might be:

  • Budget problem (too many papers loaded simultaneously)
  • Scalability problem (synthesis task too complex for single context)
  • Persistence problem (paper details lost between sessions)

Consulting: "Claude keeps suggesting irrelevant deliverables" might be:

  • Drift problem (original client requirements faded)
  • Position problem (scope constraints not in prime attention zone)
  • Knowledge problem (industry-specific patterns not documented)

Start with the most likely cause. If it doesn't resolve, check the next.

Quality Assessment Criteria

How do you know if your Digital FTE is production-ready? Apply these four criteria.

The four quality criteria (Consistency, Persistence, Scalability, Knowledge) are your Digital FTE's performance review metrics. Score below 3/5 on any = not ready for client deployment.

1. Consistency

The test: Does it give the same quality answer at turn 1 vs turn 50?

How to measure:

  1. Define a standard test task
  2. Run it at session start
  3. Work for 30-45 minutes on other things
  4. Run the identical test task
  5. Compare outputs

Scoring:

  • 5: Outputs indistinguishable
  • 4: Minor quality difference, conclusions identical
  • 3: Noticeable quality drop, conclusions mostly correct
  • 2: Significant quality drop, some conclusions wrong
  • 1: Dramatically worse, major errors

What affects it: Attention budget management, compaction timing, position sensitivity.

2. Persistence

The test: Can it resume work after a 24-hour break?

How to measure:

  1. Work on a multi-session task for 2 hours
  2. Close the session completely
  3. Wait 24 hours
  4. Resume with claude --continue or fresh session + progress file
  5. Measure: How long to re-establish context? How much is remembered?

Scoring:

  • 5: Picks up exactly where you left off (under 5 min reconstruction)
  • 4: Minor re-explanation needed (5-10 min)
  • 3: Significant re-explanation (10-20 min)
  • 2: Most context lost (20+ min or rework)
  • 1: Effectively starting over

What affects it: Progress file quality, decision documentation, session exit protocol.

3. Scalability

The test: Can it handle 10-step tasks without drift?

How to measure:

  1. Define a complex task with 10+ sequential steps
  2. Start the task
  3. Track: Does each step align with the original goal?
  4. Note any drift points

Scoring:

  • 5: All steps aligned, goal maintained throughout
  • 4: Minor drift corrected with brief redirect
  • 3: Required 2-3 explicit corrections
  • 2: Significant drift, substantial corrections needed
  • 1: Lost the plot, task failed or required restart

What affects it: Memory injection, context isolation, explicit goal statements.

4. Knowledge

The test: Does it apply domain expertise automatically?

How to measure:

  1. Establish domain-specific rules (standards, terminology, preferences)
  2. Work on tasks that should trigger those rules
  3. Track: Did Claude apply them without prompting?

Domain examples:

  • Legal: Does it flag indemnification issues using your firm's risk thresholds?
  • Marketing: Does it match brand voice without being reminded?
  • Research: Does it apply your citation format and methodology criteria?
  • Consulting: Does it structure recommendations per your firm's framework?
  • Development: Does it follow coding standards and architecture patterns?

Scoring:

  • 5: Domain rules applied consistently, unprompted
  • 4: Applied most rules, occasional misses
  • 3: Required occasional reminders
  • 2: Required frequent reminders
  • 1: Domain rules ignored, generic responses

What affects it: CLAUDE.md signal quality, memory extraction, tacit knowledge documentation.

Lab: Build Your First Production-Quality Agent

Objective: Apply the full context engineering toolkit to build an agent worth showing to clients.

Duration: 120 minutes active work (may span multiple sessions)

Deliverable: A production-quality specialized agent with quality verification evidence.

Choose Your Domain

Select an agent type that matches your expertise:

  • Legal: Contract Review Assistant — Reviews contracts against your firm's risk criteria, flags problematic clauses, suggests negotiation points based on your jurisdiction and practice area
  • Marketing: Campaign Strategy Assistant — Develops campaigns in your brand voice, maintains messaging consistency across channels, applies your audience segmentation methodology
  • Research: Literature Synthesis Assistant — Synthesizes academic papers per your methodology, maintains attribution accuracy, applies your quality assessment criteria
  • Business: Process Improvement Assistant — Analyzes workflows using your framework, identifies optimization opportunities, structures recommendations per your consulting methodology
  • Development: Code Review Assistant — Analyzes code against your team's standards, catches architectural violations, enforces your security requirements

The domain should be specific enough that generic agents fail. Your agent should have an unfair advantage because it knows YOUR context—your firm's standards, your industry's terminology, your methodology.

Phase 1: Foundation (30 minutes)

Step 1: Audit Your CLAUDE.md

Apply the signal-to-noise audit from Lesson 4:

Read my CLAUDE.md. For each line, classify as:
- SIGNAL: Claude couldn't figure this out from code/context
- NOISE: Claude could infer this, it's wasting tokens

Target: < 60 lines of pure signal.

Step 2: Optimize for Position

Apply the three-zone strategy from Lesson 3:

# ═══════════════════════════════════════════════════

# ZONE 1: FIRST 10% — Critical constraints

# ═══════════════════════════════════════════════════

## Non-Negotiables

- [Your most critical rules here]

# ═══════════════════════════════════════════════════

# ZONE 2: MIDDLE 80% — Reference material

# ═══════════════════════════════════════════════════

## Project Context

[Less critical information]

# ═══════════════════════════════════════════════════

# ZONE 3: LAST 10% — Workflow instructions

# ═══════════════════════════════════════════════════

## How to Start Work

[Session protocols]

Step 3: Establish Baseline

Run your test task. Record the output quality. This is your "before" measurement.

Phase 2: Persistence Architecture (30 minutes)

Step 1: Create Progress File Template

Based on your domain, create claude-progress.txt:

# Agent: [Your Agent Name]

## Domain Knowledge Status

### Captured Rules

- [Rules you've documented]

### Pending Extraction

- [Tacit knowledge still in your head]

## Quality Metrics

### Last Assessment: [Date]

- Consistency: [score]/5
- Persistence: [score]/5
- Scalability: [score]/5
- Knowledge: [score]/5

## Session Log

### Session 1 ([Date])

- Focus: Foundation setup
- Changes: [what you did]
- Next: [what comes next]

Step 2: Extract Tacit Knowledge

Apply the extraction protocol from Lesson 5:

I'm building a [domain] agent. Interview me to extract tacit knowledge:

1. What decisions do I make that aren't written anywhere?
2. What would a new team member need months to figure out?
3. What patterns do I follow that feel like "common sense" but aren't obvious?
4. What mistakes do beginners make that I catch automatically?

For each answer, turn it into a concrete instruction for the agent.

Domain-specific examples:

  • Legal: "Always flag indemnification clauses over $500K" / "Watch for choice of law issues in cross-border contracts"
  • Marketing: "Our brand never uses exclamation points" / "Lead with benefit, not feature"
  • Research: "Exclude studies with n less than 30 unless qualitative" / "Always note funding sources for bias assessment"
  • Business: "Quantify impact in dollars, not percentages" / "Always include quick win recommendations"
  • Development: "Never approve PRs that change auth without security review" / "Prefer composition over inheritance"

Step 3: Document Decisions

As you make choices about your agent's behavior, record them:

## Agent Design Decisions

- Decision: Always suggest tests before implementation
- Rationale: Prevents "implement then forget to test" pattern
- Alternative rejected: Test-after (too often skipped)

- Decision: Use specific file paths, never relative
- Rationale: Subagents need absolute paths (Chapter 3 lesson)
- Alternative rejected: Context-dependent paths (error-prone)

Phase 3: Memory System (30 minutes)

Step 1: Define Your Memory Schema

What memories should your agent have? Define categories:

## Memory Categories

### Domain Rules

- Type: Persistent (never changes)
- Example: "Always use TypeScript strict mode"
- Injection: Session start

### Project Decisions

- Type: Evolving (changes with project)
- Example: "We chose PostgreSQL over MySQL because..."
- Injection: When topic arises

### Learned Preferences

- Type: Accumulated (grows over time)
- Example: "User prefers detailed explanations"
- Injection: PreToolUse (based on thinking)

Step 2: Build Initial Memory Store

Create a memories file or vector DB. Examples by domain:

# memories/domain-rules.md (Legal Example)

## Risk Thresholds

- Flag any indemnification clause over $500K for partner review
- Automatic escalation for unlimited liability provisions
- Cross-border contracts require conflicts of law analysis

## Jurisdiction Patterns

- California contracts: watch for CCPA compliance requirements
- Delaware entities: verify certificate of good standing
- International: check GDPR data processing requirements

## Common Issues

- Force majeure clauses often missing pandemic language (post-2020 standard)
- Assignment clauses need change-of-control carveouts
- Termination for convenience requires 60-day minimum in our practice
# memories/domain-rules.md (Marketing Example)

## Brand Voice

- Tone: confident but not arrogant, helpful not pushy
- Never use: "revolutionary," "game-changing," exclamation points
- Always use: active voice, second person ("you"), concrete benefits

## Channel Standards

- LinkedIn: professional tone, industry insights, thought leadership
- Instagram: aspirational, visual-first, lifestyle connection
- Email: personalization required, subject lines under 40 chars

## Campaign Patterns

- Always lead with customer pain point, not product feature
- Include social proof within first 3 sentences
- CTAs should be specific actions, not "Learn More"
# memories/domain-rules.md (Development Example)

## Coding Standards

- Use 2-space indentation (not 4, not tabs)
- Prefer early returns over nested conditionals
- All functions must have JSDoc comments

## Architecture Patterns

- Services live in /src/services
- Never import directly from node_modules in components
- All API calls go through the API layer

## Common Gotchas

- The auth token expires after 1 hour; refresh proactively
- The legacy endpoint uses XML, not JSON
- Rate limiting kicks in at 100 req/min

Step 3: Configure Memory Injection

Add to your CLAUDE.md:

## Memory Protocol

Before starting work, read: memories/domain-rules.md
When encountering unfamiliar patterns, check: memories/decisions.md
At session end, note any new learnings for: memories/learned.md

For advanced implementation, set up PreToolUse hooks per Lesson 8.

Phase 4: Quality Gate (30 minutes)

Step 1: Consistency Test

  1. Start fresh session
  2. Run your standard test task
  3. Work 30 minutes on unrelated things
  4. Run the identical test task
  5. Score consistency (1-5)

Step 2: Persistence Test

  1. Work on a multi-step task for 30 minutes
  2. Close the session
  3. Wait at least 4 hours (or until tomorrow)
  4. Resume and measure reconstruction time
  5. Score persistence (1-5)

Step 3: Scalability Test

  1. Define a 10-step task in your domain
  2. Execute it start to finish
  3. Track drift points (where did focus shift?)
  4. Score scalability (1-5)

Step 4: Knowledge Test

  1. Remove explicit reminders of domain rules from your prompt
  2. Run a task that should trigger those rules
  3. Did the agent apply them unprompted?
  4. Score knowledge (1-5)

Deliverable

Your final deliverable is a folder containing:

my-production-agent/
├── CLAUDE.md # < 60 lines, position-optimized
├── claude-progress.txt # Progress tracking template
├── memories/
│ ├── domain-rules.md # Persistent domain knowledge
│ ├── decisions.md # Design decisions with rationale
│ └── learned.md # Accumulated preferences
└── quality-assessment.md # Scores with evidence

The quality-assessment.md should look like:

# Agent Quality Assessment

## Test Date: [Date]

## Scores

| Criterion | Score | Evidence |
| ----------- | ----- | ------------------------------------------------------- |
| Consistency | 4/5 | Minor quality drop at turn 45, conclusions identical |
| Persistence | 5/5 | Resumed after 24h in 3 minutes via progress file |
| Scalability | 4/5 | One drift correction at step 7, redirected successfully |
| Knowledge | 4/5 | Applied 8/10 domain rules unprompted |

## Overall: Ready for production use with monitoring

## Improvement Plan

1. Add explicit drift checkpoints at step 5 and 10
2. Document the 2 missed domain rules more prominently
3. Consider PreToolUse memory injection for complex tasks

This is your prototype for the Digital FTE manufacturing process you'll refine throughout the book. By the end of this lab, you'll have a production-quality Digital FTE that demonstrates the difference between "using AI" and "selling AI solutions."

Connecting to Principle 5

Everything in this chapter supports Principle 5: "Persisting State in Files."

You now understand:

  • WHY files work: They survive session boundaries where memory doesn't
  • HOW to structure them: Position sensitivity, signal density, progress tracking
  • WHEN to use them: Multi-session work, knowledge persistence, decision documentation

Without context engineering discipline, "persisting state in files" is cargo cult. With it, file-based state becomes a superpower.

In Chapter 6, you'll learn Principle 5 explicitly, along with the other six principles of general agent problem solving. This chapter gave you the physics. Chapter 6 gives you the practices.

What You Learned

This capstone lesson synthesized the entire chapter:

  1. The decision tree tells you which technique to apply for specific situations. Diagnose first, then act.

  2. Context budget allocation shows where your tokens go. Message history dominates late sessions, which is why quality degrades.

  3. The seven token budgeting strategies help you reclaim budget: summarize, chunk, offload, relevancy check, structure, monitor, multi-round.

  4. The when-to-use framework maps situations to techniques. Most problems have multiple contributing causes.

  5. Four quality criteria define production-readiness: consistency, persistence, scalability, knowledge.

  6. The production agent lab applies everything to build something real.

The difference between a $50/month chatbot and a $5,000/month Digital FTE is now concrete. It's not magic. It's discipline. Context engineering is the manufacturing quality control that makes your AI solutions worth buying.

Try With AI

Prompt 1: Domain-Specific Quality Assessment

I'm building a [describe your domain: legal/marketing/research/business/development] agent.

Assess my agent's production-readiness:

1. Consistency (1-5): Would I get the same quality at turn 1 vs turn 50?
- What evidence supports your score?
- What's the biggest risk to consistency in my domain?

2. Persistence (1-5): Can work resume after a session break?
- What mechanisms exist for continuity?
- What domain-specific context gets lost when sessions end?

3. Scalability (1-5): Can it handle complex multi-step tasks without drift?
- Where would drift most likely occur for [domain] work?
- What safeguards exist against losing the original brief?

4. Knowledge (1-5): Does domain expertise get applied automatically?
- What [industry/field] rules are documented?
- What tacit expertise still lives only in my head?

Be honest and specific. I want actionable improvement targets that would make this agent worth $5,000/month to a client.

What you're learning: Quality assessment requires evidence, not gut feeling. This prompt forces you to justify each score with domain-specific observations, building the habit of evidence-based evaluation. You're learning to see your agent the way a paying client in your industry would.

Prompt 2: The Client Pitch

I built a [describe your agent in one sentence — include your domain].

Write a client-facing description for [target buyer: law firm partner / CMO / research director / COO / engineering manager]. The constraints:
- Don't mention "context engineering" (clients don't care about internals)
- Don't mention technical mechanisms (vector DB, hooks, etc.)
- Focus on outcomes: reliability, consistency, domain expertise, time savings

Structure it as:
1. The problem this solves for [target buyer] (2-3 sentences)
2. What makes this agent different from generic AI (2-3 bullet points)
3. Expected results (specific, measurable outcomes in their language)

This is how I'd pitch this to someone who might pay $5,000/month for it.

What you're learning: Technical quality must translate to business value. Your context engineering discipline is invisible to clients—a law firm sees "catches 95% of liability issues," not "memory injection prevents drift." This prompt trains you to articulate value in terms that matter to buyers in YOUR industry, which is essential for the Agent Factory thesis of building sellable Digital FTEs.

Prompt 3: The Improvement Roadmap

Based on my [domain] agent setup, what are the 3 highest-impact improvements?

For each improvement:
1. What specific problem it solves (in my domain context)
2. Estimated effort (hours)
3. Expected impact on quality scores
4. Dependencies (what must exist first)

Consider domain-specific factors:
- [Legal: compliance requirements, partner review workflows]
- [Marketing: brand consistency, multi-channel coordination]
- [Research: citation accuracy, methodology rigor]
- [Business: quantifiable outcomes, stakeholder communication]
- [Development: code quality, security requirements]

Prioritize by effort-to-impact ratio. I want the quick wins that move the needle most for my specific domain.

What you're learning: Continuous improvement requires domain-aware prioritization. Not every improvement matters equally in every field. This prompt trains you to think economically about quality investments—maximizing impact per hour invested for your specific industry. You're learning that quality manufacturing is iterative, not one-shot.

Chapter Summary

You've completed the context engineering discipline. Here's what you now know:

LessonCore InsightApplies To
1Context, not prompts, determines agent valueAll domains: your expertise IS the value
2More context isn't better; the 70% threshold mattersContract review, research synthesis, code analysis
3Position determines attention; middle = danger zoneCritical rules (legal thresholds, brand voice, standards)
430-60% of context is noise; audit ruthlesslyReference materials, style guides, precedent files
5Tacit knowledge must be extracted and documentedThe expertise that makes YOU valuable
6Know when to compact vs clear; lifecycle mattersLong reviews, multi-document analysis, deep research
7Progress files enable multi-session workWeek-long projects across any domain
8Memory injection prevents workflow driftMaintaining focus on original brief/requirements
9Clean context beats dirty state for multi-agentComplex workflows with multiple specialized agents
10Decision frameworks turn knowledge into practiceKnowing WHICH technique for WHICH problem

This isn't abstract theory. You've built a production-quality agent for YOUR domain. You've measured its quality against concrete criteria. You've created artifacts you can show to clients—whether they're law firms, marketing agencies, research institutions, or engineering teams.

The Agent Factory thesis is now operational: you know how to manufacture Digital FTEs with quality control. Whether your expertise is in contracts, campaigns, citations, or code—the discipline you've learned here applies to all of them.

Welcome to professional context engineering.