Skip to main content

The Three Core Operational Constraints of LLMs

Before you can effectively orchestrate AI agents or build Digital FTEs, you need to understand the fundamental nature of the technology you're working with. Large Language Models (LLMs)—the reasoning engines powering Claude Opus 4.5, GPT-5.2, Gemini 3, and every AI coding agent built on them—operate under three core constraints that shape everything about how you work with them. Misunderstanding these constraints is the root cause of most frustrations developers have with AI tools.

These aren't bugs to be fixed or limitations to be worked around. They're fundamental characteristics of how LLMs work. Every methodology in this book—from Spec-Driven Development to context engineering to the OODA loop—exists because of these constraints. Understanding them transforms you from someone who fights the technology to someone who works with it.


Constraint 1: LLMs Are Stateless

The Reality: Every time you send a message to an LLM, it has no memory of previous interactions. The model doesn't "remember" your last conversation, your preferences, or even what you said five minutes ago in the same chat session. This is true for every model—Claude Opus 4.5, GPT-5.2, Gemini 3—regardless of how advanced they are.

What "Stateless" Actually Means:

Think of each API call to an LLM as a completely fresh start—like talking to someone with total amnesia. The model receives your message, processes it, generates a response, and then immediately forgets everything. The next message arrives to a blank slate.

The Illusion of Memory:

When you have a conversation in ChatGPT or Claude, it appears the model remembers earlier messages. But here's what's actually happening: the application (not the model) stores your conversation history, and with each new message, the entire conversation is re-sent to the model. The model reads the whole conversation from scratch every single time.

Message 1: You -> "Hi, my name is Maya"
[Model processes, responds, forgets everything]

Message 2: You -> "What's my name?"
[Application sends: "Hi, my name is Maya" + "What's my name?"]
[Model reads full history, responds "Maya", forgets everything]

Message 3: You -> "I'm building a registration system"
[Application sends: Message 1 + Message 2 + Response 2 + Message 3]
[Model reads ENTIRE history from scratch, responds, forgets everything]

Why This Matters for AI-Native Development:

  1. Context must be explicitly provided: The model doesn't know about your project unless you tell it—every time. This is why AGENTS.md exists: it provides persistent context that gets loaded into every interaction.

  2. Specifications aren't optional—they're essential: When you provide a detailed specification, you're not just being thorough; you're giving the model the only information it will ever have about your requirements.

  3. Session continuity is an application feature, not a model feature: Tools like Claude Code maintain context by re-injecting it with every interaction. Understanding this helps you work within the system rather than expecting "memory" that doesn't exist.

  4. Long projects require explicit state management: For complex projects, you need to maintain specifications, decisions, and context in files (like SPEC.md or PROJECT_CONTEXT.md) that can be re-loaded into each session.

Practical Implications:

What You Might ExpectWhat Actually HappensHow to Adapt
"It remembers my coding style"Model has no memory of past sessionsProvide style guides in AGENTS.md or specifications
"It knows my project"Each message starts freshGive the model access to your codebase; use AI-first IDEs and/or Coding Agents
"It learned from our last conversation"No learning occurs between sessionsDocument decisions and re-inject context
"It will remember this preference"Preferences are forgotten immediatelyEncode preferences in persistent configuration files

Constraint 2: LLMs Are Probabilistic, Not Deterministic

The Reality: Given the exact same input, an LLM will often produce different outputs. Unlike traditional software that returns consistent results, LLMs generate responses by sampling from probability distributions, introducing inherent variability. This applies equally to Claude Opus 4.5, GPT-5.2, and Gemini 3—the probabilistic nature is fundamental to how all transformer-based models work, not a limitation of older generations.

What "Probabilistic" Actually Means:

When an LLM generates text, it doesn't have one "correct" answer stored somewhere. Instead, at each step, it calculates the probability of many possible next tokens (words or word-parts) and selects one. This selection process involves randomness.

Prompt: "The best programming language for web development is"

Run 1: "JavaScript, because..." (probability sampled)
Run 2: "TypeScript, given..." (different sample from same distribution)
Run 3: "Python with Django..." (yet another valid sample)

Each response is valid and reasonable—but they're different. The model isn't making mistakes; it's working exactly as designed.

The Temperature Factor:

All frontier models—Claude Opus 4.5, GPT-5.2, and Gemini 3—have a "temperature" parameter that controls how much randomness influences output:

  • Temperature = 0: The model always picks the highest-probability token. More deterministic, but can be repetitive and less creative.
  • Temperature = 0.7 (typical default): Balanced creativity and coherence. The same prompt will yield varied but sensible responses.
  • Temperature = 1.0+: High creativity, but increased risk of incoherent or unexpected outputs.

Even at temperature = 0, subtle variations can occur due to floating-point calculations and batching.

Why This Matters for AI-Native Development:

  1. You cannot expect identical code from identical prompts: Running the same specification through an AI agent twice may produce two different (but both valid) implementations. This isn't a bug—it's the nature of the technology.

  2. Validation becomes essential, not optional: Because outputs vary, you must validate that any given output meets your specification. This is why SDD's "Validate" phase isn't bureaucracy—it's a fundamental requirement.

  3. Specifications provide anchoring: While outputs vary, a clear specification constrains the space of valid outputs. Vague prompts yield wildly varying results; precise specifications yield variations within acceptable bounds.

  4. Test-Driven Development takes on new importance: Tests define the invariants that must hold regardless of implementation variation. When AI generates code, tests verify that the probabilistic output meets deterministic requirements.

  5. Iteration is the norm, not the exception: Plan for 1-2 refinement cycles even with perfect specifications. The first output might be 95% correct but need adjustment. This is normal—factor it into your workflow.

Practical Implications:

What You Might ExpectWhat Actually HappensHow to Adapt
"Same prompt, same code"Different outputs each runUse specs to constrain variation; validate outputs
"It gave me this last time"No guarantee of consistencyVersion control what worked; don't rely on reproducibility
"Fix this bug" yields the same fixMultiple valid fixes existReview each fix; test against requirements
"Generate this test suite" twiceDifferent test cases emergeDefine required coverage explicitly; merge the best of multiple runs

The Power of Probabilistic Outputs:

This isn't purely a limitation—it's also a strength. Probabilistic generation means:

  • Creative problem-solving: The model can suggest approaches you hadn't considered
  • Multiple valid solutions: You can generate several implementations and choose the best
  • Exploration of solution space: Running a prompt multiple times can reveal different angles on a problem

The key is working with this characteristic rather than fighting it.


Constraint 3: Context Is Limited, Not Infinite

The Reality: LLMs have a fixed "context window"—a maximum amount of text they can process at once. This window stores everything: the system prompt, your conversation history, uploaded files, and the model's response. Once full, older content gets pushed out or truncated.

What "Context Window" Actually Means:

Think of the context window as the model's working memory—everything it can "see" at any given moment. As of early 2026, the frontier models have impressive but still finite context windows:

ModelContext WindowApproximate Equivalent
GPT-5.2 (OpenAI)256K tokens~200,000 words / ~600 pages
Claude Opus 4.5 (Anthropic)200K tokens~150,000 words / ~500 pages
Gemini 3 Pro (Google)2M tokens~1,500,000 words / ~5,000 pages

Even Gemini 3's impressive 2-million-token window—the largest among frontier models—fills up quickly on enterprise codebases. And larger context windows come with tradeoffs: increased latency, higher costs, and potential "lost in the middle" effects where information in the center of a very long context gets less attention than content at the beginning or end.

These numbers sound huge—until you realize what fills them:

System prompt (AGENTS.md, skills, instructions): 5,000-20,000 tokens
Your specification: 1,000-5,000 tokens
Conversation history: 2,000-50,000 tokens
Relevant code files: 10,000-100,000 tokens
Model's response: 1,000-10,000 tokens
-------------------------------------------
TOTAL: Can exceed limit quickly on complex projects

The Consequences of Limited Context:

  1. Information gets lost: In long conversations, early messages may be truncated or summarized. The model "forgets" not because it has bad memory, but because the information no longer fits.

  2. Large codebases don't fit: A moderately complex project might have 50,000-500,000+ lines of code. Even the largest context windows can't hold it all.

  3. Context is zero-sum: Every token spent on conversation history is a token not available for code, specifications, or response generation.

Why This Matters for AI-Native Development:

  1. Context engineering is a core skill: Deciding what goes into the context window—and what doesn't—directly impacts output quality. This is why MCP (Model Context Protocol) and progressive skill loading exist.

  2. AI-First IDEs solve context problems: Tools like Cursor and Windsurf don't just send your code to the model. They intelligently select relevant code—files you're editing, imports, related functions—to maximize the value of limited context.

  3. Specifications compress intent: A well-written specification conveys requirements in fewer tokens than scattered conversations. "Build a registration system with these 10 criteria" is more efficient than a 50-message back-and-forth that discovered those criteria.

  4. Project structure matters: How you organize code affects what context the AI can access. Small, well-named files with clear responsibilities are easier to selectively include than monolithic files.

  5. Conversation management becomes strategy: Long debugging sessions may need to be "reset" by starting a fresh conversation with only the relevant context, rather than dragging along thousands of tokens of irrelevant history.

Practical Implications:

What You Might ExpectWhat Actually HappensHow to Adapt
"It can see my whole codebase"Only selected files fit in contextUse AI-First IDEs with smart context selection
"It remembers our whole conversation"Old messages get truncatedKeep conversations focused; start fresh for new topics
"More context is always better"Noise dilutes relevant informationCurate context carefully; quality over quantity
"It knows everything in my repo"Context window has hard limitsStructure repos for selective inclusion; use AGENTS.md for orientation

Context Management Strategies:

For Specifications: Front-load the most important constraints. If context is truncated, critical requirements at the top survive while nice-to-haves at the bottom may be lost.

For Code: Reference files by path rather than pasting entire contents when the AI has file system access. Let it retrieve what it needs.

For Conversations: When a conversation becomes unwieldy, summarize progress and start fresh:

"Let's reset. We're building a user registration system.
We've decided on:
- PostgreSQL database
- bcrypt for passwords
- AWS SES for email

Now I need help with the rate limiting implementation."

For Projects: Maintain a PROJECT_CONTEXT.md file that captures critical decisions, architecture overview, and current status. This becomes your "state injection" for new sessions.


How the Three Constraints Interconnect

These constraints aren't isolated—they compound and interact:

Stateless + Limited Context: Because the model doesn't remember previous sessions, you must re-inject context every time. But context is limited, so you must re-inject efficiently. This is why AGENTS.md files are concise rather than exhaustive.

Probabilistic + Stateless: Each session starts fresh and produces variable output. Without persistent state, you can't even guarantee the model will approach a problem the same way twice. This is why version control and explicit documentation of decisions matter.

Probabilistic + Limited Context: When context is constrained, the model has less information to anchor its probabilistic generation. Vague specifications plus limited context yields wildly varying outputs. Clear specifications within context constraints yields useful variation within bounds.


The Methodological Response

Every practice in this book exists because of these three constraints:

ConstraintCreates the Need ForAddressed By
StatelessPersistent context that survives sessionsAGENTS.md, SPEC.md, MCP, Skills
ProbabilisticValidation of variable outputsSDD's Validate phase, TDD, Quality Gates
Limited ContextEfficient representation of requirementsSpecification-first thinking, context engineering

Spec-Driven Development addresses all three:

  • Specifications persist across sessions (counters statelessness)
  • Clear specs constrain probabilistic outputs (reduces variation)
  • Concise specs maximize use of context window (respects limits)

AGENTS.md is a direct response to statelessness—a persistent file that gets injected into context to give every session consistent baseline knowledge.

MCP (Model Context Protocol) addresses the context limit by allowing agents to dynamically retrieve information rather than requiring everything upfront.

Test-Driven Development accepts probabilistic outputs by defining invariants (tests) that any valid implementation must satisfy, regardless of how it's generated.


The Developer's Mental Model

Understanding these constraints transforms how you think about AI collaboration:

Old Mental Model (Incorrect):

  • "The AI is like a knowledgeable colleague who remembers our work together"
  • "If I explain something once, it knows it"
  • "The same prompt should give me the same result"

New Mental Model (Correct):

  • "The AI is like a brilliant expert with amnesia who I brief from scratch each time"
  • "I must provide all relevant context every session, concisely"
  • "I specify what I need, validate what I get, and expect iteration"

This mental model isn't pessimistic—it's pragmatic. When you understand the constraints, you stop fighting them and start designing workflows that work with them. That's when AI becomes genuinely productive rather than frustrating.


The Hallucination Risk

LLMs can confidently generate code that looks correct but contains subtle bugs, references non-existent APIs, or implements logic that doesn't match your intent. This isn't lying—it's the probabilistic nature producing confident-sounding outputs from statistical patterns. This is why validation isn't optional: you cannot trust AI-generated code without verification.


Context and Cost

Every token in the context window costs money. Frontier model APIs charge per input and output token. A poorly managed context (stuffing irrelevant files, long conversation histories) directly increases costs. Efficient specifications and smart context engineering aren't just about quality—they're about economics at scale.


Try With AI

Constraint Exploration Exercise

Use your AI companion to explore these constraints firsthand:

I want to understand how LLMs work. Let's do an experiment.

1. STATELESSNESS: Ask me 3 questions about myself (name, favorite color, current project).
I'll answer. Then I want you to explain: Do you actually "remember" my answers,
or are you reading them from the conversation history? What would happen if
this conversation's history was cleared?

2. PROBABILISTIC: Generate a Python function to check if a number is prime.
Then generate it again without looking at the first version.
Compare the two implementations—why are they different?

3. CONTEXT LIMITS: Explain what happens if I paste an entire 50,000-line codebase
into this conversation. What would get lost? How should I share large codebases
with you instead?

Help me build intuition for how these constraints affect how I should work with you.

What you're learning: Direct experience with the constraints. This builds visceral understanding that reading about them cannot provide.

Context Engineering Exercise

I'm going to practice context engineering. Here's a challenge:

I have a user authentication system with:
- 15 source files (about 3,000 lines total)
- 5 database migrations
- 3 configuration files
- An existing README

I want your help debugging a session management bug.

Help me figure out:
1. What's the minimum context you need to help effectively?
2. What information would just add noise?
3. How should I structure my question to maximize your helpfulness
within context limits?
4. If you could only see 3 files, which would be most valuable?

Don't make assumptions about my code—ask clarifying questions
that help you understand what context you actually need.

What you're learning: How to think about context curation. This skill directly impacts how effective your AI collaboration becomes.


Summary: The Three Constraints

ConstraintWhat It MeansYour Response
StatelessNo memory between sessions; conversation history is re-sent each timePersist context in files (AGENTS.md, specs); design for fresh starts
ProbabilisticVariable outputs from identical inputs; no guaranteed reproducibilityValidate all outputs; use specs to constrain variation; embrace iteration
Limited ContextFixed window for all information; content competes for spaceEngineer context carefully; prioritize quality over quantity; use smart tools

Understanding these constraints is prerequisite knowledge for everything else in this book. They explain why the methodologies exist and how to apply them effectively. With this foundation, you're ready to learn the specific techniques that transform these constraints from limitations into design parameters.