Axiom X: Observability Extends Verification
It is 2:47 AM. Your phone buzzes with an alert from a customer: "The AI assistant is giving wrong answers about pricing." You check your test suite -- all 347 tests pass. You check your CI pipeline -- the last deployment was green across every stage. You check your type system -- no errors. Everything your pre-deployment verification says is "this system works correctly." But in production, right now, it does not.
You open your logs. Nothing useful -- just print("Processing request...") scattered through the code. You check metrics. There are none. You have no idea how many requests are affected, when the problem started, or what changed. Your comprehensive test suite, your type system, your CI pipeline -- none of them can tell you what is happening right now, in production, to real users.
This is the gap that Axiom X closes. Tests verify behavior before deployment. Observability verifies behavior during deployment. Without both, your verification system has a blind spot the size of production itself.
The Problem Without This Axiom
Consider what the first nine axioms give you:
- Axiom I (Shell as Orchestrator): You can coordinate tools and workflows
- Axiom V (Types Are Guardrails): You catch structural errors at compile time
- Axiom VII (Tests Are the Specification): You verify behavior against specifications
- Axiom IX (Verification is a Pipeline): You automate all pre-deployment checks
This is powerful. But it all happens before your code reaches users. Once deployed, you are blind. The system could be:
- Responding correctly to tests but slowly degrading under real load
- Passing all type checks but consuming 10x expected tokens per request
- Green on CI but silently returning stale cached responses
- Functioning perfectly except for one edge case that affects 5% of users
Pre-deployment verification answers: "Does this code work correctly?" Post-deployment observability answers: "Is this code working correctly right now?" Both questions matter. Neither answer substitutes for the other.
The Axiom Defined
Axiom X: Observability Extends Verification. Runtime monitoring extends pre-deployment verification. Tests verify behavior before deployment; observability verifies behavior IN production. Together they form a complete verification system.
The word "extends" is precise. Observability does not replace testing -- it extends the verification boundary from "before deployment" to "always." Think of it as the verification spectrum:
| Phase | Tools | What It Catches | When |
|---|---|---|---|
| Pre-deployment | Linting, types, tests, CI | Logic errors, type mismatches, regressions | Before users see it |
| Post-deployment | Logs, metrics, traces, alerts | Performance degradation, edge cases, real-world failures | While users experience it |
A system with only pre-deployment verification is like a car that passes inspection but has no dashboard gauges. You verified it works -- but you have no way to know when it stops working.
From Principle to Axiom
In Chapter 4, Principle 7 introduced observability as visibility into what AI is doing -- seeing agent actions, understanding rationale, tracing execution. That principle focused on trust: if you cannot see what the agent does, you cannot trust it.
Axiom X takes this further. The principle is about human-AI collaboration transparency. The axiom is about production engineering discipline:
| Principle 7 (Chapter 4) | Axiom X (This Lesson) |
|---|---|
| See what the AI did | Monitor what the system is doing continuously |
| Activity logs for debugging | Structured logs, metrics, traces for operations |
| Trust through visibility | Confidence through measurement |
| Developer experience | Production reliability |
| "What happened?" | "What is happening right now, and is it normal?" |
Principle 7 gave you the mindset: make things visible. Axiom X gives you the engineering toolkit: structured observability as a first-class system concern, not an afterthought.
The Three Pillars of Observability
Production observability rests on three complementary pillars. Each answers a different question, and no single pillar suffices alone.
Pillar 1: Logs (What Happened?)
Logs are structured records of discrete events. They tell you what the system did at specific moments.
Loading Python environment...
Structured logs use key-value pairs instead of free-form strings. This makes them machine-parseable -- you can search, filter, and aggregate across millions of log entries programmatically.
Pillar 2: Metrics (How Much? How Fast?)
Metrics are numerical measurements over time. They tell you about system behavior in aggregate.
Loading Python environment...
Metrics answer questions like: "How many requests per second are we handling?" "What is the 95th percentile response time?" "Is error rate increasing?" These are questions logs cannot answer efficiently -- you would need to scan every log entry and compute aggregates yourself.
Pillar 3: Traces (Where Did Time Go?)
Traces follow a single request through your entire system, showing how time was spent across components.
Loading Python environment...
A trace might reveal: "This request took 4.2 seconds total -- 0.1s for validation, 3.8s waiting for the model, 0.3s for database storage." Without traces, you only know the total time. With traces, you know exactly where the bottleneck is.
Why All Three Together
| Scenario | Logs Alone | Metrics Alone | Traces Alone | All Three |
|---|---|---|---|---|
| "Why is the system slow?" | Shows individual slow requests | Shows 95th percentile is high | Shows where time is spent | Full picture: which requests, how many, and exactly why |
| "Is something broken?" | Shows error messages | Shows error rate is 5% | Shows which service fails | Full picture: what errors, how widespread, and the exact failure path |
| "How much does this cost?" | Shows per-request token counts | Shows total token usage trend | Shows which operations consume tokens | Full picture: cost per user, per feature, trending over time |
Python Observability Toolkit
Here is a practical implementation using the tools you will encounter in Python development.
Structured Logging with structlog
Loading Python environment...
This produces machine-parseable JSON output:
{"event": "request_processing_started", "request_id": "req-abc-123", "user_id": "user-456", "level": "info", "timestamp": "2025-06-15T14:32:15.123Z"}
Log Levels: Signal vs. Noise
Choosing the right log level determines whether your logs are useful or overwhelming:
| Level | Purpose | Example | Production Visibility |
|---|---|---|---|
DEBUG | Development details | Variable values, loop iterations | Off in production |
INFO | Normal operations | Request started, task completed | Always visible |
WARNING | Unexpected but handled | Retry succeeded, fallback used | Always visible |
ERROR | Failures requiring attention | API call failed, invalid input | Triggers alert |
CRITICAL | System-level failures | Database down, out of memory | Wakes someone up |
Loading Python environment...
Correlation IDs: Connecting the Dots
Without correlation, debugging distributed systems is impossible. A correlation ID ties all log entries for a single request together:
Loading Python environment...
Now every log entry in that request's lifecycle shares the same correlation_id. When something fails, you search for that ID and see the complete story.
Observability for AI Agents
AI agents introduce observability challenges that traditional web applications do not face. You need to monitor dimensions that did not exist before.
Dimension 1: Token Usage Tracking
Tokens are both your cost driver and your quality signal.
Loading Python environment...
Dimension 2: Response Quality Metrics
Unlike traditional APIs, AI responses can be "correct" structurally but poor in quality.
Loading Python environment...
Dimension 3: Error Rate Monitoring
AI agents fail differently from traditional software -- they can fail silently by producing plausible but wrong output.
Loading Python environment...
Dimension 4: Cost Per Operation
AI agents have variable per-request costs unlike fixed-infrastructure services.
Loading Python environment...
The Feedback Loop: Observe, Insight, Improve, Verify
Observability is not just about watching -- it drives a continuous improvement cycle:
1. OBSERVE: Collect logs, metrics, traces from production
|
2. INSIGHT: "Response times spike every Monday at 9 AM"
|
3. IMPROVE: Add request queuing to handle traffic burst
|
4. VERIFY: Write a load test (Axiom VII) that simulates Monday morning
|
5. DEPLOY: CI pipeline (Axiom IX) validates the fix
|
6. OBSERVE: Monitor production to confirm improvement
|
[Repeat]
This is where observability and testing become a unified system. Observability discovers problems that tests did not anticipate. Those discoveries become new tests. New tests prevent regressions. Observability confirms the fix works in production. The verification system grows stronger with each cycle.
The Complete System: All Ten Axioms
This is the final axiom. Together, the ten axioms form a coherent system for agentic software development. Here is how they compose:
ORCHESTRATION
Axiom I: Shell as Orchestrator
The shell coordinates all tools, agents, and workflows.
SPECIFICATION
Axiom II: Knowledge is Markdown
Requirements, designs, and context live in markdown.
Axiom III: Programs Over Scripts
Production work uses proper programs with structure.
ARCHITECTURE
Axiom IV: Composition Over Monoliths
Systems are built from small, composable units.
Axiom V: Types Are Guardrails
Type systems catch structural errors before runtime.
Axiom VI: Data is Relational
Data follows relational patterns for integrity.
VERIFICATION
Axiom VII: Tests Are the Specification
Tests define and verify correct behavior.
Axiom VIII: Version Control is Memory
Git tracks every change for accountability.
Axiom IX: Verification is a Pipeline
CI/CD automates all verification steps.
Axiom X: Observability Extends Verification
Runtime monitoring verifies behavior in production.
Trace a feature through the complete system:
- Shell orchestrates (I): You use Claude Code to coordinate the development workflow
- Spec in markdown (II): Requirements are captured in a
spec.mdfile - Proper program (III): Implementation follows program structure, not ad-hoc scripting
- Composed from units (IV): The feature is built from focused, reusable components
- Types enforce contracts (V): Interfaces between components are type-checked
- Data stored relationally (VI): Persistent data follows relational integrity patterns
- Tests specify behavior (VII): Test-Driven Generation defines what "correct" means
- Git remembers everything (VIII): Every change is tracked, reversible, attributable
- Pipeline verifies (IX): CI runs linting, types, tests, and builds automatically
- Production is observed (X): Logs, metrics, and traces confirm the feature works for real users
No single axiom is sufficient. A system with tests but no observability is blind in production. A system with observability but no tests has no baseline for "correct." A system with both but no types catches errors too late. The axioms are not a menu to choose from -- they are a system that works together.
Anti-Patterns
| Anti-Pattern | Why It Fails | The Fix |
|---|---|---|
| Print statements in production | Unstructured, no levels, no context, lost when process restarts | Use structlog with JSON output and persistent log aggregation |
| No error alerting | "We'll notice eventually" means users notice first | Define alert thresholds; wake someone for CRITICAL, notify for ERROR |
| Logging everything at DEBUG | Noise overwhelms signal; storage costs explode | Use appropriate log levels; DEBUG off in production |
| No correlation between requests | Impossible to trace a single user's journey through the system | Add correlation IDs; bind context at request start |
| Observability as afterthought | "Add monitoring later" means after the first production incident | Design observability into the system from the start, like testing |
| Metrics without baselines | "Is 200ms response time good or bad?" -- you cannot answer | Establish baselines first; alert on deviation, not absolute values |
| Monitoring only happy paths | You only track successful responses; failures are invisible | Instrument error paths with the same rigor as success paths |
Try With AI
Prompt 1: Replace Print Statements with Structured Logging
I have a Python script that uses print statements for debugging. Help me refactor it
to use structlog with proper production observability.
Here is my current code:
def process_order(order):
print(f"Processing order {order.id}")
if order.total > 1000:
print("Large order detected!")
try:
result = charge_payment(order)
print(f"Payment successful: {result}")
except Exception as e:
print(f"ERROR: Payment failed: {e}")
print("Order processing complete")
For each print statement, help me understand:
1. What log level should this be? (DEBUG, INFO, WARNING, ERROR, CRITICAL)
2. What structured context should I add? (key-value pairs)
3. Why is the structured version better for production debugging?
Then show me the complete refactored version using structlog with JSON output.
What you're learning: The difference between development-time debugging (print statements) and production-grade observability (structured logging). You are learning to think about each log statement in terms of its audience (human developer vs. log aggregation system) and its purpose (debugging vs. monitoring vs. alerting).
Prompt 2: Design AI Agent Monitoring
I am building an AI agent that helps customers with product recommendations.
Help me design a comprehensive observability strategy.
The agent:
- Receives natural language queries from customers
- Searches a product catalog (vector database)
- Generates personalized recommendations using an LLM
- Tracks which recommendations led to purchases
For each of the three observability pillars, help me define:
LOGS: What events should I log? At what levels? With what context?
METRICS: What numerical measurements matter? What are healthy baselines?
TRACES: What spans should I create? Where are the likely bottlenecks?
Also help me think about AI-specific monitoring:
- How do I detect when the agent is giving poor recommendations?
- How do I track cost per recommendation?
- What alerts should wake me up at 2 AM vs. notify me in the morning?
Walk me through the design decisions, explaining why each choice matters.
What you're learning: How to design observability for AI-specific systems where "correctness" is harder to define than in traditional software. You are learning to think about quality signals, cost tracking, and the unique failure modes of AI agents -- where the system can be "up" but producing poor results.
Prompt 3: The Full Verification Spectrum
I want to understand how all ten axioms of agentic development work together.
Take a concrete feature -- for example, adding a "summarize conversation" button
to an AI chat application -- and trace it through all ten axioms:
1. Shell as Orchestrator: How does the shell coordinate this work?
2. Knowledge is Markdown: Where do requirements live?
3. Programs Over Scripts: How is the implementation structured?
4. Composition Over Monoliths: What components make up this feature?
5. Types Are Guardrails: What type contracts exist between components?
6. Data is Relational: How is conversation data stored?
7. Tests Are the Specification: What tests define "correct"?
8. Version Control is Memory: How are changes tracked?
9. Verification is a Pipeline: What does CI check?
10. Observability Extends Verification: What do you monitor in production?
For each axiom, give me a concrete example specific to this feature.
Then help me see: where would the system fail if I skipped any single axiom?
What you're learning: Systems thinking -- how individual engineering practices compose into a coherent development methodology. You are learning to see the ten axioms not as separate rules but as an interconnected system where each axiom addresses a gap that the others leave open. This is the core insight of agentic development: rigorous engineering practices applied systematically, not selectively.
Safety Note
Observability systems handle sensitive data. Production logs may contain user inputs, personal information, or proprietary content. Always apply data minimization: log what you need for debugging and monitoring, not everything available. Sanitize personally identifiable information before it enters log aggregation. Apply retention policies -- not every log needs to live forever. And remember that observability infrastructure itself needs security: access to production logs should be as controlled as access to production databases.