Axiom IX: Verification is a Pipeline
It's 11pm on a Thursday. You've spent three hours prompting Claude Code to build a new feature — a task scheduling module with priority queues, retry logic, and SQLModel persistence. The code looks clean. The AI even wrote tests. You run them locally, they pass. You push to main, merge the PR, and go to bed.
At 2am, the deployment fails. The feature imported a package that isn't in requirements.txt. The type annotations use syntax that only works on Python 3.12, but production runs 3.11. One of the "passing" tests was actually testing the wrong function due to a copy-paste error the AI made. And the retry logic has a timing vulnerability that a security scanner would have caught instantly.
None of these failures are exotic. They're routine — the kind of mistakes that slip through when the only verification is "I ran it on my machine and it seemed fine." The problem wasn't the code quality. The problem was the verification process: manual, incomplete, and inconsistent.
This is where Axiom IX draws the line: verification is not something you do. It's something your infrastructure does, automatically, every single time.
The Problem Without This Axiom
Without automated verification pipelines, teams fall into predictable failure patterns:
The "Works on My Machine" Trap. Developer A's code passes locally because they have Python 3.12, a specific OS, and packages installed from last month's experiment. Developer B pulls the same code and it breaks. Production uses neither developer's environment. Without a standardized verification environment, "it works" means nothing.
The "I Already Tested It" Illusion. You run pytest and see green. But did you run the linter? The type checker? The security audit? The formatting check? Manual verification is inherently incomplete because humans skip steps — especially at 11pm, especially when the AI assured them the code was correct.
The "Review Fatigue" Problem. When AI generates 500 lines of code in 30 seconds, a human reviewer can't meaningfully verify every line. They skim, approve, merge. Without automated checks catching the issues, subtle bugs accumulate. The more code AI generates, the more essential automated verification becomes.
The Axiom Defined
Axiom IX: CI/CD automates the verification of all changes. The pipeline runs linting, type checking, tests, security audits, and deployment — every time, without exception. CI enforces reality: if the pipeline fails, the code doesn't ship.
This axiom transforms verification from a human discipline problem into an infrastructure guarantee. You don't need to remember to run the linter. The pipeline runs it. You don't need to trust that tests pass. The pipeline proves it. You don't need to hope there are no security vulnerabilities. The pipeline checks.
The pipeline is the gatekeeper that never sleeps, never gets tired, and never decides "it's probably fine."
From Principle to Axiom
In Part 1, you learned Principle 3: Verification as Core Step — the mindset that every action should be verified. You learned to check that files exist after creating them, to confirm commands succeeded before moving on, to validate outputs before declaring victory.
Axiom IX elevates that principle from personal discipline to infrastructure enforcement. The relationship is:
| Principle 3 (Mindset) | Axiom IX (Infrastructure) |
|---|---|
| "Always verify your work" | "The pipeline always verifies all work" |
| Relies on human discipline | Runs automatically on every push |
| Can be forgotten or skipped | Cannot be bypassed (branch protection) |
| Checks what you remember to check | Checks everything, every time |
| Individual responsibility | Team-wide guarantee |
The principle taught you why to verify. The axiom teaches you how to make verification unavoidable.
The Verification Pyramid
Not all checks are equal. They form a pyramid — fast, cheap checks at the base catch the most common issues, while slower, more thorough checks at the top catch deeper problems:
┌─────────────────┐
│ Security Audit │ ← Slowest, catches vulnerabilities
│ (pip-audit) │
├─────────────────┤
│ Integration Tests│ ← Tests component interactions
│ (pytest -m) │
┌───┴─────────────────┴───┐
│ Unit Tests │ ← Tests individual functions
│ (pytest) │
┌───┴─────────────────────────┴───┐
│ Type Checking │ ← Catches type errors statically
│ (pyright) │
┌───┴─────────────────────────────────┴───┐
│ Linting │ ← Catches bugs, style issues
│ (ruff check) │
┌───┴─────────────────────────────────────────┴───┐
│ Formatting │ ← Fastest, catches style drift
│ (ruff format --check) │
└─────────────────────────────────────────────────┘
Each level catches different categories of problems:
| Level | Tool | What It Catches | Speed |
|---|---|---|---|
| 1. Formatting | ruff format --check | Inconsistent style, whitespace | < 1 second |
| 2. Linting | ruff check | Unused imports, bad patterns, common bugs | 1-2 seconds |
| 3. Type Checking | pyright | Type mismatches, missing attributes, wrong signatures | 3-10 seconds |
| 4. Unit Tests | pytest | Logic errors, broken functions, regressions | 5-30 seconds |
| 5. Integration Tests | pytest -m integration | Component interaction failures | 30-120 seconds |
| 6. Security Audit | pip-audit | Known vulnerabilities in dependencies | 5-15 seconds |
Why the pyramid matters: If formatting fails, you don't need to wait for tests to run. Each level gates the next. Fast failures save time.
Why Each Level Exists
Level 1 — Formatting catches the trivial but distracting. AI-generated code often uses inconsistent indentation, trailing whitespace, or mixed quote styles. ruff format --check enforces a single style without debate. This is the cheapest possible check — if your code can't even be formatted consistently, there's no point running deeper analysis.
Level 2 — Linting catches the bugs hiding in plain sight. Unused imports that slow startup. Variables assigned but never read. f-strings without placeholders. assert statements with tuples (always truthy). These are patterns that look correct to a casual reader but indicate real problems.
Level 3 — Type Checking catches the structural errors. You call task.priorty instead of task.priority — the linter won't catch the typo, but the type checker knows Task has no attribute priorty. You pass a string where an integer is expected. You return None from a function declared to return Task. Static type checking is like having a compiler for Python.
Level 4 — Unit Tests catch the logic errors. The sorting algorithm puts high-priority tasks last instead of first. The retry logic retries indefinitely instead of stopping at 3. The date parser handles "2025-01-15" but crashes on "2025-1-5". These are semantic bugs that only running the code can reveal.
Level 5 — Integration Tests catch the interaction failures. The API endpoint works in isolation, but when the database is under load, it times out. The scheduler creates tasks correctly, but the notification system doesn't pick them up. Components that work alone may fail together.
Level 6 — Security Audit catches the invisible threats. A dependency you installed six months ago now has a known CVE. A package you've never heard of (three levels deep in your dependency tree) was compromised. pip-audit checks every installed package against vulnerability databases.
GitHub Actions: Your Pipeline
GitHub Actions is the CI platform you'll use most often for Python projects. It runs your verification pipeline on every push and pull request, using YAML workflow files stored in your repository.
Here's a complete, production-ready CI workflow:
# .github/workflows/ci.yml
name: CI Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
verify:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11", "3.12"]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Cache pip dependencies
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('requirements*.txt') }}
restore-keys: |
${{ runner.os }}-pip-
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Level 1: Formatting
- name: Check formatting
run: ruff format --check .
# Level 2: Linting
- name: Run linter
run: ruff check .
# Level 3: Type checking
- name: Type check
run: pyright
# Level 4: Unit tests
- name: Run unit tests
run: pytest tests/ -v --tb=short
# Level 5: Integration tests
- name: Run integration tests
run: pytest tests/ -m integration -v --tb=short
# Level 6: Security audit
- name: Security audit
run: pip-audit
Let's break down what each section does:
Triggers (on:): The pipeline runs on every push to main and every pull request targeting main. No code reaches main without passing all checks.
Matrix Testing (strategy.matrix): The pipeline runs against both Python 3.11 and 3.12. This catches version-specific issues — like the walrus operator crashing on 3.7, or tomllib not existing before 3.11. If AI generates code using a newer Python feature, the matrix catches it.
Caching (actions/cache): Dependencies are cached between runs using a hash of your requirements files. First run installs everything; subsequent runs restore from cache unless dependencies change. This can cut pipeline time from 3 minutes to 45 seconds.
Sequential Steps: The verification pyramid runs top-to-bottom. If formatting fails, linting never runs. This "fail fast" approach gives you the quickest possible feedback.
Branch Protection: Making CI Mandatory
A pipeline that runs but can be ignored is theater, not verification. To make CI truly enforceable:
- Go to your repository Settings > Branches > Branch protection rules
- Enable "Require status checks to pass before merging"
- Select the
verifyjob as a required check - Enable "Require branches to be up to date before merging"
Now the pipeline isn't advisory — it's a gate. If CI fails, the merge button is disabled. No exceptions, no overrides (unless you're a repository admin, and even then it shows a warning).
Local CI: The Makefile
Waiting 2-3 minutes for GitHub Actions to tell you about a formatting error wastes time. The solution: run the same checks locally before pushing. A Makefile gives you a single command that mirrors your CI pipeline:
# Makefile
.PHONY: format lint typecheck test test-integration security ci clean
# Individual targets
format:
ruff format --check .
lint:
ruff check .
typecheck:
pyright
test:
pytest tests/ -v --tb=short
test-integration:
pytest tests/ -m integration -v --tb=short
security:
pip-audit
# Fix targets (auto-correct what can be fixed)
fix-format:
ruff format .
fix-lint:
ruff check --fix .
# The full CI pipeline — same checks as GitHub Actions
ci: format lint typecheck test security
@echo "All CI checks passed."
# Run everything including integration tests
ci-full: format lint typecheck test test-integration security
@echo "All CI checks (including integration) passed."
# Quick check — just formatting and linting (< 5 seconds)
quick:
ruff format --check . && ruff check .
clean:
find . -type d -name __pycache__ -exec rm -rf {} +
find . -type d -name .pytest_cache -exec rm -rf {} +
find . -type d -name .ruff_cache -exec rm -rf {} +
The Workflow With Local CI
The daily development workflow becomes:
# 1. Write code (or have AI generate it)
# 2. Quick check — catches 80% of issues in 2 seconds
make quick
# 3. Fix any formatting/linting issues automatically
make fix-format
make fix-lint
# 4. Full CI check before pushing
make ci
# 5. Only push if CI passes
git add -A && git commit -m "feat: add task priority queue" && git push
This workflow means you almost never see CI failures on GitHub. The pipeline becomes a safety net, not a bottleneck. You catch issues in 5 seconds locally rather than waiting 3 minutes for a remote failure.
Why a Makefile?
You might wonder: why not a shell script? Or a Python script? The Makefile has specific advantages:
- Convention: Most open-source projects use Makefiles. Developers know to look for one.
- Self-documenting: Run
makewith no arguments to see available targets. - Dependency-aware: Make only runs targets whose dependencies changed (though for CI checks, you usually want to run everything).
- Composable:
ciis justformat + lint + typecheck + test + securitychained together. - Universal: Make is installed on every Unix system. No extra dependencies.
Why CI Matters MORE With AI-Generated Code
Traditional CI protects against human error. With AI-generated code, the case for CI becomes overwhelming:
AI generates faster than humans can review. A skilled developer writes 50-100 lines of meaningful code per hour. Claude Code can generate 500 lines in 30 seconds. The review bottleneck shifts — you physically cannot read and verify every line. The pipeline must catch what your eyes miss.
AI makes confident-looking mistakes. When a human writes buggy code, there are often signals — commented-out attempts, TODO markers, inconsistent naming that reveals uncertainty. AI-generated code looks polished even when it's wrong. A function with a perfect docstring, clean type annotations, and logical structure might still have a subtle off-by-one error. The pipeline doesn't care how polished the code looks — it checks whether it works.
AI doesn't remember project conventions. You told the AI "use ruff for linting" in your system prompt, but it generated code with unused imports anyway. You specified "all functions need type annotations" but the AI forgot on the helper functions. The pipeline enforces conventions the AI might forget between prompts.
AI introduces dependency risks. AI might suggest pip install cool-package for a feature, pulling in a package you've never audited. Without pip-audit in your pipeline, you won't know that cool-package depends on sketchy-lib which has a known remote code execution vulnerability.
Anti-Patterns: What Bad CI Looks Like
| Anti-Pattern | Why It Fails | The Fix |
|---|---|---|
| "I tested it on my machine" | Different environment, different results | CI runs in a standardized container |
| CI that only runs tests | Types, linting, security all skipped | Implement the full verification pyramid |
| Ignoring flaky tests | "It's just flaky" normalizes CI failures | Fix or quarantine flaky tests immediately |
| No local CI equivalent | Surprises at push time, slow feedback | Create make ci that mirrors the pipeline |
| CI takes 30+ minutes | Developers push without waiting, bypass CI | Cache aggressively, parallelize jobs, fail fast |
| Optional CI (no branch protection) | "I'll merge anyway, it's urgent" | Required status checks, no exceptions |
| Secrets in code | API keys committed, exposed in logs | Use GitHub Secrets and environment variables |
| CI that passes but doesn't check enough | False confidence from green checkmarks | Audit what CI actually verifies quarterly |
The Most Dangerous Anti-Pattern
The single most destructive CI anti-pattern is normalizing failures. It starts small: "That test is flaky, just re-run it." Then it becomes: "CI is red but it's not related to my change." Then: "We'll fix CI next sprint." Within weeks, the pipeline is permanently red, nobody looks at it, and you've lost your automated safety net entirely.
The rule: CI must always be green on main. If a test is flaky, fix it or delete it. If a check is wrong, fix the check. Never let "red is normal" become your team's culture.
CI as Culture
The deepest lesson of Axiom IX isn't technical — it's cultural. CI is a statement about values:
"If it's not in CI, it's not enforced." You can write all the coding standards documents you want. You can send emails about "please run the linter before pushing." None of it works. The only coding standards that actually get followed are the ones enforced by the pipeline. If you care about consistent formatting — put it in CI. If you care about type safety — put it in CI. If you care about security — put it in CI.
"The pipeline is the source of truth." When someone asks "does this code work?" the answer isn't "I think so" or "it worked when I tested it." The answer is: "CI is green." The pipeline is the objective arbiter. It doesn't have opinions, biases, or bad days.
"Fast CI is kind CI." A pipeline that takes 30 minutes punishes developers for pushing code. They'll batch changes, push less frequently, and avoid small improvements because "it's not worth waiting for CI." A pipeline that takes 2 minutes encourages small, frequent pushes — exactly the behavior you want. Invest in making CI fast: cache dependencies, parallelize jobs, skip unnecessary work.
Putting It All Together
Here's how Axiom IX connects to the axioms before and after it:
- Axiom VII (Tests Are the Specification) gives you tests. Axiom IX runs them automatically.
- Axiom VIII (Version Control is Memory) gives you commits. Axiom IX verifies them before they reach main.
- Axiom X (Observability Extends Verification) takes over where CI stops — monitoring the code after it ships.
Together, they form a continuous verification chain: tests define correctness, CI proves it before deployment, and observability confirms it in production.
Safety Note: Secrets and CI
Never put secrets directly in your workflow files or source code. CI environments handle sensitive data through environment variables and encrypted secrets:
# WRONG - secret exposed in code
- name: Deploy
run: curl -H "Authorization: Bearer sk-abc123..." https://api.example.com/deploy
# RIGHT - secret stored in GitHub Secrets
- name: Deploy
run: curl -H "Authorization: Bearer ${{ secrets.DEPLOY_TOKEN }}" https://api.example.com/deploy
env:
API_KEY: ${{ secrets.API_KEY }}
Rules for CI secrets:
- Store all tokens, keys, and passwords in GitHub Settings > Secrets and variables > Actions
- Never echo or print secret values in CI logs
- Rotate secrets regularly (at minimum quarterly)
- Use the minimum permissions necessary (read-only tokens where possible)
- Secrets are not available in pull requests from forks (this is a security feature, not a bug)
Try With AI
Prompt 1: Build Your CI Pipeline
I have a Python project with this structure:
my_project/
├── src/
│ └── task_manager/
│ ├── __init__.py
│ ├── models.py (SQLModel Task class)
│ ├── scheduler.py (priority queue logic)
│ └── api.py (FastAPI endpoints)
├── tests/
│ ├── test_models.py
│ ├── test_scheduler.py
│ └── test_api.py
├── requirements.txt
├── requirements-dev.txt
├── pyproject.toml
└── Makefile
Help me create:
1. A GitHub Actions workflow (.github/workflows/ci.yml) that implements the full
verification pyramid (formatting, linting, types, tests, security)
2. A Makefile with targets for each level plus a combined `make ci`
3. A pyproject.toml section configuring ruff and pyright
For each file, explain what each section does and why it's there.
Then show me how to add branch protection so CI is mandatory.
What you're learning: Translating the verification pyramid concept into actual infrastructure files. You're practicing the skill of configuring CI tools — not just knowing they exist, but knowing how to wire them together into a cohesive pipeline that runs automatically on every push.
Prompt 2: Diagnose CI Failures
My CI pipeline is failing with these errors. Help me understand and fix each one:
Error 1 (ruff format):
src/task_manager/scheduler.py: would reformat
Error 2 (ruff check):
src/task_manager/models.py:3:1: F401 `os` imported but unused
src/task_manager/api.py:15:5: B006 Do not use mutable data structures for argument defaults
Error 3 (pyright):
src/task_manager/scheduler.py:42:12 - error: Argument of type "str" cannot be assigned
to parameter "priority" of type "int"
Error 4 (pytest):
FAILED tests/test_scheduler.py::test_priority_order - AssertionError:
assert [Task(p=3), Task(p=1)] == [Task(p=1), Task(p=3)]
Error 5 (pip-audit):
Name Version ID Fix Versions
requests 2.28.0 PYSEC-2023-74 2.31.0
For each error:
- What verification level caught it?
- Why didn't a lower level catch it?
- What's the fix?
- What would have happened if this reached production?
What you're learning: Reading and interpreting CI failure messages across all pyramid levels. You're building the diagnostic skill of understanding which tool caught which category of error, and why the layered approach matters — each level catches problems invisible to the levels below it.
Prompt 3: CI for Your Domain
I'm building [describe your project: a web scraper, a data pipeline, a CLI tool,
an API service, a machine learning model, etc.].
Help me design a CI pipeline that goes beyond the basic verification pyramid.
Consider domain-specific checks:
- What additional checks make sense for my type of project?
(e.g., API contract tests, data validation, model accuracy thresholds,
documentation generation, container builds, load tests)
- What should run on every push vs. nightly vs. weekly?
- How do I handle checks that need external services (databases, APIs)?
- What's the right balance between thoroughness and speed?
Design a CI pipeline with:
1. Fast checks (< 2 min) — run on every push
2. Medium checks (< 10 min) — run on PRs to main
3. Slow checks (< 30 min) — run nightly or on release branches
Show me the GitHub Actions YAML for all three tiers.
What you're learning: Extending the verification pyramid beyond generic Python checks to domain-specific validation. You're practicing the architectural skill of designing CI pipelines that balance thoroughness (catch everything) with speed (developers don't bypass it), tailored to the specific risks and requirements of your project type.
Summary
Axiom IX transforms verification from a human discipline problem into an infrastructure guarantee. The verification pyramid (formatting, linting, types, tests, integration tests, security) runs automatically on every push. GitHub Actions enforces it remotely; Makefiles mirror it locally for fast feedback. With AI generating code faster than humans can review, the pipeline becomes your most reliable quality gate — the gatekeeper that never sleeps.
The next axiom, Observability Extends Verification, picks up where CI leaves off. CI verifies code before deployment. Observability verifies it after — monitoring behavior in production where no test suite can reach.