Axiom III: Programs Over Scripts

Last Tuesday, you wrote a quick Python script to rename 200 image files in a folder. Fifteen lines. No imports beyond os and re. It worked perfectly on the first run, and you felt productive.

Then your colleague asked: "Can I use that for our client deliverables?" Suddenly you needed to handle files with spaces in their names, log which files were renamed, skip files that already matched the pattern, and report errors instead of crashing silently. Your 15-line script grew to 80 lines of tangled if-statements. A week later, the script renamed a client's final deliverable incorrectly, and nobody knew why because there were no logs, no tests, and no way to reproduce the issue.

This is the script-to-program boundary. Crossing it without recognizing you've crossed it is one of the most common sources of production failures in AI-era development.

The Problem Without This Axiom

Without "Programs Over Scripts," developers fall into a dangerous pattern: they write quick scripts, those scripts work for the immediate problem, and then those scripts quietly become production infrastructure. Nobody announces "this script is now load-bearing code." It just happens, one convenience at a time.

The consequences compound:

A data processing script runs in production for months. One day the input format changes slightly. The script crashes at 2 AM with no error message beyond KeyError: 'timestamp'. Nobody knows what it expected or why.
An AI agent generates a utility function. It works for the test case. Three weeks later, it fails on edge cases the AI never considered. There are no tests to reveal this, and no type annotations to show what the function actually expects.
A deployment script uses hardcoded paths. It works on the author's machine. On the CI server, it fails silently and deploys a broken build.

The root cause is the same every time: code that grew beyond script-level complexity while retaining script-level discipline.

The Axiom Defined

Axiom III: Production work requires proper programs, not ad-hoc scripts. Programs have types, tests, error handling, and CI integration. Scripts are for exploration; programs are for shipping.

This axiom draws a clear line: scripts serve exploration and experimentation; programs serve reliability and collaboration. Both are valuable. The failure mode is not writing scripts. The failure mode is shipping scripts as if they were programs.

From Principle to Axiom

In Chapter 3, you learned Principle 2: Code as Universal Interface -- the idea that code solves problems precisely where prose fails. Code is unambiguous. Code is executable. Code is the language machines understand natively.

Axiom III builds on that foundation: if code is your universal interface, then the quality of that code determines the reliability of your interface. A vague specification is bad. A vague program is worse, because it compiles and runs -- giving the false appearance of correctness while hiding fragility beneath the surface.

Principle 2 says: use code to solve problems. Axiom III says: make that code worthy of the problems it solves.

The principle is about choosing the right medium. The axiom is about discipline within that medium.

The Script-to-Program Continuum

Scripts and programs are not binary categories. They exist on a continuum, and code naturally moves along it as its responsibilities grow. The key is recognizing when your code has moved far enough that script-level practices become dangerous.

Dimension	Script	Program
Purpose	Explore, prototype, one-off task	Reliable, repeatable, shared
Type annotations	None or minimal	Complete on all public interfaces
Error handling	Bare `except` or crash-and-fix	Specific exceptions with recovery
Tests	Manual verification ("it printed the right thing")	Automated test suite (pytest)
CLI interface	Hardcoded values, `sys.argv[1]`	Typed CLI (typer/click/argparse)
Dependencies	`pip install` globally	Locked in `pyproject.toml` (uv)
Configuration	Magic strings in source	Typed config objects or env vars
Documentation	Comments (maybe)	Docstrings, README, usage examples
CI/CD	None	Linted, type-checked, tested on every push

When Does a Script Become a Program?

A script should become a program when any of these conditions become true:

Someone else will run it. If another human (or an automated system) depends on your code, it needs to communicate its expectations through types and handle failures gracefully.
It will run more than once. One-off scripts can crash and you re-run them with a fix. Repeated execution requires reliability.
It processes important data. If the input or output matters (client files, financial records, deployment artifacts), silent failures are unacceptable.
It grew beyond 50 lines. This is not a strict threshold, but complexity compounds. Beyond 50 lines, you cannot hold the full logic in your head while debugging.
An AI generated it. AI-generated code deserves extra scrutiny because you did not write it line-by-line. Types and tests become your verification layer.

A Script Becomes a Program: Concrete Example

Here is a real progression. First, the script version -- quick, functional, fragile:

Loading Python environment...

This works. It also has no error handling, no way to preview changes, no protection against overwriting files, no tests, no type information, and hardcoded paths. When it fails, it fails silently or mid-operation, leaving your folder in an inconsistent state.

Now, the program version:

Loading Python environment...

And the tests that verify it:

Loading Python environment...

Notice what changed:

Aspect	Script	Program
Errors	Crashes on missing folder	Raises specific exceptions with context
Safety	Can overwrite files	Checks for conflicts, skips with warning
Preview	No way to see what will happen	`--dry-run` flag shows planned changes
Types	None	Full annotations on all functions
Testing	"I ran it and it looked right"	7 automated tests covering edge cases
Interface	Edit source code to change folder	CLI with `--help`, arguments, options
Logging	`print()`	Structured logging with levels

The Python Discipline Stack

Python is flexible enough to be used as both a scripting language and a systems programming language. The discipline stack is what transforms Python from "quick and loose" into "verified and reliable." Four tools form the foundation:

Tool	Role	What It Catches
uv	Dependency management	Wrong versions, missing packages, environment conflicts
pyright	Static type checker	Wrong argument types, missing attributes, incompatible returns
ruff	Linter and formatter	Unused imports, style violations, common bugs, inconsistent formatting
pytest	Test runner	Logic errors, edge cases, regressions after changes

These tools form layers of verification, each catching a different class of defect:

Layer 4: pytest     → Does the logic produce correct results?
Layer 3: pyright    → Do the types align across function boundaries?
Layer 2: ruff       → Does the code follow consistent patterns?
Layer 1: uv         → Are the dependencies resolved and reproducible?

How They Work Together

A minimal pyproject.toml that activates the full stack:

[project]
name = "image-renamer"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = ["typer>=0.9.0"]

[project.scripts]
image-renamer = "image_renamer.cli:app"

[tool.pyright]
pythonVersion = "3.12"
typeCheckingMode = "standard"

[tool.ruff]
target-version = "py312"
line-length = 100

[tool.ruff.lint]
select = ["E", "F", "I", "UP", "B", "SIM"]

[tool.pytest.ini_options]
testpaths = ["tests"]

Running the full stack:

# Install dependencies in an isolated environment
uv sync

# Check types (catches mismatched arguments, wrong return types)
uv run pyright src/

# Lint and format (catches style issues, common bugs)
uv run ruff check src/ tests/
uv run ruff format src/ tests/

# Run tests (catches logic errors)
uv run pytest

Each tool catches problems the others miss. Pyright will not tell you that your rename logic is wrong -- that requires tests. Pytest will not tell you that you are passing a str where a Path is expected -- that requires pyright. Ruff will not tell you either of those things, but it will catch the unused import and the inconsistent formatting that make code harder to read and maintain.

Why AI-Generated Code Requires Program Discipline

When you write code yourself, you build a mental model of how it works as you type each line. You know the assumptions, the edge cases you considered, and the shortcuts you took deliberately. AI-generated code has none of this implicit understanding. You receive finished output with no trace of the reasoning behind it.

This creates three specific risks that program discipline addresses:

1. Types Catch Hallucinated APIs

AI models sometimes generate code that calls functions or methods that do not exist, or passes arguments in the wrong order. Type checking catches this immediately:

Loading Python environment...

Without pyright, this code would crash at runtime when a user first triggers that code path -- possibly in production, possibly weeks later. With pyright, you catch it before you ever run the code.

2. Tests Prevent Drift

AI does not remember previous sessions. Each time you ask it to modify code, it works from the current file content without understanding the history of decisions that shaped it. Tests encode your expectations permanently:

Loading Python environment...

When a future AI edit accidentally changes normalize_filename to strip hyphens, this test fails immediately. The test is your memory; the AI has none.

3. CI Enforces Standards Across Sessions

You might forget to run pyright before committing. The AI certainly will not remind you. CI (Continuous Integration) enforces the discipline stack on every push, regardless of who or what wrote the code:

# .github/workflows/check.yml
name: Verify
on: [push, pull_request]

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv sync
      - run: uv run pyright src/
      - run: uv run ruff check .
      - run: uv run pytest

This pipeline does not care whether a human or an AI wrote the code. It applies the same standards to both. Code that fails any check does not merge. This is your safety net against AI-generated code that looks correct but contains subtle issues.

Anti-Patterns: Scripts Masquerading as Programs

Recognizing these patterns helps you catch code that has outgrown its script-level discipline:

Anti-Pattern	Why It Fails	Program Alternative
Jupyter notebooks as production code	No tests, no types, cell execution order matters, hidden state between cells	Extract logic into modules, test independently
`def process(data):` (no type hints)	Callers cannot verify they pass correct types; AI cannot validate its own output	`def process(data: list[Record]) -> Summary:`
Bare `except Exception:`	Hides real errors, makes debugging impossible	Catch specific exceptions: `except FileNotFoundError:`
`DB_HOST = "localhost"` in source	Breaks in any environment besides your machine	`DB_HOST = os.environ["DB_HOST"]` or typed config
"It's too simple to test"	Simple code becomes complex code; tests document expected behavior	Even one test proves the function works and prevents regressions
`python my_script.py input.csv`	No `--help`, no validation, no discoverability	`typer` or `argparse` with typed arguments
`pip install` in global environment	Different projects conflict; "works on my machine" syndrome	`uv` with locked `pyproject.toml`

The "Too Simple to Test" Trap

This anti-pattern deserves special attention because it sounds reasonable. A function that adds two numbers does not need a test. But production code is never that simple for long. The function that "just renames files" eventually needs to handle Unicode filenames, skip hidden files, preserve file permissions, and log operations. Each addition is "too simple to test" individually, but together they create untested complexity.

The cost of adding a test is low. The cost of debugging production failures in untested code is high. Write the test.

The Decision Framework

When you sit down to write code -- or when an AI generates code for you -- ask these questions in order:

1. Will this code run more than once?
   YES → It needs tests.

2. Will someone else read or run this code?
   YES → It needs types and docstrings.

3. Does this code handle external input (files, APIs, user input)?
   YES → It needs specific error handling.

4. Will this code run in CI or production?
   YES → It needs all of the above, plus packaging (pyproject.toml).

5. Did an AI generate this code?
   YES → Apply extra scrutiny. Run pyright. Add tests for edge cases
         the AI may not have considered.

If you answered YES to any question, your code has moved past the script boundary. Apply program discipline proportional to the number of YES answers.

Try With AI

Prompt 1: Transform a Script into a Program

Here is a Python script I wrote to [describe your actual script -- processing CSV data,
calling an API, generating reports, etc.]:

[paste your script here]

Help me transform this into a proper program. Specifically:
1. Add type annotations to all functions
2. Replace bare except blocks with specific exceptions
3. Add a typer CLI interface so I can pass arguments
4. Write 3-5 pytest tests covering the main logic and one edge case
5. Create a pyproject.toml with pyright and ruff configuration

Walk me through each change and explain what class of bug it prevents.

What you're learning: The mechanical process of applying program discipline to existing code. By watching the transformation step-by-step, you internalize which changes catch which categories of bugs, and you develop an intuition for what "production-ready" looks like compared to "it works on my machine."

Prompt 2: Audit AI-Generated Code

I asked an AI to generate this Python function:

```python
def fetch_user_data(user_id):
    import requests
    resp = requests.get(f"http://api.example.com/users/{user_id}")
    data = resp.json()
    return {"name": data["name"], "email": data["email"], "age": data["age"]}

Audit this code against the "Programs Over Scripts" axiom. For each issue you find:

Name the specific anti-pattern
Explain what could go wrong in production
Show the fixed version with proper types, error handling, and structure

Then write 3 pytest tests that would catch the most dangerous failure modes.

**What you're learning**: Critical evaluation of AI-generated code. You are building the skill of reading code skeptically -- identifying missing error handling, absent type information, and implicit assumptions. This is the core verification skill for AI-era development: the AI generates, you verify.

### Prompt 3: Design a Discipline Stack for Your Project

I'm starting a new Python project that will [describe your project: a CLI tool for file processing / an API client / a data pipeline / etc.].

Help me set up the complete Python discipline stack from scratch:

Project structure (src layout with pyproject.toml)
uv configuration for dependency management
pyright configuration (what strictness level and why)
ruff rules (which rule sets to enable for my use case)
pytest setup with a single example test
A pre-commit hook or Makefile that runs all four tools in sequence

Explain WHY each configuration choice matters -- don't just give me the config, help me understand what each setting protects against.

**What you're learning**: Setting up verification infrastructure from the ground up. Understanding the "why" behind each tool configuration builds judgment about when to be strict (public APIs, shared code) versus lenient (prototypes, experiments). You are learning to create environments where bad code cannot survive.

## Safety Note

The "Programs Over Scripts" axiom is about production code. It is explicitly not about exploration. When you are experimenting with a new idea, prototyping a concept, or running a one-time data transformation, scripts are the right tool. The axiom does not say "never write scripts." It says "do not ship scripts as programs."

The danger is not writing a quick script. The danger is the moment that quick script becomes load-bearing infrastructure without anyone applying program discipline. Recognize that moment. When it arrives, stop and apply types, tests, error handling, and packaging before the script accumulates dependencies and expectations it was never built to handle.

The Problem Without This Axiom​

The Axiom Defined​

From Principle to Axiom​

The Script-to-Program Continuum​

When Does a Script Become a Program?​

A Script Becomes a Program: Concrete Example​

The Python Discipline Stack​

How They Work Together​

Why AI-Generated Code Requires Program Discipline​

1. Types Catch Hallucinated APIs​

2. Tests Prevent Drift​

3. CI Enforces Standards Across Sessions​

Anti-Patterns: Scripts Masquerading as Programs​

The "Too Simple to Test" Trap​

The Decision Framework​

Try With AI​

Prompt 1: Transform a Script into a Program​

Prompt 2: Audit AI-Generated Code​

The Problem Without This Axiom

The Axiom Defined

From Principle to Axiom

The Script-to-Program Continuum

When Does a Script Become a Program?

A Script Becomes a Program: Concrete Example

The Python Discipline Stack

How They Work Together

Why AI-Generated Code Requires Program Discipline

1. Types Catch Hallucinated APIs

2. Tests Prevent Drift

3. CI Enforces Standards Across Sessions

Anti-Patterns: Scripts Masquerading as Programs

The "Too Simple to Test" Trap

The Decision Framework

Try With AI

Prompt 1: Transform a Script into a Program

Prompt 2: Audit AI-Generated Code