Updated Feb 16, 2026

The Testing Loop

You have a number. 186.38. You stare at it. Is it right?

You built sum.py in Lesson 1. It ran without errors. Output appeared. No red text. Your terminal says Total: 186.38 and you have no way to know if that's correct without doing the math yourself. And if you could do the math yourself, you wouldn't need the script.

This is the verification paradox — you built a tool to do work you can't easily check, which means you're trusting code you've never proven works. Most people stop here. Output appeared, no red text, must be fine. This assumption has cost companies millions and produced bugs that lurked for months before anyone noticed.

But there's a trick. What if you tested with numbers you CAN add in your head?

10 + 20 + 30 = 60. If the script says 60, it works. If it says anything else, you've caught a bug before it touched real data.

The Verification Conversation

Open Claude Code and try this:

You: I have sum.py that sums numbers from stdin. I want to verify it
works correctly. Create a test file with 10, 20, 30 -- numbers I can
add in my head. Expected answer: 60. Run it and check.

Watch what happens:

Claude Code: Good idea -- testing with known answers. Let me set that up.

[Creates: test_simple.txt with contents: 10, 20, 30]

[Runs: cat test_simple.txt | python sum.py]
Total: 60.0

Matches expected (60). Let me also test with decimals:

[Creates: test_decimals.txt with contents: 10.5, 20.5, 19.0]

[Runs: cat test_decimals.txt | python sum.py]
Total: 50.0

Matches expected (50). And negative numbers:

[Creates: test_negative.txt with contents: 100, -30, -20]

[Runs: cat test_negative.txt | python sum.py]
Total: 50.0

All three cases verified:
- Simple integers: 60.0 (correct)
- Decimals: 50.0 (correct)
- Negatives: 50.0 (correct)

Notice what the agent did. You asked for one test, and it volunteered two more -- decimals and negatives. That's the agent teaching you something: one test case isn't enough. Different input types can trigger different bugs.

But also notice what YOU did. You chose test data with a known answer. That's the human contribution the agent can't make on its own -- it doesn't know which numbers are easy for you to verify mentally.

Why Exit Code 0 Is a Lie

Every command in Bash produces an exit code. Check it with $?:

cat test_simple.txt | python sum.py
echo $?

Output:

Total: 60.0
0

Exit code 0 means: "the script ran without crashing."

It does NOT mean: "the script produced the right answer."

Here's proof. Consider this buggy version:

# buggy_sum.py - Has a silent bug
import sys

total = 0
for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    # BUG: Skips lines starting with digits > 5
    if line[0] in '6789':
        continue
    total += float(line)

print(f"Total: {total}")

Run it:

echo -e "10\n60\n30" | python buggy_sum.py

Output:

Total: 40.0

Check the exit code:

echo $?

Output:

Exit code 0. No errors. No warnings. But 40 is wrong -- the answer should be 100. The script silently skipped "60" because it starts with '6'.

Exit codes catch crashes. They don't catch logic errors. That's why verification with known answers matters. Your test data of 10, 20, 30 would have passed -- all start with digits 1-3. Only by testing with numbers like 60, 70, or 80 would you catch this bug.

The Exit Code Table

Common exit codes you'll encounter:

Exit Code	Meaning	Example
0	Success -- command completed without crashing	Script ran, output appeared
1	General error -- something went wrong	Python raised an exception
2	Misuse of command -- wrong arguments	`python` with no file
127	Command not found -- typo or missing program	`pythn sum.py`
130	Interrupted by Ctrl+C	You cancelled a long run

One caution: $? holds the exit code of the most recent command. Run echo $? immediately after the command you care about -- any command in between overwrites it.

The Verification Pattern

Here's the prompt pattern that works every time:

"Verify [tool] works correctly. Create test data with a known answer
[X] and check that output matches."

This works because:

Known answer first. You calculate the expected result before running the tool.
Simple test data. Numbers you can add in your head (10 + 20 + 30 = 60).
Multiple cases. Test integers, decimals, negatives, edge cases.
Comparison. Output must match expectation exactly.

Pattern Variations

What You're Testing	The Prompt
Sum script	"Verify sum.py with test data 10, 20, 30 (expected: 60)"
Average script	"Verify average.py with test data 10, 20, 30 (expected: 20)"
Max script	"Verify max.py with test data 10, 50, 30 (expected: 50)"
Filter script	"Verify filter.py keeps only numbers > 20 from 10, 30, 50 (expected: 30, 50)"

The tool changes. The verification pattern stays the same.

The Zero-Trust Philosophy

This approach embodies a mindset: assume everything is broken until proven otherwise.

When Claude Code summed your expenses, it wrote a script, ran it, and showed output. Exit code 0. Everything looked fine. But the agent cannot verify that $186.38 is correct for YOUR expenses unless you give it test data with known answers. The agent ran code -- it didn't validate business logic.

The workflow:

Agent generates and runs code -- script executes, output appears
You request verification -- "test this with known data"
Agent creates test cases -- simple data with calculable answers
Comparison proves correctness -- output matches expectation

This applies to code from any source. Trust is earned through verification, not granted through origin.

Checkpoint: Verify YOUR sum.py

Stop reading. Create a file called test_simple.txt with three numbers: 10, 20, 30. Run your sum.py from Lesson 1 against it. Does it say 60?

You: Create test_simple.txt with 10, 20, 30 on separate lines.
Then run: cat test_simple.txt | python sum.py
Expected output: Total: 60.0

If it says 60 -- your script works for simple integers. Now try edge cases:

You: Test sum.py with these edge cases:
Empty file (expected: 0 or 0.0)
Single number: just "42.5" (expected: 42.5)
File with blank lines mixed in between numbers

If any test fails, you've discovered a bug before it touched real data. Fix it now — Lesson 3 builds on a working sum.py.

Your testing loop works beautifully on clean number lists. But real bank data isn't clean. Open an actual bank statement CSV and you'll find merchant names with commas inside them, dollar signs mixed into amounts, and header rows that aren't numbers at all. Your carefully tested sum.py is about to meet the real world — and the real world cheats. (And somewhere in that data, "DR PEPPER SNAPPLE" is waiting to be counted as a medical expense. But that's a problem for later.)

Try With AI

Prompt 1: Discover Edge Cases

What edge cases might break a script that sums numbers from stdin?
Think about unusual inputs: empty files, non-numeric lines, very
large numbers, special characters. List cases I should test.

What you're learning: Defensive thinking. The agent anticipates failure modes you haven't considered -- dollar signs in data, overflow on large numbers, Unicode characters. Your verification becomes more thorough than anything you'd design alone.

Prompt 2: Automate Verification

I have 5 test cases for sum.py. Help me write a simple bash script
that runs all tests and reports which passed and which failed. Each
test should compare actual output to expected output.

What you're learning: Test automation. Instead of manually running tests one at a time, you build a script that runs them all and reports results. This is how professionals keep code correct over time.

Prompt 3: Debug a Failure

My sum.py gives wrong output on this test:
- Input: 10, 60, 30
- Expected: 100
- Actual: 40

The script works fine on other inputs. Exit code is 0.
Help me find the bug. What could cause 60 to be skipped?

What you're learning: Root cause analysis through collaboration. You present a specific failure with evidence (expected vs. actual), and the agent helps you reason backward from symptoms to cause. Notice you're teaching the agent what the problem is -- it can't debug without your observation that 60 is being skipped. This is collaborative debugging where both sides contribute something the other can't.

The Verification Conversation​

Why Exit Code 0 Is a Lie​

The Exit Code Table​

The Verification Pattern​

Pattern Variations​

The Zero-Trust Philosophy​

Checkpoint: Verify YOUR sum.py​

Try With AI​

Prompt 1: Discover Edge Cases​

Prompt 2: Automate Verification​

Prompt 3: Debug a Failure​

The Verification Conversation

Why Exit Code 0 Is a Lie

The Exit Code Table

The Verification Pattern

Pattern Variations

The Zero-Trust Philosophy

Checkpoint: Verify YOUR sum.py

Try With AI

Prompt 1: Discover Edge Cases

Prompt 2: Automate Verification

Prompt 3: Debug a Failure