Skip to main content

Parsing Real Data

Try This First: 10 Seconds

Run this in your terminal before reading on:

echo '2024-01-07,"AMAZON, INC.",-89.50' | awk -F',' '{print $3}'

What did you expect to see? What did you actually get?

Your first Python command handled clean numbers. Real data fights back. You download your bank statement, point your tool at it, and get a number that looks reasonable: until you notice that $89.50 Amazon order is missing from the total. The culprit: "AMAZON, INC.": a comma hiding inside quotes. Your script split that merchant name in half and summed the wrong column. No error. No warning. Just a wrong total.

This is the same lesson from the File Processing chapter in a different domain: real folders had hidden files and spaces in filenames. Real CSVs have commas inside quoted fields.

Checkpoint: Use YOUR Bank Data

Download your actual bank statement as CSV. Most banks offer this under "Download Transactions" or "Export" in online banking.

If your bank doesn't offer CSV export, use the test data below instead.

Privacy note: Your bank CSV may contain account numbers. Before processing, you can remove sensitive columns. The scripts only need the Description and Amount columns.

The Problem

Here's what a bank statement CSV looks like:

Date,Description,Amount
2024-01-02,Coffee Shop,-5.50
2024-01-03,Grocery Store,-127.43
2024-01-07,"AMAZON, INC.",-89.50

You need to sum the Amount column. Simple, right? Let's ask Claude Code.

The First Attempt

Ask Claude Code to sum the Amount column. It reaches for awk: fast and reasonable:

Claude Code: I'll extract and sum the third column.

[Runs: awk -F',' '{if(NR>1) sum+=$3} END{print sum}' bank-statement.csv]

-133.43

The number looks plausible. But that Amazon order was $89.50: shouldn't the total be higher? Check the problem row:

echo '2024-01-07,"AMAZON, INC.",-89.50' | awk -F',' '{print $3}'

Output: INC.": garbage. awk split on the comma inside "AMAZON, INC.", saw four fields instead of three, and silently treated INC." as zero. Some rows work, some silently fail. The trap only triggers on quoted fields: the worst kind of bug.

The agent gave you a working command that exited cleanly. But it was wrong. The agent doesn't know your data has commas inside quoted fields unless you tell it. This is where your verification catches what the agent missed.

Redirecting the Agent

Bring your discovery back to Claude Code:

You: The awk approach breaks on rows like "AMAZON, INC." where
commas hide inside quoted fields. The total is wrong. I need a
solution that handles CSV quoting correctly.

The agent switches to Python's csv module. The key lines of sum-expenses.py:

import csv
reader = csv.reader(sys.stdin) # Understands quoting rules
next(reader) # Skip header row
for row in reader:
amount = float(row[2]) # Amount is 3rd column (index 2)

The agent tests it on the problem data and gets the correct total. The full script is in your working directory: the important thing is what just happened: the agent's first instinct (awk) was reasonable but wrong for your data. You caught the error through verification, told the agent exactly what went wrong, and it switched to the right tool. The agent knew about csv module. You knew about your data. Neither could have solved this alone.

Why Filter for Negatives?

Notice the script checks if amount < 0 before summing. Bank CSVs use negative numbers for debits (money out) and positive numbers for credits (refunds, deposits). Without this filter, a $500 refund would be counted as a $500 expense: silently inflating your total. This is the kind of bug that passes every test with expense-only data and breaks the moment real data includes a refund. If your bank uses a different convention (separate Debit/Credit columns, all positive amounts), tell Claude Code about your format and it will adapt the filter.

Python's csv module understands quoting rules. It knows that commas inside quotes don't count as separators. The module handles:

  • Quoted fields with commas inside
  • Escaped quotes within quoted fields
  • Different line ending styles
  • Edge cases you haven't thought of yet

When to Use Which

Data TypeUse awkUse csv module
Log files with consistent delimitersYes
Tab-separated dataYes
Data you control (no embedded delimiters)Yes
Bank exportsYes
Downloaded datasetsYes
Any external CSVYes

Rule of thumb: If the CSV came from outside your control, use a proper CSV parser.

The Pattern

Here's the pattern for handling structured data:

"I have [structured data type]. Process [specific column/field].
Be careful - [edge case that could break naive parsing]."

Mentioning the edge case helps the agent choose robust solutions. Compare:

PromptAgent's Likely Response
"Sum the third column of this CSV"Might use awk (faster, simpler)
"Sum the Amount column. Some merchant names have commas"Uses csv module (handles edge cases)

The second prompt gives context that guides the agent to the right tool. You're teaching the agent about your data -- and the agent is teaching you which tools handle that data correctly.

Three lessons in, you've built two scripts that work and one habit that matters more than both of them: never trust output you haven't verified. That habit just saved you from a silent CSV parsing bug that the agent itself introduced.

Make It Permanent

Your scripts work. But try this: close your terminal, open a new one, navigate to a different folder, and run sum-expenses. You'll get "command not found." The script exists somewhere on your machine, and you can't use it without remembering the exact path. If you have to remember where a tool lives, it's not a tool yet.

Ask Claude Code to fix this:

You: I want to use sum-expenses from anywhere on my system.
Make it a permanent command I can run like: cat file.csv | sum-expenses
Claude Code: I'll set that up.

[Runs: mkdir -p ~/tools]
[Runs: cp sum-expenses.py ~/tools/sum-expenses.py]
[Runs: chmod +x ~/tools/sum-expenses.py]
[Runs: echo $SHELL]
/bin/zsh

[Appends to ~/.zshrc:]
alias sum-expenses='python3 ~/tools/sum-expenses.py'

[Runs: source ~/.zshrc]

Done! Try it from any directory:
cat any_file.csv | sum-expenses

The agent organized your script into ~/tools, made it executable, checked your shell, and reloaded the config. You described the outcome; the agent handled every step.

Checkpoint: Prove It's Permanent
  1. Close your terminal completely
  2. Open a brand new terminal
  3. Navigate to your home directory: cd ~
  4. Type: sum-expenses
  5. If you see usage info or an error about missing input: your command is installed
  6. If you see "command not found": check your alias in ~/.zshrc (or ~/.bashrc if you use bash)

You've just installed your second Unix-style command. The first (sum.py) handles numbers. This one handles structured data. The pattern is the same: stdin in, result out, pipes connect everything. But one tool isn't a toolkit, and the next lesson shows why that matters.

Flashcards Study Aid


Try With AI

Prompt 1: Understand the CSV Trap

Show me what happens when I try to parse "AMAZON, INC.",-89.50 with awk.
Why does it give the wrong result? What does Python's csv module do differently?

What you're learning: How bringing a specific failure changes what the agent can give you. "Parse this CSV" produces a generic answer. "Parse this CSV: here's the line that breaks it" produces a targeted solution. The failure you discovered through verification is the input that directs the agent to the right tool. Your observation was the work; the agent's knowledge of csv module was the mechanism.

Prompt 2: Extend the Parser

My sum-expenses.py works well. Can you modify it to:
1. Accept a column number as an argument (so I can sum any column)
2. Handle empty lines gracefully
3. Show how many transactions were processed

Keep the stdin pattern so it still works with pipes.

What you're learning: Iterative improvement. You have working code and clear requirements. The agent extends functionality while preserving the composable pattern.

Prompt 3: Adapt to Your Bank Format

My bank's CSV has different column names than the example. The amount
column is called 'Debit' instead of 'Amount' and there's a separate
'Credit' column. How do I modify sum-expenses.py to handle my specific
bank format?

What you're learning: The most common director move in data work: you specify your data's actual schema (column name: 'Debit', not 'Amount'; separate credits column), the agent adapts the implementation. You don't need to know how csv.DictReader works. You need to know what your data looks like. That knowledge is yours; the implementation is the agent's.