Migrations: Schema History, Not Schema Memory

The boss has a new request for the Reconciler. Every expense row should now carry a confidence_score so finance can flag low-confidence matches for review.

The agent says, "I'll add a column."

You stop the agent.

In production, the difference between "add a column" and "add a column safely" is the difference between a 30-second change and a 30-minute outage. The Reconciler is running. Other Workers are reading the expenses table. There is real data behind the column. A migration file is the contract that says "here is the change, here is how to undo it, here is the version of the schema we are moving from and the version we are moving to." Without that file, you are at the mercy of whatever the agent remembered last.

This lesson teaches you to read migrations, apply them on a Neon branch, and refuse anything destructive without an explicit override.

Key Terms for This Lesson

Migration: a versioned file that describes a schema change. The same file runs on dev, on staging, on production. It produces the same schema everywhere.
DDL (Data Definition Language): the SQL that changes structure (CREATE TABLE, ALTER TABLE, DROP TABLE). Distinct from DML (INSERT, UPDATE, DELETE) which changes rows.
Idempotent migration: a migration that produces the same end state whether it runs once or twice. Safe to retry.
Upgrade: the function (or block) that moves the schema forward. Usually called upgrade() in Alembic.
Downgrade: the function that moves the schema back. Usually called downgrade(). Your rollback path.
Neon branch: a copy-on-write branch of your Neon database. Cheap. Disposable. The right place to test a migration before it touches your real data.

Why Schema Memory Fails

If you let the agent "just edit the model" or "just alter the table," three things break.

No history: a week later, nobody can say what the schema looked like on Tuesday.
No replay: dev and staging drift away from production because the change was made once on production and never recorded.
No rollback: if the change breaks something, the only way back is to remember the old shape and type it from memory. At 11pm. With the boss watching.

A migration file fixes all three. It is checked into git. It runs the same way everywhere. Its downgrade() block is the rollback you wrote at the time you knew best, not the one you guess at under pressure.

The agent writes the migration file. You read the DDL. The two roles do not change.

What a Migration File Looks Like

Here is the smallest example: add a confidence_score column to expenses. The DDL is the part you read. The Python wrapping is the part the agent writes.

# alembic/versions/2026_05_11_add_confidence_score.py
"""add confidence_score to expenses

Revision ID: ab12cd34
Revises: 9f0e1d2c
Create Date: 2026-05-11 09:00:00
"""

from alembic import op
import sqlalchemy as sa


def upgrade() -> None:
    op.add_column(
        "expenses",
        sa.Column("confidence_score", sa.Numeric(4, 3), nullable=True),
    )


def downgrade() -> None:
    op.drop_column("expenses", "confidence_score")

Two things to read:

upgrade() adds a confidence_score column on expenses. It is a numeric(4,3) (so values like 0.987). It is nullable, so existing rows keep going without backfilling.
downgrade() drops the same column. Your rollback path. If you change one, you change both.

The DDL underneath, in plain SQL, is just:

-- upgrade
ALTER TABLE expenses ADD COLUMN confidence_score NUMERIC(4,3);

-- downgrade
ALTER TABLE expenses DROP COLUMN confidence_score;

That is what the database actually runs. The Python is the wrapper that records "this revision was applied" so the next migration knows where it is starting from.

The Drop Guard

A DROP TABLE, DROP COLUMN, or ALTER TABLE ... DROP CONSTRAINT is irreversible against real data. The downgrade may add the column back, but it cannot bring back the rows you deleted with it.

The rule for this chapter (and for your career):

Never let an agent run a destructive migration without (a) a Neon branch you can revert to, (b) a verified backup of the table, and (c) an explicit human approval on the change.

If the agent hands you something like this:

def upgrade() -> None:
    op.drop_column("expenses", "old_legacy_field")

def downgrade() -> None:
    op.add_column(
        "expenses",
        sa.Column("old_legacy_field", sa.Text(), nullable=True),
    )

Read the upgrade() first. The data in old_legacy_field is gone the moment that runs. The downgrade() brings the column back. It does not bring the values back. Refuse the migration. Demand the three preconditions. Then run it on a branch, confirm nothing else breaks, and only then run it on the main database.

What to demand instead, in business English:

"Show me the count of non-null old_legacy_field values before you drop it."
"Confirm there is a Neon branch I can fail back to if this turns out wrong."
"Confirm the data has been exported somewhere I can read it for the next 90 days."

Three short sentences. They cost you a minute. They have saved many careers.

PRIMM-AI+ Practice: Add `confidence_score` Safely

Predict [AI-FREE]

Before you let any tool run, write down:

What columns exist on expenses today?
What columns will exist after upgrade() runs?
What columns will exist after downgrade() runs?
If a row already exists in expenses with no confidence_score, what value will the new column hold for that row, and why?
Confidence score (1 to 5) on each answer.

If your prediction for question 4 is anything other than "NULL," re-read the migration. The column was added with nullable=True and no default, so existing rows get NULL.

Run

Apply the migration on a Neon branch, not on your main database.

Create a Neon branch from my current main branch called
`migration-confidence-score`. Switch the connection string to that
branch. Apply the migration that adds confidence_score to expenses.
Show me the new schema for the expenses table. Then run the downgrade.
Show me the schema again so I can confirm the column is gone.

What you should see:

After upgrade(): expenses now has a confidence_score numeric(4,3) column. Existing rows have NULL for it. Row count is unchanged.
After downgrade(): the column is gone. Row count is still unchanged. Other columns are untouched.

Investigate

Write your own explanation first.

Why did existing rows survive the upgrade() without backfilling? (Hint: nullable.)
Why is the order in downgrade() exactly the inverse of upgrade()? (Hint: it has to undo the same step.)
What would have changed if the agent had written nullable=False without a default?

Then ask the agent these targeted questions:

"What error would the database throw if I tried ALTER TABLE expenses ADD COLUMN confidence_score NUMERIC(4,3) NOT NULL against a non-empty table? Why?"
"Walk me through the smallest safe sequence to add a NOT NULL column with no default value to a non-empty table."
"If I forgot to write the downgrade(), how would I roll this change back in production?"

You should be able to answer your version of (3) before reading the agent's: "I would restore from the Neon branch I took before applying the migration."

Modify

Change one constraint. Ask the agent for a follow-up migration that adds a NOT NULL constraint on confidence_score with a default value of 0.000.

Predict before you run:

Will existing NULL values be filled with 0.000?
Will the table be locked during the change?
What does downgrade() need to do to be a real rollback (drop the constraint, drop the default, both)?

Apply on the branch. Verify. Roll back. Compare with your prediction. The point is not the syntax: the point is that you are reading the migration before it runs.

Make [Mastery Gate]

Write a one-sentence brief in business English:

"Add a reconciler_run_id column to expenses so we can trace which run created each row. The column should be nullable for backfill, of type uuid, and indexed because we will join on it later."

Hand the brief to the agent. Ask for the migration file.

Read the file. Confirm:

The upgrade() adds a uuid column with nullable=True and creates an index.
The downgrade() drops the index first, then drops the column.
The order matters: in PostgreSQL, you cannot drop a column an index depends on without dropping the index first.

Apply on the branch. Verify the column exists and the index exists. Roll back. Verify both are gone.

The gate passes when you can approve a migration in writing ("yes, I read this, I confirm upgrade() does X and downgrade() does Y, and I tested both on a branch") and when you can refuse a destructive one with the three preconditions named.

What Failure This Prevents

This lesson kills three failure modes at once: schema drift between dev/staging/production (because the migration file is the same everywhere), irreversible destructive changes (because the drop guard makes them require explicit approval), and forgotten history (because every change is now a versioned file in git).

Try With AI

Prompt 1: Read a Migration

Show me a sample Alembic migration file that adds a `created_by uuid`
column to a `tasks` table, nullable, indexed. Do not explain anything.
After you show it, I will narrate `upgrade()` and `downgrade()` in
plain English. Then you tell me whether I missed anything.

What you're learning: Reading-first habit. The migration file is a contract. Your job is to read the contract before you sign it. The agent's job is to be the one who has to argue if your reading is wrong.

Prompt 2: Refuse a Destructive Migration

You wrote a migration that drops the `legacy_notes` column from the
`expenses` table because the column "is no longer used." Before I let
this run, list the three preconditions I should require: branch,
backup, approval. For each one, give me the exact thing I should ask
you to do.

What you're learning: The drop-guard ritual. By forcing the agent to write out the preconditions, you build the muscle of refusing first and asking for evidence second. By the third time you do this, it becomes automatic.

Prompt 3: Migration with NOT NULL on a Live Table

The Reconciler now requires every expense to have a confidence_score.
Write the smallest migration that takes the existing nullable column
and makes it NOT NULL with a default of 0.000. Walk me through the
order of operations. Tell me which step locks the table and for how
long.

What you're learning: Reality. A NOT NULL migration on a live table is the second most common production incident in this whole topic. Reading the agent's plan in advance, and refusing if the table-lock window is too long, is the difference between a smooth Thursday and a paged-on-Friday.

Checkpoint

I can read a migration file and narrate upgrade() and downgrade() in plain English.
I applied a migration on a Neon branch, verified the change, and ran the downgrade successfully.
I rejected at least one destructive migration on this lesson and named the three preconditions.
I can explain why "let the agent just alter the table" is not the same as "run a migration."
I am ready to look at the run-state journal in the next lesson.

Why Schema Memory Fails​

What a Migration File Looks Like​

The Drop Guard​

PRIMM-AI+ Practice: Add confidence_score Safely​

Predict [AI-FREE]​

Run​

Investigate​

Modify​

Make [Mastery Gate]​

What Failure This Prevents​

Try With AI​

Prompt 1: Read a Migration​

Prompt 2: Refuse a Destructive Migration​

Prompt 3: Migration with NOT NULL on a Live Table​

Checkpoint​

Flashcards Study Aid​