Skip to main content

RLS and Roles: Policy in the Database

The Reconciler ran clean for two weeks. Tuesday night a bug shipped.

Somewhere in the agent's code, a variable holding user_id got reused. While processing Alice's bank statement, the Worker read Bob's user_id from the previous iteration. It categorised Alice's transactions correctly. Then it wrote the expense rows. Under Bob's name.

By morning Bob had forty new expenses he did not recognise. Alice's report was blank. The boss asked the obvious question: "the database knew the difference between Alice and Bob, right? Why did it let this happen?"

The honest answer was uncomfortable. "The database did not know who the Worker thought it was. The application told it. We trusted the application." That is the failure RLS exists to prevent.

The application layer is a layer. It is not a wall. A buggy Worker, a swapped variable, a leaked session: any of these can send the wrong user_id down the wire. The database accepts the value. The application was supposed to be the only line of defence. There is no second line.

This lesson teaches the smallest fix: put the wall inside the database itself.

Key Terms for This Lesson
  • Row-Level Security (RLS): A Postgres feature that filters which rows a query can see and which rows it can change, based on a policy attached to the table.
  • USING: The clause of an RLS policy that filters rows on read. A row that fails the USING check does not appear in the result set.
  • WITH CHECK: The clause of an RLS policy that rejects writes. A row that fails the WITH CHECK is refused at INSERT or UPDATE.
  • Principal: The identity the database thinks is making the request. For a Worker, this is the role plus the tenant context.
  • Tenant context: A session-local variable (usually app.current_user_id) that says "this transaction is acting on behalf of this user".
  • Role: A named database identity (worker_read, worker_write, worker_admin) with a defined set of permissions.
  • Boundary Error: A silent authorization failure. The query does not crash; it returns rows the principal should not see, or zero rows the principal should see.

Where Policy Belongs

The Reconciler bug above is not a database bug. It is a layering bug. The application was the only thing checking that Alice's transactions wrote to Alice's user_id. The database accepted whatever value came down the wire.

Two layers should know the rule. The application enforces UX (Alice cannot click the button to edit Bob's data). The database enforces the wall (even if the application asks, the database refuses).

LayerWhat it enforcesWhat it catches
ApplicationThe UX path. A user only sees their own data in the dashboard.Honest users using the product the right way.
Database (RLS)The contract. Even a buggy Worker or a leaked session cannot read or write outside its scope.Bugs, leaks, attacks, the wrong variable, the swapped session.

Without the database layer, a single bug above the database becomes a multi-tenant data leak. With it, the same bug becomes an empty result set or a rejected write. Visible. Recoverable. Not a press release.

Roles, the Minimal Set

Workers should never connect as the database owner. They should connect as a role with the smallest permission set that lets them do their job.

RolePurposeWhat it can do
worker_readWorkers that only read SoR dataSELECT on permitted tables, filtered by RLS USING
worker_writeWorkers that perform writesINSERT and UPDATE on permitted tables, limited by RLS WITH CHECK
worker_adminMigrations, role grants, ops fixesDDL, GRANT, can bypass RLS when explicitly needed

The agent writes the grants. You read them.

CREATE ROLE worker_write;
GRANT SELECT, INSERT, UPDATE ON expenses TO worker_write;
GRANT SELECT, INSERT ON outbox_messages TO worker_write;
GRANT SELECT, INSERT ON agent_events TO worker_write;
-- callback to Lesson 6: REVOKE UPDATE, DELETE ON agent_events FROM worker_write

Two things to read. The role gets only the verbs it needs (no DROP, no TRUNCATE). The grants are explicit per table; nothing is granted at the database or schema level by mistake.

A Worker that connects as worker_admin to "make things easier" is one bug away from a DROP TABLE expenses. Connect as the smallest role that works.

USING vs WITH CHECK

A policy has two clauses. Each one does a different job.

ALTER TABLE expenses ENABLE ROW LEVEL SECURITY;

CREATE POLICY expenses_tenant_isolation ON expenses
FOR ALL
TO worker_write
USING (user_id = current_setting('app.current_user_id', true)::uuid)
WITH CHECK (user_id = current_setting('app.current_user_id', true)::uuid);

Read it left to right.

ClauseWhat it doesWhat it catches
USINGFilters rows on read. A row whose user_id does not match the session's tenant context disappears from the result set.A Worker trying to read another user's data.
WITH CHECKValidates rows on write. An INSERT or UPDATE that would land a row whose user_id does not match the tenant context is rejected.A Worker trying to write a row for another user.

The two clauses usually carry the same condition. The Tuesday night bug above would have been caught by WITH CHECK: the Worker tried to INSERT an expense with Bob's user_id while the tenant context said Alice. The database would have refused the write.

If the policy only has USING, writes pass through. If the policy only has WITH CHECK, reads are unrestricted. Both are partial walls. You want both clauses.

The Silent Zero-Row Failure Mode

This is the most important warning in the lesson.

When tenant context is missing or wrong, RLS does not throw. The query does not error. The application gets back a result set, just an empty one. The INSERT does not crash; it silently affects zero rows.

-- Session forgot to call SET LOCAL app.current_user_id.
SELECT * FROM expenses;
-- Returns zero rows. No error.

This is a real failure mode, not a theoretical one. The shape:

  1. A Worker starts a new transaction.
  2. The Worker forgets to set app.current_user_id (a bug, a refactor, a missing config).
  3. The Worker runs a SELECT.
  4. RLS evaluates USING (user_id = current_setting('app.current_user_id', true)::uuid). The setting is missing, so the comparison is not true for any row.
  5. The result set is empty.
  6. The Worker logs "no work to do" and exits cleanly.

The Worker did not error. The Worker did not do anything either. Alice's expenses sat unread. The next morning, finance notices the dashboard is empty.

The verification habit that catches this is simple: do not just check that reads ran without error. Check that they returned the expected rows. A zero-row result for a tenant you know has data is the visible signal of a missing or wrong tenant context.

Setting Tenant Context per Worker Session

The Worker tells the database who it is acting for at the start of every transaction.

-- At the start of the Worker's per-user transaction:
SELECT set_config('app.current_user_id', $alice_uuid, true);
-- The third argument `true` means SET LOCAL: scope is the transaction only.

-- Then the Worker runs its work:
INSERT INTO expenses (worker_id, user_id, category, amount, spent_at, idempotency_key)
VALUES ($worker_id, $alice_uuid, 'Food', 12.50, '2026-05-09', $key)
ON CONFLICT (worker_id, idempotency_key) DO NOTHING;

Three things to read.

  1. The set_config call comes first, before any SELECT or INSERT. If it is missing, every subsequent statement runs without a tenant context.
  2. The third argument is true. This makes the setting transaction-scoped. When the transaction ends, the context goes with it. A subsequent transaction in the same session starts with a clean slate.
  3. The user_id in the INSERT must match the context. Otherwise WITH CHECK refuses the write.

The pattern is mechanical. Every Worker transaction starts with set_config. Every Worker review checks that the call is present and that the context matches the user the Worker is acting for.

PRIMM-AI+ Practice: The Identity Flip Drill

Predict [AI-FREE]

Three rows exist in expenses. Alice owns two, Bob owns one. RLS is enabled on the table with the policy above. You will run SELECT * FROM expenses in three separate transactions: one with Alice's context set, one with Bob's, one with no context set at all.

Before running anything, write down:

  • How many rows does Alice's session see?
  • How many rows does Bob's session see?
  • How many rows does the no-context session see?
  • Does any of the three sessions error?
  • Your confidence score from 1 to 5.

You should be able to answer all four before reading further.

Run

Ask Claude Code to run three separate transactions against your Neon branch.

Transaction A:

BEGIN;
SELECT set_config('app.current_user_id', '<alice-uuid>', true);
SELECT id, user_id, category, amount FROM expenses ORDER BY id;
COMMIT;

Transaction B: same but with <bob-uuid>.

Transaction C: same but skip the set_config line entirely.

What you should see:

SessionTenant contextRows returnedErrored?
AliceAlice's UUID2No
BobBob's UUID1No
Noneunset0No

The third row is the lesson. Zero rows. No error. A bug that looks like "nothing to do".

Investigate

Write your own one-paragraph explanation:

  1. Alice's session saw only Alice's two rows because the USING clause filtered out Bob's row.
  2. Bob's session saw only Bob's one row for the same reason.
  3. The no-context session saw zero rows because current_setting('app.current_user_id', true) returned NULL, so the equality check was not true for any row.
  4. None of the three sessions errored. RLS filtering is silent by design.

Then ask the agent:

  1. "If a Worker forgets to call set_config and then runs a SELECT, how would the on-call engineer notice? Walk through the logs."
  2. "An attacker controls the value passed to set_config. What can they do? What can they NOT do? Be specific."
  3. "I see only USING in this policy, no WITH CHECK. What writes would still be possible that USING does not prevent?"

The third question is where the WITH CHECK clause earns its keep. A USING-only policy filters reads but allows any write. A Worker with a swapped user_id would happily INSERT a row for the wrong tenant. The row would then disappear from that Worker's own reads (because USING filters it back out), but it would still be there.

This is the /boundary-probe habit in plain language: confirm the database role, confirm the tenant context, confirm the policy exists, read the USING clause, then read the WITH CHECK clause. Do that before editing the business query. Many "no rows" bugs are boundary errors, not query bugs.

Modify

Change one rule. Add a separate admin read policy:

CREATE POLICY expenses_admin_read ON expenses
FOR SELECT
TO worker_admin
USING (true);

Predict the difference. Then run a fourth transaction as the database role worker_admin with no app.current_user_id.

What you should see: the admin session reads all three rows from expenses. The bypass is tied to the database role, not to a session variable the Worker can casually set. That is the point. An admin path may exist, but it must be role-scoped, narrow, and auditable through append-only events. If the agent proposes OR current_setting('app.role') = 'admin' inside a normal Worker policy, reject it: that turns a tenant wall into a string flag.

Make [Mastery Gate]

The brief in business English:

"The Reconciler must only write expenses for the user it is currently scoped to. If it tries to INSERT a row whose user_id does not match the tenant context set at the start of the transaction, the database must reject the write. The Reconciler should also only read its own user's expenses; another user's rows must not appear in any SELECT."

Hand this brief to the agent. Ask for:

  1. The CREATE POLICY statement.
  2. The set_config call the Worker must run as the first statement in each per-user transaction.
  3. A short test plan: three transactions (Alice, Bob, unset) plus one write attempt with a mismatched user_id.

You read each piece. The gate passes when you can:

  1. Point at the USING clause and explain which read it filters.
  2. Point at the WITH CHECK clause and explain which write it rejects.
  3. Name what happens in the unset-context case and why the team must verify reads return the expected rows, not just that they did not error.

If the agent ships USING without WITH CHECK, you reject. If the agent forgets the set_config call, you reject. If the agent says "the application will handle it", you reject.

Try With AI

Prompt 1: The Identity-Flip Drill

I have an `expenses` table with RLS enabled and this policy:

CREATE POLICY expenses_tenant ON expenses
FOR ALL
TO worker_write
USING (user_id = current_setting('app.current_user_id', true)::uuid)
WITH CHECK (user_id = current_setting('app.current_user_id', true)::uuid);

Walk me through what each of these three transactions sees and does:

A) SET LOCAL app.current_user_id = '<alice-uuid>';
SELECT * FROM expenses;
B) SET LOCAL app.current_user_id = '<bob-uuid>';
INSERT INTO expenses (user_id, ...) VALUES ('<alice-uuid>', ...);
C) -- no SET LOCAL
SELECT * FROM expenses;

For each one, name the rows returned, the rows written, and whether
any statement errors.

What you're learning: Recognising the three classic RLS outcomes. Filtered read. Refused write. Silent empty read. The third one is the trap; the drill builds the reflex to catch it.

Prompt 2: USING vs WITH CHECK Recognition

I will paste three RLS policies. For each one, tell me whether it
fully isolates tenants, partially isolates them, or does almost nothing.
Name the specific failure each gap allows.

A) USING (user_id = current_setting('app.current_user_id', true)::uuid)
-- no WITH CHECK
B) WITH CHECK (user_id = current_setting('app.current_user_id', true)::uuid)
-- no USING
C) USING (user_id = current_setting('app.current_user_id', true)::uuid)
WITH CHECK (user_id = current_setting('app.current_user_id', true)::uuid)

What you're learning: USING and WITH CHECK do different jobs. A USING-only policy lets bad writes through. A WITH CHECK-only policy lets bad reads through. Both clauses are usually needed and they usually carry the same condition.

Prompt 3: Silent Zero-Row Trap

A Worker calls SELECT id, amount FROM expenses and gets back zero rows.
List five different reasons this could happen, ordered from least
alarming to most alarming. For each one, describe how I would tell
the difference by reading another table or running another query.

What you're learning: Zero rows is a signal, not an answer. It can mean "the table is empty for this tenant" (fine), "the tenant context is missing" (a Worker bug), "the role lacks SELECT" (a config bug), or "RLS is filtering for a wrong reason" (a policy bug). Differentiating between these is the verification habit that turns RLS from "looks like it works" to "I can prove it works".

Checkpoint

  • I can explain why the application layer alone is not a wall for multi-tenant data.
  • I can read a CREATE POLICY statement and name what USING filters and what WITH CHECK rejects.
  • I can recognise the silent zero-row failure mode and name the verification habit that catches it.
  • I can read the minimum role grants for worker_write and confirm the role only has the verbs it needs.
  • I have run the three-session identity flip drill and watched Alice, Bob, and the unset session each return the expected rows.
  • I rejected at least one agent-proposed policy during practice for missing WITH CHECK or for adding an unaudited admin bypass.

Flashcards Study Aid