SOS Services
← Engineering
engineering · Vimal Bahuguna

Multi-tenant RLS at studio scale

Postgres row-level security is the simplest answer to tenant isolation. The simplicity ends the moment you have to migrate 46 tables under live load.

If you’re building B2B SaaS on Postgres, sooner or later somebody asks: “how do you stop tenant A’s queries from returning tenant B’s rows?” There are basically two answers — a per-tenant schema (clean, expensive), or row-level security (cheap, dangerous if you get it wrong).

We chose RLS. Across two Supabase projects, more than 80 tables, and 38 active tenants. This post is what bit us along the way.

The mental model

Every table that holds tenant-scoped data gets a tenant_id column. A RLS policy on the table says: only rows where tenant_id matches the current session’s tenant claim are visible. Service-role traffic bypasses RLS (so background workers can do their thing). End-user JWTs carry the tenant claim, and every policy checks it.

CREATE POLICY "Tenant isolation" ON conversion_jobs
  FOR ALL USING (
    auth.role() = 'authenticated'
    AND tenant_id = (auth.jwt() ->> 'tenant_id')::uuid
  );

In theory you write this once and forget it. In practice every table needs one, every new feature needs a policy review, and the database refuses to let you turn it on in bulk while anyone is connected.

The drift incident

Our biggest production Postgres has 905 migrations. We discovered three weeks ago that local was at migration 905 and prod was also at 905 — but they were different 905s. Somebody had applied a hotfix migration directly through the Supabase MCP without committing the corresponding .sql file. Local kept advancing. Prod kept advancing. The numbers matched but the content didn’t.

The fix took a day. We replayed both sides, file by file, against a third reference checkout. The lesson stuck: every apply_migration MCP call must commit the matching .sql file at the same timestamp, no exceptions. We added this to our agent memory so future sessions can’t repeat the mistake.

The 46-table bulk migration

When we onboarded our second Supabase project (Aviation AI Pro), we needed to retrofit RLS onto 46 existing tables in one pass. Wrote the migration, ran it through MCP, watched the lock timeout fire. Tried again with smaller batches — same thing.

Turned out three Postgres connections from 2026-05-11 were still alive, holding AccessShareLock on 11 of the tables we needed to ALTER. They were zombie connections from a Python worker that had crashed without closing its pool. Nothing in the dashboard showed them as a problem. They just sat there blocking every DDL forever.

The fix:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND state_change < now() - interval '24 hours';

After that, the RLS migration applied in small batches. We split it into chunks of 8 tables each — enough to fit inside Supabase MCP’s 60-second query budget.

What changed in our process

Three changes stuck:

  1. Every migration commits the SQL file. No agent-only-applied changes.
  2. Migrations that touch >5 tables get split. Anything larger times out under any kind of contention.
  3. Idle-in-transaction connections older than 24h get killed nightly. Cron job, not optional.

RLS is still the right answer for studio-scale multi-tenancy — it puts the isolation rule next to the data instead of relying on every query writer to remember. But the path to fully-enforced RLS isn’t “write a policy and you’re done.” It’s a long tail of operational lessons that only show up when you’ve already shipped.

Worth the cost. Eventually.