--- name: database-design version: 3.1.0 description: > Design, review, and optimize relational and non-relational database schemas. Use when: designing schemas, modeling data, choosing DB technologies, adding indexes, optimizing queries, planning migrations, designing multi-tenancy, or reviewing an existing schema for correctness and performance. TaxBridge context: PostgreSQL 15 RLS multi-tenant tax data, BullMQ job tables, payment audit chains, FIRS/NDPC compliance retention schemas. SabiScore context: credit feature store, ML scoring audit trail, Redis Pub/Sub. Triggers: "design a schema", "model this data", "what indexes do I need", "is my schema correct", "query is slow", "plan a migration", "multi-tenant DB". Do NOT use for: ORM config, DB driver code, or pure application data handling without schema implications. portfolio: scardubu.dev (TaxBridge · SabiScore · pg-tenant · audit-chain) --- The database schema is the most expensive thing to change in production. A bad column type ships with the first migration and compounds with every feature built on top of it. In a compliance-bound fintech system like TaxBridge, a nullable `amount` column or a missing `deleted_at` audit trail is a regulatory liability. Design it right once. The application layer is just plumbing. --- ## PHASE 1 — DATA MODELING ### 1.1 Entity Identification Extract entities (nouns) and relationships (verbs) from the domain. ``` Process: 1. List all "things" the system needs to know about 2. For each: identify lifecycle (created/modified/deleted? or append-only?) 3. For each pair: identify cardinality 4. Identify roots (owned by nothing) vs. children (owned by a root) TaxBridge entity map: Roots: tenants · users · taxpayers · tax_periods Children: filings → line_items · payments → payment_events jobs (BullMQ) · audit_events (append-only) Junction: taxpayer_accountants (M:N users ↔ taxpayers) ``` **Cardinality:** ``` 1:1 taxpayer → profile Profile belongs to exactly one taxpayer 1:N filing → line_items Filing has many line items M:N user → taxpayer User (accountant) manages many taxpayers Self category → subcategory Category can have a parent ``` ### 1.2 Normalization **Target: Third Normal Form (3NF) for OLTP. Denormalize only with measured proof.** ``` 1NF: Atomic values, no repeating groups ❌ tags: "vat,cit,pit" → ✅ separate filing_tags table 2NF: Every non-key column depends on the WHOLE PK ❌ filing_items(filing_id, product_id, product_name) — product_name → product_id ✅ products(product_id, product_name) + filing_items(filing_id, product_id, quantity) 3NF: No transitive dependencies ❌ filings(filing_id, taxpayer_id, taxpayer_email) — email depends on taxpayer_id ✅ taxpayers(taxpayer_id, email) + filings(filing_id, taxpayer_id) ``` **When to denormalize (with measurement, not assumption):** - Read > 10× more frequent than write on that data - JOIN cost measured and significant (EXPLAIN ANALYZE on production-scale data) - Data is effectively append-only (no update anomalies possible) - Materialized view / read replica is insufficient ### 1.3 Primary Key Strategy | Strategy | Type | Use when | |---|---|---| | **UUID v7** | `uuid` | Default for new tables in 2026+ (time-ordered + globally unique) | | **UUID v4** | `uuid` | Distributed generation without coordination | | **ULID** | `varchar(26)` | URL-safe sortable IDs for public-facing identifiers | | **bigserial** | `bigint` | Internal junction tables, audit logs, BullMQ job IDs | | **Snowflake** | `bigint` | High-throughput event tables (audit_events) | **Rule: Never expose sequential integers in APIs.** A user seeing `/filings/47` knows there are at most 46 others. Use dual-ID pattern: ```sql CREATE TABLE filings ( id bigserial PRIMARY KEY, -- internal JOINs public_id uuid NOT NULL DEFAULT gen_random_uuid() UNIQUE, -- API surface tenant_id uuid NOT NULL REFERENCES tenants(id), taxpayer_id bigint NOT NULL REFERENCES taxpayers(id), type text NOT NULL CHECK (type IN ('VAT', 'CIT', 'PIT', 'WHT')), period text NOT NULL, -- 'Q1-2026', '2025' status text NOT NULL DEFAULT 'draft' CHECK (status IN ('draft','pending','submitted','accepted','rejected')), amount numeric(14,2) NOT NULL DEFAULT 0 CHECK (amount >= 0), currency char(3) NOT NULL DEFAULT 'NGN', submitted_at timestamptz, created_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now(), deleted_at timestamptz, -- soft delete deleted_by bigint REFERENCES users(id) ); ``` --- ## PHASE 2 — COLUMN DESIGN ### 2.1 Data Types — Always the Most Specific Type ```sql -- Timestamps: ALWAYS timezone-aware created_at timestamptz NOT NULL DEFAULT now() updated_at timestamptz NOT NULL DEFAULT now() deleted_at timestamptz -- NULL = not deleted (soft delete) submitted_at timestamptz -- NULL = not yet submitted -- Money: NEVER float — precision loss is a financial and regulatory bug amount numeric(14, 2) -- 14 digits total, 2 decimal places amount_kobo bigint -- kobo (1 NGN = 100 kobo) — preferred for atomicity currency char(3) -- ISO 4217: 'NGN', 'USD', 'GBP' -- Nigeria-specific identifiers tin varchar(12) -- FIRS TIN format (individual: 10 digits, company: 12) rc_number varchar(8) -- CAC registration number bvn varchar(11) -- masked in logs; encrypted at rest phone varchar(15) -- E.164: +2348012345678 state_code char(2) -- ISO 3166-2:NG codes: LA, AB, KN lga_code varchar(10) -- Local government area code -- Status fields: text + CHECK, never integers status text NOT NULL CHECK (status IN ('draft','pending','submitted','accepted','rejected')) -- Rationale: integers break migrations and make query logs unreadable -- Booleans: explicit NOT NULL + default, no nullable booleans is_active boolean NOT NULL DEFAULT true is_verified boolean NOT NULL DEFAULT false -- Nullable boolean = three-state logic = design smell -- JSON: jsonb over json for indexing + operators metadata jsonb -- unstructured/extensible fields, rarely queried -- Rule: if you query inside the JSON, normalize it — don't use JSON as a schema escape hatch -- Arrays: only for truly unstructured, rarely-filtered data tags text[] -- If you JOIN or filter on values → normalize to a junction table ``` ### 2.2 NULL Discipline **NULL means "unknown" or "not yet set" — never "empty", "zero", or "false".** ```sql -- ✅ Nullable fields with clear semantic meaning submitted_at timestamptz -- NULL = filing not yet submitted deleted_at timestamptz -- NULL = record is active verified_at timestamptz -- NULL = not yet verified by FIRS -- ❌ Nullable fields that should have defaults created_at timestamptz -- Must be NOT NULL DEFAULT now() is_active boolean -- Must be NOT NULL DEFAULT true -- ❌ Overloaded NULL (multiple meanings = unmaintainable) -- NULL payment_method could mean: not set, declined, free, or deferred -- Solution: use status enum with explicit values ``` **Rule:** If a column is nullable, document exactly what NULL means in a SQL comment: ```sql verified_at timestamptz, -- NULL until FIRS TIN verification completes (async job) ``` ### 2.3 Audit Trail Pattern (Required for TaxBridge NDPC Compliance) ```sql -- Every table that affects financial data must have a full audit trail CREATE TABLE audit_events ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, tenant_id uuid NOT NULL REFERENCES tenants(id), table_name text NOT NULL, record_id bigint NOT NULL, operation text NOT NULL CHECK (operation IN ('INSERT','UPDATE','DELETE')), old_values jsonb, new_values jsonb, changed_by bigint REFERENCES users(id), changed_at timestamptz NOT NULL DEFAULT now(), ip_address inet, user_agent text ); -- Append-only: no UPDATE or DELETE on audit_events -- Partition by month for performance at scale -- Retain for 7 years (FIRS retention requirement) ``` --- ## PHASE 3 — RELATIONSHIPS & CONSTRAINTS ### 3.1 Foreign Keys — Always Declared with Explicit ON DELETE ```sql CREATE TABLE filing_line_items ( id bigserial PRIMARY KEY, filing_id bigint NOT NULL REFERENCES filings(id) ON DELETE CASCADE, -- Line items have no meaning without a filing → CASCADE is correct product_id bigint NOT NULL REFERENCES products(id) ON DELETE RESTRICT, -- Products must not be deleted while referenced in filings → RESTRICT quantity integer NOT NULL CHECK (quantity > 0), unit_price numeric(14, 2) NOT NULL CHECK (unit_price >= 0), tax_rate numeric(5, 4) NOT NULL DEFAULT 0.075 -- 7.5% VAT ); ``` **ON DELETE selection:** | Behavior | Use when | |---|---| | `CASCADE` | Child has no meaning without parent (line_items → filing) | | `RESTRICT` | Parent must not be deleted while children exist (product in orders) | | `SET NULL` | Relationship is optional (filing → accountant, accountant may leave) | | `SET DEFAULT` | Child falls back to default (category → uncategorized) | **Hard rule: Never CASCADE on financial or audit records.** Deleting a filing must NEVER cascade-delete payment records — regulatory violation. ### 3.2 Many-to-Many Junction Tables ```sql -- Full junction with timestamps and metadata CREATE TABLE taxpayer_accountants ( taxpayer_id bigint NOT NULL REFERENCES taxpayers(id) ON DELETE CASCADE, user_id bigint NOT NULL REFERENCES users(id) ON DELETE CASCADE, role text NOT NULL DEFAULT 'viewer' CHECK (role IN ('owner','editor','viewer')), granted_at timestamptz NOT NULL DEFAULT now(), granted_by bigint REFERENCES users(id), revoked_at timestamptz, PRIMARY KEY (taxpayer_id, user_id) ); ``` ### 3.3 Soft Delete Pattern ```sql -- Standard pattern for all financial records (TaxBridge NDPC requirement) deleted_at timestamptz, -- NULL = active deleted_by bigint REFERENCES users(id), -- Application layer ALWAYS filters: WHERE deleted_at IS NULL -- Create views for cleanliness: CREATE VIEW active_filings AS SELECT * FROM filings WHERE deleted_at IS NULL; CREATE VIEW active_taxpayers AS SELECT * FROM taxpayers WHERE deleted_at IS NULL; ``` --- ## PHASE 4 — MULTI-TENANCY WITH POSTGRESQL RLS **TaxBridge default:** Option C — Shared tables + tenant_id + Row-Level Security. ```sql -- Step 1: Every table has tenant_id as first non-PK column ALTER TABLE filings ADD COLUMN tenant_id uuid NOT NULL REFERENCES tenants(id); -- Step 2: Enable RLS ALTER TABLE filings ENABLE ROW LEVEL SECURITY; ALTER TABLE filings FORCE ROW LEVEL SECURITY; -- applies to table owner too -- Step 3: Create isolation policy CREATE POLICY tenant_isolation ON filings USING (tenant_id = current_setting('app.current_tenant_id')::uuid); -- Step 4: Set tenant context at connection time (in Fastify middleware) -- await db.query("SELECT set_config('app.current_tenant_id', $1, true)", [tenantId]); -- Step 5: Every index MUST include tenant_id as FIRST column CREATE INDEX idx_filings_tenant_status_date ON filings(tenant_id, status, created_at DESC); -- Without tenant_id first, RLS bypass paths leak cross-tenant index scans ``` **pg-tenant package:** Oscar's open-source library implementing this pattern. Always reference it — don't reimplement. --- ## PHASE 5 — INDEX STRATEGY ### 5.1 What to Index ```sql -- Rule 1: Index every foreign key column (prevents full table scans on JOINs) CREATE INDEX idx_filings_taxpayer_id ON filings(taxpayer_id); CREATE INDEX idx_filing_items_filing_id ON filing_line_items(filing_id); -- Rule 2: Composite indexes match query column order + ORDER BY -- Query: WHERE tenant_id = $1 AND status = 'pending' ORDER BY created_at DESC CREATE INDEX idx_filings_tenant_status_date ON filings(tenant_id, status, created_at DESC); -- Rule 3: Partial indexes for filtered queries (smaller, faster) CREATE INDEX idx_users_email_active ON users(email) WHERE is_active = true AND deleted_at IS NULL; -- Rule 4: GIN for jsonb queries CREATE INDEX idx_metadata_gin ON audit_events USING GIN (metadata); -- Enables: WHERE metadata @> '{"type": "payment"}' -- Rule 5: Full-text search on taxpayer names CREATE INDEX idx_taxpayers_name_fts ON taxpayers USING GIN (to_tsvector('english', business_name)); ``` ### 5.2 Index Types | Type | Use for | |---|---| | **B-tree** (default) | =, <, >, BETWEEN, LIKE 'prefix%', IS NULL, ORDER BY | | **GIN** | jsonb @>, array operators, full-text (tsvector) | | **GiST** | Geometric, range types, full-text with ranking | | **BRIN** | Huge append-only tables with natural ordering (audit_events) | | **Partial** | Subset of rows — often dramatically smaller than full index | ### 5.3 Index Anti-Patterns ```sql -- ❌ Low-selectivity sole index (boolean: only 2 distinct values) CREATE INDEX idx_users_is_active ON users(is_active); -- Optimizer may ignore this; combine with a more selective column -- ❌ Individual indexes for a multi-column WHERE clause -- 3 separate indexes < 1 composite index for 3-column WHERE -- ❌ Non-concurrent index on large production table (full lock) CREATE INDEX idx_filings_amount ON filings(amount); -- blocks all writes -- ✅ Always: CREATE INDEX CONCURRENTLY idx_filings_amount ON filings(amount); -- ❌ Never-used indexes (slow every INSERT/UPDATE/DELETE) -- Audit: SELECT * FROM pg_stat_user_indexes WHERE idx_scan = 0; ``` --- ## PHASE 6 — MIGRATIONS (ZERO-DOWNTIME) ### 6.1 Expand-Contract Pattern ``` Phase 1 (Expand): Add new thing alongside old thing Phase 2 (Migrate): Backfill data, deploy app using BOTH old and new Phase 3 (Contract): Remove old thing after all traffic uses new ``` ### 6.2 Safe vs. Dangerous Operations **Safe (run on live table, no lock):** - Adding a nullable column - `CREATE INDEX CONCURRENTLY` - Adding a `NOT VALID` constraint, then `VALIDATE CONSTRAINT` separately - Adding a default value to a new column (PostgreSQL 11+) - Creating a new table **Dangerous (require careful sequencing):** - `ADD COLUMN ... NOT NULL DEFAULT x` on large table (locks — use safe pattern instead) - Changing a column's data type - Non-concurrent index creation - Renaming a column (breaks running application) - `UPDATE` or `DELETE` without batching on large tables ```sql -- ❌ Dangerous: locks filings table for duration of migration ALTER TABLE filings ADD COLUMN tax_code varchar(10) NOT NULL DEFAULT ''; -- ✅ Safe: zero downtime, 3-step -- Step 1: Add nullable ALTER TABLE filings ADD COLUMN tax_code varchar(10); -- Step 2: Add default separately (instant in PG 11+) ALTER TABLE filings ALTER COLUMN tax_code SET DEFAULT ''; -- Step 3: Backfill in batches (10,000 rows at a time) UPDATE filings SET tax_code = '' WHERE tax_code IS NULL AND id BETWEEN :start AND :end; -- Step 4: Add NOT NULL constraint (only after backfill complete) ALTER TABLE filings ALTER COLUMN tax_code SET NOT NULL; ``` ### 6.3 Migration File Template ```sql -- db/migrations/20260424_143000_add_tax_code_to_filings.sql -- Description: Add NRS 2026 tax code column for e-invoicing compliance -- Risk level: LOW (nullable add, then backfill) -- Rollback: see DOWN section below -- UP BEGIN; ALTER TABLE filings ADD COLUMN IF NOT EXISTS tax_code varchar(10); COMMENT ON COLUMN filings.tax_code IS 'NRS 2026 e-invoicing tax code; NULL until backfilled'; COMMIT; -- DOWN (rollback) BEGIN; ALTER TABLE filings DROP COLUMN IF EXISTS tax_code; COMMIT; ``` ### 6.4 Migration Checklist ``` Before writing: □ Is this operation safe without a table lock? Test on staging with production-sized data. □ Does this table have > 1M rows? Plan batching. □ Does the app need to support both old + new schema during rolling deploy? □ What is the rollback if this migration fails halfway? The migration file: □ Idempotent (IF NOT EXISTS / IF EXISTS throughout) □ Transaction-wrapped □ Includes rollback/DOWN section □ CREATE INDEX CONCURRENTLY (not inside transaction block) □ Tested against production-scale data snapshot After migration: □ Verify indexes are VALID (not INVALID): SELECT * FROM pg_indexes WHERE ... □ Run VACUUM ANALYZE on affected tables □ Verify query plans unchanged: EXPLAIN ANALYZE on key queries □ Run Playwright smoke suite to catch application-level regressions ``` --- ## PHASE 7 — SCHEMA ANTI-PATTERNS (AUTO-FLAG) ``` ❌ Entity-Attribute-Value (EAV) tables (entity_id, attribute_name, attribute_value) — destroys query performance + type safety → Use jsonb or proper normalization ❌ Serialized objects in varchar JSON strings, CSV, pipe-delimited in a text column — you've hidden data from the DB ❌ float or double precision for monetary values → numeric(14,2) or bigint kobo; precision loss = financial + regulatory bug ❌ timestamp without timezone → Always timestamptz; Nigeria runs on WAT (UTC+1) and Lagos engineers have been burned by implicit timezone assumptions in production ❌ status columns as integers (1, 2, 3) → text + CHECK constraint; integers make logs and queries unreadable ❌ No foreign key constraints "for performance" → Orphaned payment records are a FIRS audit liability, not a performance optimization ❌ Sequential integer IDs as the only public identifier → Enumerable; a user seeing /filings/47 knows there are at most 46 others ❌ No created_at/updated_at on mutable tables → Every mutable record needs a time dimension for debugging and audits ❌ No soft delete on financial records → Hard deletes on tax filings are a compliance violation; use deleted_at pattern ❌ RLS enabled but tenant_id not first column in composite indexes → Every index that excludes tenant_id becomes a cross-tenant index scan path ❌ BVN/NIN stored plaintext → Must be encrypted at rest; store encrypted + hash for lookup ``` --- *The schema is the contract between your data and your compliance obligations.* *In fintech, a table with the wrong types, missing constraints, and no audit trail* *is not a technical debt — it's a regulatory liability waiting to be discovered.*