--- name: eng-pii-redaction-preprocessor description: Use when building or configuring the PII redaction layer that sanitizes user-submitted text before it is sent to an LLM or stored in a database. Covers entity detection patterns, redaction strategies, audit logging, and reconstruction for legal AI pipelines where client-confidential data must never leak into training or third-party model calls. license: MIT metadata: id: eng.PII-redaction-preprocessor category: eng jurisdictions: [__multi__] priority: P2 intent: [__eng__, pii, redaction, privacy, preprocessing] related: [eng-tenant-isolation-row-level-security, eng-supabase-edge-functions-patterns, eng-rag-chunking-rules-legal-docs, safety-client-confidentiality-cross-tenant] source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal) version: "1.0" --- # PII Redaction Preprocessor ## What it does The PII redaction preprocessor is a text-transformation pipeline stage that detects and removes or masks personally identifiable information (PII) from user-submitted legal content before that content is forwarded to an external LLM API call, stored in a search index, or written to a shared database table. In a multi-tenant legal AI product this is a critical safety control: a client of Firm A must never have their counterpart names, passport numbers, or contract figures embedded in a vector store that another tenant can retrieve. The preprocessor runs synchronously in the request path (≤ 50 ms budget) or, for large document uploads, as an async pre-indexing job. ## Setup / auth No external auth required. The preprocessor is a pure-text pipeline component that can be deployed as: - A Supabase Edge Function (`functions/redact-pii/index.ts`) invoked before any call to the LLM or embedding API. - A middleware layer in the Express/Hono backend, mounted before the chat-route handler. - A standalone Node.js worker for batch document ingestion. Dependencies: `@presidio/presidio-js` (if using Microsoft Presidio via REST), or a custom regex+NER approach. For MENA content, a regex-first approach with Arabic-aware patterns is more reliable than English-trained NER models. ## Capabilities ### Entity types to detect and redact | Category | Examples | Redaction token | |---|---|---| | Full name | "Ahmad Al-Rashidi", "Marie Dupont" | `[PERSON]` | | National ID / Iqama / Emirates ID | 784-XXXX-XXXXXXX-X | `[NATIONAL_ID]` | | Passport number | any 7–9 char alphanumeric | `[PASSPORT]` | | Phone number | +961-X-XXX-XXXX, +971-5X-XXXXXXX | `[PHONE]` | | Email address | RFC 5321 pattern | `[EMAIL]` | | IBAN / bank account | LB62XXXX… | `[FINANCIAL_ACCOUNT]` | | Company registration number | Lebanon: ش.م.م XXXXXX, UAE: CN-XXXXXXXX | `[COMPANY_REG]` | | Address / property number | plot numbers, parcel IDs | `[ADDRESS]` | | Date of birth | explicit DOB patterns | `[DOB]` | | Contract monetary amounts | optional — flag rather than redact | `[AMOUNT]` | ### Redaction modes - **Replace** (default): substitute with typed token (`[PERSON]`). Preserves sentence structure for downstream LLM comprehension. - **Mask**: overwrite with `█` characters of equal length. Used for rendered PDFs. - **Hash**: replace with a consistent HMAC-SHA256 keyed hash. Allows cross-document entity resolution without revealing the value. Use for analytics pipelines where re-identification must remain possible under a controlled key. - **Delete**: remove entity and surrounding whitespace. Use only when the surrounding sentence still makes sense. ## Usage patterns ### Pattern 1 — Inline chat message preprocessing ```typescript // supabase/functions/chat/index.ts (simplified) import { redactPII } from "../_shared/pii-redactor.ts"; const userMessage = req.body.message; const { redacted, auditLog } = redactPII(userMessage, { mode: "replace", locale: detectLocale(userMessage), // "ar" | "en" | "fr" }); // Forward redacted text to LLM; store auditLog for compliance const llmResponse = await callClaude(redacted); ``` ### Pattern 2 — Document ingestion before embedding ```typescript // Before chunking and embedding a uploaded contract PDF const rawText = await extractTextFromPDF(file); const { redacted, auditLog } = redactPII(rawText, { mode: "hash", hashKey: TENANT_HASH_KEY }); const chunks = chunkLegalDocument(redacted); // see eng-rag-chunking-rules-legal-docs await embedAndStore(chunks, tenantId); ``` ### Pattern 3 — Audit trail Every redaction run should emit a structured log entry: ```json { "runId": "uuid", "tenantId": "t_xxx", "userId": "u_xxx", "entitiesFound": [ { "type": "PERSON", "count": 3, "positions": [[12, 24], [88, 102], [211, 225]] }, { "type": "PHONE", "count": 1, "positions": [[400, 413]] } ], "mode": "replace", "inputLengthChars": 4820, "processedAt": "2026-05-14T09:00:00Z" } ``` Store in `pii_audit_log` table (append-only, tenant-scoped, RLS-protected). ## Permissions & safety - The redactor must run **before** any call to the external LLM API. Never pass raw user text containing PII to Claude/GPT/Gemini without this gate. - The audit log must be stored even when redaction finds zero entities (zero-finding logs help detect evasion). - Hash mode keys must be per-tenant, stored in Supabase Vault (not in code or `.env` files). - For bilingual AR/EN documents, run two passes: one with Arabic-aware patterns first, then English. - GDPR / PDPL (UAE Federal Decree-Law No. 45 of 2021) and Lebanon Law 81 on electronic transactions all require documented evidence that PII is handled appropriately. The audit log is that evidence. ## Failure modes | Failure | Impact | Mitigation | |---|---|---| | False negative (PII not detected) | PII sent to external model | Ensemble detection: regex + NER; review audit samples weekly | | False positive (non-PII redacted) | LLM loses useful context | Use typed tokens so LLM infers entity type; tune regex thresholds | | Performance degradation | Chat latency spikes | Run async for large docs; timeout at 200 ms and log | | Hash key rotation | Re-identification broken | Version hash keys; store version alongside hash | | Arabic diacritic variants | Names missed | Normalize Unicode before pattern matching | ## Related skills - [[eng-tenant-isolation-row-level-security]] — RLS ensures redacted data is tenant-scoped in the database - [[eng-rag-chunking-rules-legal-docs]] — chunking runs after redaction in the document pipeline - [[eng-supabase-edge-functions-patterns]] — deployment pattern for the redactor as an Edge Function - [[safety-client-confidentiality-cross-tenant]] — policy that mandates this technical control