--- name: eng-embedding-model-choice-legal description: Use when selecting an embedding model for semantic search, document similarity, or retrieval-augmented generation (RAG) in a legal AI product. Covers the criteria specific to legal text (multilingual Arabic/English, domain vocabulary, long-document chunking), a comparison of leading embedding models, and configuration recommendations for legal retrieval pipelines serving MENA and common-law jurisdictions. license: MIT metadata: id: eng.embedding-model-choice-legal category: eng jurisdictions: [__multi__] priority: P2 intent: [embeddings, RAG, semantic-search, vector-database, multilingual] related: - eng-context-cache-key-design - eng-fallback-model-cascade - eng-langfuse-eval-runner - eng-mcp-tool-registry source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal) version: "1.0" --- # Embedding Model Choice — Legal ## What it does Embedding models convert text into dense vector representations that enable semantic search: "find documents that are conceptually similar to this query, even if they don't share keywords." In legal AI products, embeddings power: - Precedent retrieval (find the most similar prior NDA to the current draft) - KB search (retrieve the most relevant knowledge-base document for this user query) - Conflict-check similarity (find matters involving conceptually similar parties or issues) - Document classification (identify the document type from its content) The choice of embedding model has a large impact on retrieval quality. Legal text is domain-specific, often multilingual (Arabic + English in MENA), and involves long documents. Generic models trained on web text perform poorly on legal vocabulary without fine-tuning or careful chunking. ## Key selection criteria for legal text | Criterion | Why it matters for legal | |---|---| | Multilingual quality (AR + EN) | MENA contracts are often bilingual; queries in Arabic must retrieve English documents and vice versa | | Max input length | Contracts can exceed 10,000 tokens; short-context models require aggressive chunking that loses clause context | | Domain vocabulary | Legal terms ("indemnification", "escrow", "قضاء", "تعويض") must be represented precisely | | Embedding dimension | Higher dimensions → more expressive but higher storage/compute cost | | Cost per token | Embedding all firm documents is a large one-time and recurring cost | | Latency | Real-time search requires low embedding latency; batch indexing can tolerate higher latency | | Data privacy | Some models require sending text to a third-party API; sensitive legal documents may require local/self-hosted models | ## Model comparison | Model | Max tokens | Dimensions | Multilingual AR | Notes | |---|---|---|---|---| | `text-embedding-3-large` (OpenAI) | 8191 | 3072 (configurable) | Good | Strong general legal performance; high cost at scale | | `text-embedding-3-small` (OpenAI) | 8191 | 1536 | Good | Lower cost; slight quality trade-off | | `voyage-law-2` (Voyage AI) | 16000 | 1024 | Moderate | Specifically trained on legal text; best English legal recall | | `voyage-multilingual-2` (Voyage AI) | 32000 | 1024 | Strong | Best multilingual option tested on AR-EN legal text | | `jina-embeddings-v3` (Jina AI) | 8192 | 1024 | Strong | Open-weights available; good AR support; self-hostable | | `e5-mistral-7b-instruct` (Microsoft) | 4096 | 4096 | Moderate | Open weights; good legal quality if fine-tuned | | `multilingual-e5-large` | 514 | 1024 | Strong | Short context limit; requires chunking; good AR quality | **Recommendation for MENA legal AI**: - Primary: `voyage-multilingual-2` for AR+EN retrieval; or `voyage-law-2` for English-only legal retrieval. - Privacy-sensitive deployments: `jina-embeddings-v3` (self-hostable) or `multilingual-e5-large` (self-hostable). - Cost-sensitive: `text-embedding-3-small` with 1536 dimensions reduced to 256–512 (OpenAI supports dimension reduction via MRL training). ## Chunking strategy for legal documents Legal documents require jurisdiction-aware chunking: ### Fixed-window chunking (simplest, weakest) Split at N tokens with K-token overlap. Loses clause structure. Use only for short documents. ### Semantic / structure-aware chunking (recommended) Split at structural boundaries: 1. **Headings**: split at H1/H2/H3 (contract sections, article numbers). 2. **Clause boundaries**: each numbered clause is a chunk candidate. 3. **Paragraph boundaries**: fallback if clause structure is not present. For contracts with defined-term sections: embed the definitions chunk as a prefix to every subsequent chunk that uses those terms (increases retrieval precision for clause-level queries). ### Hierarchical chunking (best for long contracts) Embed at two granularities: - **Document level**: one embedding per document, summarizing the whole. - **Clause level**: one embedding per clause. On retrieval: first match at document level (coarse), then re-rank at clause level (fine). This two-stage retrieval is more accurate and cheaper for large corpora. ## Metadata to store alongside embeddings In the vector database, store these fields alongside each chunk's embedding: ```json { "chunk_id": "ulid", "document_id": "doc_xxx", "matter_id": "matter_xxx | kb_doc", "org_id": "org_xxx", "document_type": "NDA | SPA | Opinion | ...", "jurisdiction": "UAE | DIFC | KSA | ...", "governing_law": "UAE | English | ...", "language": "en | ar | bilingual", "section_title": "Article 5 — Indemnification", "chunk_index": 3, "total_chunks": 12, "approved_at": "ISO-8601", "access_tier": "firm-wide | practice | partner-only" } ``` Use metadata filters on retrieval to enforce access control — never return a chunk to a user who lacks access to the parent document. ## Access control on retrieval The embedding search pipeline must enforce tenant and document-access controls: 1. Apply `org_id = current_org` filter to every query — never allow cross-tenant retrieval. 2. Apply `access_tier IN (user's_permitted_tiers)` filter. 3. For matter-specific documents: apply `matter_id = current_matter OR matter_id = 'kb_doc'`. These filters must be applied at the vector database query layer, not post-retrieval in application code. ## Evaluation Evaluate embedding model quality on a legal domain test set using: - **NDCG@10**: ranking quality of retrieved results. - **Recall@5**: were the 5 most relevant documents retrieved? - **MRR**: mean reciprocal rank of the first relevant result. Use Langfuse (see [[eng-langfuse-eval-runner]]) to run systematic retrieval quality evaluations with legal practitioner-curated test cases. ## Related skills - [[eng-context-cache-key-design]] - [[eng-fallback-model-cascade]] - [[eng-langfuse-eval-runner]] - [[eng-mcp-tool-registry]]