# @memberjunction/ai-vector-sync Synchronizes MemberJunction entity records with vector databases by transforming records into embeddings through a template-based pipeline. Handles batch processing, worker-based parallelism, Entity Document management, and Entity Record Document tracking. ## Architecture ```mermaid graph TD subgraph SyncPkg["@memberjunction/ai-vector-sync"] EVS["EntityVectorSyncer"] EDC["EntityDocumentCache"] EDTP["EntityDocumentTemplateParser"] BW["BatchWorker"] end subgraph Pipeline["Vectorization Pipeline"] FETCH["Fetch Records
(batched)"] --> TEMPL["Parse Templates
(text from fields)"] TEMPL --> EMBED["Generate Embeddings
(AI model)"] EMBED --> UPSERT["Upsert to
Vector DB"] UPSERT --> TRACK["Create Entity
Record Documents"] end subgraph MJEntities["MemberJunction Entities"] ED["Entity Documents"] EDT["Entity Document Types"] ERD["Entity Record Documents"] VDI["Vector Indexes"] end subgraph External["External Services"] AI["Embedding Model
(OpenAI, Mistral, etc.)"] VDB["Vector Database
(Pinecone, etc.)"] end EVS --> EDC EVS --> EDTP EVS --> BW EDTP --> TEMPL BW --> EMBED BW --> UPSERT BW --> TRACK EVS --> ED EVS --> ERD BW --> AI BW --> VDB style SyncPkg fill:#2d6a9f,stroke:#1a4971,color:#fff style Pipeline fill:#2d8659,stroke:#1a5c3a,color:#fff style MJEntities fill:#b8762f,stroke:#8a5722,color:#fff style External fill:#7c5295,stroke:#563a6b,color:#fff ``` ## Installation ```bash npm install @memberjunction/ai-vector-sync ``` ## Overview This package converts MemberJunction entity records into vector embeddings stored in a vector database. The process is driven by **Entity Documents** -- metadata records that define which entity to vectorize, how to generate text from it (via templates), which embedding model to use, and where to store the results. Key capabilities: - **Batch processing** with configurable sizes for fetching, embedding, and upserting - **Template-based text generation** using Entity Document templates that reference entity fields - **Worker architecture** for concurrent embedding and upsert operations - **Entity Document caching** via a singleton cache to avoid repeated database lookups - **Default Entity Document creation** for entities that lack one - **Resume support** via `StartingOffset` for interrupted processes - **Entity Record Document tracking** to record which records have been vectorized ## Vectorization Flow ```mermaid sequenceDiagram participant Caller participant EVS as EntityVectorSyncer participant Cache as EntityDocumentCache participant Parser as TemplateParser participant Worker as BatchWorker participant Model as Embedding Model participant VDB as Vector Database participant DB as MJ Database Caller->>EVS: VectorizeEntity(params, user) EVS->>EVS: Config(forceRefresh, user) EVS->>Cache: Refresh (loads Entity Documents) EVS->>Cache: GetDocument(entityDocumentID) Cache-->>EVS: EntityDocumentEntity EVS->>DB: Load template for Entity Document EVS->>DB: Fetch entity records (batch) loop For each batch EVS->>Parser: Parse template for each record Parser-->>EVS: Text strings EVS->>Worker: VectorizeTemplates batch Worker->>Model: createBatchEmbedding(texts) Model-->>Worker: Embedding vectors EVS->>Worker: UpsertVectors batch Worker->>VDB: createRecords(vectors) VDB-->>Worker: Success/failure EVS->>Worker: Create EntityRecordDocuments Worker->>DB: Save tracking records end EVS-->>Caller: VectorizeEntityResponse ``` ## Core Components ### EntityVectorSyncer The main class that orchestrates the entire vectorization process. Extends `VectorBase` from `@memberjunction/ai-vectors`. **Key methods:** | Method | Description | |---|---| | `Config(forceRefresh, contextUser)` | Initializes engines and caches; must be called before vectorization | | `VectorizeEntity(params, contextUser)` | Runs the full vectorization pipeline for an entity | | `GetEntityDocument(id)` | Retrieves an Entity Document by ID | | `GetEntityDocumentByName(name, user)` | Retrieves an Entity Document by name | | `GetActiveEntityDocuments(entityNames?, entityDocumentType?)` | Gets Active Entity Documents of a given type (default `'Record Duplicate'`; pass `'Search'` for the search-tier pool), optionally filtered by entity name. Returns `[]` (no throw) when nothing matches | | `CreateDefaultEntityDocument(entityID, vectorDB, aiModel)` | Creates a default Entity Document when one does not exist | ### EntityDocumentCache A singleton cache that loads all Entity Document and Entity Document Type records into memory for fast lookup. ```mermaid classDiagram class EntityDocumentCache { -_instance : EntityDocumentCache -_cache : Record~string, EntityDocumentEntity~ -_typeCache : Record~string, EntityDocumentTypeEntity~ +Instance : EntityDocumentCache +IsLoaded : boolean +GetDocument(id) EntityDocumentEntity +GetDocumentByName(name) EntityDocumentEntity +GetDocumentType(id) EntityDocumentTypeEntity +GetDocumentTypeByName(name) EntityDocumentTypeEntity +GetFirstActiveDocumentForEntityByID(entityID) EntityDocumentEntity +GetFirstActiveDocumentForEntityByName(name) EntityDocumentEntity +Refresh(forceRefresh, user) void +SetCurrentUser(user) void } style EntityDocumentCache fill:#2d6a9f,stroke:#1a4971,color:#fff ``` ### EntityDocumentTemplateParser Converts entity records into text strings by evaluating Entity Document templates. Templates use `${FieldName}` syntax to reference entity field values. ```typescript // Template example: "${FirstName} ${LastName} works at ${Company} as ${Title}" // With record { FirstName: 'Jane', LastName: 'Doe', Company: 'Acme', Title: 'Engineer' } // Result: "Jane Doe works at Acme as Engineer" ``` ### BatchWorker Handles the parallel execution of embedding generation, vector database upserts, and Entity Record Document creation. Configurable batch sizes allow tuning for memory and API rate limits. ## Usage ### Basic Vectorization ```typescript import { EntityVectorSyncer } from '@memberjunction/ai-vector-sync'; import { UserInfo } from '@memberjunction/core'; const syncer = new EntityVectorSyncer(); // Initialize (required once) await syncer.Config(false, contextUser); // Vectorize all records for an entity await syncer.VectorizeEntity({ entityID: 'entity-uuid', entityDocumentID: 'doc-uuid', listBatchCount: 50, VectorizeBatchCount: 50, UpsertBatchCount: 50 }, contextUser); ``` ### Vectorize a Specific List ```typescript await syncer.VectorizeEntity({ entityID: 'entity-uuid', entityDocumentID: 'doc-uuid', listID: 'list-uuid' // Only records in this list }, contextUser); ``` ### Resume Interrupted Processing ```typescript await syncer.VectorizeEntity({ entityID: 'entity-uuid', entityDocumentID: 'doc-uuid', StartingOffset: 5000 // Skip first 5000 records }, contextUser); ``` Note: `StartingOffset` forces OFFSET-based pagination for that run (keyset can't skip ahead without knowing the PK at the offset). For runs from the start, the syncer **auto-promotes to keyset (seek) pagination** when the entity has a single-column orderable PK — each page stays O(log N) regardless of how deep into the entity you go, which makes a meaningful difference on multi-million-row entities. Falls back to `PageNumber`-based OFFSET when the entity has a composite PK. See **[KEYSET_PAGINATION_GUIDE.md](../../../../guides/KEYSET_PAGINATION_GUIDE.md)** for details. ### Manage Entity Documents ```typescript // Look up by name const doc = await syncer.GetEntityDocumentByName('Contacts Vectorization', contextUser); // Get all active 'Record Duplicate' documents (the default type) const activeDocs = await syncer.GetActiveEntityDocuments(); // Get active 'Search'-type documents for specific entities only const searchDocs = await syncer.GetActiveEntityDocuments(['Contacts', 'Companies'], 'Search'); // Create a default document when none exists const newDoc = await syncer.CreateDefaultEntityDocument( entityID, vectorDatabase, aiModel ); ``` > `GetActiveEntityDocuments` returns an **empty array** when no Active documents of the > requested type exist — it does not throw. (A misspelled/unknown type name is logged as a > warning.) Callers treat the empty case as "nothing to do": e.g. `VectorizeEntityAction` > returns `Success: true` / `ResultCode: "NO_DOCUMENTS"` so the unattended daily sync job > isn't reported as a failed run on a fresh DB. ### Search Entity Documents and the daily sync job A `Search`-type Entity Document marks an entity as semantically searchable (it backs `Provider.SearchEntity` / the `Search Entity` action — see [guides/ENTITY_SEARCH_GUIDE.md](../../../../guides/ENTITY_SEARCH_GUIDE.md)). MemberJunction ships a **standard set** as seed metadata in `/metadata/entity-documents/`, all on the in-process `Simple Vector Service Provider` + `gte-small (Local)` stack (no API key, no cost): `MJ: Entities`, `MJ: AI Agents`, `MJ: Actions`, `MJ: AI Prompts`, and `MJ: AI Models`. The seeded **`Entity Vector Sync - Daily`** scheduled job (cron `0 0 4 * * *`, `RunImmediatelyIfNeverRun: true`) drives the `Vectorize Entity` action with `EntityDocumentType="Search"`, vectorizing every Active Search document. Add your own by dropping a record into `.entity-documents.json` (+ a `.njk` template) and pushing with `mj sync push` — the job picks it up on the next run. With no Active Search documents, the job is a clean no-op rather than an error. ## Configuration Types ### VectorizeEntityParams ```typescript type VectorizeEntityParams = { entityID: string; // Entity to vectorize entityDocumentID?: string; // Entity Document configuration listID?: string; // Optional: vectorize only this list listBatchCount?: number; // Records per fetch batch (default: 50) VectorizeBatchCount?: number; // Embedding batch size (default: 50) UpsertBatchCount?: number; // DB upsert batch size (default: 50) StartingOffset?: number; // Skip records for resume CurrentUser?: UserInfo; // User context }; ``` ### EntitySyncConfig ```typescript type EntitySyncConfig = { EntityDocumentID: string; Interval: number; // Seconds between syncs RunViewParams: RunViewParams; IncludeInSync: boolean; LastRunDate: string; VectorIndexID: number; VectorID: number; }; ``` ## Entity Document Templates Templates define how entity records are transformed into text for embedding generation. ```mermaid graph LR ED["Entity Document"] --> TMPL["Template
${Field} syntax"] TMPL --> PARSER["Template Parser"] REC["Entity Record"] --> PARSER PARSER --> TEXT["Plain Text"] TEXT --> EMBED["Embedding Model"] EMBED --> VEC["Vector"] style ED fill:#2d6a9f,stroke:#1a4971,color:#fff style TMPL fill:#2d8659,stroke:#1a5c3a,color:#fff style PARSER fill:#b8762f,stroke:#8a5722,color:#fff style EMBED fill:#7c5295,stroke:#563a6b,color:#fff style REC fill:#2d8659,stroke:#1a5c3a,color:#fff style TEXT fill:#b8762f,stroke:#8a5722,color:#fff style VEC fill:#7c5295,stroke:#563a6b,color:#fff ``` ## Environment Variables ```env # Database DB_HOST=your-sql-server DB_PORT=1433 DB_USERNAME=your-username DB_PASSWORD=your-password DB_DATABASE=your-database # AI Models OPENAI_API_KEY=your-openai-key MISTRAL_API_KEY=your-mistral-key # Vector Database PINECONE_API_KEY=your-pinecone-key PINECONE_HOST=your-pinecone-host PINECONE_DEFAULT_INDEX=your-default-index # User Context CURRENT_USER_EMAIL=user@example.com ``` ## Dependencies | Package | Purpose | |---|---| | `@memberjunction/ai` | `BaseEmbeddings`, `GetAIAPIKey`, `EmbedTextsResult` | | `@memberjunction/ai-vectordb` | `VectorDBBase`, `VectorRecord` | | `@memberjunction/ai-vectors` | `VectorBase` base class | | `@memberjunction/aiengine` | `AIEngine` singleton | | `@memberjunction/core` | `Metadata`, `RunView`, `BaseEntity`, `UserInfo` | | `@memberjunction/core-entities` | Entity type definitions | | `@memberjunction/global` | MJGlobal class factory | | `@memberjunction/templates` | Template engine for text generation | ## Performance Considerations - **Batch sizes**: Adjust `listBatchCount`, `VectorizeBatchCount`, and `UpsertBatchCount` based on available memory and API rate limits - **Long-running**: Full vectorization of large entities can take hours; use `StartingOffset` to resume - **Worker concurrency**: The BatchWorker processes embedding and upsert operations concurrently within each batch - **Caching**: `EntityDocumentCache` reduces database lookups for document metadata ## Development ```bash # Build npm run build # Development mode npm run start ``` ## License ISC