# @memberjunction/ai-vector-sync
Synchronizes MemberJunction entity records with vector databases by transforming records into embeddings through a template-based pipeline. Handles batch processing, worker-based parallelism, Entity Document management, and Entity Record Document tracking.
## Architecture
```mermaid
graph TD
subgraph SyncPkg["@memberjunction/ai-vector-sync"]
EVS["EntityVectorSyncer"]
EDC["EntityDocumentCache"]
EDTP["EntityDocumentTemplateParser"]
BW["BatchWorker"]
end
subgraph Pipeline["Vectorization Pipeline"]
FETCH["Fetch Records
(batched)"] --> TEMPL["Parse Templates
(text from fields)"]
TEMPL --> EMBED["Generate Embeddings
(AI model)"]
EMBED --> UPSERT["Upsert to
Vector DB"]
UPSERT --> TRACK["Create Entity
Record Documents"]
end
subgraph MJEntities["MemberJunction Entities"]
ED["Entity Documents"]
EDT["Entity Document Types"]
ERD["Entity Record Documents"]
VDI["Vector Indexes"]
end
subgraph External["External Services"]
AI["Embedding Model
(OpenAI, Mistral, etc.)"]
VDB["Vector Database
(Pinecone, etc.)"]
end
EVS --> EDC
EVS --> EDTP
EVS --> BW
EDTP --> TEMPL
BW --> EMBED
BW --> UPSERT
BW --> TRACK
EVS --> ED
EVS --> ERD
BW --> AI
BW --> VDB
style SyncPkg fill:#2d6a9f,stroke:#1a4971,color:#fff
style Pipeline fill:#2d8659,stroke:#1a5c3a,color:#fff
style MJEntities fill:#b8762f,stroke:#8a5722,color:#fff
style External fill:#7c5295,stroke:#563a6b,color:#fff
```
## Installation
```bash
npm install @memberjunction/ai-vector-sync
```
## Overview
This package converts MemberJunction entity records into vector embeddings stored in a vector database. The process is driven by **Entity Documents** -- metadata records that define which entity to vectorize, how to generate text from it (via templates), which embedding model to use, and where to store the results.
Key capabilities:
- **Batch processing** with configurable sizes for fetching, embedding, and upserting
- **Template-based text generation** using Entity Document templates that reference entity fields
- **Worker architecture** for concurrent embedding and upsert operations
- **Entity Document caching** via a singleton cache to avoid repeated database lookups
- **Default Entity Document creation** for entities that lack one
- **Resume support** via `StartingOffset` for interrupted processes
- **Entity Record Document tracking** to record which records have been vectorized
## Vectorization Flow
```mermaid
sequenceDiagram
participant Caller
participant EVS as EntityVectorSyncer
participant Cache as EntityDocumentCache
participant Parser as TemplateParser
participant Worker as BatchWorker
participant Model as Embedding Model
participant VDB as Vector Database
participant DB as MJ Database
Caller->>EVS: VectorizeEntity(params, user)
EVS->>EVS: Config(forceRefresh, user)
EVS->>Cache: Refresh (loads Entity Documents)
EVS->>Cache: GetDocument(entityDocumentID)
Cache-->>EVS: EntityDocumentEntity
EVS->>DB: Load template for Entity Document
EVS->>DB: Fetch entity records (batch)
loop For each batch
EVS->>Parser: Parse template for each record
Parser-->>EVS: Text strings
EVS->>Worker: VectorizeTemplates batch
Worker->>Model: createBatchEmbedding(texts)
Model-->>Worker: Embedding vectors
EVS->>Worker: UpsertVectors batch
Worker->>VDB: createRecords(vectors)
VDB-->>Worker: Success/failure
EVS->>Worker: Create EntityRecordDocuments
Worker->>DB: Save tracking records
end
EVS-->>Caller: VectorizeEntityResponse
```
## Core Components
### EntityVectorSyncer
The main class that orchestrates the entire vectorization process. Extends `VectorBase` from `@memberjunction/ai-vectors`.
**Key methods:**
| Method | Description |
|---|---|
| `Config(forceRefresh, contextUser)` | Initializes engines and caches; must be called before vectorization |
| `VectorizeEntity(params, contextUser)` | Runs the full vectorization pipeline for an entity |
| `GetEntityDocument(id)` | Retrieves an Entity Document by ID |
| `GetEntityDocumentByName(name, user)` | Retrieves an Entity Document by name |
| `GetActiveEntityDocuments(entityNames?, entityDocumentType?)` | Gets Active Entity Documents of a given type (default `'Record Duplicate'`; pass `'Search'` for the search-tier pool), optionally filtered by entity name. Returns `[]` (no throw) when nothing matches |
| `CreateDefaultEntityDocument(entityID, vectorDB, aiModel)` | Creates a default Entity Document when one does not exist |
### EntityDocumentCache
A singleton cache that loads all Entity Document and Entity Document Type records into memory for fast lookup.
```mermaid
classDiagram
class EntityDocumentCache {
-_instance : EntityDocumentCache
-_cache : Record~string, EntityDocumentEntity~
-_typeCache : Record~string, EntityDocumentTypeEntity~
+Instance : EntityDocumentCache
+IsLoaded : boolean
+GetDocument(id) EntityDocumentEntity
+GetDocumentByName(name) EntityDocumentEntity
+GetDocumentType(id) EntityDocumentTypeEntity
+GetDocumentTypeByName(name) EntityDocumentTypeEntity
+GetFirstActiveDocumentForEntityByID(entityID) EntityDocumentEntity
+GetFirstActiveDocumentForEntityByName(name) EntityDocumentEntity
+Refresh(forceRefresh, user) void
+SetCurrentUser(user) void
}
style EntityDocumentCache fill:#2d6a9f,stroke:#1a4971,color:#fff
```
### EntityDocumentTemplateParser
Converts entity records into text strings by evaluating Entity Document templates. Templates use `${FieldName}` syntax to reference entity field values.
```typescript
// Template example: "${FirstName} ${LastName} works at ${Company} as ${Title}"
// With record { FirstName: 'Jane', LastName: 'Doe', Company: 'Acme', Title: 'Engineer' }
// Result: "Jane Doe works at Acme as Engineer"
```
### BatchWorker
Handles the parallel execution of embedding generation, vector database upserts, and Entity Record Document creation. Configurable batch sizes allow tuning for memory and API rate limits.
## Usage
### Basic Vectorization
```typescript
import { EntityVectorSyncer } from '@memberjunction/ai-vector-sync';
import { UserInfo } from '@memberjunction/core';
const syncer = new EntityVectorSyncer();
// Initialize (required once)
await syncer.Config(false, contextUser);
// Vectorize all records for an entity
await syncer.VectorizeEntity({
entityID: 'entity-uuid',
entityDocumentID: 'doc-uuid',
listBatchCount: 50,
VectorizeBatchCount: 50,
UpsertBatchCount: 50
}, contextUser);
```
### Vectorize a Specific List
```typescript
await syncer.VectorizeEntity({
entityID: 'entity-uuid',
entityDocumentID: 'doc-uuid',
listID: 'list-uuid' // Only records in this list
}, contextUser);
```
### Resume Interrupted Processing
```typescript
await syncer.VectorizeEntity({
entityID: 'entity-uuid',
entityDocumentID: 'doc-uuid',
StartingOffset: 5000 // Skip first 5000 records
}, contextUser);
```
Note: `StartingOffset` forces OFFSET-based pagination for that run (keyset can't skip ahead without knowing the PK at the offset). For runs from the start, the syncer **auto-promotes to keyset (seek) pagination** when the entity has a single-column orderable PK — each page stays O(log N) regardless of how deep into the entity you go, which makes a meaningful difference on multi-million-row entities. Falls back to `PageNumber`-based OFFSET when the entity has a composite PK. See **[KEYSET_PAGINATION_GUIDE.md](../../../../guides/KEYSET_PAGINATION_GUIDE.md)** for details.
### Manage Entity Documents
```typescript
// Look up by name
const doc = await syncer.GetEntityDocumentByName('Contacts Vectorization', contextUser);
// Get all active 'Record Duplicate' documents (the default type)
const activeDocs = await syncer.GetActiveEntityDocuments();
// Get active 'Search'-type documents for specific entities only
const searchDocs = await syncer.GetActiveEntityDocuments(['Contacts', 'Companies'], 'Search');
// Create a default document when none exists
const newDoc = await syncer.CreateDefaultEntityDocument(
entityID, vectorDatabase, aiModel
);
```
> `GetActiveEntityDocuments` returns an **empty array** when no Active documents of the
> requested type exist — it does not throw. (A misspelled/unknown type name is logged as a
> warning.) Callers treat the empty case as "nothing to do": e.g. `VectorizeEntityAction`
> returns `Success: true` / `ResultCode: "NO_DOCUMENTS"` so the unattended daily sync job
> isn't reported as a failed run on a fresh DB.
### Search Entity Documents and the daily sync job
A `Search`-type Entity Document marks an entity as semantically searchable (it backs
`Provider.SearchEntity` / the `Search Entity` action — see
[guides/ENTITY_SEARCH_GUIDE.md](../../../../guides/ENTITY_SEARCH_GUIDE.md)). MemberJunction
ships a **standard set** as seed metadata in `/metadata/entity-documents/`, all on the
in-process `Simple Vector Service Provider` + `gte-small (Local)` stack (no API key, no cost):
`MJ: Entities`, `MJ: AI Agents`, `MJ: Actions`, `MJ: AI Prompts`, and `MJ: AI Models`.
The seeded **`Entity Vector Sync - Daily`** scheduled job (cron `0 0 4 * * *`,
`RunImmediatelyIfNeverRun: true`) drives the `Vectorize Entity` action with
`EntityDocumentType="Search"`, vectorizing every Active Search document. Add your own by
dropping a record into `.entity-documents.json` (+ a `.njk` template) and pushing with
`mj sync push` — the job picks it up on the next run. With no Active Search documents, the
job is a clean no-op rather than an error.
## Configuration Types
### VectorizeEntityParams
```typescript
type VectorizeEntityParams = {
entityID: string; // Entity to vectorize
entityDocumentID?: string; // Entity Document configuration
listID?: string; // Optional: vectorize only this list
listBatchCount?: number; // Records per fetch batch (default: 50)
VectorizeBatchCount?: number; // Embedding batch size (default: 50)
UpsertBatchCount?: number; // DB upsert batch size (default: 50)
StartingOffset?: number; // Skip records for resume
CurrentUser?: UserInfo; // User context
};
```
### EntitySyncConfig
```typescript
type EntitySyncConfig = {
EntityDocumentID: string;
Interval: number; // Seconds between syncs
RunViewParams: RunViewParams;
IncludeInSync: boolean;
LastRunDate: string;
VectorIndexID: number;
VectorID: number;
};
```
## Entity Document Templates
Templates define how entity records are transformed into text for embedding generation.
```mermaid
graph LR
ED["Entity Document"] --> TMPL["Template
${Field} syntax"]
TMPL --> PARSER["Template Parser"]
REC["Entity Record"] --> PARSER
PARSER --> TEXT["Plain Text"]
TEXT --> EMBED["Embedding Model"]
EMBED --> VEC["Vector"]
style ED fill:#2d6a9f,stroke:#1a4971,color:#fff
style TMPL fill:#2d8659,stroke:#1a5c3a,color:#fff
style PARSER fill:#b8762f,stroke:#8a5722,color:#fff
style EMBED fill:#7c5295,stroke:#563a6b,color:#fff
style REC fill:#2d8659,stroke:#1a5c3a,color:#fff
style TEXT fill:#b8762f,stroke:#8a5722,color:#fff
style VEC fill:#7c5295,stroke:#563a6b,color:#fff
```
## Environment Variables
```env
# Database
DB_HOST=your-sql-server
DB_PORT=1433
DB_USERNAME=your-username
DB_PASSWORD=your-password
DB_DATABASE=your-database
# AI Models
OPENAI_API_KEY=your-openai-key
MISTRAL_API_KEY=your-mistral-key
# Vector Database
PINECONE_API_KEY=your-pinecone-key
PINECONE_HOST=your-pinecone-host
PINECONE_DEFAULT_INDEX=your-default-index
# User Context
CURRENT_USER_EMAIL=user@example.com
```
## Dependencies
| Package | Purpose |
|---|---|
| `@memberjunction/ai` | `BaseEmbeddings`, `GetAIAPIKey`, `EmbedTextsResult` |
| `@memberjunction/ai-vectordb` | `VectorDBBase`, `VectorRecord` |
| `@memberjunction/ai-vectors` | `VectorBase` base class |
| `@memberjunction/aiengine` | `AIEngine` singleton |
| `@memberjunction/core` | `Metadata`, `RunView`, `BaseEntity`, `UserInfo` |
| `@memberjunction/core-entities` | Entity type definitions |
| `@memberjunction/global` | MJGlobal class factory |
| `@memberjunction/templates` | Template engine for text generation |
## Performance Considerations
- **Batch sizes**: Adjust `listBatchCount`, `VectorizeBatchCount`, and `UpsertBatchCount` based on available memory and API rate limits
- **Long-running**: Full vectorization of large entities can take hours; use `StartingOffset` to resume
- **Worker concurrency**: The BatchWorker processes embedding and upsert operations concurrently within each batch
- **Caching**: `EntityDocumentCache` reduces database lookups for document metadata
## Development
```bash
# Build
npm run build
# Development mode
npm run start
```
## License
ISC