# @memberjunction/ai-vectors Core foundation package for vector operations in MemberJunction. Provides text processing utilities (chunking, extraction), base classes for vectorization pipelines, and interfaces for embedding providers and vector databases. ## Installation ```bash npm install @memberjunction/ai-vectors ``` ## What's Included | Export | Type | Purpose | |---|---|---| | `TextChunker` | Class | Token-aware text splitting with sentence, paragraph, and fixed strategies | | `TextExtractor` | Class | HTML stripping, entity decoding, MIME-type routing, token truncation | | `VectorBase` | Class | Base class providing RunView, Metadata, AIEngine integration for subclasses | | `IEmbedding` | Interface | Contract for single and batch text embedding generation | | `IVectorDatabase` | Interface | Contract for vector database management (create/delete/list indexes) | | `IVectorIndex` | Interface | Contract for CRUD operations on vector records within an index | | `ChunkTextParams` | Type | Configuration for `TextChunker.ChunkText()` | | `TextChunk` | Type | Output chunk with text, offsets, token count, and index | | `PageRecordsParams` | Type | Paginated entity record retrieval configuration. Supports both OFFSET-based pagination (`PageNumber`) and keyset/seek pagination (`AfterKey`) — see [KEYSET_PAGINATION_GUIDE.md](../../../../guides/KEYSET_PAGINATION_GUIDE.md). | ## Architecture ```mermaid graph TD subgraph Core["@memberjunction/ai-vectors"] TC["TextChunker"] TE["TextExtractor"] VB["VectorBase"] IE["IEmbedding"] IVD["IVectorDatabase"] IVI["IVectorIndex"] end subgraph MJCore["MemberJunction Core"] MD["Metadata"] RV["RunView"] BE["BaseEntity"] end subgraph AIEngine["AI Engine"] AIM["AIEngine.Instance"] MOD["Embedding Models"] VDB["Vector Databases"] end subgraph Consumers["Consumer Packages"] SYNC["ai-vector-sync"] DUPE["ai-vector-dupe"] end VB --> MD VB --> RV VB --> BE VB --> AIM AIM --> MOD AIM --> VDB SYNC --> VB SYNC --> TC SYNC --> TE DUPE --> VB style Core fill:#2d6a9f,stroke:#1a4971,color:#fff style MJCore fill:#2d8659,stroke:#1a5c3a,color:#fff style AIEngine fill:#b8762f,stroke:#8a5722,color:#fff style Consumers fill:#7c5295,stroke:#563a6b,color:#fff ``` ## TextChunker Token-aware text splitting that respects natural language boundaries. All methods are static. ### Strategies | Strategy | Splits On | Best For | |---|---|---| | `sentence` | Sentence-ending punctuation (`.` `!` `?`) | Prose, articles, descriptions | | `paragraph` | Double newlines (`\n\n`) | Structured documents, Markdown, reports | | `fixed` | Whitespace boundaries at the character limit | Logs, code, unstructured data | ### Basic Usage ```typescript import { TextChunker, ChunkTextParams, TextChunk } from '@memberjunction/ai-vectors'; const article = `Machine learning models require training data. The quality of training data directly impacts model performance. Data preprocessing is a critical step in any ML pipeline. Feature engineering transforms raw data into meaningful representations. Good features can dramatically improve model accuracy.`; // Sentence strategy (default) const chunks: TextChunk[] = TextChunker.ChunkText({ Text: article, MaxChunkTokens: 128, Strategy: 'sentence' }); for (const chunk of chunks) { console.log(`Chunk ${chunk.Index}: ${chunk.TokenCount} tokens, offset ${chunk.StartOffset}-${chunk.EndOffset}`); console.log(chunk.Text); } ``` ### Paragraph Strategy ```typescript const markdownDoc = `## Introduction This document covers the architecture of our data pipeline. It handles ingestion, transformation, and storage. ## Processing Records are validated against schema constraints. Invalid records are routed to a dead-letter queue. ## Storage Processed data is stored in both relational and vector databases. Vector embeddings enable semantic search across all records.`; const chunks = TextChunker.ChunkText({ Text: markdownDoc, MaxChunkTokens: 256, Strategy: 'paragraph' }); // Each paragraph becomes a chunk (or paragraphs merge if they fit together) ``` ### Fixed Strategy ```typescript const logData = `2024-01-15T10:00:00Z INFO Server started on port 4000 2024-01-15T10:00:01Z INFO Connected to database 2024-01-15T10:00:02Z WARN High memory usage detected: 85% 2024-01-15T10:00:03Z ERROR Connection timeout after 30000ms`; const chunks = TextChunker.ChunkText({ Text: logData, MaxChunkTokens: 64, Strategy: 'fixed' }); ``` ### Configuring Overlap Overlap repeats trailing content from the previous chunk at the start of the next chunk, preserving context across chunk boundaries. Defaults to 10% of `MaxChunkTokens`. ```typescript // Explicit overlap: 50 tokens of shared context between chunks const chunks = TextChunker.ChunkText({ Text: longDocument, MaxChunkTokens: 512, OverlapTokens: 50, Strategy: 'sentence' }); // No overlap const chunks = TextChunker.ChunkText({ Text: longDocument, MaxChunkTokens: 512, OverlapTokens: 0, Strategy: 'sentence' }); ``` ### Token Estimation `EstimateTokenCount` provides a fast approximation using the ~4 characters per token heuristic for English text. This is suitable for chunking where exact counts are not critical. ```typescript const tokens = TextChunker.EstimateTokenCount('This is a sample sentence.'); // Returns: 7 (26 characters / 4) // For production accuracy with specific models, use tiktoken directly // and pass the result to MaxChunkTokens for precise control ``` ### TextChunk Output Shape Each chunk includes full position metadata for traceability back to the source: ```typescript interface TextChunk { Text: string; // The chunk text content StartOffset: number; // Start character offset in original text EndOffset: number; // End character offset (exclusive) TokenCount: number; // Approximate token count Index: number; // 0-based chunk index } ``` ## TextExtractor Static utilities for extracting clean plain text from various content formats. Dependency-light (regex-based, no DOM parser required). ### HTML Extraction ```typescript import { TextExtractor } from '@memberjunction/ai-vectors'; const html = `
This is a formatted paragraph with & entities.