--- name: vector-databases description: Vector database selection, embedding storage, approximate nearest neighbor (ANN) algorithms, and vector search optimization. Use when choosing vector stores, designing semantic search, or optimizing similarity search performance. allowed-tools: Read, Glob, Grep --- # Vector Databases ## When to Use This Skill Use this skill when: - Choosing between vector database options - Designing semantic/similarity search systems - Optimizing vector search performance - Understanding ANN algorithm trade-offs - Scaling vector search infrastructure - Implementing hybrid search (vectors + filters) **Keywords:** vector database, embeddings, vector search, similarity search, ANN, approximate nearest neighbor, HNSW, IVF, FAISS, Pinecone, Weaviate, Milvus, Qdrant, Chroma, pgvector, cosine similarity, semantic search ## Vector Database Comparison ### Managed Services | Database | Strengths | Limitations | Best For | | -------- | --------- | ----------- | -------- | | **Pinecone** | Fully managed, easy scaling, enterprise | Vendor lock-in, cost at scale | Enterprise production | | **Weaviate Cloud** | GraphQL, hybrid search, modules | Complexity | Knowledge graphs | | **Zilliz Cloud** | Milvus-based, high performance | Learning curve | High-scale production | | **MongoDB Atlas Vector** | Existing MongoDB users | Newer feature | MongoDB shops | | **Elastic Vector** | Existing Elastic stack | Resource heavy | Search platforms | ### Self-Hosted Options | Database | Strengths | Limitations | Best For | | -------- | --------- | ----------- | -------- | | **Milvus** | Feature-rich, scalable, GPU support | Operational complexity | Large-scale production | | **Qdrant** | Rust performance, filtering, easy | Smaller ecosystem | Performance-focused | | **Weaviate** | Modules, semantic, hybrid | Memory usage | Knowledge applications | | **Chroma** | Simple, Python-native | Limited scale | Development, prototyping | | **pgvector** | PostgreSQL extension | Performance limits | Postgres shops | | **FAISS** | Library, not DB, fastest | No persistence, no filtering | Research, embedded | ### Selection Decision Tree ```text Need managed, don't want operations? ├── Yes → Pinecone (simplest) or Weaviate Cloud └── No (self-hosted) └── Already using PostgreSQL? ├── Yes, <1M vectors → pgvector └── No └── Need maximum performance at scale? ├── Yes → Milvus or Qdrant └── No └── Prototyping/development? ├── Yes → Chroma └── No → Qdrant (balanced choice) ``` ## ANN Algorithms ### Algorithm Overview ```text Exact KNN: • Search ALL vectors • O(n) time complexity • Perfect accuracy • Impractical at scale Approximate NN (ANN): • Search SUBSET of vectors • O(log n) to O(1) complexity • Near-perfect accuracy • Practical at any scale ``` ### HNSW (Hierarchical Navigable Small World) ```text Layer 3: ○───────────────────────○ (sparse, long connections) │ │ Layer 2: ○───○───────○───────○───○ (medium density) │ │ │ │ │ Layer 1: ○─○─○─○─○─○─○─○─○─○─○─○─○ (denser) │││││││││││││││││││││││ Layer 0: ○○○○○○○○○○○○○○○○○○○○○○○○○ (all vectors) Search: Start at top layer, greedily descend • Fast: O(log n) search time • High recall: >95% typically • Memory: Extra graph storage ``` **HNSW Parameters:** | Parameter | Description | Trade-off | | --------- | ----------- | --------- | | `M` | Connections per node | Memory vs. recall | | `ef_construction` | Build-time search width | Build time vs. recall | | `ef_search` | Query-time search width | Latency vs. recall | ### IVF (Inverted File Index) ```text Clustering Phase: ┌─────────────────────────────────────────┐ │ Cluster vectors into K centroids │ │ │ │ ● ● ● ● │ │ /│\ /│\ /│\ /│\ │ │ ○○○○○ ○○○○○ ○○○○○ ○○○○○ │ │ Cluster 1 Cluster 2 Cluster 3 Cluster 4│ └─────────────────────────────────────────┘ Search Phase: 1. Find nprobe nearest centroids 2. Search only those clusters 3. Much faster than exhaustive ``` **IVF Parameters:** | Parameter | Description | Trade-off | | --------- | ----------- | --------- | | `nlist` | Number of clusters | Build time vs. search quality | | `nprobe` | Clusters to search | Latency vs. recall | ### IVF-PQ (Product Quantization) ```text Original Vector (128 dim): [0.1, 0.2, ..., 0.9] (128 × 4 bytes = 512 bytes) PQ Compressed (8 subvectors, 8-bit codes): [23, 45, 12, 89, 56, 34, 78, 90] (8 bytes) Memory reduction: 64x Accuracy trade-off: ~5% recall drop ``` ### Algorithm Comparison | Algorithm | Search Speed | Memory | Build Time | Recall | | --------- | ------------ | ------ | ---------- | ------ | | **Flat/Brute** | Slow (O(n)) | Low | None | 100% | | **IVF** | Fast | Low | Medium | 90-95% | | **IVF-PQ** | Very fast | Very low | Medium | 85-92% | | **HNSW** | Very fast | High | Slow | 95-99% | | **HNSW+PQ** | Very fast | Medium | Slow | 90-95% | ### When to Use Which ```text < 100K vectors: └── Flat index (exact search is fast enough) 100K - 1M vectors: └── HNSW (best recall/speed trade-off) 1M - 100M vectors: ├── Memory available → HNSW └── Memory constrained → IVF-PQ or HNSW+PQ > 100M vectors: └── Sharded IVF-PQ or distributed HNSW ``` ## Distance Metrics ### Common Metrics | Metric | Formula | Range | Best For | | ------ | ------- | ----- | -------- | | **Cosine Similarity** | `A·B / (\|\|A\|\| \|\|B\|\|)` | [-1, 1] | Normalized embeddings | | **Dot Product** | `A·B` | (-∞, ∞) | When magnitude matters | | **Euclidean (L2)** | `√Σ(A-B)²` | [0, ∞) | Absolute distances | | **Manhattan (L1)** | `Σ\|A-B\|` | [0, ∞) | High-dimensional sparse | ### Metric Selection ```text Embeddings pre-normalized (unit vectors)? ├── Yes → Cosine = Dot Product (use Dot, faster) └── No └── Magnitude meaningful? ├── Yes → Dot Product └── No → Cosine Similarity Note: Most embedding models output normalized vectors → Dot product is usually the best choice ``` ## Filtering and Hybrid Search ### Pre-filtering vs Post-filtering ```text Pre-filtering (Filter → Search): ┌─────────────────────────────────────────┐ │ 1. Apply metadata filter │ │ (category = "electronics") │ │ Result: 10K of 1M vectors │ │ │ │ 2. Vector search on 10K vectors │ │ Much faster, guaranteed filter match │ └─────────────────────────────────────────┘ Post-filtering (Search → Filter): ┌─────────────────────────────────────────┐ │ 1. Vector search on 1M vectors │ │ Return top-1000 │ │ │ │ 2. Apply metadata filter │ │ May return < K results! │ └─────────────────────────────────────────┘ ``` ### Hybrid Search Architecture ```text Query: "wireless headphones under $100" │ ┌─────┴─────┐ ▼ ▼ ┌───────┐ ┌───────┐ │Vector │ │Filter │ │Search │ │ Build │ │"wire- │ │price │ │less │ │< 100 │ │head- │ │ │ │phones"│ │ │ └───────┘ └───────┘ │ │ └─────┬─────┘ ▼ ┌───────────┐ │ Combine │ │ Results │ └───────────┘ ``` ### Metadata Index Design | Metadata Type | Index Strategy | Query Example | | ------------- | -------------- | ------------- | | **Categorical** | Bitmap/hash index | category = "books" | | **Numeric range** | B-tree | price BETWEEN 10 AND 50 | | **Keyword search** | Inverted index | tags CONTAINS "sale" | | **Geospatial** | R-tree/geohash | location NEAR (lat, lng) | ## Scaling Strategies ### Sharding Approaches ```text Naive Sharding (by ID): ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ │ IDs 0-N │ │IDs N-2N │ │IDs 2N-3N│ └─────────┘ └─────────┘ └─────────┘ Query → Search ALL shards → Merge results Semantic Sharding (by cluster): ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ │ Tech │ │ Health │ │ Finance │ │ docs │ │ docs │ │ docs │ └─────────┘ └─────────┘ └─────────┘ Query → Route to relevant shard(s) → Faster! ``` ### Replication ```text ┌─────────────────────────────────────────┐ │ Load Balancer │ └─────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │Replica 1│ │Replica 2│ │Replica 3│ │ (Read) │ │ (Read) │ │ (Read) │ └─────────┘ └─────────┘ └─────────┘ │ │ │ └───────────┼───────────┘ │ ┌─────────┐ │ Primary │ │ (Write) │ └─────────┘ ``` ### Scaling Decision Matrix | Scale (vectors) | Architecture | Replication | | --------------- | ------------ | ----------- | | < 1M | Single node | Optional | | 1-10M | Single node, more RAM | For HA | | 10-100M | Sharded, few nodes | Required | | 100M-1B | Sharded, many nodes | Required | | > 1B | Sharded + tiered | Required | ## Performance Optimization ### Index Build Optimization | Optimization | Description | Impact | | ------------ | ----------- | ------ | | **Batch insertion** | Insert in batches of 1K-10K | 10x faster | | **Parallel build** | Multi-threaded index construction | 2-4x faster | | **Incremental index** | Add to existing index | Avoids rebuild | | **GPU acceleration** | Use GPU for training (IVF) | 10-100x faster | ### Query Optimization | Optimization | Description | Impact | | ------------ | ----------- | ------ | | **Warm cache** | Keep index in memory | 10x latency reduction | | **Query batching** | Batch similar queries | Higher throughput | | **Reduce dimensions** | PCA, random projection | 2-4x faster | | **Early termination** | Stop when "good enough" | Lower latency | ### Memory Optimization ```text Memory per vector: ┌────────────────────────────────────────┐ │ 1536 dims × 4 bytes = 6KB per vector │ │ │ │ 1M vectors: │ │ Raw: 6GB │ │ + HNSW graph: +2-4GB (M-dependent) │ │ = 8-10GB total │ │ │ │ With PQ (64 subquantizers): │ │ 1M vectors: ~64MB │ │ = 100x reduction │ └────────────────────────────────────────┘ ``` ## Operational Considerations ### Backup and Recovery | Strategy | Description | RPO/RTO | | -------- | ----------- | ------- | | **Snapshots** | Periodic full backup | Hours | | **WAL replication** | Write-ahead log streaming | Minutes | | **Real-time sync** | Synchronous replication | Seconds | ### Monitoring Metrics | Metric | Description | Alert Threshold | | ------ | ----------- | --------------- | | **Query latency p99** | 99th percentile latency | > 100ms | | **Recall** | Search accuracy | < 90% | | **QPS** | Queries per second | Capacity dependent | | **Memory usage** | Index memory | > 80% | | **Index freshness** | Time since last update | Domain dependent | ### Index Maintenance ```text ┌─────────────────────────────────────────┐ │ Index Maintenance Tasks │ ├─────────────────────────────────────────┤ │ • Compaction: Merge small segments │ │ • Reindex: Rebuild degraded index │ │ • Vacuum: Remove deleted vectors │ │ • Optimize: Tune parameters │ │ │ │ Schedule during low-traffic periods │ └─────────────────────────────────────────┘ ``` ## Common Patterns ### Multi-Tenant Vector Search ```text Option 1: Namespace/Collection per tenant ┌─────────────────────────────────────────┐ │ tenant_1_collection │ │ tenant_2_collection │ │ tenant_3_collection │ └─────────────────────────────────────────┘ Pro: Complete isolation Con: Many indexes, operational overhead Option 2: Single collection + tenant filter ┌─────────────────────────────────────────┐ │ shared_collection │ │ metadata: { tenant_id: "..." } │ │ Pre-filter by tenant_id │ └─────────────────────────────────────────┘ Pro: Simpler operations Con: Requires efficient filtering ``` ### Real-Time Updates ```text Write Path: ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Write │ │ Buffer │ │ Merge │ │ Request │───▶│ (Memory) │───▶│ to Index │ └─────────────┘ └─────────────┘ └─────────────┘ Strategy: 1. Buffer writes in memory 2. Periodically merge to main index 3. Search: main index + buffer 4. Compact periodically ``` ### Embedding Versioning ```text Version 1 embeddings ──┐ │ Version 2 embeddings ──┼──▶ Parallel indexes during migration │ │ ┌─────────────────────┐ └───▶│ Gradual reindexing │ │ Blue-green switch │ └─────────────────────┘ ``` ## Cost Estimation ### Storage Costs ```text Cost = (vectors × dimensions × bytes × replication) / GB × $/GB/month Example: 10M vectors × 1536 dims × 4 bytes × 3 replicas = 184 GB At $0.10/GB/month = $18.40/month storage Note: Memory (for serving) costs more than storage ``` ### Compute Costs ```text Factors: • QPS (queries per second) • Latency requirements • Index type (HNSW needs more RAM) • Filtering complexity Rule of thumb: • 1M vectors, HNSW, <50ms latency: 16GB RAM • 10M vectors, HNSW, <50ms latency: 64-128GB RAM • 100M vectors: Distributed system required ``` ## Related Skills - `rag-architecture` - Using vector databases in RAG systems - `llm-serving-patterns` - LLM inference with vector retrieval - `ml-system-design` - End-to-end ML pipeline design - `estimation-techniques` - Capacity planning for vector systems ## Version History - v1.0.0 (2025-12-26): Initial release - Vector database patterns for systems design --- ## Last Updated **Date:** 2025-12-26