--- name: using-vector-databases description: Vector database implementation for AI/ML applications, semantic search, and RAG systems. Use when building chatbots, search engines, recommendation systems, or similarity-based retrieval. Covers Qdrant (primary), Pinecone, Milvus, pgvector, Chroma, embedding generation (OpenAI, Voyage, Cohere), chunking strategies, and hybrid search patterns. --- # Vector Databases for AI Applications ## When to Use This Skill Use this skill when implementing: - **RAG (Retrieval-Augmented Generation)** systems for AI chatbots - **Semantic search** capabilities (meaning-based, not just keyword) - **Recommendation systems** based on similarity - **Multi-modal AI** (unified search across text, images, audio) - **Document similarity** and deduplication - **Question answering** over private knowledge bases ## Quick Decision Framework ### 1. Vector Database Selection ``` START: Choosing a Vector Database EXISTING INFRASTRUCTURE? ├─ Using PostgreSQL already? │ └─ pgvector (<10M vectors, tight budget) │ See: references/pgvector.md │ └─ No existing vector database? │ ├─ OPERATIONAL PREFERENCE? │ │ │ ├─ Zero-ops managed only │ │ └─ Pinecone (fully managed, excellent DX) │ │ See: references/pinecone.md │ │ │ └─ Flexible (self-hosted or managed) │ │ │ ├─ SCALE: <100M vectors + complex filtering ⭐ │ │ └─ Qdrant (RECOMMENDED) │ │ • Best metadata filtering │ │ • Built-in hybrid search (BM25 + Vector) │ │ • Self-host: Docker/K8s │ │ • Managed: Qdrant Cloud │ │ See: references/qdrant.md │ │ │ ├─ SCALE: >100M vectors + GPU acceleration │ │ └─ Milvus / Zilliz Cloud │ │ See: references/milvus.md │ │ │ ├─ Embedded / No server │ │ └─ LanceDB (serverless, edge deployment) │ │ │ └─ Local prototyping │ └─ Chroma (simple API, in-memory) ``` ### 2. Embedding Model Selection ``` REQUIREMENTS? ├─ Best quality (cost no object) │ └─ Voyage AI voyage-3 (1024d) │ • 9.74% better than OpenAI on MTEB │ • ~$0.12/1M tokens │ See: references/embedding-strategies.md │ ├─ Enterprise reliability │ └─ OpenAI text-embedding-3-large (3072d) │ • Industry standard │ • ~$0.13/1M tokens │ • Maturity shortening: reduce to 256/512/1024d │ ├─ Cost-optimized │ └─ OpenAI text-embedding-3-small (1536d) │ • ~$0.02/1M tokens (6x cheaper) │ • 90-95% of large model performance │ ├─ Multilingual (100+ languages) │ └─ Cohere embed-v3 (1024d) │ • ~$0.10/1M tokens │ └─ Self-hosted / Privacy-critical ├─ English: nomic-embed-text-v1.5 (768d, Apache 2.0) ├─ Multilingual: BAAI/bge-m3 (1024d, MIT) └─ Long docs: jina-embeddings-v2 (768d, 8K context) ``` ## Core Concepts ### Document Chunking Strategy **Recommended defaults for most RAG systems:** - **Chunk size:** 512 tokens (not characters) - **Overlap:** 50 tokens (10% overlap) **Why these numbers?** - 512 tokens balances context vs. precision - Too small (128-256): Fragments concepts, loses context - Too large (1024-2048): Dilutes relevance, wastes LLM tokens - 50 token overlap ensures sentences aren't split mid-context See `references/chunking-patterns.md` for advanced strategies by content type. ### Hybrid Search (Vector + Keyword) **Hybrid Search = Vector Similarity + BM25 Keyword Matching** ``` User Query: "OAuth refresh token implementation" │ ┌──────┴──────┐ │ │ Vector Search Keyword Search (Semantic) (BM25) │ │ Top 20 docs Top 20 docs │ │ └──────┬──────┘ │ Reciprocal Rank Fusion (Merge + Re-rank) │ Final Top 5 Results ``` **Why hybrid matters:** - Vector captures semantic meaning ("OAuth refresh" ≈ "token renewal") - Keyword ensures exact matches ("refresh_token" literal) - Combined provides best retrieval quality See `references/hybrid-search.md` for implementation details. ## Getting Started ### Python + Qdrant Example ```python from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct # 1. Initialize client client = QdrantClient("localhost", port=6333) # 2. Create collection client.create_collection( collection_name="documents", vectors_config=VectorParams(size=1024, distance=Distance.COSINE) ) # 3. Insert documents with embeddings points = [ PointStruct( id=idx, vector=embedding, # From OpenAI/Voyage/etc payload={ "text": chunk_text, "source": "docs/api.md", "section": "Authentication" } ) for idx, (embedding, chunk_text) in enumerate(chunks) ] client.upsert(collection_name="documents", points=points) # 4. Search with metadata filtering results = client.search( collection_name="documents", query_vector=query_embedding, limit=5, query_filter={ "must": [ {"key": "section", "match": {"value": "Authentication"}} ] } ) ``` For complete examples, see `examples/qdrant-python/`. ### TypeScript + Qdrant Example ```typescript import { QdrantClient } from '@qdrant/js-client-rest'; const client = new QdrantClient({ url: 'http://localhost:6333' }); // Create collection await client.createCollection('documents', { vectors: { size: 1024, distance: 'Cosine' } }); // Insert documents await client.upsert('documents', { points: chunks.map((chunk, idx) => ({ id: idx, vector: chunk.embedding, payload: { text: chunk.text, source: chunk.source } })) }); // Search const results = await client.search('documents', { vector: queryEmbedding, limit: 5, filter: { must: [ { key: 'source', match: { value: 'docs/api.md' } } ] } }); ``` For complete examples, see `examples/typescript-rag/`. ## RAG Pipeline Architecture ### Complete Pipeline Components ``` 1. INGESTION ├─ Document Loading (PDF, web, code, Office) ├─ Text Extraction & Cleaning ├─ Chunking (semantic, recursive, code-aware) └─ Embedding Generation (batch, rate-limited) 2. INDEXING ├─ Vector Store Insertion (batch upsert) ├─ Index Configuration (HNSW, distance metric) └─ Keyword Index (BM25 for hybrid search) 3. RETRIEVAL (Query Time) ├─ Query Processing (expansion, embedding) ├─ Hybrid Search (vector + keyword) ├─ Filtering & Post-Processing (metadata, MMR) └─ Re-Ranking (cross-encoder, LLM-based) 4. GENERATION ├─ Context Construction (format chunks, citations) ├─ Prompt Engineering (system + context + query) ├─ LLM Inference (streaming, temperature tuning) └─ Response Post-Processing (citations, validation) 5. EVALUATION (Production Critical) ├─ Retrieval Metrics (precision, recall, relevancy) ├─ Generation Metrics (faithfulness, correctness) └─ System Metrics (latency, cost, satisfaction) ``` ## Essential Metadata for Production RAG **Critical for filtering and relevance:** ```python metadata = { # SOURCE TRACKING "source": "docs/api-reference.md", "source_type": "documentation", # code, docs, logs, chat "last_updated": "2025-12-01T12:00:00Z", # HIERARCHICAL CONTEXT "section": "Authentication", "subsection": "OAuth 2.1", "heading_hierarchy": ["API Reference", "Authentication", "OAuth 2.1"], # CONTENT CLASSIFICATION "content_type": "code_example", # prose, code, table, list "programming_language": "python", # FILTERING DIMENSIONS "product_version": "v2.0", "audience": "enterprise", # free, pro, enterprise # RETRIEVAL HINTS "chunk_index": 3, "total_chunks": 12, "has_code": True } ``` **Why metadata matters:** - Enables filtering BEFORE vector search (reduces search space) - Improves relevance through targeted retrieval - Supports multi-tenant systems (filter by user/org) - Enables versioned documentation (filter by product version) ## Evaluation with RAGAS **Use scripts/evaluate_rag.py for automated evaluation:** ```python from ragas import evaluate from ragas.metrics import ( faithfulness, # Answer grounded in context answer_relevancy, # Answer addresses query context_recall, # Retrieved docs cover ground truth context_precision # Retrieved docs are relevant ) # Test dataset test_data = { "question": ["How do I refresh OAuth tokens?"], "answer": ["Use /token with refresh_token grant..."], "contexts": [["OAuth refresh documentation..."]], "ground_truth": ["POST to /token with grant_type=refresh_token"] } # Evaluate results = evaluate(test_data, metrics=[ faithfulness, answer_relevancy, context_recall, context_precision ]) # Production targets: # faithfulness: >0.90 (minimal hallucination) # answer_relevancy: >0.85 (addresses user query) # context_recall: >0.80 (sufficient context retrieved) # context_precision: >0.75 (minimal noise) ``` ## Performance Optimization ### Embedding Generation - **Batch processing:** 100-500 chunks per batch - **Caching:** Cache embeddings by content hash - **Rate limiting:** Respect API provider limits (exponential backoff) ### Vector Search - **Index type:** HNSW (Hierarchical Navigable Small World) for most cases - **Distance metric:** Cosine for normalized embeddings - **Pre-filtering:** Apply metadata filters before vector search - **Result diversity:** Use MMR (Maximal Marginal Relevance) to reduce redundancy ### Cost Optimization - **Embedding model:** Consider text-embedding-3-small for budget constraints - **Dimension reduction:** Use maturity shortening (3072d → 1024d) - **Caching:** Implement semantic caching for repeated queries - **Batch operations:** Group insertions/updates for efficiency ## Common Workflows ### 1. Building a RAG Chatbot - Vector database: Qdrant (self-hosted or cloud) - Embeddings: OpenAI text-embedding-3-large - Chunking: 512 tokens, 50 overlap, semantic splitter - Search: Hybrid (vector + BM25) - Integration: Frontend with ai-chat skill See `examples/qdrant-python/` for complete implementation. ### 2. Semantic Search Engine - Vector database: Qdrant or Pinecone - Embeddings: Voyage AI voyage-3 (best quality) - Chunking: Content-type specific (see chunking-patterns.md) - Search: Hybrid with re-ranking - Filtering: Pre-filter by metadata (date, category, etc.) ### 3. Code Search - Vector database: Qdrant - Embeddings: OpenAI text-embedding-3-large - Chunking: AST-based (function/class boundaries) - Metadata: Language, file path, imports - Search: Hybrid with language filtering See `examples/qdrant-python/` for code-specific implementation. ## Integration with Other Skills ### Frontend Skills - **ai-chat**: Vector DB powers RAG pipeline behind chat interface - **search-filter**: Replace keyword search with semantic search - **data-viz**: Visualize embedding spaces, similarity scores ### Backend Skills - **databases-relational**: Hybrid approach using pgvector extension - **api-patterns**: Expose semantic search via REST/GraphQL - **observability**: Monitor embedding quality and retrieval metrics ## Multi-Language Support ### Python (Primary) - Client: `qdrant-client` - Framework: LangChain, LlamaIndex - See: `examples/qdrant-python/` ### Rust - Client: `qdrant-client` (1,549 code snippets in Context7) - Framework: Raw Rust for performance-critical systems - See: `examples/rust-axum-vector/` ### TypeScript - Client: `@qdrant/js-client-rest` - Framework: LangChain.js, integration with Next.js - See: `examples/typescript-rag/` ### Go - Client: `qdrant-go` - Use case: High-performance microservices ## Troubleshooting ### Poor Retrieval Quality 1. Check chunking strategy (too large/small?) 2. Verify metadata filtering (too restrictive?) 3. Try hybrid search instead of vector-only 4. Implement re-ranking stage 5. Evaluate with RAGAS metrics ### Slow Performance 1. Use HNSW index (not Flat) 2. Pre-filter with metadata before vector search 3. Reduce vector dimensions (maturity shortening) 4. Batch operations (insertions, searches) 5. Consider GPU acceleration (Milvus) ### High Costs 1. Switch to text-embedding-3-small 2. Implement semantic caching 3. Reduce chunk overlap 4. Use self-hosted embeddings (nomic, bge-m3) 5. Batch embedding generation ## Qdrant Context7 Documentation **Primary resource:** `/llmstxt/qdrant_tech_llms-full_txt` - **Trust score:** High - **Code snippets:** 10,154 - **Quality score:** 83.1 Access via Context7: ``` resolve-library-id({ libraryName: "Qdrant" }) get-library-docs({ context7CompatibleLibraryID: "/llmstxt/qdrant_tech_llms-full_txt", topic: "hybrid search collections python", mode: "code" }) ``` ## Additional Resources ### Reference Documentation - `references/qdrant.md` - Comprehensive Qdrant guide - `references/pgvector.md` - PostgreSQL pgvector extension - `references/milvus.md` - Milvus/Zilliz for billion-scale - `references/embedding-strategies.md` - Embedding model comparison - `references/chunking-patterns.md` - Advanced chunking techniques ### Code Examples - `examples/qdrant-python/` - FastAPI + Qdrant RAG pipeline - `examples/pgvector-prisma/` - PostgreSQL + Prisma integration - `examples/typescript-rag/` - TypeScript RAG with Hono ### Automation Scripts - `scripts/generate_embeddings.py` - Batch embedding generation - `scripts/benchmark_similarity.py` - Performance benchmarking - `scripts/evaluate_rag.py` - RAGAS-based evaluation --- **Next Steps:** 1. Choose vector database based on scale and infrastructure 2. Select embedding model based on quality vs. cost trade-off 3. Implement chunking strategy for the content type 4. Set up hybrid search for production quality 5. Evaluate with RAGAS metrics 6. Optimize for performance and cost