{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Week 3: Keyword Search First - The Critical Foundation\n", "\n", "> ** The 90% Problem:** Most RAG systems jump straight to vector search and miss the foundation that powers the best retrieval systems. We're doing it right!\n", "\n", "## ESSENTIAL SETUP - Do This First!\n", "\n", "**Before running any cells, ensure your environment is properly configured:**\n", "\n", "```bash\n", "# 1. CRITICAL: Copy the environment configuration\n", "cp .env.example .env\n", "\n", "# 2. Verify these Week 3 settings are in your .env:\n", "# OPENSEARCH__HOST=http://opensearch:9200\n", "# OPENSEARCH__INDEX_NAME=arxiv-papers\n", "# ARXIV__MAX_RESULTS=15\n", "```\n", "\n", "**Important:** Week 3 requires the `.env` file for OpenSearch connectivity and service configuration. The defaults in `.env.example` work perfectly out of the box!\n", "\n", "**Why Keyword Search First?**\n", "- **Exact Match Power:** Find specific technical terms and paper IDs precisely\n", "- **Speed & Efficiency:** BM25 is fast and doesn't require expensive embedding models\n", "- **Interpretable:** You understand exactly why papers were retrieved\n", "- **Production Reality:** Companies like Elasticsearch use keyword search as their foundation\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Week 3: OpenSearch Integration & BM25 Search\n", "\n", "**What We're Building This Week:**\n", "\n", "Week 3 focuses on implementing OpenSearch integration for full-text search capabilities using BM25 scoring. This transforms our system from a simple storage solution into a searchable knowledge base.\n", "\n", "## Week 3 Focus Areas\n", "\n", "### Core Objectives\n", "- **OpenSearch Integration**: Connect our FastAPI application to OpenSearch cluster\n", "- **Index Management**: Create and manage the arxiv-papers index with proper mappings\n", "- **BM25 Search**: Implement full-text search with relevance scoring\n", "- **Data Pipeline**: Transfer papers from PostgreSQL to OpenSearch\n", "- **Search API**: Expose search functionality through REST endpoints\n", "\n", "### What We'll Test In This Notebook\n", "1. **Infrastructure Verification** - Ensure all services from Week 1-2 are running\n", "2. **OpenSearch Service Integration** - Test client creation and health checks\n", "3. **Index Creation & Management** - Create arxiv-papers index with proper mappings\n", "4. **Data Pipeline** - Transfer papers from PostgreSQL to OpenSearch\n", "5. **BM25 Search Functionality** - Test search queries with relevance scoring\n", "6. **Search API Endpoints** - Verify FastAPI search endpoints work correctly\n", "\n", "### Success Metrics\n", "- OpenSearch cluster healthy and accessible\n", "- arxiv-papers index created with proper mappings\n", "- Papers successfully indexed from PostgreSQL\n", "- BM25 search returns relevant results with scores\n", "- Search API endpoints respond correctly\n", "- All components ready for production use\n", "\n", "---\n", "\n", "## Week 3 Component Status\n", "| Component | Purpose | Status |\n", "|-----------|---------|--------|\n", "| **OpenSearch Client** | Connect to OpenSearch cluster | ✅ Complete |\n", "| **Index Management** | Create and manage search indices | ✅ Complete |\n", "| **Query Builder** | Build complex search queries | ✅ Complete |\n", "| **Data Pipeline** | Transfer papers to OpenSearch | ✅ Complete |\n", "| **Search API** | REST endpoints for search | ✅ Complete |\n", "| **BM25 Scoring** | Relevance-based search results | ✅ Complete |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## IMPORTANT: Week 3 Docker Services Restart\n", "\n", "**NEW USERS OR INTEGRATION CONFLICTS**: Week 3 introduces OpenSearch integration that requires fresh container state. Use this clean restart approach:\n", "\n", "### Fresh Start (Recommended for Week 3)\n", "```bash\n", "# Complete clean slate - removes all data but ensures correct OpenSearch state\n", "docker compose down -v\n", "\n", "# Build fresh containers with latest code\n", "docker compose up --build -d\n", "```\n", "\n", "**When to use this:**\n", "- First time running Week 3 \n", "- OpenSearch connection issues\n", "- Index conflicts or mapping errors\n", "- Want to start with clean OpenSearch state\n", "\n", "**Note**: This destroys existing data but ensures you have the correct Week 3 configuration with proper OpenSearch integration.\n", "\n", "---\n", "\n", "## Prerequisites Check\n", "\n", "**Before starting:**\n", "1. Week 1 infrastructure completed\n", "2. Week 2 arXiv integration working\n", "3. UV environment activated\n", "4. Docker Desktop running\n", "5. Some papers already in PostgreSQL from Week 2\n", "\n", "**Why fresh containers?** Week 3 includes OpenSearch integration that requires proper cluster initialization and may conflict with existing index states.\n", "\n", "**Service Access Points:**\n", "- **FastAPI**: http://localhost:8000/docs (API documentation)\n", "- **PostgreSQL**: via API or `docker exec -it rag-postgres psql -U rag_user -d rag_db`\n", "- **OpenSearch**: http://localhost:9200/_cluster/health\n", "- **Ollama**: http://localhost:11434 (LLM service)\n", "- **Airflow**: http://localhost:8080 (Username: `admin`, Password: `admin`)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Environment Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Environment Setup and Path Configuration\n", "import sys\n", "from pathlib import Path\n", "import json\n", "import requests\n", "\n", "print(f\"Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}\")\n", "print(f\"Environment: {sys.executable}\")\n", "\n", "# Find project root and add to Python path\n", "current_dir = Path.cwd()\n", "if current_dir.name == \"week3\" and current_dir.parent.name == \"notebooks\":\n", " project_root = current_dir.parent.parent\n", "elif (current_dir / \"compose.yml\").exists():\n", " project_root = current_dir\n", "else:\n", " project_root = None\n", "\n", "if project_root and (project_root / \"compose.yml\").exists():\n", " print(f\"Project root: {project_root}\")\n", " sys.path.insert(0, str(project_root))\n", "else:\n", " print(\"Missing compose.yml - check directory\")\n", " exit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Infrastructure Verification" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Service Health Verification\n", "print(\"WEEK 3 PREREQUISITE CHECK\")\n", "print(\"=\" * 50)\n", "\n", "services_to_test = {\n", " \"FastAPI\": \"http://localhost:8000/api/v1/health\",\n", " \"PostgreSQL (via API)\": \"http://localhost:8000/api/v1/health\", \n", " \"OpenSearch\": \"http://localhost:9200/_cluster/health\",\n", " \"Airflow\": \"http://localhost:8080/health\" \n", "}\n", "\n", "all_healthy = True\n", "\n", "for service_name, url in services_to_test.items():\n", " try:\n", " response = requests.get(url, timeout=5)\n", " if response.status_code == 200:\n", " print(f\"✓ {service_name}: Healthy\")\n", " else:\n", " print(f\"✗ {service_name}: HTTP {response.status_code}\")\n", " all_healthy = False\n", " except requests.exceptions.ConnectionError:\n", " print(f\"✗ {service_name}: Not accessible\")\n", " all_healthy = False\n", " except Exception as e:\n", " print(f\"✗ {service_name}: {type(e).__name__}\")\n", " all_healthy = False\n", "\n", "print()\n", "if all_healthy:\n", " print(\"All services healthy! Ready for Week 3 OpenSearch integration.\")\n", "else:\n", " print(\"Some services need attention. Please run: docker compose up --build\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. OpenSearch Client Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# OpenSearch Client Setup\n", "from src.services.opensearch.factory import make_opensearch_client\n", "from opensearchpy import OpenSearch\n", "\n", "print(\"OPENSEARCH CLIENT SETUP\")\n", "print(\"=\" * 40)\n", "\n", "# Create OpenSearch client using factory pattern\n", "opensearch_client = make_opensearch_client()\n", "\n", "# Override for notebook execution (localhost instead of container hostname)\n", "opensearch_client.host = \"http://localhost:9200\"\n", "opensearch_client.client = OpenSearch(\n", " hosts=[\"http://localhost:9200\"],\n", " http_compress=True,\n", " use_ssl=False,\n", " verify_certs=False,\n", " ssl_assert_hostname=False,\n", " ssl_show_warn=False,\n", ")\n", "\n", "print(f\"Client configured with host: {opensearch_client.host}\")\n", "print(f\"Index name: {opensearch_client.index_name}\")\n", "\n", "# Test health check\n", "is_healthy = opensearch_client.health_check()\n", "if is_healthy:\n", " print(\"✓ OpenSearch health check: PASSED\")\n", "else:\n", " print(\"✗ OpenSearch health check: FAILED\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Index Configuration" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Display Index Configuration\n", "from src.services.opensearch.index_config import ARXIV_PAPERS_INDEX, ARXIV_PAPERS_MAPPING\n", "\n", "print(\"INDEX CONFIGURATION\")\n", "print(\"=\" * 40)\n", "print(f\"Index Name: {ARXIV_PAPERS_INDEX}\")\n", "print(f\"\\nKey Features:\")\n", "print(\"• Custom text analyzers for better search\")\n", "print(\"• Multi-field mapping (text + keyword)\")\n", "print(\"• 10 specialized fields for papers\")\n", "print(\"\\nField Types:\")\n", "\n", "properties = ARXIV_PAPERS_MAPPING[\"mappings\"][\"properties\"]\n", "for field_name, config in properties.items():\n", " field_type = config.get(\"type\")\n", " analyzer = config.get(\"analyzer\", \"\")\n", " if analyzer:\n", " print(f\" • {field_name}: {field_type} [{analyzer}]\")\n", " else:\n", " print(f\" • {field_name}: {field_type}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create Index if it doesn't exist\n", "print(\"INDEX CREATION\")\n", "print(\"=\" * 40)\n", "\n", "try:\n", " # Check if index already exists\n", " index_exists = opensearch_client.client.indices.exists(index=opensearch_client.index_name)\n", " \n", " if index_exists:\n", " print(f\"✓ Index '{opensearch_client.index_name}' already exists\")\n", " \n", " # Get current index statistics\n", " stats = opensearch_client.get_index_stats()\n", " if stats and 'error' not in stats:\n", " print(f\"\\nCurrent Statistics:\")\n", " print(f\" Documents: {stats.get('document_count', 0)}\")\n", " print(f\" Size: {stats.get('size_in_bytes', 0):,} bytes\")\n", " else:\n", " print(f\"Creating new index: {opensearch_client.index_name}\")\n", " \n", " # Create the index with our custom mapping\n", " success = opensearch_client.create_index()\n", " \n", " if success:\n", " print(f\"✓ Index created successfully!\")\n", " else:\n", " print(f\"✗ Index creation failed\")\n", " \n", "except Exception as e:\n", " print(f\"✗ Error with index management: {e}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Data Pipeline - Run Airflow DAG\n", "\n", "The **arxiv_paper_ingestion** DAG automatically:\n", "1. Fetches papers from arXiv API\n", "2. Stores papers in PostgreSQL\n", "3. **Indexes papers into OpenSearch**\n", "\n", "### Instructions:\n", "\n", "**Before proceeding, run the Airflow DAG:**\n", "\n", "1. Open Airflow UI: http://localhost:8080\n", "2. Login: username `admin`, password `admin`\n", "3. Find **`arxiv_paper_ingestion`** DAG\n", "4. Click the DAG name to open it\n", "5. Click **\"Trigger DAG\"** button (▶️ play icon)\n", "6. Wait ~10 minutes for completion\n", "7. Check that all tasks turn green\n", "\n", "Then run the cell below to verify:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Verify Data Pipeline Results\n", "print(\"VERIFYING DATA PIPELINE\")\n", "print(\"=\" * 40)\n", "\n", "stats = opensearch_client.get_index_stats()\n", "\n", "if stats and 'error' not in stats:\n", " doc_count = stats.get('document_count', 0)\n", " \n", " if doc_count > 0:\n", " print(f\"✓ Success! Found {doc_count} documents in OpenSearch\")\n", " \n", " # Show sample papers\n", " sample = opensearch_client.search_papers(\"*\", size=3)\n", " if sample.get('hits'):\n", " print(f\"\\nSample papers:\")\n", " for i, paper in enumerate(sample['hits'], 1):\n", " title = paper.get('title', 'Unknown')[:60]\n", " print(f\" {i}. {title}...\")\n", " else:\n", " print(\"⚠️ No documents in OpenSearch yet\")\n", " print(\"\\nPlease run the Airflow DAG first (see instructions above)\")\n", "else:\n", " print(\"✗ Could not retrieve index stats\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Simple BM25 Search\n", "\n", "Let's start with a simple search to demonstrate BM25 scoring:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Simple BM25 Search\n", "print(\"SIMPLE BM25 SEARCH\")\n", "print(\"=\" * 40)\n", "\n", "# Change this to any word from your papers\n", "search_term = \"learning\" # Try different terms!\n", "\n", "print(f\"Searching for: '{search_term}'\\n\")\n", "\n", "results = opensearch_client.search_papers(\n", " query=search_term,\n", " size=5\n", ")\n", "\n", "if results.get('hits'):\n", " print(f\"Found {results.get('total', 0)} total matches\\n\")\n", " \n", " for i, paper in enumerate(results['hits'], 1):\n", " print(f\"{i}. {paper.get('title', 'Unknown')[:70]}...\")\n", " print(f\" Score: {paper.get('score', 0):.2f}\")\n", " print(f\" arXiv ID: {paper.get('arxiv_id', 'N/A')}\\n\")\n", "else:\n", " print(\"No results found. Try searching for:\")\n", " print(\" • 'neural', 'model', 'algorithm'\")\n", " print(\" • Use '*' to see all papers\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Advanced OpenSearch Queries\n", "\n", "Now let's explore different query types using the OpenSearch Python client directly. This shows the power of BM25 without needing vectors!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.1 Match Query\n", "\n", "The `match` query is the standard query for full-text search on a single field:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Match Query - Search in title field\n", "print(\"MATCH QUERY - Single Field Search\")\n", "print(\"=\" * 40)\n", "\n", "query = {\n", " \"query\": {\n", " \"match\": {\n", " \"title\": \"machine learning\"\n", " }\n", " },\n", " \"size\": 3\n", "}\n", "\n", "response = opensearch_client.client.search(\n", " index=opensearch_client.index_name,\n", " body=query\n", ")\n", "\n", "print(f\"Found {response['hits']['total']['value']} results\\n\")\n", "\n", "for hit in response['hits']['hits']:\n", " print(f\"Title: {hit['_source']['title'][:70]}...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2 Multi-Match Query\n", "\n", "Search across multiple fields simultaneously:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Multi-Match Query - Search across multiple fields\n", "print(\"MULTI-MATCH QUERY - Search Multiple Fields\")\n", "print(\"=\" * 40)\n", "\n", "query = {\n", " \"query\": {\n", " \"multi_match\": {\n", " \"query\": \"AI Agents\",\n", " \"fields\": [\"title^2\", \"abstract\", \"authors\"], # ^2 boosts title field\n", " \"type\": \"best_fields\"\n", " }\n", " },\n", " \"size\": 3\n", "}\n", "\n", "response = opensearch_client.client.search(\n", " index=opensearch_client.index_name,\n", " body=query\n", ")\n", "\n", "print(f\"Found {response['hits']['total']['value']} results\\n\")\n", "\n", "for hit in response['hits']['hits']:\n", " print(f\"Title: {hit['_source']['title'][:70]}...\")\n", " print(f\"Score: {hit['_score']:.2f}\")\n", " print(f\"Authors: {', '.join(hit['_source']['authors'][:2])}...\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3 Boosting Query\n", "\n", "Boost certain results while demoting others:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Boosting Query - Promote and demote results\n", "print(\"BOOSTING QUERY - Promote/Demote Results\")\n", "print(\"=\" * 40)\n", "\n", "query = {\n", " \"query\": {\n", " \"boosting\": {\n", " \"positive\": {\n", " \"match\": {\n", " \"abstract\": \"deep learning\"\n", " }\n", " },\n", " \"negative\": {\n", " \"match\": {\n", " \"abstract\": \"multimodal\"\n", " }\n", " },\n", " \"negative_boost\": 0.1 # Reduce score of negative matches\n", " }\n", " },\n", " \"size\": 3\n", "}\n", "\n", "response = opensearch_client.client.search(\n", " index=opensearch_client.index_name,\n", " body=query\n", ")\n", "\n", "print(f\"Query: Boost 'deep learning', demote 'survey' papers\\n\")\n", "print(f\"Found {response['hits']['total']['value']} results\\n\")\n", "\n", "for hit in response['hits']['hits']:\n", " title = hit['_source']['title'][:70]\n", " abstract_snippet = hit['_source']['abstract'][:100]\n", " print(f\"Title: {title}...\")\n", " print(f\"Score: {hit['_score']:.2f}\")\n", " print(f\"Abstract: {abstract_snippet}...\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.4 Filter Query\n", "\n", "Filter results by specific criteria (doesn't affect scoring):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Filter Query - Filter by categories\n", "print(\"FILTER QUERY - Category Filtering\")\n", "print(\"=\" * 40)\n", "\n", "query = {\n", " \"query\": {\n", " \"bool\": {\n", " \"must\": [\n", " {\n", " \"match\": {\n", " \"abstract\": \"neural\"\n", " }\n", " }\n", " ],\n", " \"filter\": [\n", " {\n", " \"terms\": {\n", " \"categories\": [\"cs.AI\"]\n", " }\n", " }\n", " ]\n", " }\n", " },\n", " \"size\": 3\n", "}\n", "\n", "response = opensearch_client.client.search(\n", " index=opensearch_client.index_name,\n", " body=query\n", ")\n", "\n", "print(f\"Found {response['hits']['total']['value']} results\\n\")\n", "\n", "for hit in response['hits']['hits']:\n", " title = hit['_source']['title'][:70]\n", " categories = ', '.join(hit['_source']['categories'])\n", " print(f\"Title: {title}...\")\n", " print(f\"Categories: {categories}\")\n", " print(f\"Score: {hit['_score']:.2f}\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.5 Sorting Query\n", "\n", "Sort results by different criteria:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sorting Query - Sort by publication date\n", "print(\"SORTING QUERY - Latest Papers First\")\n", "print(\"=\" * 40)\n", "\n", "query = {\n", " \"query\": {\n", " \"match_all\": {} # Get all papers\n", " },\n", " \"sort\": [\n", " {\n", " \"published_date\": {\n", " \"order\": \"desc\" # Latest first\n", " }\n", " }\n", " ],\n", " \"size\": 5\n", "}\n", "\n", "response = opensearch_client.client.search(\n", " index=opensearch_client.index_name,\n", " body=query\n", ")\n", "\n", "print(f\"Query: All papers sorted by publication date (newest first)\\n\")\n", "\n", "for hit in response['hits']['hits']:\n", " title = hit['_source']['title'][:70]\n", " pub_date = hit['_source']['published_date'][:10]\n", " print(f\"Date: {pub_date} | {title}...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.6 Combined Query\n", "\n", "Combine multiple query types for complex searches:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Combined Query - Complex search with multiple criteria\n", "print(\"COMBINED QUERY - Complex Search\")\n", "print(\"=\" * 40)\n", "\n", "query = {\n", " \"query\": {\n", " \"bool\": {\n", " \"must\": [\n", " {\n", " \"multi_match\": {\n", " \"query\": \"transformer\",\n", " \"fields\": [\"title^3\", \"abstract\"],\n", " \"type\": \"best_fields\"\n", " }\n", " }\n", " ],\n", " \"filter\": [\n", " {\n", " \"range\": {\n", " \"published_date\": {\n", " \"gte\": \"2024-01-01\"\n", " }\n", " }\n", " }\n", " ],\n", " \"should\": [\n", " {\n", " \"match\": {\n", " \"categories\": \"cs.AI\"\n", " }\n", " }\n", " ]\n", " }\n", " },\n", " \"sort\": [\n", " \"_score\",\n", " {\"published_date\": {\"order\": \"desc\"}}\n", " ],\n", " \"size\": 3\n", "}\n", "\n", "response = opensearch_client.client.search(\n", " index=opensearch_client.index_name,\n", " body=query\n", ")\n", "\n", "print(f\"Complex Query:\")\n", "print(f\" • Must contain 'transformer' (title boosted 3x)\")\n", "print(f\" • Filter: published after 2024-01-01\")\n", "print(f\" • Prefer: cs.AI category\")\n", "print(f\" • Sort: by relevance, then date\\n\")\n", "\n", "print(f\"Found {response['hits']['total']['value']} results\\n\")\n", "\n", "for hit in response['hits']['hits']:\n", " title = hit['_source']['title'][:70]\n", " pub_date = hit['_source']['published_date'][:10]\n", " score = hit['_score']\n", " categories = ', '.join(hit['_source']['categories'][:2])\n", " \n", " print(f\"Title: {title}...\")\n", " print(f\" Date: {pub_date} | Score: {score:.2f}\")\n", " print(f\" Categories: {categories}\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "### What We Demonstrated\n", "\n", "**BM25 Search is Powerful!** Without any vector embeddings, we can:\n", "\n", "1. **Simple Search**: Basic keyword search with relevance scoring\n", "2. **Match Queries**: Search specific fields\n", "3. **Multi-Match**: Search across multiple fields with boosting\n", "4. **Boosting**: Promote or demote certain results\n", "5. **Filtering**: Apply filters without affecting scores\n", "6. **Sorting**: Order results by date, score, or other fields\n", "7. **Complex Queries**: Combine all techniques for sophisticated searches\n", "\n", "### Key Takeaways\n", "\n", "- **BM25 works great** for many search use cases\n", "- **No vectors needed** for effective full-text search\n", "- **Simple and fast** compared to embedding-based approaches\n", "- **Filters and sorting** make searches precise and relevant\n", "- **Field boosting** helps prioritize important content\n", "\n", "### When to Use BM25 vs Vectors\n", "\n", "**Use BM25 when:**\n", "- Searching for specific keywords or phrases\n", "- Need fast, simple implementation\n", "- Have good text fields with clear terminology\n", "- Want explainable search results\n", "\n", "**Consider vectors when:**\n", "- Need semantic similarity (concepts, not keywords)\n", "- Dealing with synonyms and paraphrasing\n", "- Cross-language search requirements\n", "- Very short queries or documents\n", "\n", "Remember: **You can also combine both** (hybrid search) for best results!\n", "We will see this in the next week :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.11" } }, "nbformat": 4, "nbformat_minor": 4 }