{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Week 3: Keyword Search First - The Critical Foundation\n",
    "\n",
    "> ** The 90% Problem:** Most RAG systems jump straight to vector search and miss the foundation that powers the best retrieval systems. We're doing it right!\n",
    "\n",
    "## ESSENTIAL SETUP - Do This First!\n",
    "\n",
    "**Before running any cells, ensure your environment is properly configured:**\n",
    "\n",
    "```bash\n",
    "# 1. CRITICAL: Copy the environment configuration\n",
    "cp .env.example .env\n",
    "\n",
    "# 2. Verify these Week 3 settings are in your .env:\n",
    "# OPENSEARCH__HOST=http://opensearch:9200\n",
    "# OPENSEARCH__INDEX_NAME=arxiv-papers\n",
    "# ARXIV__MAX_RESULTS=15\n",
    "```\n",
    "\n",
    "**Important:** Week 3 requires the `.env` file for OpenSearch connectivity and service configuration. The defaults in `.env.example` work perfectly out of the box!\n",
    "\n",
    "**Why Keyword Search First?**\n",
    "- **Exact Match Power:** Find specific technical terms and paper IDs precisely\n",
    "- **Speed & Efficiency:** BM25 is fast and doesn't require expensive embedding models\n",
    "- **Interpretable:** You understand exactly why papers were retrieved\n",
    "- **Production Reality:** Companies like Elasticsearch use keyword search as their foundation\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Week 3: OpenSearch Integration & BM25 Search\n",
    "\n",
    "**What We're Building This Week:**\n",
    "\n",
    "Week 3 focuses on implementing OpenSearch integration for full-text search capabilities using BM25 scoring. This transforms our system from a simple storage solution into a searchable knowledge base.\n",
    "\n",
    "## Week 3 Focus Areas\n",
    "\n",
    "### Core Objectives\n",
    "- **OpenSearch Integration**: Connect our FastAPI application to OpenSearch cluster\n",
    "- **Index Management**: Create and manage the arxiv-papers index with proper mappings\n",
    "- **BM25 Search**: Implement full-text search with relevance scoring\n",
    "- **Data Pipeline**: Transfer papers from PostgreSQL to OpenSearch\n",
    "- **Search API**: Expose search functionality through REST endpoints\n",
    "\n",
    "### What We'll Test In This Notebook\n",
    "1. **Infrastructure Verification** - Ensure all services from Week 1-2 are running\n",
    "2. **OpenSearch Service Integration** - Test client creation and health checks\n",
    "3. **Index Creation & Management** - Create arxiv-papers index with proper mappings\n",
    "4. **Data Pipeline** - Transfer papers from PostgreSQL to OpenSearch\n",
    "5. **BM25 Search Functionality** - Test search queries with relevance scoring\n",
    "6. **Search API Endpoints** - Verify FastAPI search endpoints work correctly\n",
    "\n",
    "### Success Metrics\n",
    "- OpenSearch cluster healthy and accessible\n",
    "- arxiv-papers index created with proper mappings\n",
    "- Papers successfully indexed from PostgreSQL\n",
    "- BM25 search returns relevant results with scores\n",
    "- Search API endpoints respond correctly\n",
    "- All components ready for production use\n",
    "\n",
    "---\n",
    "\n",
    "## Week 3 Component Status\n",
    "| Component | Purpose | Status |\n",
    "|-----------|---------|--------|\n",
    "| **OpenSearch Client** | Connect to OpenSearch cluster | ✅ Complete |\n",
    "| **Index Management** | Create and manage search indices | ✅ Complete |\n",
    "| **Query Builder** | Build complex search queries | ✅ Complete |\n",
    "| **Data Pipeline** | Transfer papers to OpenSearch | ✅ Complete |\n",
    "| **Search API** | REST endpoints for search | ✅ Complete |\n",
    "| **BM25 Scoring** | Relevance-based search results | ✅ Complete |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## IMPORTANT: Week 3 Docker Services Restart\n",
    "\n",
    "**NEW USERS OR INTEGRATION CONFLICTS**: Week 3 introduces OpenSearch integration that requires fresh container state. Use this clean restart approach:\n",
    "\n",
    "### Fresh Start (Recommended for Week 3)\n",
    "```bash\n",
    "# Complete clean slate - removes all data but ensures correct OpenSearch state\n",
    "docker compose down -v\n",
    "\n",
    "# Build fresh containers with latest code\n",
    "docker compose up --build -d\n",
    "```\n",
    "\n",
    "**When to use this:**\n",
    "- First time running Week 3 \n",
    "- OpenSearch connection issues\n",
    "- Index conflicts or mapping errors\n",
    "- Want to start with clean OpenSearch state\n",
    "\n",
    "**Note**: This destroys existing data but ensures you have the correct Week 3 configuration with proper OpenSearch integration.\n",
    "\n",
    "---\n",
    "\n",
    "## Prerequisites Check\n",
    "\n",
    "**Before starting:**\n",
    "1. Week 1 infrastructure completed\n",
    "2. Week 2 arXiv integration working\n",
    "3. UV environment activated\n",
    "4. Docker Desktop running\n",
    "5. Some papers already in PostgreSQL from Week 2\n",
    "\n",
    "**Why fresh containers?** Week 3 includes OpenSearch integration that requires proper cluster initialization and may conflict with existing index states.\n",
    "\n",
    "**Service Access Points:**\n",
    "- **FastAPI**: http://localhost:8000/docs (API documentation)\n",
    "- **PostgreSQL**: via API or `docker exec -it rag-postgres psql -U rag_user -d rag_db`\n",
    "- **OpenSearch**: http://localhost:9200/_cluster/health\n",
    "- **Ollama**: http://localhost:11434 (LLM service)\n",
    "- **Airflow**: http://localhost:8080 (Username: `admin`, Password: `admin`)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Environment Setup and Path Configuration\n",
    "import sys\n",
    "from pathlib import Path\n",
    "import json\n",
    "import requests\n",
    "\n",
    "print(f\"Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}\")\n",
    "print(f\"Environment: {sys.executable}\")\n",
    "\n",
    "# Find project root and add to Python path\n",
    "current_dir = Path.cwd()\n",
    "if current_dir.name == \"week3\" and current_dir.parent.name == \"notebooks\":\n",
    "    project_root = current_dir.parent.parent\n",
    "elif (current_dir / \"compose.yml\").exists():\n",
    "    project_root = current_dir\n",
    "else:\n",
    "    project_root = None\n",
    "\n",
    "if project_root and (project_root / \"compose.yml\").exists():\n",
    "    print(f\"Project root: {project_root}\")\n",
    "    sys.path.insert(0, str(project_root))\n",
    "else:\n",
    "    print(\"Missing compose.yml - check directory\")\n",
    "    exit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Infrastructure Verification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Service Health Verification\n",
    "print(\"WEEK 3 PREREQUISITE CHECK\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "services_to_test = {\n",
    "    \"FastAPI\": \"http://localhost:8000/api/v1/health\",\n",
    "    \"PostgreSQL (via API)\": \"http://localhost:8000/api/v1/health\", \n",
    "    \"OpenSearch\": \"http://localhost:9200/_cluster/health\",\n",
    "    \"Airflow\": \"http://localhost:8080/health\"  \n",
    "}\n",
    "\n",
    "all_healthy = True\n",
    "\n",
    "for service_name, url in services_to_test.items():\n",
    "    try:\n",
    "        response = requests.get(url, timeout=5)\n",
    "        if response.status_code == 200:\n",
    "            print(f\"✓ {service_name}: Healthy\")\n",
    "        else:\n",
    "            print(f\"✗ {service_name}: HTTP {response.status_code}\")\n",
    "            all_healthy = False\n",
    "    except requests.exceptions.ConnectionError:\n",
    "        print(f\"✗ {service_name}: Not accessible\")\n",
    "        all_healthy = False\n",
    "    except Exception as e:\n",
    "        print(f\"✗ {service_name}: {type(e).__name__}\")\n",
    "        all_healthy = False\n",
    "\n",
    "print()\n",
    "if all_healthy:\n",
    "    print(\"All services healthy! Ready for Week 3 OpenSearch integration.\")\n",
    "else:\n",
    "    print(\"Some services need attention. Please run: docker compose up --build\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. OpenSearch Client Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# OpenSearch Client Setup\n",
    "from src.services.opensearch.factory import make_opensearch_client\n",
    "from opensearchpy import OpenSearch\n",
    "\n",
    "print(\"OPENSEARCH CLIENT SETUP\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "# Create OpenSearch client using factory pattern\n",
    "opensearch_client = make_opensearch_client()\n",
    "\n",
    "# Override for notebook execution (localhost instead of container hostname)\n",
    "opensearch_client.host = \"http://localhost:9200\"\n",
    "opensearch_client.client = OpenSearch(\n",
    "    hosts=[\"http://localhost:9200\"],\n",
    "    http_compress=True,\n",
    "    use_ssl=False,\n",
    "    verify_certs=False,\n",
    "    ssl_assert_hostname=False,\n",
    "    ssl_show_warn=False,\n",
    ")\n",
    "\n",
    "print(f\"Client configured with host: {opensearch_client.host}\")\n",
    "print(f\"Index name: {opensearch_client.index_name}\")\n",
    "\n",
    "# Test health check\n",
    "is_healthy = opensearch_client.health_check()\n",
    "if is_healthy:\n",
    "    print(\"✓ OpenSearch health check: PASSED\")\n",
    "else:\n",
    "    print(\"✗ OpenSearch health check: FAILED\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Index Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display Index Configuration\n",
    "from src.services.opensearch.index_config import ARXIV_PAPERS_INDEX, ARXIV_PAPERS_MAPPING\n",
    "\n",
    "print(\"INDEX CONFIGURATION\")\n",
    "print(\"=\" * 40)\n",
    "print(f\"Index Name: {ARXIV_PAPERS_INDEX}\")\n",
    "print(f\"\\nKey Features:\")\n",
    "print(\"• Custom text analyzers for better search\")\n",
    "print(\"• Multi-field mapping (text + keyword)\")\n",
    "print(\"• 10 specialized fields for papers\")\n",
    "print(\"\\nField Types:\")\n",
    "\n",
    "properties = ARXIV_PAPERS_MAPPING[\"mappings\"][\"properties\"]\n",
    "for field_name, config in properties.items():\n",
    "    field_type = config.get(\"type\")\n",
    "    analyzer = config.get(\"analyzer\", \"\")\n",
    "    if analyzer:\n",
    "        print(f\"  • {field_name}: {field_type} [{analyzer}]\")\n",
    "    else:\n",
    "        print(f\"  • {field_name}: {field_type}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create Index if it doesn't exist\n",
    "print(\"INDEX CREATION\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "try:\n",
    "    # Check if index already exists\n",
    "    index_exists = opensearch_client.client.indices.exists(index=opensearch_client.index_name)\n",
    "    \n",
    "    if index_exists:\n",
    "        print(f\"✓ Index '{opensearch_client.index_name}' already exists\")\n",
    "        \n",
    "        # Get current index statistics\n",
    "        stats = opensearch_client.get_index_stats()\n",
    "        if stats and 'error' not in stats:\n",
    "            print(f\"\\nCurrent Statistics:\")\n",
    "            print(f\"   Documents: {stats.get('document_count', 0)}\")\n",
    "            print(f\"   Size: {stats.get('size_in_bytes', 0):,} bytes\")\n",
    "    else:\n",
    "        print(f\"Creating new index: {opensearch_client.index_name}\")\n",
    "        \n",
    "        # Create the index with our custom mapping\n",
    "        success = opensearch_client.create_index()\n",
    "        \n",
    "        if success:\n",
    "            print(f\"✓ Index created successfully!\")\n",
    "        else:\n",
    "            print(f\"✗ Index creation failed\")\n",
    "            \n",
    "except Exception as e:\n",
    "    print(f\"✗ Error with index management: {e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Data Pipeline - Run Airflow DAG\n",
    "\n",
    "The **arxiv_paper_ingestion** DAG automatically:\n",
    "1. Fetches papers from arXiv API\n",
    "2. Stores papers in PostgreSQL\n",
    "3. **Indexes papers into OpenSearch**\n",
    "\n",
    "### Instructions:\n",
    "\n",
    "**Before proceeding, run the Airflow DAG:**\n",
    "\n",
    "1. Open Airflow UI: http://localhost:8080\n",
    "2. Login: username `admin`, password `admin`\n",
    "3. Find **`arxiv_paper_ingestion`** DAG\n",
    "4. Click the DAG name to open it\n",
    "5. Click **\"Trigger DAG\"** button (▶️ play icon)\n",
    "6. Wait ~10 minutes for completion\n",
    "7. Check that all tasks turn green\n",
    "\n",
    "Then run the cell below to verify:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify Data Pipeline Results\n",
    "print(\"VERIFYING DATA PIPELINE\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "stats = opensearch_client.get_index_stats()\n",
    "\n",
    "if stats and 'error' not in stats:\n",
    "    doc_count = stats.get('document_count', 0)\n",
    "    \n",
    "    if doc_count > 0:\n",
    "        print(f\"✓ Success! Found {doc_count} documents in OpenSearch\")\n",
    "        \n",
    "        # Show sample papers\n",
    "        sample = opensearch_client.search_papers(\"*\", size=3)\n",
    "        if sample.get('hits'):\n",
    "            print(f\"\\nSample papers:\")\n",
    "            for i, paper in enumerate(sample['hits'], 1):\n",
    "                title = paper.get('title', 'Unknown')[:60]\n",
    "                print(f\"  {i}. {title}...\")\n",
    "    else:\n",
    "        print(\"⚠️  No documents in OpenSearch yet\")\n",
    "        print(\"\\nPlease run the Airflow DAG first (see instructions above)\")\n",
    "else:\n",
    "    print(\"✗ Could not retrieve index stats\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Simple BM25 Search\n",
    "\n",
    "Let's start with a simple search to demonstrate BM25 scoring:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simple BM25 Search\n",
    "print(\"SIMPLE BM25 SEARCH\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "# Change this to any word from your papers\n",
    "search_term = \"learning\"  # Try different terms!\n",
    "\n",
    "print(f\"Searching for: '{search_term}'\\n\")\n",
    "\n",
    "results = opensearch_client.search_papers(\n",
    "    query=search_term,\n",
    "    size=5\n",
    ")\n",
    "\n",
    "if results.get('hits'):\n",
    "    print(f\"Found {results.get('total', 0)} total matches\\n\")\n",
    "    \n",
    "    for i, paper in enumerate(results['hits'], 1):\n",
    "        print(f\"{i}. {paper.get('title', 'Unknown')[:70]}...\")\n",
    "        print(f\"   Score: {paper.get('score', 0):.2f}\")\n",
    "        print(f\"   arXiv ID: {paper.get('arxiv_id', 'N/A')}\\n\")\n",
    "else:\n",
    "    print(\"No results found. Try searching for:\")\n",
    "    print(\"  • 'neural', 'model', 'algorithm'\")\n",
    "    print(\"  • Use '*' to see all papers\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Advanced OpenSearch Queries\n",
    "\n",
    "Now let's explore different query types using the OpenSearch Python client directly. This shows the power of BM25 without needing vectors!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1 Match Query\n",
    "\n",
    "The `match` query is the standard query for full-text search on a single field:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match Query - Search in title field\n",
    "print(\"MATCH QUERY - Single Field Search\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "query = {\n",
    "    \"query\": {\n",
    "        \"match\": {\n",
    "            \"title\": \"machine learning\"\n",
    "        }\n",
    "    },\n",
    "    \"size\": 3\n",
    "}\n",
    "\n",
    "response = opensearch_client.client.search(\n",
    "    index=opensearch_client.index_name,\n",
    "    body=query\n",
    ")\n",
    "\n",
    "print(f\"Found {response['hits']['total']['value']} results\\n\")\n",
    "\n",
    "for hit in response['hits']['hits']:\n",
    "    print(f\"Title: {hit['_source']['title'][:70]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 Multi-Match Query\n",
    "\n",
    "Search across multiple fields simultaneously:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multi-Match Query - Search across multiple fields\n",
    "print(\"MULTI-MATCH QUERY - Search Multiple Fields\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "query = {\n",
    "    \"query\": {\n",
    "        \"multi_match\": {\n",
    "            \"query\": \"AI Agents\",\n",
    "            \"fields\": [\"title^2\", \"abstract\", \"authors\"],  # ^2 boosts title field\n",
    "            \"type\": \"best_fields\"\n",
    "        }\n",
    "    },\n",
    "    \"size\": 3\n",
    "}\n",
    "\n",
    "response = opensearch_client.client.search(\n",
    "    index=opensearch_client.index_name,\n",
    "    body=query\n",
    ")\n",
    "\n",
    "print(f\"Found {response['hits']['total']['value']} results\\n\")\n",
    "\n",
    "for hit in response['hits']['hits']:\n",
    "    print(f\"Title: {hit['_source']['title'][:70]}...\")\n",
    "    print(f\"Score: {hit['_score']:.2f}\")\n",
    "    print(f\"Authors: {', '.join(hit['_source']['authors'][:2])}...\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 Boosting Query\n",
    "\n",
    "Boost certain results while demoting others:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Boosting Query - Promote and demote results\n",
    "print(\"BOOSTING QUERY - Promote/Demote Results\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "query = {\n",
    "    \"query\": {\n",
    "        \"boosting\": {\n",
    "            \"positive\": {\n",
    "                \"match\": {\n",
    "                    \"abstract\": \"deep learning\"\n",
    "                }\n",
    "            },\n",
    "            \"negative\": {\n",
    "                \"match\": {\n",
    "                    \"abstract\": \"multimodal\"\n",
    "                }\n",
    "            },\n",
    "            \"negative_boost\": 0.1  # Reduce score of negative matches\n",
    "        }\n",
    "    },\n",
    "    \"size\": 3\n",
    "}\n",
    "\n",
    "response = opensearch_client.client.search(\n",
    "    index=opensearch_client.index_name,\n",
    "    body=query\n",
    ")\n",
    "\n",
    "print(f\"Query: Boost 'deep learning', demote 'survey' papers\\n\")\n",
    "print(f\"Found {response['hits']['total']['value']} results\\n\")\n",
    "\n",
    "for hit in response['hits']['hits']:\n",
    "    title = hit['_source']['title'][:70]\n",
    "    abstract_snippet = hit['_source']['abstract'][:100]\n",
    "    print(f\"Title: {title}...\")\n",
    "    print(f\"Score: {hit['_score']:.2f}\")\n",
    "    print(f\"Abstract: {abstract_snippet}...\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.4 Filter Query\n",
    "\n",
    "Filter results by specific criteria (doesn't affect scoring):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter Query - Filter by categories\n",
    "print(\"FILTER QUERY - Category Filtering\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "query = {\n",
    "    \"query\": {\n",
    "        \"bool\": {\n",
    "            \"must\": [\n",
    "                {\n",
    "                    \"match\": {\n",
    "                        \"abstract\": \"neural\"\n",
    "                    }\n",
    "                }\n",
    "            ],\n",
    "            \"filter\": [\n",
    "                {\n",
    "                    \"terms\": {\n",
    "                        \"categories\": [\"cs.AI\"]\n",
    "                    }\n",
    "                }\n",
    "            ]\n",
    "        }\n",
    "    },\n",
    "    \"size\": 3\n",
    "}\n",
    "\n",
    "response = opensearch_client.client.search(\n",
    "    index=opensearch_client.index_name,\n",
    "    body=query\n",
    ")\n",
    "\n",
    "print(f\"Found {response['hits']['total']['value']} results\\n\")\n",
    "\n",
    "for hit in response['hits']['hits']:\n",
    "    title = hit['_source']['title'][:70]\n",
    "    categories = ', '.join(hit['_source']['categories'])\n",
    "    print(f\"Title: {title}...\")\n",
    "    print(f\"Categories: {categories}\")\n",
    "    print(f\"Score: {hit['_score']:.2f}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.5 Sorting Query\n",
    "\n",
    "Sort results by different criteria:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sorting Query - Sort by publication date\n",
    "print(\"SORTING QUERY - Latest Papers First\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "query = {\n",
    "    \"query\": {\n",
    "        \"match_all\": {}  # Get all papers\n",
    "    },\n",
    "    \"sort\": [\n",
    "        {\n",
    "            \"published_date\": {\n",
    "                \"order\": \"desc\"  # Latest first\n",
    "            }\n",
    "        }\n",
    "    ],\n",
    "    \"size\": 5\n",
    "}\n",
    "\n",
    "response = opensearch_client.client.search(\n",
    "    index=opensearch_client.index_name,\n",
    "    body=query\n",
    ")\n",
    "\n",
    "print(f\"Query: All papers sorted by publication date (newest first)\\n\")\n",
    "\n",
    "for hit in response['hits']['hits']:\n",
    "    title = hit['_source']['title'][:70]\n",
    "    pub_date = hit['_source']['published_date'][:10]\n",
    "    print(f\"Date: {pub_date} | {title}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.6 Combined Query\n",
    "\n",
    "Combine multiple query types for complex searches:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Combined Query - Complex search with multiple criteria\n",
    "print(\"COMBINED QUERY - Complex Search\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "query = {\n",
    "    \"query\": {\n",
    "        \"bool\": {\n",
    "            \"must\": [\n",
    "                {\n",
    "                    \"multi_match\": {\n",
    "                        \"query\": \"transformer\",\n",
    "                        \"fields\": [\"title^3\", \"abstract\"],\n",
    "                        \"type\": \"best_fields\"\n",
    "                    }\n",
    "                }\n",
    "            ],\n",
    "            \"filter\": [\n",
    "                {\n",
    "                    \"range\": {\n",
    "                        \"published_date\": {\n",
    "                            \"gte\": \"2024-01-01\"\n",
    "                        }\n",
    "                    }\n",
    "                }\n",
    "            ],\n",
    "            \"should\": [\n",
    "                {\n",
    "                    \"match\": {\n",
    "                        \"categories\": \"cs.AI\"\n",
    "                    }\n",
    "                }\n",
    "            ]\n",
    "        }\n",
    "    },\n",
    "    \"sort\": [\n",
    "        \"_score\",\n",
    "        {\"published_date\": {\"order\": \"desc\"}}\n",
    "    ],\n",
    "    \"size\": 3\n",
    "}\n",
    "\n",
    "response = opensearch_client.client.search(\n",
    "    index=opensearch_client.index_name,\n",
    "    body=query\n",
    ")\n",
    "\n",
    "print(f\"Complex Query:\")\n",
    "print(f\"  • Must contain 'transformer' (title boosted 3x)\")\n",
    "print(f\"  • Filter: published after 2024-01-01\")\n",
    "print(f\"  • Prefer: cs.AI category\")\n",
    "print(f\"  • Sort: by relevance, then date\\n\")\n",
    "\n",
    "print(f\"Found {response['hits']['total']['value']} results\\n\")\n",
    "\n",
    "for hit in response['hits']['hits']:\n",
    "    title = hit['_source']['title'][:70]\n",
    "    pub_date = hit['_source']['published_date'][:10]\n",
    "    score = hit['_score']\n",
    "    categories = ', '.join(hit['_source']['categories'][:2])\n",
    "    \n",
    "    print(f\"Title: {title}...\")\n",
    "    print(f\"  Date: {pub_date} | Score: {score:.2f}\")\n",
    "    print(f\"  Categories: {categories}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "### What We Demonstrated\n",
    "\n",
    "**BM25 Search is Powerful!** Without any vector embeddings, we can:\n",
    "\n",
    "1. **Simple Search**: Basic keyword search with relevance scoring\n",
    "2. **Match Queries**: Search specific fields\n",
    "3. **Multi-Match**: Search across multiple fields with boosting\n",
    "4. **Boosting**: Promote or demote certain results\n",
    "5. **Filtering**: Apply filters without affecting scores\n",
    "6. **Sorting**: Order results by date, score, or other fields\n",
    "7. **Complex Queries**: Combine all techniques for sophisticated searches\n",
    "\n",
    "### Key Takeaways\n",
    "\n",
    "- **BM25 works great** for many search use cases\n",
    "- **No vectors needed** for effective full-text search\n",
    "- **Simple and fast** compared to embedding-based approaches\n",
    "- **Filters and sorting** make searches precise and relevant\n",
    "- **Field boosting** helps prioritize important content\n",
    "\n",
    "### When to Use BM25 vs Vectors\n",
    "\n",
    "**Use BM25 when:**\n",
    "- Searching for specific keywords or phrases\n",
    "- Need fast, simple implementation\n",
    "- Have good text fields with clear terminology\n",
    "- Want explainable search results\n",
    "\n",
    "**Consider vectors when:**\n",
    "- Need semantic similarity (concepts, not keywords)\n",
    "- Dealing with synonyms and paraphrasing\n",
    "- Cross-language search requirements\n",
    "- Very short queries or documents\n",
    "\n",
    "Remember: **You can also combine both** (hybrid search) for best results!\n",
    "We will see this in the next week :)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}