{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7f3a21bc",
   "metadata": {},
   "source": [
    "# Oracle RAG Agents: Zero to Hero\n",
    "\n",
    "[![Open in Colab](https://img.shields.io/badge/Open%20in-Colab-F9AB00?style=flat-square&logo=googlecolab)](https://colab.research.google.com/github/oracle-devrel/oracle-ai-developer-hub/blob/main/notebooks/oracle_rag_agents_zero_to_hero.ipynb)\n",
    "\n",
    "-------\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5b7f4b9",
   "metadata": {},
   "source": [
    "## What You Will Learn\n",
    "\n",
    "In this notebook, you will learn how to:\n",
    "- Set up Oracle AI Database locally for AI application development.\n",
    "- Build a research dataset pipeline from ingestion to embeddings.\n",
    "- Implement keyword, vector, hybrid, and graph-based retrieval in Oracle.\n",
    "- Create a full RAG workflow and connect it to OpenAI models.\n",
    "- Build agentic workflows with tool use, orchestration, and session memory.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e322ad7c",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install -Uq oracledb pandas sentence-transformers datasets einops \"numpy<2.0\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "010cf528",
   "metadata": {},
   "source": [
    "# Part 1. Oracle AI Database Local Installation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20816044",
   "metadata": {},
   "source": [
    "## 1.1 Local Installation of Oracle AI Database"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c687fd63",
   "metadata": {},
   "source": [
    "**Oracle AI Database 26ai** is a **converged database** built for AI developers.  \n",
    "It unifies **relational, document, graph and vector data** in one engine — so you can build  \n",
    "**semantic search**, **RAG**, and **natural language to SQL** applications without leaving the database.  \n",
    "\n",
    "Store embeddings, run vector search, and apply AI directly where your data lives —  \n",
    "**securely, efficiently, and at scale.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0ad10f4",
   "metadata": {},
   "source": [
    "For this notebook we will be using a local installation of [Oracle AI Database](https://www.oracle.com/database/free/get-started/)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0308b61d",
   "metadata": {},
   "source": [
    "1. Install & start Docker. Docker Desktop (Mac/Windows) or Docker Engine (Linux). Make sure it’s running.\n",
    "    - If installed with Docker Enginer, run from terminal ```open /Applications/Docker.app```\n",
    "2. Pull [docker image](https://www.oracle.com/database/free/get-started/)\n",
    "3. Run a container with oracle image\n",
    "\n",
    "    ```\n",
    "    docker run -d \\\n",
    "      --name oracle-full \\\n",
    "      -p 1521:1521 -p 5500:5500 \\\n",
    "      -e ORACLE_PWD=YourStrongPassword \\\n",
    "      -e ORACLE_SID=FREE \\\n",
    "      -e ORACLE_PDB=FREEPDB1 \\\n",
    "      -v ~/oracle/full_data:/opt/oracle/oradata \\\n",
    "      container-registry.oracle.com/database/free:latest\n",
    "    ```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e0da924",
   "metadata": {},
   "source": [
    "> 🚫 **Troubleshoot**  \n",
    "> If you see the error:  \n",
    "> *`docker: Error response from daemon: Conflict. The container name \"/oracle-full\" is already in use by container ... You have to remove (or rename) that container to be able to reuse that name.`*  \n",
    ">\n",
    "> 🧩 **Fix:**  \n",
    "> - Remove the existing container:  \n",
    ">   ```bash\n",
    ">   docker rm oracle-full\n",
    ">   ```  \n",
    "> - Then re-run your Docker command from **Step 3** to start a new container.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83c21e0d",
   "metadata": {},
   "source": [
    "After running the docker command above in your terminal, you should see the image below if you click into the container running"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "464c1481",
   "metadata": {},
   "source": [
    "### 1.1.1 Connecting to Oracle AI Database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de8defa7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import oracledb\n",
    "import time\n",
    "\n",
    "\n",
    "def ensure_property_graph_privileges(\n",
    "    target_user: str = \"VECTOR\",\n",
    "    admin_user: str = \"SYSTEM\",\n",
    "    admin_password: str | None = None,\n",
    "    dsn: str = \"localhost:1521/FREEPDB1\",\n",
    "):\n",
    "    \"\"\"\n",
    "    Grant SQL Property Graph privileges required by this notebook.\n",
    "\n",
    "    Uses an admin session (SYSTEM by default) to grant:\n",
    "      - CREATE PROPERTY GRAPH (required)\n",
    "      - READ ANY PROPERTY GRAPH (optional; useful for cross-schema graph reads)\n",
    "\n",
    "    Set ORACLE_ADMIN_PWD in your environment if admin password differs.\n",
    "    \"\"\"\n",
    "    admin_password = (\n",
    "        admin_password\n",
    "        or os.environ.get(\"ORACLE_ADMIN_PWD\")\n",
    "        or os.environ.get(\"ORACLE_PASSWORD\")\n",
    "        or \"YourStrongPassword\"\n",
    "    )\n",
    "\n",
    "    try:\n",
    "        with oracledb.connect(user=admin_user, password=admin_password, dsn=dsn) as admin_conn:\n",
    "            with admin_conn.cursor() as cur:\n",
    "                # Required for CREATE PROPERTY GRAPH DDL\n",
    "                cur.execute(f\"GRANT CREATE PROPERTY GRAPH TO {target_user}\")\n",
    "\n",
    "                # Optional in some setups/versions; ignore if unsupported\n",
    "                try:\n",
    "                    cur.execute(f\"GRANT READ ANY PROPERTY GRAPH TO {target_user}\")\n",
    "                    print(f\"✓ Granted READ ANY PROPERTY GRAPH to {target_user}.\")\n",
    "                except oracledb.DatabaseError as read_err:\n",
    "                    if \"ORA-00990\" in str(read_err):\n",
    "                        print(\"ℹ️ READ ANY PROPERTY GRAPH is not available in this setup; continuing.\")\n",
    "                    else:\n",
    "                        raise\n",
    "\n",
    "            admin_conn.commit()\n",
    "\n",
    "        print(f\"✓ Granted CREATE PROPERTY GRAPH to {target_user}.\")\n",
    "\n",
    "    except oracledb.DatabaseError as e:\n",
    "        print(f\"⚠️ Could not auto-grant property graph privileges as {admin_user}: {e}\")\n",
    "        print(\"   If you hit ORA-01031 later, run as admin:\")\n",
    "        print(f\"   GRANT CREATE PROPERTY GRAPH TO {target_user};\")\n",
    "        print(f\"   -- Optional if supported: GRANT READ ANY PROPERTY GRAPH TO {target_user};\")\n",
    "\n",
    "\n",
    "def connect_to_oracle(max_retries=3, retry_delay=5):\n",
    "    \"\"\"\n",
    "    Connect to Oracle database with retry logic and better error handling.\n",
    "\n",
    "    Args:\n",
    "        max_retries: Maximum number of connection attempts\n",
    "        retry_delay: Seconds to wait between retries\n",
    "    \"\"\"\n",
    "    user = \"VECTOR\"\n",
    "    password = \"VectorPwd_2025\"  # must match ORACLE_PWD from docker run\n",
    "    dsn = \"localhost:1521/FREEPDB1\"\n",
    "\n",
    "    for attempt in range(1, max_retries + 1):\n",
    "        try:\n",
    "            print(f\"Connection attempt {attempt}/{max_retries}...\")\n",
    "            conn = oracledb.connect(\n",
    "                user=user,\n",
    "                password=password,\n",
    "                dsn=dsn\n",
    "            )\n",
    "            print(\"✓ Connected successfully!\")\n",
    "\n",
    "            # Test the connection\n",
    "            with conn.cursor() as cur:\n",
    "                cur.execute(\"SELECT banner FROM v$version WHERE banner LIKE 'Oracle%';\")\n",
    "                banner = cur.fetchone()[0]\n",
    "                print(f\"\\n{banner}\")\n",
    "\n",
    "            return conn\n",
    "\n",
    "        except oracledb.OperationalError as e:\n",
    "            error_msg = str(e)\n",
    "            print(f\"✗ Connection failed (attempt {attempt}/{max_retries})\")\n",
    "\n",
    "            if \"DPY-4011\" in error_msg or \"Connection reset by peer\" in error_msg:\n",
    "                print(\"  → This usually means:\")\n",
    "                print(\"    1. Database is still starting up (wait 2-3 minutes)\")\n",
    "                print(\"    2. Listener is not bound to 0.0.0.0 (run fix_oracle_listener())\")\n",
    "                print(\"    3. Container is not running (check with check_docker_container())\")\n",
    "\n",
    "                if attempt < max_retries:\n",
    "                    print(f\"\\n  Waiting {retry_delay} seconds before retry...\")\n",
    "                    time.sleep(retry_delay)\n",
    "                else:\n",
    "                    print(\"\\n  💡 Try running:\")\n",
    "                    print(\"     1. check_docker_container() - verify container is running\")\n",
    "                    print(\"     2. fix_oracle_listener() - fix listener binding\")\n",
    "                    raise\n",
    "            else:\n",
    "                raise\n",
    "        except Exception as e:\n",
    "            print(f\"✗ Unexpected error: {e}\")\n",
    "            raise\n",
    "\n",
    "    raise ConnectionError(\"Failed to connect after all retries\")\n",
    "\n",
    "\n",
    "# Ensure VECTOR can create/read SQL Property Graph objects\n",
    "ensure_property_graph_privileges(\n",
    "    target_user=\"VECTOR\",\n",
    "    admin_user=\"SYSTEM\",\n",
    "    dsn=\"localhost:1521/FREEPDB1\",\n",
    ")\n",
    "\n",
    "# Connect to Oracle as VECTOR\n",
    "conn = connect_to_oracle()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31988f1d",
   "metadata": {},
   "source": [
    "> 🚫 Troubleshoot: Oracle Database Free (Docker) — Connection Fix\n",
    ">\n",
    "> If you see errors like:\n",
    ">\n",
    "> ```\n",
    "> OperationalError: DPY-6005: cannot connect to database\n",
    "> DPY-4011: the database or network closed the connection\n",
    "> TNS-12545: Connect failed because target host or object does not exist\n",
    "> ```\n",
    ">\n",
    "> It means the **Oracle listener** inside the container is binding to the **container hostname** instead of `0.0.0.0`, preventing host connections.\n",
    ">\n",
    ">\n",
    ">\n",
    "> 🧩 Fix\n",
    "> \n",
    "> Run this exact command to patch the listener and restart it:\n",
    "> \n",
    "> ```bash\n",
    "> docker exec -it oracle-full bash -lc '\n",
    ">   export ORACLE_HOME=${ORACLE_HOME:-/opt/oracle/product/26ai/dbhomeFree}\n",
    ">   export PATH=$ORACLE_HOME/bin:$PATH\n",
    ">   LISTENER_ORA=\"$ORACLE_HOME/network/admin/listener.ora\"\n",
    "> \n",
    ">   echo \"== Fixing listener to use HOST=0.0.0.0\"\n",
    ">   sed -i \"s/(HOST *= *[^)]*)/(HOST = 0.0.0.0)/\" \"$LISTENER_ORA\"\n",
    "> \n",
    ">   echo \"== Restarting listener\"\n",
    ">   lsnrctl stop || true\n",
    ">   lsnrctl start\n",
    "> \n",
    ">   echo \"== Re-registering services\"\n",
    ">   echo \"ALTER SYSTEM REGISTER;\" | sqlplus -s / as sysdba\n",
    "> \n",
    ">   echo \"== Listener status (first 20 lines):\"\n",
    ">   lsnrctl status | sed -n \"1,20p\"\n",
    "> '\n",
    "> ```\n",
    "> \n",
    "> This:\n",
    "> - Forces the listener to bind on all interfaces (`0.0.0.0`).\n",
    "> - Restarts the listener.\n",
    "> - Re-registers the PDB service (`FREEPDB1`) with the listener.\n",
    "> \n",
    "> ---\n",
    "> \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d714bb2",
   "metadata": {},
   "source": [
    "## 1.2 Remote Access with FreeSQL.com (Coming Soon)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23c24fa8",
   "metadata": {},
   "source": [
    "---------"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02032d60",
   "metadata": {},
   "source": [
    "# Part 2. Data Loading, Preparation and Embedding Generation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34d8e19e",
   "metadata": {},
   "source": [
    "## 2.1 Data Loading From Hugging Face"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc6ddae6",
   "metadata": {},
   "source": [
    "**Streaming the ArXiv Papers Dataset with Hugging Face**\n",
    "\n",
    "The following code in the next cell demonstrates how to efficiently load and stream a large dataset using the **Hugging Face `datasets`** library — a powerful tool for handling massive datasets that may not fit into memory all at once.\n",
    "\n",
    "```python\n",
    "import pandas as pd\n",
    "from datasets import load_dataset\n",
    "\n",
    "ds_stream = load_dataset(\"nick007x/arxiv-papers\", split=\"train\", streaming=True)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e6cc928",
   "metadata": {},
   "source": [
    "**Step-by-Step Explanation**\n",
    "1. **Importing dependencies**\n",
    "   - `load_dataset` is imported from the `datasets` library, giving access to thousands of open datasets hosted on the Hugging Face Hub.\n",
    "\n",
    "2. **Loading the dataset**\n",
    "   - The dataset `\"nick007x/arxiv-papers\"` refers to a public dataset on the Hugging Face Hub that contains metadata or text from research papers hosted on [arXiv.org](https://arxiv.org).\n",
    "   - The parameter `split=\"train\"` loads the training partition of the dataset (many datasets have `train`, `validation`, and `test` splits).\n",
    "   - The key argument `streaming=True` activates **streaming mode**, meaning the dataset is read progressively from the source without downloading it entirely to disk.\n",
    "\n",
    "3. **Why streaming mode matters**\n",
    "   - Traditional dataset loading downloads the full dataset into memory or disk, which can be slow and memory-intensive.  \n",
    "   - Streaming allows you to process examples **as they arrive**, ideal for very large datasets or limited-resource environments.\n",
    "   - You can iterate over the dataset like this:\n",
    "     ```python\n",
    "     for record in ds_stream:\n",
    "         print(record)\n",
    "         break\n",
    "     ```\n",
    "\n",
    "4. **Resulting object**\n",
    "   - The variable `ds_stream` is an instance of `datasets.IterableDataset`, not a static table.  \n",
    "   - You can convert a small sample into a Pandas DataFrame for inspection:\n",
    "     ```python\n",
    "     sample = [next(iter(ds_stream)) for _ in range(5)]\n",
    "     pd.DataFrame(sample)\n",
    "     ```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07b06e64",
   "metadata": {},
   "source": [
    "> **💡 Takeaway:**  \n",
    "> Using `load_dataset(..., streaming=True)` enables developers and data scientists to work with **large datasets efficiently** — a perfect fit for machine learning pipelines, LLM training, or large document analysis workflows.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43ee654d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from datasets import load_dataset\n",
    "\n",
    "ds_stream = load_dataset(\"nick007x/arxiv-papers\", split=\"train\", streaming=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be8e1adc",
   "metadata": {},
   "source": [
    "The code below streams the first 1,000 ArXiv papers(feel free to use more) from the dataset, extracts their titles and abstracts, combines them into a single text field, and stores the results as structured dictionaries for later use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5ebcc91",
   "metadata": {},
   "outputs": [],
   "source": [
    "sampled = []\n",
    "for i, item in enumerate(ds_stream):\n",
    "    if i >= 1000:\n",
    "        break\n",
    "    \n",
    "    arxiv_id = item.get(\"arxiv_id\", f\"unknown_{i}\")\n",
    "    title = item.get(\"title\", \"\").strip()\n",
    "    abstract = item.get(\"abstract\", \"\").strip()\n",
    "    authors = item.get(\"authors\", [])\n",
    "\n",
    "    if isinstance(authors, str):\n",
    "        authors = [a.strip() for a in authors.split(\",\") if a.strip()]\n",
    "    elif isinstance(authors, list):\n",
    "        authors = [str(a).strip() for a in authors if str(a).strip()]\n",
    "    else:\n",
    "        authors = []\n",
    "    \n",
    "    # Combine title + abstract for embedding text\n",
    "    text = f\"{title} — {abstract}\"\n",
    "    \n",
    "    sampled.append({\n",
    "        \"id\": item.get(\"id\", f\"arxiv_{i}\"),\n",
    "        \"arxiv_id\": arxiv_id,\n",
    "        \"title\": title,\n",
    "        \"abstract\": abstract,\n",
    "        \"authors\": authors,\n",
    "        \"text\": text\n",
    "    })\n",
    "\n",
    "print(f\"✅ Streamed {len(sampled)} papers.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "faaae87a",
   "metadata": {},
   "source": [
    "The code below converts the collected list of sampled paper records into a Pandas DataFrame for easier analysis, prints how many rows were loaded, and displays the first few entries to preview the dataset’s structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b655b54",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert the list of tuples (id, text) into a DataFrame\n",
    "dataset_df = pd.DataFrame(sampled)\n",
    "\n",
    "# View shape and head\n",
    "print(f\"✅ Loaded {len(dataset_df)} rows into DataFrame.\")\n",
    "dataset_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77ac879d",
   "metadata": {},
   "source": [
    "## 2.2 Embedding Generation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "357d2c62",
   "metadata": {},
   "source": [
    "\n",
    "This code in the next cell below imports the **`SentenceTransformer`** class from the `sentence_transformers` library and loads a pretrained model called **`\"nomic-ai/nomic-embed-text-v1.5\"`** from the Hugging Face Hub.\n",
    "\n",
    "```python\n",
    "from sentence_transformers import SentenceTransformer\n",
    "embedding_model = SentenceTransformer(\"nomic-ai/nomic-embed-text-v1.5\", trust_remote_code=True)\n",
    "```\n",
    "\n",
    "🔍 What it does\n",
    "- **`SentenceTransformer`** extends transformer-based models (like BERT or RoBERTa) to produce **sentence-level embeddings** — dense numerical vectors that capture the semantic meaning of text.\n",
    "- The model **`nomic-ai/nomic-embed-text-v1.5`** is optimized for general-purpose text embeddings and works well for tasks such as:\n",
    "  - Semantic search  \n",
    "  - Clustering and topic modeling  \n",
    "  - Retrieval-Augmented Generation (RAG)  \n",
    "  - Similarity ranking and recommendation systems\n",
    "- The parameter **`trust_remote_code=True`** allows the loader to execute custom code from the model’s repository, which is required for models that define specialized architectures or preprocessing logic.\n",
    "\n",
    "In short, this code prepares a powerful embedding model that can transform text (like paper titles or abstracts) into **high-dimensional semantic vectors**, making it easier to measure meaning-based similarity across documents.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "939be3b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sentence_transformers import SentenceTransformer\n",
    "embedding_model = SentenceTransformer(\"nomic-ai/nomic-embed-text-v1.5\", trust_remote_code=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1aeb4f07",
   "metadata": {},
   "source": [
    "This code in the next cell below iterates through each document, encodes it into a normalized embedding vector while showing live progress, and prefixes each text with \"search_document: \" so that the Nomic embedding model correctly interprets it as a retrieval document, improving alignment with query embeddings during semantic search.\n",
    "\n",
    "The prefix `search_document`: \" tells the Nomic embedding model what kind of text it’s encoding — in this case, a document intended for retrieval.\n",
    "\n",
    "Nomic models like nomic-embed-text-v1.5 are trained with instructional prefixes (e.g., \"search_query:\", \"search_document:\", \"classification:\"), which guide the model to generate embeddings suited for different purposes.\n",
    "\n",
    "**Why Use the `\"search_document:\"` Prefix with Nomic Embeddings**\n",
    "\n",
    "Nomic embedding models (like **`nomic-embed-text-v1.5`**) are **instruction-tuned**, meaning they were trained with specific **task prefixes** that tell the model *how* to interpret the text you’re embedding.  \n",
    "\n",
    "According to the [Nomic documentation](https://docs.nomic.ai/reference/api/embed-text-v-1-embedding-text-post) and the [Hugging Face model card](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5):\n",
    "\n",
    "> “Important: the text prompt must include a task instruction prefix, instructing the model which task is being performed.\n",
    "> For example, if you are implementing a RAG application, you embed your documents as  \n",
    "> `search_document: <text>` and embed your user queries as `search_query: <text>`.”\n",
    "\n",
    "**💡 Why this matters**\n",
    "- The prefix tells the model whether a text is a **document** or a **query**, ensuring both are embedded into the **same semantic space**.\n",
    "- Using `search_document:` for document embeddings and `search_query:` for query embeddings **improves retrieval accuracy**, since the model optimizes for similarity between matching query–document pairs.\n",
    "- Omitting or mismatching prefixes can lead to weaker alignment and lower recall in search or RAG workflows.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be75832a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import sys\n",
    "\n",
    "# Prefix for retrieval-style embeddings\n",
    "dataset_df[\"text_prefixed\"] = dataset_df[\"text\"].apply(lambda x: f\"search_document: {x}\")\n",
    "\n",
    "# Convert to list for iteration\n",
    "texts = dataset_df[\"text_prefixed\"].tolist()\n",
    "embs = []\n",
    "\n",
    "# Encode one text at a time with a single-line progress display\n",
    "total = len(texts)\n",
    "for i, text in enumerate(texts, start=1):\n",
    "    embedding = embedding_model.encode(\n",
    "        text,\n",
    "        show_progress_bar=False,   # Disable SentenceTransformer progress\n",
    "        convert_to_numpy=True,\n",
    "        normalize_embeddings=True\n",
    "    )\n",
    "    embs.append(embedding)\n",
    "\n",
    "    # Print progress on the same line\n",
    "    sys.stdout.write(f\"\\rEncoding {i}/{total} texts...\")\n",
    "    sys.stdout.flush()\n",
    "\n",
    "# Final newline to avoid overwriting the last line\n",
    "print(f\"\\n✅ Finished encoding {len(embs)} texts.\")\n",
    "\n",
    "# Convert list of vectors to NumPy array\n",
    "embs = np.vstack(embs)\n",
    "print(f\"✅ Embeddings shape: {embs.shape}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2986b5f5",
   "metadata": {},
   "source": [
    "One important detail to note is the embedding dimensionality — the number of numerical features in each vector representation.\n",
    "This dimensionality determines the structure of your vector index and must remain consistent across all embeddings to ensure efficient similarity search and accurate retrieval performance."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eab5614e",
   "metadata": {},
   "source": [
    "This code in the next cell below converts each embedding into a list of 32-bit floating-point numbers (float32) so it can be stored in an Oracle VECTOR column, determines the embedding dimension (needed for defining the vector index), and we then displays the first two rows to confirm the data and embeddings were formatted correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53954b6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# store as float32 lists (ready for Oracle VECTOR column)\n",
    "dataset_df[\"embedding\"] = [e.astype(np.float32).tolist() for e in embs]\n",
    "\n",
    "dim = len(dataset_df[\"embedding\"].iloc[0])\n",
    "\n",
    "# View the first 2 rows of the dataset\n",
    "dataset_df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58889766",
   "metadata": {},
   "source": [
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c526f50c",
   "metadata": {},
   "source": [
    "# Part 3: Database Table Setup and Data Ingestion"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b9710e3",
   "metadata": {},
   "source": [
    "This code below safely resets the core paper table by first dropping any dependent graph edge tables (if they exist), then recreating `research_papers` with paper metadata and a `VECTOR` embedding column of size `dim`.\n",
    "\n",
    "This rerun-safe order avoids foreign-key drop errors when graph tables already reference `research_papers`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a52cfc9",
   "metadata": {},
   "source": [
    "## 3.1 Create Reseach Papers Table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "833b3c02",
   "metadata": {},
   "outputs": [],
   "source": [
    "ddl = f\"\"\"\n",
    "BEGIN\n",
    "    EXECUTE IMMEDIATE 'DROP TABLE paper_similarities';\n",
    "EXCEPTION WHEN OTHERS THEN\n",
    "    IF SQLCODE != -942 THEN RAISE; END IF;\n",
    "END;\n",
    "/\n",
    "BEGIN\n",
    "    EXECUTE IMMEDIATE 'DROP TABLE paper_authors';\n",
    "EXCEPTION WHEN OTHERS THEN\n",
    "    IF SQLCODE != -942 THEN RAISE; END IF;\n",
    "END;\n",
    "/\n",
    "BEGIN\n",
    "    EXECUTE IMMEDIATE 'DROP TABLE authors';\n",
    "EXCEPTION WHEN OTHERS THEN\n",
    "    IF SQLCODE != -942 THEN RAISE; END IF;\n",
    "END;\n",
    "/\n",
    "BEGIN\n",
    "    EXECUTE IMMEDIATE 'DROP TABLE research_papers CASCADE CONSTRAINTS PURGE';\n",
    "EXCEPTION WHEN OTHERS THEN\n",
    "    IF SQLCODE != -942 THEN RAISE; END IF;\n",
    "END;\n",
    "/\n",
    "CREATE TABLE research_papers (\n",
    "    arxiv_id VARCHAR2(255) PRIMARY KEY,\n",
    "    title VARCHAR2(4000),\n",
    "    abstract VARCHAR2(4000),\n",
    "    text CLOB,\n",
    "    embedding VECTOR({dim}, FLOAT32)\n",
    ")\n",
    "TABLESPACE USERS\n",
    "\"\"\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6f83f9e",
   "metadata": {},
   "source": [
    "Here, we’re taking the multi-statement SQL stored in `ddl`, splitting it by `/` so each command runs separately, and executing them one by one using the database cursor.  \n",
    "After all statements finish, we call `conn.commit()` to save the changes — effectively creating the `RESEARCH_PAPERS` table — and then print a confirmation message showing the vector dimension (`dim`) used for the `embedding` column.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d20eecfe",
   "metadata": {},
   "outputs": [],
   "source": [
    "with conn.cursor() as cur:\n",
    "    for stmt in ddl.split(\"/\"):\n",
    "        if stmt.strip():\n",
    "            cur.execute(stmt)\n",
    "\n",
    "conn.commit()\n",
    "print(\"✅ Table RESEARCH_PAPERS created with VECTOR dimension:\", dim)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab2ab118",
   "metadata": {},
   "source": [
    "## 3.2 Create Indexes (Vector and Search)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a0e7ce7",
   "metadata": {},
   "source": [
    "This code in the next cell below creates a **vector index** on the `embedding` column of the `research_papers` table to enable fast similarity search.  \n",
    "The index, named `RP_VEC_HNSW`, uses Oracle’s `VECTOR` indexing with **HNSW (Hierarchical Navigable Small World)** organization, which stores vectors in a graph for efficient approximate nearest-neighbor traversal.  \n",
    "It’s configured to use **cosine distance** as the similarity metric and a **target accuracy of 90%**, balancing search speed and precision.  \n",
    "We also set HNSW build parameters (`NEIGHBORS`, `EFCONSTRUCTION`) for graph connectivity and index quality.  \n",
    "Finally, `conn.commit()` saves the index creation, and a confirmation message is printed once the index is successfully built.\n",
    "\n",
    "Note: In Oracle’s vector indexing syntax, **ORGANIZATION INMEMORY NEIGHBOR GRAPH** enables the HNSW graph-based index structure.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87b97672",
   "metadata": {},
   "outputs": [],
   "source": [
    "with conn.cursor() as cur:\n",
    "    # Drop old vector indexes if they exist (supports re-running this section)\n",
    "    for idx_name in (\"RP_VEC_HNSW\", \"RP_VEC_IVF\"):\n",
    "        try:\n",
    "            cur.execute(f\"DROP INDEX {idx_name}\")\n",
    "        except oracledb.DatabaseError as e:\n",
    "            if \"ORA-01418\" not in str(e):  # ignore \"index does not exist\"\n",
    "                raise\n",
    "\n",
    "    cur.execute(\"\"\"\n",
    "        CREATE VECTOR INDEX RP_VEC_HNSW\n",
    "        ON research_papers(embedding)\n",
    "        ORGANIZATION INMEMORY NEIGHBOR GRAPH\n",
    "        DISTANCE COSINE\n",
    "        WITH TARGET ACCURACY 90\n",
    "        PARAMETERS (TYPE HNSW, NEIGHBORS 40, EFCONSTRUCTION 500)\n",
    "        TABLESPACE USERS\n",
    "    \"\"\")\n",
    "\n",
    "conn.commit()\n",
    "print(\"✅ Vector Index RP_VEC_HNSW (HNSW) created\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "106fa5b5",
   "metadata": {},
   "source": [
    "> **💡 Knowledge Checkpoint**\n",
    ">\n",
    "> HNSW (Hierarchical Navigable Small World) is a graph-based vector index that speeds up similarity search using layered small-world navigation.\n",
    ">\n",
    "> Instead of scanning all vectors, HNSW traverses graph links from coarse to fine levels to quickly approach nearest neighbors.\n",
    "> \n",
    "> In Oracle AI Database, the clause `ORGANIZATION INMEMORY NEIGHBOR GRAPH` activates this HNSW structure. Parameters like `NEIGHBORS` and `EFCONSTRUCTION` control graph connectivity and index build quality, while `WITH TARGET ACCURACY` controls ANN search quality.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6231597e",
   "metadata": {},
   "source": [
    "**Creating an Oracle Text Index for Full-Text Search**\n",
    "\n",
    "This code below creates a **full-text search index** on the `text` column of the `research_papers` table using Oracle Text.\n",
    "\n",
    "1. **Drop existing index** — removes `rp_text_idx` if it already exists, ignoring the “index does not exist” error.  \n",
    "2. **Create new text index** — builds a `CTXSYS.CONTEXT` index, which tokenizes and indexes the text for efficient keyword searches.  \n",
    "   - `SYNC (ON COMMIT)` keeps the index automatically updated whenever new data is committed.  \n",
    "3. **Commit and confirm** — saves the changes and prints a success message.\n",
    "\n",
    "Once created, you can use the `CONTAINS()` operator with `SCORE()` for fast, ranked keyword searches, e.g.:\n",
    "\n",
    "```sql\n",
    "SELECT title, SCORE(1) AS relevance\n",
    "FROM research_papers\n",
    "WHERE CONTAINS(text, 'transformer architecture', 1) > 0\n",
    "ORDER BY SCORE(1) DESC;\n",
    "```\n",
    "\n",
    "This turns your text column into a search-optimized field, similar to how search engines handle full-text retrieval.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52720503",
   "metadata": {},
   "outputs": [],
   "source": [
    "with conn.cursor() as cur:\n",
    "    # Drop old index if it exists\n",
    "    try:\n",
    "        cur.execute(\"DROP INDEX rp_text_idx\")\n",
    "    except oracledb.DatabaseError as e:\n",
    "        if \"ORA-01418\" not in str(e):  # ignore \"index does not exist\"\n",
    "            raise\n",
    "\n",
    "    # Create a TEXT index on the 'text' column only\n",
    "    cur.execute(\"\"\"\n",
    "        CREATE INDEX rp_text_idx\n",
    "        ON research_papers(text)\n",
    "        INDEXTYPE IS CTXSYS.CONTEXT\n",
    "        PARAMETERS ('SYNC (ON COMMIT)')\n",
    "    \"\"\")\n",
    "\n",
    "conn.commit()\n",
    "print(\"✅ Oracle Text index (rp_text_idx) created successfully on TEXT column.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f2e0bf1",
   "metadata": {},
   "source": [
    "### 3.2.1 Create Relational Tables for Graph Retrieval\n",
    "\n",
    "To support Oracle SQL Property Graph retrieval, we add normalized edge tables for authorship and paper similarity. This keeps the existing `research_papers` vector/text setup intact while adding graph traversal data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec534faf",
   "metadata": {},
   "outputs": [],
   "source": [
    "graph_tables_ddl = \"\"\"\n",
    "BEGIN\n",
    "    EXECUTE IMMEDIATE 'DROP TABLE paper_similarities';\n",
    "EXCEPTION WHEN OTHERS THEN\n",
    "    IF SQLCODE != -942 THEN RAISE; END IF;\n",
    "END;\n",
    "/\n",
    "BEGIN\n",
    "    EXECUTE IMMEDIATE 'DROP TABLE paper_authors';\n",
    "EXCEPTION WHEN OTHERS THEN\n",
    "    IF SQLCODE != -942 THEN RAISE; END IF;\n",
    "END;\n",
    "/\n",
    "BEGIN\n",
    "    EXECUTE IMMEDIATE 'DROP TABLE authors';\n",
    "EXCEPTION WHEN OTHERS THEN\n",
    "    IF SQLCODE != -942 THEN RAISE; END IF;\n",
    "END;\n",
    "/\n",
    "CREATE TABLE authors (\n",
    "    author_name VARCHAR2(512) PRIMARY KEY\n",
    ")\n",
    "TABLESPACE USERS\n",
    "/\n",
    "CREATE TABLE paper_authors (\n",
    "    arxiv_id VARCHAR2(255) NOT NULL,\n",
    "    author_name VARCHAR2(512) NOT NULL,\n",
    "    CONSTRAINT pk_paper_authors PRIMARY KEY (arxiv_id, author_name),\n",
    "    CONSTRAINT fk_pa_paper FOREIGN KEY (arxiv_id) REFERENCES research_papers(arxiv_id),\n",
    "    CONSTRAINT fk_pa_author FOREIGN KEY (author_name) REFERENCES authors(author_name)\n",
    ")\n",
    "TABLESPACE USERS\n",
    "/\n",
    "CREATE TABLE paper_similarities (\n",
    "    source_arxiv_id VARCHAR2(255) NOT NULL,\n",
    "    target_arxiv_id VARCHAR2(255) NOT NULL,\n",
    "    sim_score NUMBER(8,6) NOT NULL,\n",
    "    rank_no NUMBER(5) NOT NULL,\n",
    "    CONSTRAINT pk_paper_similarities PRIMARY KEY (source_arxiv_id, target_arxiv_id),\n",
    "    CONSTRAINT fk_ps_src FOREIGN KEY (source_arxiv_id) REFERENCES research_papers(arxiv_id),\n",
    "    CONSTRAINT fk_ps_tgt FOREIGN KEY (target_arxiv_id) REFERENCES research_papers(arxiv_id),\n",
    "    CONSTRAINT ck_ps_not_self CHECK (source_arxiv_id <> target_arxiv_id)\n",
    ")\n",
    "TABLESPACE USERS\n",
    "\"\"\"\n",
    "\n",
    "with conn.cursor() as cur:\n",
    "    for stmt in graph_tables_ddl.split(\"/\"):\n",
    "        if stmt.strip():\n",
    "            cur.execute(stmt)\n",
    "\n",
    "    cur.execute(\"CREATE INDEX idx_pa_author ON paper_authors(author_name)\")\n",
    "    cur.execute(\"CREATE INDEX idx_ps_source ON paper_similarities(source_arxiv_id)\")\n",
    "    cur.execute(\"CREATE INDEX idx_ps_target ON paper_similarities(target_arxiv_id)\")\n",
    "\n",
    "conn.commit()\n",
    "print(\"✅ Graph tables created: AUTHORS, PAPER_AUTHORS, PAPER_SIMILARITIES\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd2d74db",
   "metadata": {},
   "source": [
    "In the code below we’re taking all the research paper records (including their embeddings) from the DataFrame and inserting them into the `RESEARCH_PAPERS` table in Oracle.\n",
    "\n",
    "Let’s break down what’s happening:\n",
    "\n",
    "\n",
    "1. Convert each embedding for Oracle compatibility\n",
    "```python\n",
    "embedding_array = array.array('f', row.get(\"embedding\"))\n",
    "```\n",
    "Oracle’s `VECTOR` column expects the data as a compact binary float array, not a Python list.  \n",
    "So here we convert each embedding into an `array.array('f')` — this ensures the data binds correctly and efficiently when inserting.\n",
    "\n",
    "\n",
    "2. Prepare all rows for insertion\n",
    "For every paper, we build a tuple containing its metadata (`arxiv_id`, `title`, `abstract`, `text`) and the formatted `embedding_array`.  \n",
    "These tuples are collected in a list called `rows`.\n",
    "\n",
    "3. Insert each row with a progress bar\n",
    "```python\n",
    "for row in tqdm(rows):\n",
    "    cur.execute(\"\"\"\n",
    "        INSERT INTO research_papers (arxiv_id, title, abstract, text, embedding)\n",
    "        VALUES (:1, :2, :3, :4, :5)\n",
    "    \"\"\", row)\n",
    "```\n",
    "We loop through each row, inserting it into the table using Oracle’s parameterized query syntax.  \n",
    "The `tqdm` progress bar gives real-time feedback on the insertion process — helpful when inserting hundreds or thousands of embeddings.\n",
    "\n",
    "\n",
    "4. Commit everything\n",
    "Finally, `conn.commit()` saves all the inserted records permanently in the database.  You’ll see a confirmation message once the operation completes successfully.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08324270",
   "metadata": {},
   "source": [
    "## 3.3 Ingest Data into Oracle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be6bdde3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "import array\n",
    "\n",
    "\n",
    "rows = []\n",
    "for i, row in dataset_df.iterrows():\n",
    "    # Convert embedding list to array.array for proper VECTOR binding\n",
    "    embedding_array = array.array('f', row.get(\"embedding\"))\n",
    "    \n",
    "    rows.append((\n",
    "        row.get(\"arxiv_id\"),\n",
    "        row.get(\"title\"),\n",
    "        row.get(\"abstract\"),\n",
    "        row.get(\"text\"),\n",
    "        embedding_array\n",
    "    ))\n",
    "\n",
    "print(f\"Preparing to insert {len(rows)} rows into RESEARCH_PAPERS...\")\n",
    "\n",
    "with conn.cursor() as cur:\n",
    "    for row in tqdm(rows):\n",
    "        cur.execute(\n",
    "            \"\"\"\n",
    "            INSERT INTO research_papers (arxiv_id, title, abstract, text, embedding)\n",
    "            VALUES (:1, :2, :3, :4, :5)\n",
    "            \"\"\", \n",
    "            row\n",
    "        )\n",
    "        \n",
    "conn.commit()\n",
    "print(\"✅ Data inserted successfully\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7a14d9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "with conn.cursor() as cur:\n",
    "    cur.execute(\"SELECT COUNT(*) FROM RESEARCH_PAPERS\")\n",
    "    print(\"Row count:\", cur.fetchone()[0])\n",
    "\n",
    "    cur.execute(\"\"\"\n",
    "        SELECT arxiv_id, title, abstract, text FROM RESEARCH_PAPERS\n",
    "        FETCH FIRST 3 ROWS ONLY\n",
    "    \"\"\")\n",
    "    for row in cur.fetchall():\n",
    "        print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f83ce7ad",
   "metadata": {},
   "source": [
    "## 3.4 Build and Register Graph Relationships\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5de4e73b",
   "metadata": {},
   "source": [
    "### 3.4.1 Load Author Edges (`Author -> WROTE -> Paper`)\n",
    "\n",
    "Here we normalize the dataset `authors` field and populate `AUTHORS` + `PAPER_AUTHORS` so graph traversal can connect papers through shared authors.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d79b490",
   "metadata": {},
   "outputs": [],
   "source": [
    "def normalize_authors(raw_authors):\n",
    "    if raw_authors is None:\n",
    "        return []\n",
    "    if isinstance(raw_authors, str):\n",
    "        return [a.strip() for a in raw_authors.split(\",\") if a.strip()]\n",
    "    if isinstance(raw_authors, list):\n",
    "        cleaned = []\n",
    "        for author in raw_authors:\n",
    "            text = str(author).strip()\n",
    "            if text:\n",
    "                cleaned.append(text)\n",
    "        return cleaned\n",
    "    return []\n",
    "\n",
    "author_rows = set()\n",
    "paper_author_rows = set()\n",
    "\n",
    "for _, row in dataset_df.iterrows():\n",
    "    arxiv_id = row.get(\"arxiv_id\")\n",
    "    if not arxiv_id:\n",
    "        continue\n",
    "\n",
    "    for author_name in normalize_authors(row.get(\"authors\")):\n",
    "        author_rows.add((author_name,))\n",
    "        paper_author_rows.add((arxiv_id, author_name))\n",
    "\n",
    "with conn.cursor() as cur:\n",
    "    cur.execute(\"TRUNCATE TABLE paper_authors\")\n",
    "    cur.execute(\"TRUNCATE TABLE authors\")\n",
    "\n",
    "    if author_rows:\n",
    "        cur.executemany(\n",
    "            \"INSERT INTO authors (author_name) VALUES (:1)\",\n",
    "            sorted(author_rows)\n",
    "        )\n",
    "\n",
    "    if paper_author_rows:\n",
    "        cur.executemany(\n",
    "            \"INSERT INTO paper_authors (arxiv_id, author_name) VALUES (:1, :2)\",\n",
    "            sorted(paper_author_rows)\n",
    "        )\n",
    "\n",
    "conn.commit()\n",
    "print(f\"✅ Loaded {len(author_rows)} authors and {len(paper_author_rows)} Author->WROTE->Paper edges.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "235952fd",
   "metadata": {},
   "source": [
    "### 3.4.2 Load Similarity Edges (`Paper -> SIMILAR_TO -> Paper`)\n",
    "\n",
    "Using the normalized embeddings already computed for vector search, we add top-k nearest-neighbor paper links and persist them as directed similarity edges.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8da9ea5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "TOP_SIM_NEIGHBORS = 10\n",
    "\n",
    "paper_ids = dataset_df[\"arxiv_id\"].tolist()\n",
    "emb_matrix = np.vstack(dataset_df[\"embedding\"].values).astype(np.float32)\n",
    "\n",
    "neighbor_k = min(TOP_SIM_NEIGHBORS, max(len(paper_ids) - 1, 0))\n",
    "similarity_rows = []\n",
    "\n",
    "if neighbor_k > 0:\n",
    "    similarity_matrix = emb_matrix @ emb_matrix.T\n",
    "    np.fill_diagonal(similarity_matrix, -np.inf)\n",
    "\n",
    "    for i, source_arxiv_id in enumerate(paper_ids):\n",
    "        nn_idx = np.argpartition(similarity_matrix[i], -neighbor_k)[-neighbor_k:]\n",
    "        nn_idx = nn_idx[np.argsort(similarity_matrix[i][nn_idx])[::-1]]\n",
    "\n",
    "        for rank_no, j in enumerate(nn_idx, start=1):\n",
    "            similarity_rows.append((\n",
    "                source_arxiv_id,\n",
    "                paper_ids[j],\n",
    "                float(similarity_matrix[i][j]),\n",
    "                rank_no\n",
    "            ))\n",
    "\n",
    "with conn.cursor() as cur:\n",
    "    cur.execute(\"TRUNCATE TABLE paper_similarities\")\n",
    "    if similarity_rows:\n",
    "        cur.executemany(\n",
    "            \"\"\"\n",
    "            INSERT INTO paper_similarities (source_arxiv_id, target_arxiv_id, sim_score, rank_no)\n",
    "            VALUES (:1, :2, :3, :4)\n",
    "            \"\"\",\n",
    "            similarity_rows\n",
    "        )\n",
    "\n",
    "conn.commit()\n",
    "print(f\"✅ Loaded {len(similarity_rows)} Paper->SIMILAR_TO->Paper edges (top {neighbor_k} per paper).\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8fdf463",
   "metadata": {},
   "source": [
    "### 3.4.3 Register the Oracle SQL Property Graph\n",
    "\n",
    "This cell creates a SQL Property Graph (`RESEARCH_GRAPH`) over `RESEARCH_PAPERS`, `AUTHORS`, `PAPER_AUTHORS`, and `PAPER_SIMILARITIES`, so we can use `GRAPH_TABLE` pattern matching for graph-native retrieval.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ca2ff21",
   "metadata": {},
   "outputs": [],
   "source": [
    "with conn.cursor() as cur:\n",
    "    cur.execute(\"\"\"\n",
    "        SELECT COUNT(*)\n",
    "        FROM user_property_graphs\n",
    "        WHERE graph_name = 'RESEARCH_GRAPH'\n",
    "    \"\"\")\n",
    "    graph_exists = cur.fetchone()[0] > 0\n",
    "\n",
    "    if graph_exists:\n",
    "        cur.execute(\"DROP PROPERTY GRAPH research_graph\")\n",
    "\n",
    "    cur.execute(\"\"\"\n",
    "        CREATE PROPERTY GRAPH research_graph\n",
    "        VERTEX TABLES (\n",
    "            research_papers\n",
    "                KEY (arxiv_id)\n",
    "                LABEL paper\n",
    "                PROPERTIES (arxiv_id, title, abstract),\n",
    "            authors\n",
    "                KEY (author_name)\n",
    "                LABEL author\n",
    "                PROPERTIES (author_name)\n",
    "        )\n",
    "        EDGE TABLES (\n",
    "            paper_authors\n",
    "                KEY (arxiv_id, author_name)\n",
    "                SOURCE KEY (author_name) REFERENCES authors (author_name)\n",
    "                DESTINATION KEY (arxiv_id) REFERENCES research_papers (arxiv_id)\n",
    "                LABEL wrote,\n",
    "            paper_similarities\n",
    "                KEY (source_arxiv_id, target_arxiv_id)\n",
    "                SOURCE KEY (source_arxiv_id) REFERENCES research_papers (arxiv_id)\n",
    "                DESTINATION KEY (target_arxiv_id) REFERENCES research_papers (arxiv_id)\n",
    "                LABEL similar_to\n",
    "                PROPERTIES (sim_score, rank_no)\n",
    "        )\n",
    "    \"\"\")\n",
    "\n",
    "conn.commit()\n",
    "print(\"✅ SQL Property Graph RESEARCH_GRAPH created.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d9788fc",
   "metadata": {},
   "source": [
    "# Part 4. Retrieval Mechanisms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab38912e",
   "metadata": {},
   "outputs": [],
   "source": [
    "SEARCH_TEXT_KEYWORDS = \"optimization\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa03c53c",
   "metadata": {},
   "source": [
    "## 4.1 Text Based Retrieval"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "154d31cd",
   "metadata": {},
   "source": [
    "The code below performs a full-text search over the text column of the research_papers table, using the Oracle Text index we just created earlier\n",
    "\n",
    "- CONTAINS(text, :keyword, 1) uses the Oracle Text index on the text column to find documents containing the given keyword or phrase.\n",
    "- SCORE(1) assigns a relevance score based on how well each document matches the search term.\n",
    "- SUBSTR(text, 1, 200) returns the first 200 characters as a short preview snippet.\n",
    "- FETCH FIRST 10 ROWS ONLY limits the results to the top 10 most relevant matches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7fbb1a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "def keyword_search_research_papers(conn, keyword: str):\n",
    "    \"\"\"\n",
    "    Perform a full-text keyword search on the 'text' column \n",
    "    using the Oracle Text index (rp_text_idx).\n",
    "\n",
    "    Args:\n",
    "        conn: Oracle database connection object.\n",
    "        keyword (str): Keyword or phrase to search for.\n",
    "\n",
    "    Returns:\n",
    "        tuple: (rows, columns)\n",
    "    \"\"\"\n",
    "    query = \"\"\"\n",
    "        SELECT \n",
    "            arxiv_id, \n",
    "            title, \n",
    "            SUBSTR(text, 1, 200) AS text_snippet,\n",
    "            SCORE(1) AS relevance_score\n",
    "        FROM research_papers\n",
    "        WHERE CONTAINS(text, :keyword, 1) > 0\n",
    "        ORDER BY SCORE(1) DESC\n",
    "        FETCH FIRST 10 ROWS ONLY\n",
    "    \"\"\"\n",
    "\n",
    "    with conn.cursor() as cur:\n",
    "        cur.execute(query, keyword=keyword)\n",
    "        rows = cur.fetchall()\n",
    "        columns = [desc[0] for desc in cur.description]\n",
    "\n",
    "    return rows, columns\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bbbe04fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "rows, columns = keyword_search_research_papers(conn, SEARCH_TEXT_KEYWORDS)\n",
    "\n",
    "results_df = pd.DataFrame(rows, columns=columns)\n",
    "\n",
    "print(f\"🔍 Keyword Search: '{SEARCH_TEXT_KEYWORDS}'\")\n",
    "print(f\"📊 Results: {len(results_df)}\\n\")\n",
    "\n",
    "results_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c352afa",
   "metadata": {},
   "source": [
    "As shown above, the returned rows represent documents whose indexed `text` content matched the full-text search query **\"Optimization\"**, as determined by the Oracle Text `CONTAINS()` operator using the `rp_text_idx` index.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22ff4f6b",
   "metadata": {},
   "source": [
    "## 4.2 Vector Based Retrieval"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d0cc6e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "SEARCH_QUERY = \"Get me papers related to planetary exploration\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "023c96d4",
   "metadata": {},
   "source": [
    "**Vector Similarity Search in Oracle**\n",
    "\n",
    "This function below `vector_search_research_papers` performs a **semantic vector search** on the `research_papers` table, retrieving papers most similar to a given query using **Oracle’s native VECTOR search** feature.\n",
    "\n",
    "\n",
    "Step-by-step overview\n",
    "\n",
    "1. Encode the search query\n",
    "```python\n",
    "query_embedding = embedding_model.encode(\n",
    "    [f\"search_query: {search_query}\"],\n",
    "    convert_to_numpy=True,\n",
    "    normalize_embeddings=True\n",
    ")[0].astype(np.float32).tolist()\n",
    "```\n",
    "- The query is embedded using the same model that generated the document embeddings.  \n",
    "- The prefix `\"search_query:\"` ensures embeddings align with `\"search_document:\"` vectors.  \n",
    "- Normalized and cast to `float32` for Oracle’s `VECTOR` type.\n",
    "\n",
    "\n",
    "2. Prepare for database binding\n",
    "```python\n",
    "query_embedding_array = array.array('f', query_embedding)\n",
    "```\n",
    "- Converts the embedding list into a binary float array (`array('f')`), required for Oracle’s vector binding.\n",
    "\n",
    "\n",
    "3. Perform vector search\n",
    "```sql\n",
    "SELECT arxiv_id, title, abstract,\n",
    "       SUBSTR(text, 1, 200) AS text_snippet,\n",
    "        ROUND(1 - VECTOR_DISTANCE(embedding, :q, COSINE), 4) AS similarity_score\n",
    "FROM research_papers\n",
    "ORDER BY similarity_score\n",
    "FETCH APPROX FIRST {top_k} ROWS ONLY WITH TARGET ACCURACY 90\n",
    "```\n",
    "- Computes **cosine distance** between the query vector and stored embeddings.  \n",
    "- Orders results by similarity, returning the closest matches.  \n",
    "- Uses **Approximate Nearest Neighbor (ANN)** search for fast retrieval at 90% target accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b46a3f34",
   "metadata": {},
   "outputs": [],
   "source": [
    "def vector_search_research_papers(conn, embedding_model, search_query: str, top_k: int = 5):\n",
    "    \"\"\"\n",
    "    Perform a vector similarity search on the research_papers table using a query embedding.\n",
    "    Returns cosine similarity scores (higher = more similar).\n",
    "    \"\"\"\n",
    "\n",
    "    # 1️⃣ Encode the query into a vector\n",
    "    query_embedding = embedding_model.encode(\n",
    "        [f\"search_query: {search_query}\"], \n",
    "        convert_to_numpy=True,\n",
    "        normalize_embeddings=True\n",
    "    )[0].astype(np.float32).tolist()\n",
    "\n",
    "    # 2️⃣ Prepare the vector for Oracle binding\n",
    "    query_embedding_array = array.array('f', query_embedding)\n",
    "\n",
    "    # 3️⃣ Run a vector similarity search using cosine similarity\n",
    "    query = f\"\"\"\n",
    "        SELECT \n",
    "            arxiv_id, \n",
    "            title, \n",
    "            abstract, \n",
    "            SUBSTR(text, 1, 200) AS text_snippet,\n",
    "            ROUND(1 - VECTOR_DISTANCE(embedding, :q, COSINE), 4) AS similarity_score\n",
    "        FROM research_papers\n",
    "        ORDER BY similarity_score DESC\n",
    "        FETCH APPROX FIRST {top_k} ROWS ONLY WITH TARGET ACCURACY 90\n",
    "    \"\"\"\n",
    "\n",
    "    # 4️⃣ Execute and return results\n",
    "    with conn.cursor() as cur:\n",
    "        cur.execute(query, q=query_embedding_array)\n",
    "        rows = cur.fetchall()\n",
    "        columns = [desc[0] for desc in cur.description]\n",
    "\n",
    "    return rows, columns\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dca16bb6",
   "metadata": {},
   "outputs": [],
   "source": [
    "rows, columns = vector_search_research_papers(conn, embedding_model, SEARCH_TEXT_KEYWORDS, top_k=5)\n",
    "\n",
    "results_df = pd.DataFrame(rows, columns=columns)\n",
    "\n",
    "print(f\"🔍 Vector Search: '{SEARCH_TEXT_KEYWORDS}'\")\n",
    "print(f\"📊 Results: {len(results_df)}\\n\")\n",
    "\n",
    "results_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8c8bf3e",
   "metadata": {},
   "source": [
    "In this step, we used Oracle Database’s native vector search capabilities to perform a semantic similarity search over the research_papers dataset.\n",
    "\n",
    "The text query — “Get me papers related to planetary exploration” — was first transformed into a high-dimensional embedding vector using the same model that generated our document embeddings (nomic-embed-text-v1.5).\n",
    "This query vector captures the semantic meaning of the phrase rather than just the exact words.\n",
    "\n",
    "Using Oracle’s VECTOR_DISTANCE(..., COSINE) function (converted to 1 - distance), we then compared this query vector against the stored embeddings for each paper in the database.\n",
    "The result is a ranked list of research papers ordered by cosine similarity, where higher scores indicate stronger semantic alignment with the search query."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a93fd0c7",
   "metadata": {},
   "source": [
    "## 4.3 Hybrid Retrieval (Vector + Text Combined)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1972f31",
   "metadata": {},
   "source": [
    "#### 4.3.1 Pre Filtering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e07236a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import array\n",
    "import numpy as np\n",
    "\n",
    "def hybrid_search_research_papers_pre_filter(\n",
    "    conn,\n",
    "    embedding_model,\n",
    "    search_phrase: str,\n",
    "    top_k: int = 10,\n",
    "    show_explain: bool = False\n",
    "):\n",
    "    \"\"\"\n",
    "    Perform a hybrid search using Oracle Text + Vector Search.\n",
    "    Combines lexical filtering (CONTAINS) with semantic re-ranking via cosine similarity.\n",
    "\n",
    "    Args:\n",
    "        conn: Oracle database connection object.\n",
    "        embedding_model: Model with `.encode()` method (e.g., SentenceTransformer).\n",
    "        search_phrase (str): User search phrase used for both text filtering and embedding.\n",
    "        top_k (int): Number of results to return (default = 10).\n",
    "        show_explain (bool): If True, prints the execution plan.\n",
    "\n",
    "    Returns:\n",
    "        tuple: (rows, columns, exec_plan_text or None)\n",
    "    \"\"\"\n",
    "\n",
    "    # --- Step 1: Encode search phrase into normalized vector ---\n",
    "    query_embedding = embedding_model.encode(\n",
    "        [f\"search_query: {search_phrase}\"],\n",
    "        convert_to_numpy=True,\n",
    "        normalize_embeddings=True\n",
    "    )[0].astype(np.float32).tolist()\n",
    "    query_embedding_array = array.array('f', query_embedding)\n",
    "\n",
    "    with conn.cursor() as cur:\n",
    "        # Enable runtime stats if needed\n",
    "        if show_explain:\n",
    "            cur.execute(\"ALTER SESSION SET statistics_level = ALL\")\n",
    "\n",
    "        # --- Step 2: Hybrid query (Oracle Text + Vector) ---\n",
    "        sql = f\"\"\"\n",
    "            SELECT {\"/*+ GATHER_PLAN_STATISTICS */\" if show_explain else \"\"}\n",
    "                arxiv_id,\n",
    "                title,\n",
    "                abstract,\n",
    "                SUBSTR(text, 1, 200) AS text_snippet,\n",
    "                ROUND(1 - VECTOR_DISTANCE(embedding, :q, COSINE), 4) AS similarity_score\n",
    "            FROM research_papers\n",
    "            WHERE CONTAINS(text, :kw, 1) > 0\n",
    "            ORDER BY similarity_score DESC\n",
    "            FETCH APPROX FIRST {top_k} ROWS ONLY WITH TARGET ACCURACY 90\n",
    "        \"\"\"\n",
    "\n",
    "        cur.execute(sql, q=query_embedding_array, kw=search_phrase)\n",
    "        rows = cur.fetchall()\n",
    "        columns = [desc[0] for desc in cur.description]\n",
    "\n",
    "    # --- Step 3: Execution plan (optional) ---\n",
    "    exec_plan_text = None\n",
    "    if show_explain:\n",
    "        with conn.cursor() as cur_plan:\n",
    "            cur_plan.execute(\"\"\"\n",
    "                SELECT plan_table_output\n",
    "                FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL, NULL, 'ALLSTATS LAST +PREDICATE'))\n",
    "            \"\"\")\n",
    "            exec_plan_text = \"\\n\".join(r[0] for r in cur_plan.fetchall())\n",
    "\n",
    "        print(\"\\n====== Execution Plan (DBMS_XPLAN.DISPLAY_CURSOR) ======\")\n",
    "        print(exec_plan_text)\n",
    "        print(\"========================================================\\n\")\n",
    "\n",
    "    return rows, columns, exec_plan_text\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91ba38db",
   "metadata": {},
   "outputs": [],
   "source": [
    "rows, columns, exec_plan = hybrid_search_research_papers_pre_filter(\n",
    "    conn,\n",
    "    embedding_model,\n",
    "    search_phrase=SEARCH_TEXT_KEYWORDS,\n",
    "    top_k=10,\n",
    "    show_explain=False\n",
    ")\n",
    "\n",
    "pre_filter_results_df = pd.DataFrame(rows, columns=columns)\n",
    "\n",
    "print(f\"🔍 Hybrid Search: '{SEARCH_TEXT_KEYWORDS}'\")\n",
    "print(f\"📊 Found {len(pre_filter_results_df)} results\\n\")\n",
    "\n",
    "pre_filter_results_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1e5fda2",
   "metadata": {},
   "source": [
    "#### 4.3.2 Post Filtering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae572091",
   "metadata": {},
   "outputs": [],
   "source": [
    "import array\n",
    "import numpy as np\n",
    "\n",
    "def hybrid_search_research_papers_postfilter(\n",
    "    conn,\n",
    "    embedding_model,\n",
    "    search_phrase: str,\n",
    "    top_k: int = 10,\n",
    "    candidate_k: int = 200,\n",
    "    show_explain: bool = False\n",
    "):\n",
    "    \"\"\"\n",
    "    Perform a hybrid search using Vector Search first, then Oracle Text filtering.\n",
    "    Returns top results ranked by semantic similarity but filtered by lexical match.\n",
    "\n",
    "    Args:\n",
    "        conn: Oracle database connection object.\n",
    "        embedding_model: Model with `.encode()` method (e.g., SentenceTransformer).\n",
    "        search_phrase (str): Search phrase used for both embedding and text filtering.\n",
    "        top_k (int): Number of top results to return (default = 10).\n",
    "        candidate_k (int): Number of initial vector candidates (default = 200).\n",
    "        show_explain (bool): If True, prints the execution plan.\n",
    "\n",
    "    Returns:\n",
    "        tuple: (rows, columns, exec_plan_text or None)\n",
    "    \"\"\"\n",
    "\n",
    "    # --- Step 1: Encode search phrase into a normalized query vector ---\n",
    "    query_embedding = embedding_model.encode(\n",
    "        [f\"search_query: {search_phrase}\"],\n",
    "        convert_to_numpy=True,\n",
    "        normalize_embeddings=True\n",
    "    )[0].astype(np.float32).tolist()\n",
    "    query_embedding_array = array.array('f', query_embedding)\n",
    "\n",
    "    with conn.cursor() as cur:\n",
    "        # Enable runtime statistics if requested\n",
    "        if show_explain:\n",
    "            cur.execute(\"ALTER SESSION SET statistics_level = ALL\")\n",
    "\n",
    "        # --- Step 2: Hybrid query (Vector first → Text filter) ---\n",
    "        sql = f\"\"\"\n",
    "            WITH vec_candidates AS (\n",
    "                SELECT\n",
    "                    arxiv_id,\n",
    "                    title,\n",
    "                    abstract,\n",
    "                    text,\n",
    "                    1 - VECTOR_DISTANCE(embedding, :q, COSINE) AS similarity_score\n",
    "                FROM research_papers\n",
    "                ORDER BY similarity_score DESC\n",
    "                FETCH APPROX FIRST {candidate_k} ROWS ONLY WITH TARGET ACCURACY 90\n",
    "            )\n",
    "            SELECT {\"/*+ GATHER_PLAN_STATISTICS */\" if show_explain else \"\"}\n",
    "                arxiv_id,\n",
    "                title,\n",
    "                SUBSTR(text, 1, 200) AS text_snippet,\n",
    "                ROUND(similarity_score, 4) AS similarity_score\n",
    "            FROM vec_candidates\n",
    "            WHERE CONTAINS(text, :kw, 1) > 0\n",
    "            ORDER BY similarity_score DESC\n",
    "            FETCH FIRST {top_k} ROWS ONLY\n",
    "        \"\"\"\n",
    "\n",
    "        cur.execute(sql, q=query_embedding_array, kw=search_phrase)\n",
    "        rows = cur.fetchall()\n",
    "        columns = [desc[0] for desc in cur.description]\n",
    "\n",
    "    # --- Step 3: Fetch and display execution plan (optional) ---\n",
    "    exec_plan_text = None\n",
    "    if show_explain:\n",
    "        with conn.cursor() as cur_plan:\n",
    "            cur_plan.execute(\"\"\"\n",
    "                SELECT plan_table_output\n",
    "                FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL, NULL, 'ALLSTATS LAST +PREDICATE'))\n",
    "            \"\"\")\n",
    "            exec_plan_text = \"\\n\".join(r[0] for r in cur_plan.fetchall())\n",
    "\n",
    "        print(\"\\n====== Execution Plan (DBMS_XPLAN.DISPLAY_CURSOR) ======\")\n",
    "        print(exec_plan_text)\n",
    "        print(\"========================================================\\n\")\n",
    "\n",
    "    return rows, columns, exec_plan_text\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4048cf0",
   "metadata": {},
   "outputs": [],
   "source": [
    "rows, columns, exec_plan = hybrid_search_research_papers_postfilter(\n",
    "    conn,\n",
    "    embedding_model,\n",
    "    search_phrase=SEARCH_TEXT_KEYWORDS,\n",
    "    top_k=10,\n",
    "    show_explain=False\n",
    ")\n",
    "\n",
    "post_filter_results_df = pd.DataFrame(rows, columns=columns)\n",
    "\n",
    "print(f\"🔍 Hybrid Search: '{SEARCH_TEXT_KEYWORDS}'\")\n",
    "print(f\"📊 Found {len(post_filter_results_df)} results\\n\")\n",
    "\n",
    "post_filter_results_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7ba61b5",
   "metadata": {},
   "source": [
    "Observe pre/post filtering technques in a table side by side "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5e521f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show the results side by side\n",
    "pd.concat([pre_filter_results_df, post_filter_results_df], axis=1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b041cd8b",
   "metadata": {},
   "source": [
    "| Approach                          | Strength                            | Best For                                |\n",
    "| --------------------------------- | ----------------------------------- | --------------------------------------- |\n",
    "| **Pre-filter** (`CONTAINS` first) | Fast, keyword-strict                | Narrow keyword search                   |\n",
    "| **Post-filter** (this one)        | Semantically rich but still precise | Broader exploratory or research queries |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a008720",
   "metadata": {},
   "source": [
    "#### 4.3.3 Reciprocial Rank Fusion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7aefbf4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import array\n",
    "import numpy as np\n",
    "import oracledb\n",
    "\n",
    "def hybrid_rrf_search(\n",
    "    conn,\n",
    "    embedding_model,\n",
    "    search_phrase: str,\n",
    "    top_k: int = 10,\n",
    "    per_list: int = 120,     # candidates from each list before fusion (>= 10x top_k is a good rule)\n",
    "    k: int = 60,             # RRF smoothing constant (60 is standard)\n",
    "    phrase_safe: bool = True,\n",
    "    show_explain: bool = False\n",
    "):\n",
    "    \"\"\"\n",
    "    Local-friendly RRF fusion of Vector + Oracle Text results on research_papers(text, embedding).\n",
    "\n",
    "    Prereqs (local/Docker Free OK):\n",
    "      - VECTOR column/index on research_papers(embedding)  -- HNSW\n",
    "      - Oracle Text index on research_papers(text)         -- e.g. CREATE SEARCH INDEX rp_text_idx ON research_papers(text);\n",
    "\n",
    "    RRF = 1/(k + r_vec) + 1/(k + r_txt), where r_vec and r_txt are ranks (1 = best).\n",
    "    \"\"\"\n",
    "\n",
    "    # 1) Encode query for vector modality (align with your doc prefixing scheme)\n",
    "    qv = embedding_model.encode(\n",
    "        [f\"search_query: {search_phrase}\"],\n",
    "        convert_to_numpy=True,\n",
    "        normalize_embeddings=True\n",
    "    )[0].astype(np.float32).tolist()\n",
    "    qv = array.array('f', qv)\n",
    "\n",
    "    # 2) Phrase-safe text query for Oracle Text (optional)\n",
    "    kw = f\"\\\"{search_phrase}\\\"\" if (phrase_safe and \" \" in search_phrase.strip()) else search_phrase\n",
    "\n",
    "    with conn.cursor() as cur:\n",
    "        if show_explain:\n",
    "            cur.execute(\"ALTER SESSION SET statistics_level = ALL\")\n",
    "\n",
    "        sql = f\"\"\"\n",
    "            WITH\n",
    "            /* Vector top-N with ranks (higher similarity first) */\n",
    "            vec AS (\n",
    "              SELECT\n",
    "                arxiv_id,\n",
    "                title,\n",
    "                SUBSTR(text, 1, 200) AS text_snippet,\n",
    "                1 - VECTOR_DISTANCE(embedding, :q, COSINE) AS sim_vec,\n",
    "                ROW_NUMBER() OVER (ORDER BY 1 - VECTOR_DISTANCE(embedding, :q, COSINE) DESC) AS r_vec\n",
    "              FROM research_papers\n",
    "              FETCH APPROX FIRST {per_list} ROWS ONLY WITH TARGET ACCURACY 90\n",
    "            ),\n",
    "            /* Oracle Text top-N with ranks (higher SCORE(1) first) */\n",
    "            txt AS (\n",
    "              SELECT\n",
    "                arxiv_id,\n",
    "                title,\n",
    "                SUBSTR(text, 1, 200) AS text_snippet,\n",
    "                SCORE(1) AS score_txt,\n",
    "                ROW_NUMBER() OVER (ORDER BY SCORE(1) DESC) AS r_txt\n",
    "              FROM research_papers\n",
    "              WHERE CONTAINS(text, :kw, 1) > 0\n",
    "              FETCH FIRST {per_list} ROWS ONLY\n",
    "            ),\n",
    "            /* Fuse by arxiv_id; keep docs present in either list */\n",
    "            fused AS (\n",
    "              SELECT\n",
    "                COALESCE(v.arxiv_id, t.arxiv_id)           AS arxiv_id,\n",
    "                COALESCE(v.title,     t.title)             AS title,\n",
    "                COALESCE(v.text_snippet, t.text_snippet)   AS text_snippet,\n",
    "                NVL(v.r_vec,  999999) AS r_vec,\n",
    "                NVL(t.r_txt,  999999) AS r_txt,\n",
    "                NVL(v.sim_vec, 0)     AS sim_vec,\n",
    "                NVL(t.score_txt, 0)   AS score_txt\n",
    "              FROM vec v\n",
    "              FULL OUTER JOIN txt t\n",
    "                ON t.arxiv_id = v.arxiv_id\n",
    "            )\n",
    "            SELECT {\"/*+ GATHER_PLAN_STATISTICS */\" if show_explain else \"\"}\n",
    "              arxiv_id,\n",
    "              title,\n",
    "              text_snippet,\n",
    "              ROUND( (1.0/(:k + r_vec)) + (1.0/(:k + r_txt)), 6 ) AS rrf_score,\n",
    "              r_vec,\n",
    "              r_txt,\n",
    "              ROUND(sim_vec, 4)  AS sim_vec,\n",
    "              ROUND(score_txt,4) AS score_txt\n",
    "            FROM fused\n",
    "            ORDER BY rrf_score DESC, title\n",
    "            FETCH FIRST {top_k} ROWS ONLY\n",
    "        \"\"\"\n",
    "\n",
    "        cur.execute(sql, q=qv, kw=kw, k=k)\n",
    "        rows = cur.fetchall()\n",
    "        columns = [d[0] for d in cur.description]\n",
    "\n",
    "    exec_plan_text = None\n",
    "    if show_explain:\n",
    "        with conn.cursor() as cur_plan:\n",
    "            cur_plan.execute(\"\"\"\n",
    "                SELECT plan_table_output\n",
    "                FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL, NULL, 'ALLSTATS LAST +PREDICATE'))\n",
    "            \"\"\")\n",
    "            exec_plan_text = \"\\n\".join(r[0] for r in cur_plan.fetchall())\n",
    "            print(\"\\n====== Execution Plan (DBMS_XPLAN.DISPLAY_CURSOR) ======\")\n",
    "            print(exec_plan_text)\n",
    "            print(\"========================================================\\n\")\n",
    "\n",
    "    return rows, columns, exec_plan_text\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "128e687c",
   "metadata": {},
   "outputs": [],
   "source": [
    "rows, columns, exec_plan = hybrid_rrf_search(\n",
    "    conn,\n",
    "    embedding_model,\n",
    "    search_phrase=SEARCH_TEXT_KEYWORDS,\n",
    "    top_k=3,\n",
    "    show_explain=False\n",
    ")\n",
    "\n",
    "rrf_results_df = pd.DataFrame(rows, columns=columns)\n",
    "\n",
    "print(f\"🔍 Hybrid Search: '{SEARCH_TEXT_KEYWORDS}'\")\n",
    "print(f\"📊 Found {len(rrf_results_df)} results\\n\")\n",
    "\n",
    "rrf_results_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a80ec43",
   "metadata": {},
   "source": [
    "## 4.4 Graph-Based Retrieval (Oracle SQL Property Graph)\n",
    "\n",
    "In this retrieval mode, we seed with vector similarity, then expand candidates through graph paths:\n",
    "- `Paper -> SIMILAR_TO -> Paper`\n",
    "- `Paper <- WROTE - Author - WROTE -> Paper`\n",
    "\n",
    "The final score blends seed vector relevance and graph-path evidence.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "250ef601",
   "metadata": {},
   "outputs": [],
   "source": [
    "def graph_search_research_papers(\n",
    "    conn,\n",
    "    embedding_model,\n",
    "    search_query: str,\n",
    "    top_k: int = 10,\n",
    "    seed_k: int = 25\n",
    "):\n",
    "    \"\"\"\n",
    "    Graph retrieval using Oracle SQL Property Graph + GRAPH_TABLE.\n",
    "    Seeds from vector similarity, then expands via SIMILAR_TO and shared-author paths.\n",
    "    \"\"\"\n",
    "\n",
    "    seed_k = max(seed_k, top_k)\n",
    "\n",
    "    query_embedding = embedding_model.encode(\n",
    "        [f\"search_query: {search_query}\"],\n",
    "        convert_to_numpy=True,\n",
    "        normalize_embeddings=True\n",
    "    )[0].astype(np.float32).tolist()\n",
    "    \n",
    "    query_embedding_array = array.array('f', query_embedding)\n",
    "\n",
    "    sql = f\"\"\"\n",
    "        WITH seed AS (\n",
    "            SELECT\n",
    "                arxiv_id,\n",
    "                1 - VECTOR_DISTANCE(embedding, :q, COSINE) AS seed_score\n",
    "            FROM research_papers\n",
    "            ORDER BY seed_score DESC\n",
    "            FETCH APPROX FIRST {seed_k} ROWS ONLY WITH TARGET ACCURACY 90\n",
    "        ),\n",
    "        seed_hits AS (\n",
    "            SELECT\n",
    "                arxiv_id AS source_arxiv_id,\n",
    "                arxiv_id AS candidate_arxiv_id,\n",
    "                seed_score,\n",
    "                'seed' AS relation_type,\n",
    "                seed_score AS edge_score\n",
    "            FROM seed\n",
    "        ),\n",
    "        sim_hops AS (\n",
    "            SELECT\n",
    "                s.arxiv_id AS source_arxiv_id,\n",
    "                gt.target_arxiv_id AS candidate_arxiv_id,\n",
    "                s.seed_score,\n",
    "                'similar_to' AS relation_type,\n",
    "                gt.edge_score AS edge_score\n",
    "            FROM seed s\n",
    "            JOIN GRAPH_TABLE(\n",
    "                research_graph\n",
    "                MATCH (src IS paper)-[e IS similar_to]->(dst IS paper)\n",
    "                COLUMNS (\n",
    "                    src.arxiv_id AS source_arxiv_id,\n",
    "                    dst.arxiv_id AS target_arxiv_id,\n",
    "                    e.sim_score AS edge_score\n",
    "                )\n",
    "            ) gt\n",
    "                ON gt.source_arxiv_id = s.arxiv_id\n",
    "        ),\n",
    "        author_hops AS (\n",
    "            SELECT\n",
    "                s.arxiv_id AS source_arxiv_id,\n",
    "                gt.target_arxiv_id AS candidate_arxiv_id,\n",
    "                s.seed_score,\n",
    "                'shared_author' AS relation_type,\n",
    "                1.0 AS edge_score\n",
    "            FROM seed s\n",
    "            JOIN GRAPH_TABLE(\n",
    "                research_graph\n",
    "                MATCH (src IS paper)<-[w1 IS wrote]-(a IS author)-[w2 IS wrote]->(dst IS paper)\n",
    "                COLUMNS (\n",
    "                    src.arxiv_id AS source_arxiv_id,\n",
    "                    dst.arxiv_id AS target_arxiv_id\n",
    "                )\n",
    "            ) gt\n",
    "                ON gt.source_arxiv_id = s.arxiv_id\n",
    "            WHERE gt.target_arxiv_id <> s.arxiv_id\n",
    "        ),\n",
    "        candidates AS (\n",
    "            SELECT * FROM seed_hits\n",
    "            UNION ALL\n",
    "            SELECT * FROM sim_hops\n",
    "            UNION ALL\n",
    "            SELECT * FROM author_hops\n",
    "        ),\n",
    "        scored AS (\n",
    "            SELECT\n",
    "                candidate_arxiv_id AS arxiv_id,\n",
    "                MAX(\n",
    "                    CASE relation_type\n",
    "                        WHEN 'seed' THEN seed_score\n",
    "                        WHEN 'similar_to' THEN (0.70 * seed_score) + (0.30 * edge_score)\n",
    "                        WHEN 'shared_author' THEN (0.85 * seed_score) + (0.15 * edge_score)\n",
    "                        ELSE seed_score\n",
    "                    END\n",
    "                ) AS graph_score\n",
    "            FROM candidates\n",
    "            GROUP BY candidate_arxiv_id\n",
    "        )\n",
    "        SELECT\n",
    "            rp.arxiv_id,\n",
    "            rp.title,\n",
    "            rp.abstract,\n",
    "            SUBSTR(rp.text, 1, 200) AS text_snippet,\n",
    "            ROUND(sc.graph_score, 4) AS graph_score\n",
    "        FROM scored sc\n",
    "        JOIN research_papers rp\n",
    "            ON rp.arxiv_id = sc.arxiv_id\n",
    "        ORDER BY graph_score DESC\n",
    "        FETCH FIRST {top_k} ROWS ONLY\n",
    "    \"\"\"\n",
    "\n",
    "    with conn.cursor() as cur:\n",
    "        cur.execute(sql, q=query_embedding_array)\n",
    "        rows = cur.fetchall()\n",
    "        columns = [desc[0] for desc in cur.description]\n",
    "\n",
    "    return rows, columns\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a69a4d16",
   "metadata": {},
   "source": [
    "This quick check runs graph retrieval on the same sample query so you can compare graph-ranked results side-by-side with keyword/vector/hybrid modes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0256de57",
   "metadata": {},
   "outputs": [],
   "source": [
    "rows, columns = graph_search_research_papers(\n",
    "    conn,\n",
    "    embedding_model,\n",
    "    search_query=SEARCH_TEXT_KEYWORDS,\n",
    "    top_k=5,\n",
    "    seed_k=20\n",
    ")\n",
    "\n",
    "graph_results_df = pd.DataFrame(rows, columns=columns)\n",
    "\n",
    "print(f\"🔍 Graph Search: '{SEARCH_TEXT_KEYWORDS}'\")\n",
    "print(f\"📊 Results: {len(graph_results_df)}\\n\")\n",
    "\n",
    "graph_results_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1f0ef74",
   "metadata": {},
   "source": [
    "### 4.5 Compare Top Results Across Retrieval Strategies\n",
    "\n",
    "To make the retrieval behavior easy to inspect, the next cell builds a side-by-side table of titles.\n",
    "\n",
    "Each row is a retrieval strategy, and the columns `Top_1` to `Top_5` show the highest-ranked paper titles returned for the same query.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86f9d8a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_top_titles(rows, columns, k=5):\n",
    "    \"\"\"Return top-k titles from retrieval results.\"\"\"\n",
    "    titles = []\n",
    "    for row in rows[:k]:\n",
    "        row_data = dict(zip(columns, row))\n",
    "        titles.append(row_data.get(\"TITLE\", \"Untitled\"))\n",
    "    return titles\n",
    "\n",
    "\n",
    "comparison_specs = [\n",
    "    (\"keyword\", lambda: keyword_search_research_papers(conn, SEARCH_TEXT_KEYWORDS)),\n",
    "    (\"vector\", lambda: vector_search_research_papers(conn, embedding_model, SEARCH_TEXT_KEYWORDS, top_k=5)),\n",
    "    (\n",
    "        \"hybrid_pre_filter\",\n",
    "        lambda: hybrid_search_research_papers_pre_filter(\n",
    "            conn=conn,\n",
    "            embedding_model=embedding_model,\n",
    "            search_phrase=SEARCH_TEXT_KEYWORDS,\n",
    "            top_k=5,\n",
    "            show_explain=False,\n",
    "        ),\n",
    "    ),\n",
    "    (\n",
    "        \"hybrid_postfilter\",\n",
    "        lambda: hybrid_search_research_papers_postfilter(\n",
    "            conn=conn,\n",
    "            embedding_model=embedding_model,\n",
    "            search_phrase=SEARCH_TEXT_KEYWORDS,\n",
    "            top_k=5,\n",
    "            candidate_k=200,\n",
    "            show_explain=False,\n",
    "        ),\n",
    "    ),\n",
    "    (\n",
    "        \"hybrid_rrf\",\n",
    "        lambda: hybrid_rrf_search(\n",
    "            conn=conn,\n",
    "            embedding_model=embedding_model,\n",
    "            search_phrase=SEARCH_TEXT_KEYWORDS,\n",
    "            top_k=5,\n",
    "            per_list=60,\n",
    "            k=60,\n",
    "            show_explain=False,\n",
    "        ),\n",
    "    ),\n",
    "    (\n",
    "        \"graph\",\n",
    "        lambda: graph_search_research_papers(\n",
    "            conn=conn,\n",
    "            embedding_model=embedding_model,\n",
    "            search_query=SEARCH_TEXT_KEYWORDS,\n",
    "            top_k=5,\n",
    "            seed_k=20,\n",
    "        ),\n",
    "    ),\n",
    "]\n",
    "\n",
    "comparison_rows = []\n",
    "for strategy_name, runner in comparison_specs:\n",
    "    result = runner()\n",
    "\n",
    "    if isinstance(result, tuple) and len(result) >= 2:\n",
    "        rows, columns = result[0], result[1]\n",
    "    else:\n",
    "        rows, columns = [], []\n",
    "\n",
    "    titles = extract_top_titles(rows, columns, k=5)\n",
    "\n",
    "    record = {\"retrieval_strategy\": strategy_name}\n",
    "    for i in range(5):\n",
    "        record[f\"Top_{i+1}\"] = titles[i] if i < len(titles) else \"\"\n",
    "\n",
    "    comparison_rows.append(record)\n",
    "\n",
    "retrieval_strategy_comparison_df = pd.DataFrame(comparison_rows)\n",
    "retrieval_strategy_comparison_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed0832de",
   "metadata": {},
   "source": [
    "# Part 5: Building a RAG Pipeline\n",
    "\n",
    "In this section, we connect retrieval to generation end-to-end.\n",
    "\n",
    "Flow for this part:\n",
    "1. Configure API access and initialize the OpenAI client.\n",
    "2. Define a reusable RAG function that supports multiple retrieval modes.\n",
    "3. Retrieve evidence from Oracle and synthesize a grounded answer.\n",
    "4. Run a sample query to validate the pipeline behavior.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4c1c9a2",
   "metadata": {},
   "source": [
    "### 5.1 Configure API Access\n",
    "\n",
    "The next two cells define a small helper to securely set environment variables and prompt for `OPENAI_API_KEY`. This keeps credentials out of source code and notebook outputs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "741e36c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import getpass\n",
    "import os\n",
    "\n",
    "# Function to securely get and set environment variables\n",
    "def set_env_securely(var_name, prompt):\n",
    "    value = getpass.getpass(prompt)\n",
    "    os.environ[var_name] = value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1cf5406",
   "metadata": {},
   "outputs": [],
   "source": [
    "set_env_securely(\"OPENAI_API_KEY\", \"Enter your OPEN API Key: \")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a90d90f",
   "metadata": {},
   "source": [
    "### 5.2 Initialize and Smoke-Test the OpenAI Client\n",
    "\n",
    "Before building the full pipeline, we perform a lightweight client call to confirm model access and credential configuration.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f41960f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the OpenAI Python client library\n",
    "from openai import OpenAI\n",
    "import os\n",
    "\n",
    "# Initialize the OpenAI client (API key read from env var OPENAI_API_KEY)\n",
    "openai_client = OpenAI(\n",
    "    api_key=os.environ.get(\"OPENAI_API_KEY\"),\n",
    ")\n",
    "\n",
    "# Use the Responses API\n",
    "response = openai_client.responses.create(\n",
    "    model=\"gpt-5\",\n",
    "    input=\"Hello! I’m a user!\",\n",
    "    instructions=\"You are a research paper assistant.\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27805348",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print the output text\n",
    "print(response.output_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d97587f",
   "metadata": {},
   "source": [
    "### 5.3 Define the Reusable RAG Function\n",
    "\n",
    "This function is the core integration point between retrieval and generation. It:\n",
    "- selects retrieval strategy (`keyword`, `vector`, `hybrid`, or `graph`),\n",
    "- formats retrieved rows into citation-ready context,\n",
    "- calls the Responses API with grounding instructions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "005a01c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def research_paper_assistant_rag_pipeline(\n",
    "    conn,\n",
    "    embedding_model,\n",
    "    user_query: str,\n",
    "    top_k: int = 10,\n",
    "    retrieval_mode: str = \"hybrid\",\n",
    "    show_explain: bool = False\n",
    "):\n",
    "    \"\"\"\n",
    "    Research Paper Assistant — Retrieval-Augmented Generation (RAG) pipeline\n",
    "    built on SQL-based retrieval functions and powered by the OpenAI Responses API.\n",
    "\n",
    "    Retrieval techniques available:\n",
    "        - 'keyword'  → uses keyword_search_research_papers()\n",
    "        - 'vector'   → uses vector_search_research_papers()\n",
    "        - 'hybrid'   → uses hybrid_search_research_papers_pre_filter() [default]\n",
    "        - 'graph'    → uses graph_search_research_papers() via SQL Property Graph\n",
    "\n",
    "    Args:\n",
    "        conn: Oracle database connection.\n",
    "        embedding_model: Embedding model (e.g., SentenceTransformer, Voyage).\n",
    "        user_query (str): Research question from the user.\n",
    "        top_k (int): Number of top documents to retrieve.\n",
    "        retrieval_mode (str): Retrieval method ('keyword', 'vector', 'hybrid', 'graph').\n",
    "        show_explain (bool): Whether to show the SQL execution plan.\n",
    "\n",
    "    Returns:\n",
    "        str: LLM-generated research synthesis with citations.\n",
    "    \"\"\"\n",
    "\n",
    "    # ----------------------------------------------------------------------\n",
    "    # 1. Retrieve relevant research papers using the selected retrieval mode\n",
    "    # ----------------------------------------------------------------------\n",
    "    if retrieval_mode == \"keyword\":\n",
    "        rows, columns = keyword_search_research_papers(conn, user_query)\n",
    "        exec_plan_text = None\n",
    "\n",
    "    elif retrieval_mode == \"vector\":\n",
    "        rows, columns = vector_search_research_papers(conn, embedding_model, user_query, top_k)\n",
    "        exec_plan_text = None\n",
    "\n",
    "    elif retrieval_mode == \"graph\":\n",
    "        rows, columns = graph_search_research_papers(conn, embedding_model, user_query, top_k=top_k)\n",
    "        exec_plan_text = None\n",
    "\n",
    "    else:  # default: hybrid retrieval\n",
    "        rows, columns, exec_plan_text = hybrid_search_research_papers_pre_filter(\n",
    "            conn=conn,\n",
    "            embedding_model=embedding_model,\n",
    "            search_phrase=user_query,\n",
    "            top_k=top_k,\n",
    "            show_explain=show_explain\n",
    "        )\n",
    "\n",
    "    retrieved_count = len(rows) if rows else 0\n",
    "    print(f\"📊 Retrieved {retrieved_count} papers using {retrieval_mode.upper()} retrieval.\")\n",
    "\n",
    "\n",
    "    # ----------------------------------------------------------------------\n",
    "    # 2. Convert retrieved rows to formatted LLM context\n",
    "    # ----------------------------------------------------------------------\n",
    "    formatted_context = \"\"\n",
    "    if retrieved_count > 0:\n",
    "        formatted_context += f\"\\n\\n📚 {retrieved_count} relevant research papers retrieved:\\n\\n\"\n",
    "        for i, row in enumerate(rows):\n",
    "            row_data = dict(zip(columns, row))\n",
    "            title = row_data.get(\"TITLE\", \"Untitled Paper\")\n",
    "            abstract = row_data.get(\"ABSTRACT\", \"No abstract available.\")\n",
    "            snippet = row_data.get(\"TEXT_SNIPPET\", \"\")\n",
    "            score = (\n",
    "                row_data.get(\"GRAPH_SCORE\")\n",
    "                or row_data.get(\"SIMILARITY_SCORE\")\n",
    "                or row_data.get(\"RELEVANCE_SCORE\")\n",
    "                or row_data.get(\"TEXT_RELEVANCE_SCORE\")\n",
    "                or \"N/A\"\n",
    "            )\n",
    "            formatted_context += (\n",
    "                f\"[{i+1}] **{title}**\\n\"\n",
    "                f\"Abstract: {abstract}\\n\"\n",
    "                f\"Snippet: {snippet}\\n\"\n",
    "                f\"Relevance Score: {score}\\n\\n\"\n",
    "            )\n",
    "    else:\n",
    "        formatted_context = \"\\n\\n⚠️ No relevant papers were retrieved from the database.\\n\"\n",
    "\n",
    "    # ----------------------------------------------------------------------\n",
    "    # 3. Construct the prompt for the Responses API\n",
    "    # ----------------------------------------------------------------------\n",
    "    prompt = f\"\"\"\n",
    "            You are a **Research Paper Assistant** that synthesizes academic literature to help answer user questions.\n",
    "\n",
    "            User Query: {user_query}\n",
    "\n",
    "            Number of retrieved papers: {retrieved_count}\n",
    "            {formatted_context}\n",
    "\n",
    "            Please:\n",
    "            - Summarize the findings most relevant to the query.\n",
    "            - Use citation numbers [X] to support claims.\n",
    "            - Highlight consensus, innovation, or research gaps.\n",
    "            - If there is insufficient context, clearly say so.\n",
    "            \"\"\"\n",
    "\n",
    "    # ----------------------------------------------------------------------\n",
    "    # 4. Call the OpenAI Responses API\n",
    "    # ----------------------------------------------------------------------\n",
    "    response = openai_client.responses.create(\n",
    "        model=\"gpt-4o\",\n",
    "        input=prompt,\n",
    "        instructions=\"You are a scientific research assistant. Use only the provided context to answer. Always cite papers [1], [2], etc.\",\n",
    "        temperature=0.3,\n",
    "    )\n",
    "\n",
    "    # ----------------------------------------------------------------------\n",
    "    # 5. Optionally print SQL execution plan (if hybrid)\n",
    "    # ----------------------------------------------------------------------\n",
    "    if show_explain and exec_plan_text:\n",
    "        print(\"\\n====== SQL Execution Plan ======\")\n",
    "        print(exec_plan_text)\n",
    "        print(\"================================\\n\")\n",
    "\n",
    "    # ----------------------------------------------------------------------\n",
    "    # 6. Return the LLM’s output text\n",
    "    # ----------------------------------------------------------------------\n",
    "    return response.output_text\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9e8a3df",
   "metadata": {},
   "source": [
    "### 5.4 Run an End-to-End RAG Example\n",
    "\n",
    "This invocation exercises the full path from user query to retrieval to synthesized answer, so you can validate retrieval quality and response grounding.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f4b00b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = research_paper_assistant_rag_pipeline(\n",
    "    conn=conn,\n",
    "    embedding_model=embedding_model,\n",
    "    user_query=SEARCH_TEXT_KEYWORDS,\n",
    "    top_k=5,\n",
    "    retrieval_mode=\"hybrid\",  # options: 'keyword', 'vector', 'hybrid', 'graph'\n",
    "    show_explain=False\n",
    ")\n",
    "\n",
    "print(summary)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "104b31b8",
   "metadata": {},
   "source": [
    "# Part 6: AI Agents with OpenAI and Oracle AI Database\n",
    "\n",
    "In this section, we move from a single RAG call to an agentic system.\n",
    "\n",
    "You will build the flow incrementally:\n",
    "1. Start with a base agent.\n",
    "2. Add retrieval tools backed by Oracle SQL.\n",
    "3. Upgrade to multi-tool routing and orchestration.\n",
    "4. Add conversation history and persistent session memory.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41c4bd7f",
   "metadata": {},
   "source": [
    "### 6.1 Install Agent Runtime and Verify Package Compatibility\n",
    "\n",
    "Use the notebook-native `%pip` command so installation happens in the **active kernel environment**.\n",
    "\n",
    "We install `openai` and `openai-agents` together to avoid version mismatches, then print resolved versions before importing `agents`.\n",
    "\n",
    "If you still see import errors after install, restart the kernel and run this section again.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74defe4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -Uq --no-cache-dir openai openai-agents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d87a8fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import importlib.metadata as md\n",
    "\n",
    "print(\"Python executable:\", sys.executable)\n",
    "print(\"openai version:\", md.version(\"openai\"))\n",
    "print(\"openai-agents version:\", md.version(\"openai-agents\"))\n",
    "\n",
    "OPENAI_MODEL = \"gpt-5\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f63cabf2",
   "metadata": {},
   "source": [
    "### 6.2 Create a Baseline Agent (No Tools Yet)\n",
    "\n",
    "Start simple: define a research-focused assistant and run one direct query. This establishes baseline behavior before enabling external tool use.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1222a7bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from agents import Agent, Runner\n",
    "\n",
    "research_paper_assistant = Agent(\n",
    "    name=\"Research Paper Assistant\",\n",
    "    model=OPENAI_MODEL,\n",
    "    instructions=\"\"\"\n",
    "      You are a Research Paper Assistant focused on helping users explore, analyze, and summarize\n",
    "      academic research.\n",
    "\n",
    "      Maintain a professional, concise, and scholarly tone appropriate for research discussions.\n",
    "    \"\"\",\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1ab0778",
   "metadata": {},
   "outputs": [],
   "source": [
    "run_result = await Runner.run(\n",
    "    starting_agent=research_paper_assistant,\n",
    "    input=\"Summarize recent research on optimization techniques for planetary exploration.\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2eda5fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(run_result.final_output)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc63ef7c",
   "metadata": {},
   "source": [
    "### 6.3 Expose Oracle Retrieval as a Callable Tool\n",
    "\n",
    "Here we wrap SQL retrieval in a `@function_tool`, attach it to the assistant, and run a query that should trigger tool usage. Inspecting raw responses helps verify tool-call arguments and execution order.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd18f19a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from agents.tool import function_tool\n",
    "\n",
    "@function_tool\n",
    "def get_research_papers(user_query: str, retrieval_mode: str = \"hybrid\", top_k: int = 5) -> str:\n",
    "    \"\"\"\n",
    "    Retrieves academic research papers relevant to the user's query.\n",
    "\n",
    "    This tool queries the research_papers SQL table using one of four retrieval techniques:\n",
    "        - 'keyword'  → lexical search via Oracle Text\n",
    "        - 'vector'   → semantic similarity search\n",
    "        - 'hybrid'   → combines keyword prefiltering + vector similarity (default)\n",
    "        - 'graph'    → graph expansion via SQL Property Graph + GRAPH_TABLE\n",
    "\n",
    "    Use this tool when analyzing or summarizing scientific literature.\n",
    "\n",
    "    Args:\n",
    "        user_query (str): Research topic or question to search for.\n",
    "        retrieval_mode (str): 'keyword', 'vector', 'hybrid', or 'graph'. Default is 'hybrid'.\n",
    "        top_k (int): Number of top papers to retrieve (default=5).\n",
    "\n",
    "    Returns:\n",
    "        str: A formatted summary of the most relevant research papers.\n",
    "    \"\"\"\n",
    "\n",
    "    # ------------------------------------------------------------------\n",
    "    # Perform retrieval using SQL-based functions (defined earlier)\n",
    "    # ------------------------------------------------------------------\n",
    "    if retrieval_mode == \"keyword\":\n",
    "        rows, columns = keyword_search_research_papers(conn, user_query)\n",
    "    elif retrieval_mode == \"vector\":\n",
    "        rows, columns = vector_search_research_papers(conn, embedding_model, user_query, top_k)\n",
    "    elif retrieval_mode == \"graph\":\n",
    "        rows, columns = graph_search_research_papers(conn, embedding_model, user_query, top_k=top_k)\n",
    "    else:\n",
    "        rows, columns, _ = hybrid_search_research_papers_pre_filter(\n",
    "            conn=conn,\n",
    "            embedding_model=embedding_model,\n",
    "            search_phrase=user_query,\n",
    "            top_k=top_k,\n",
    "            show_explain=False\n",
    "        )\n",
    "\n",
    "    retrieved_count = len(rows) if rows else 0\n",
    "\n",
    "    # ------------------------------------------------------------------\n",
    "    # Format the output into a readable string\n",
    "    # ------------------------------------------------------------------\n",
    "    if retrieved_count == 0:\n",
    "        return f\"No research papers found related to '{user_query}'.\"\n",
    "\n",
    "    formatted_results = [f\"📚 {retrieved_count} papers retrieved for query: '{user_query}'\\n\"]\n",
    "    for i, row in enumerate(rows):\n",
    "        row_data = dict(zip(columns, row))\n",
    "        title = row_data.get(\"TITLE\", \"Untitled Paper\")\n",
    "        abstract = row_data.get(\"ABSTRACT\", \"No abstract available.\")\n",
    "        score = (\n",
    "            row_data.get(\"GRAPH_SCORE\")\n",
    "            or row_data.get(\"SIMILARITY_SCORE\")\n",
    "            or row_data.get(\"RELEVANCE_SCORE\")\n",
    "            or row_data.get(\"TEXT_RELEVANCE_SCORE\")\n",
    "            or \"N/A\"\n",
    "        )\n",
    "        formatted_results.append(\n",
    "            f\"[{i+1}] {title}\\n\"\n",
    "            f\"Abstract: {abstract}\\n\"\n",
    "            f\"Relevance Score: {score}\\n\"\n",
    "        )\n",
    "\n",
    "    return \"\\n\".join(formatted_results)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "625c3b81",
   "metadata": {},
   "outputs": [],
   "source": [
    "research_paper_assistant.tools.append(get_research_papers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3bdca1fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "run_result_with_tool = await Runner.run(\n",
    "    starting_agent=research_paper_assistant,\n",
    "    input=\"Get me information on rover navigation, planetary data collection, mission planning, resource allocation, or other related fields\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92279dac",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(run_result_with_tool.final_output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9afe21b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pprint\n",
    "pprint.pprint(run_result_with_tool.raw_responses)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c31a451b",
   "metadata": {},
   "source": [
    "### 6.4 Add a Second Tool and Enable Multi-Tool Access\n",
    "\n",
    "The first tool retrieves current literature. In this block, we add a second tool for past research conversations so the assistant can combine:\n",
    "- fresh evidence from the paper corpus, and\n",
    "- prior analytical context from earlier runs.\n",
    "\n",
    "This is the point where the agent starts making tool-selection decisions based on intent.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "785c0266",
   "metadata": {},
   "outputs": [],
   "source": [
    "from agents.tool import function_tool\n",
    "\n",
    "@function_tool\n",
    "def get_past_research_conversations(user_query: str, top_k: int = 5) -> str:\n",
    "    \"\"\"\n",
    "    Retrieves relevant past research-related conversations or analyses related to the query.\n",
    "\n",
    "    This tool searches a SQL database of prior research assistant conversations, \n",
    "    literature discussions, or synthesis sessions to find relevant context. \n",
    "    It allows the research assistant to recall previous analyses or summaries \n",
    "    that addressed similar topics, providing continuity and richer insights.\n",
    "\n",
    "    Args:\n",
    "        user_query (str): The research topic, concept, or question to search for.\n",
    "        top_k (int): Number of top past discussions to retrieve (default=5).\n",
    "\n",
    "    Returns:\n",
    "        str: Formatted examples of relevant past research discussions.\n",
    "    \"\"\"\n",
    "\n",
    "    # ------------------------------------------------------------------\n",
    "    # Perform retrieval using the SQL-based hybrid search (vector + keyword)\n",
    "    # ------------------------------------------------------------------\n",
    "    rows, columns, _ = hybrid_search_research_papers_pre_filter(\n",
    "        conn=conn,\n",
    "        embedding_model=embedding_model,\n",
    "        search_phrase=user_query,\n",
    "        top_k=top_k,\n",
    "        show_explain=False\n",
    "    )\n",
    "\n",
    "    retrieved_count = len(rows) if rows else 0\n",
    "\n",
    "    # ------------------------------------------------------------------\n",
    "    # Format results for readability\n",
    "    # ------------------------------------------------------------------\n",
    "    if retrieved_count == 0:\n",
    "        return f\"No past research discussions found related to '{user_query}'.\"\n",
    "\n",
    "    formatted_results = [f\"🧠 {retrieved_count} past research discussions retrieved for query: '{user_query}'\\n\"]\n",
    "    for i, row in enumerate(rows):\n",
    "        row_data = dict(zip(columns, row))\n",
    "        title = row_data.get(\"TITLE\", \"Untitled Discussion\")\n",
    "        abstract = row_data.get(\"ABSTRACT\", \"No summary available.\")\n",
    "        snippet = row_data.get(\"TEXT_SNIPPET\", \"\")\n",
    "        score = (\n",
    "            row_data.get(\"SIMILARITY_SCORE\")\n",
    "            or row_data.get(\"TEXT_RELEVANCE_SCORE\")\n",
    "            or \"N/A\"\n",
    "        )\n",
    "        formatted_results.append(\n",
    "            f\"[{i+1}] **{title}**\\n\"\n",
    "            f\"Summary: {abstract}\\n\"\n",
    "            f\"Snippet: {snippet}\\n\"\n",
    "            f\"Relevance Score: {score}\\n\"\n",
    "        )\n",
    "\n",
    "    return \"\\n\".join(formatted_results)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89db0e96",
   "metadata": {},
   "source": [
    "### 6.5 Strengthen Agent Instructions for Tool Routing\n",
    "\n",
    "Tool availability alone is not enough. The instruction policy below makes routing explicit so the assistant knows:\n",
    "- when to call paper retrieval,\n",
    "- when to call conversation retrieval,\n",
    "- and when to combine both for broader synthesis.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f602932",
   "metadata": {},
   "outputs": [],
   "source": [
    "upgraded_research_paper_assistant = Agent(\n",
    "    name=\"Research Paper Assistant\",\n",
    "    model=OPENAI_MODEL,\n",
    "    instructions=\"\"\"\n",
    "    Always maintain an academic, evidence-based tone.\n",
    "    Your purpose is to help users explore, synthesize, and connect research insights —\n",
    "    not to speculate or fabricate information.\n",
    "    \"\"\",\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fecf68a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Attach research retrieval tools to the upgraded research assistant\n",
    "upgraded_research_paper_assistant.tools.append(get_research_papers)\n",
    "upgraded_research_paper_assistant.tools.append(get_past_research_conversations)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a66d1b64",
   "metadata": {},
   "outputs": [],
   "source": [
    "pprint.pprint(upgraded_research_paper_assistant.tools)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b54a82ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "run_result_with_tools = await Runner.run(\n",
    "    starting_agent=upgraded_research_paper_assistant,\n",
    "    input=(\n",
    "        \"Get me information on rover navigation, planetary data collection, mission planning, resource allocation, or other related fields \"\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b61f988",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(run_result_with_tools.raw_responses)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62a57b24",
   "metadata": {},
   "source": [
    "## Agent-as-Tools Orchestration\n",
    "\n",
    "Now we compose specialists:\n",
    "- a research-paper retrieval specialist,\n",
    "- a past-conversation retrieval specialist,\n",
    "- an orchestrator that delegates,\n",
    "- and a final synthesizer that produces a cohesive response.\n",
    "\n",
    "This mirrors a production pattern where narrow agents perform focused work and a coordinator merges outputs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66ef7841",
   "metadata": {},
   "source": [
    "### Orchestration Flow\n",
    "\n",
    "Execution sequence in the next cells:\n",
    "1. Define specialized agents with clear scope.\n",
    "2. Register them as tools on an orchestrator agent.\n",
    "3. Run the orchestrator to gather relevant evidence.\n",
    "4. Pass gathered evidence to a synthesizer for final answer generation.\n",
    "\n",
    "The benefit is separation of concerns: retrieval logic is modular, and synthesis is centralized.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a3dafca",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define specialized agents for different research retrieval tasks\n",
    "research_paper_agent = Agent(\n",
    "    name=\"research_paper_agent\",\n",
    "    instructions=\"\"\"\n",
    "        You specialize in retrieving and summarizing academic research papers.\n",
    "        Use the get_research_papers tool to find relevant literature based on the user's query.\n",
    "        Always cite sources using [1], [2], etc., and focus on summarizing key findings,\n",
    "        methodologies, and implications of the studies retrieved.\n",
    "    \"\"\",\n",
    "    handoff_description=\"A research retrieval specialist with access to academic papers and literature databases.\",\n",
    "    tools=[get_research_papers],\n",
    ")\n",
    "\n",
    "research_conversation_agent = Agent(\n",
    "    name=\"research_conversation_agent\",\n",
    "    instructions=\"\"\"\n",
    "        You specialize in retrieving and summarizing past research discussions and analyses.\n",
    "        Use the get_past_research_conversations tool to surface relevant prior sessions\n",
    "        or summaries that relate to the user's current topic of inquiry.\n",
    "        Present these as context and examples of prior analytical reasoning.\n",
    "    \"\"\",\n",
    "    handoff_description=\"A research memory specialist with access to prior academic discussions and analyses.\",\n",
    "    tools=[get_past_research_conversations],\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "304cac6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an orchestrator agent that can coordinate both research retrieval agents\n",
    "orchestrator_agent = Agent(\n",
    "    name=\"research_assistant_orchestrator\",\n",
    "    instructions=(\n",
    "        \"You are a Research Orchestrator Assistant responsible for coordinating information retrieval \"\n",
    "        \"across multiple specialized research tools.\\n\\n\"\n",
    "        \"Your role is to help users explore, analyze, and synthesize academic research efficiently.\\n\\n\"\n",
    "        \"IMPORTANT RULES:\\n\"\n",
    "        \"1. ALWAYS use translate_to_research_papers when a query mentions research papers, studies, or findings.\\n\"\n",
    "        \"2. ALWAYS use translate_to_research_conversations when a query mentions previous discussions, analyses, or summaries.\\n\"\n",
    "        \"3. If a query requests BOTH new research and past discussions, use BOTH tools in sequence.\\n\"\n",
    "        \"4. NEVER attempt to provide research summaries without using your tools.\\n\"\n",
    "        \"5. Each tool provides complementary context — use all appropriate tools for a comprehensive academic response.\\n\\n\"\n",
    "        \"After retrieving relevant results, synthesize them into a cohesive summary:\\n\"\n",
    "        \"- Clearly distinguish between newly retrieved research and recalled past discussions.\\n\"\n",
    "        \"- Cite sources using [1], [2], etc.\\n\"\n",
    "        \"- Identify key insights, trends, and research gaps.\\n\"\n",
    "        \"- Maintain an academic and objective tone.\"\n",
    "    ),\n",
    "    tools=[\n",
    "        research_paper_agent.as_tool(\n",
    "            tool_name=\"translate_to_research_papers\",\n",
    "            tool_description=\"Retrieve and summarize relevant academic research papers and literature findings.\",\n",
    "        ),\n",
    "        research_conversation_agent.as_tool(\n",
    "            tool_name=\"translate_to_research_conversations\",\n",
    "            tool_description=\"Retrieve and summarize past research discussions or analyses related to the topic.\",\n",
    "        ),\n",
    "    ],\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62aab9de",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Final agent to synthesize information from all sources (Research use case)\n",
    "synthesizer_agent = Agent(\n",
    "    name=\"research_response_synthesizer\",\n",
    "    instructions=(\n",
    "        \"You create comprehensive, well-organized research summaries by combining information from multiple sources.\\n\\n\"\n",
    "        \"When organizing your response:\\n\"\n",
    "        \"1) Start with a concise abstract-style overview (3–5 sentences) highlighting key findings and takeaways.\\n\"\n",
    "        \"2) Clearly separate NEW LITERATURE FINDINGS from PAST RESEARCH DISCUSSIONS.\\n\"\n",
    "        \"3) Cite sources using bracketed numbers [1], [2], etc., aligned with the retrieved items.\\n\"\n",
    "        \"4) Emphasize methods, evidence strength, and limitations; avoid speculation beyond the provided context.\\n\"\n",
    "        \"5) Use clear, scannable formatting (short paragraphs, bullet points where appropriate).\\n\"\n",
    "        \"6) Conclude with open questions, gaps, or future work suggested by the literature.\\n\"\n",
    "        \"7) If evidence is sparse, state this explicitly and avoid overgeneralization.\\n\"\n",
    "        \"Tone: academic, objective, and precise.\"\n",
    "    ),\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2784d60",
   "metadata": {},
   "source": [
    "### 6.6 Run the Asynchronous Orchestration Workflow\n",
    "\n",
    "The next cells define async workflow execution, patch the notebook event loop, and provide a synchronous wrapper for interactive use.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f02800b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from agents import ItemHelpers, MessageOutputItem, trace\n",
    "from agents import Runner  # assuming Runner is imported elsewhere; include here for clarity\n",
    "\n",
    "\n",
    "async def research_assistant_workflow(user_query: str):\n",
    "    \"\"\"Run the complete research assistant workflow (orchestrate retrieval + synthesize).\"\"\"\n",
    "    # 1) Have the research orchestrator decide which tools to invoke\n",
    "    with trace(\"Research Orchestrator\"):\n",
    "        orchestrator_result = await Runner.run(orchestrator_agent, user_query)\n",
    "\n",
    "        # Debug/transparency: print intermediate orchestration steps\n",
    "        print(\"\\n--- Research Orchestration Steps ---\")\n",
    "        for item in orchestrator_result.new_items:\n",
    "            if isinstance(item, MessageOutputItem):\n",
    "                text = ItemHelpers.text_message_output(item)\n",
    "                if text:\n",
    "                    print(f\"  - Retrieval step: {text}\")\n",
    "\n",
    "        # 2) Synthesize all gathered information into a cohesive research summary\n",
    "        synthesizer_result = await Runner.run(\n",
    "            synthesizer_agent, orchestrator_result.to_input_list()\n",
    "        )\n",
    "\n",
    "        print(f\"\\n\\n--- Final Research Synthesis ---\\n{synthesizer_result.final_output}\\n\")\n",
    "\n",
    "    return synthesizer_result.final_output\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc193e35",
   "metadata": {},
   "outputs": [],
   "source": [
    "import asyncio\n",
    "import nest_asyncio\n",
    "\n",
    "# Apply nest_asyncio to patch the event loop\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bc86489",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_virtual_research_assistant(query):\n",
    "    # Create a new event loop\n",
    "    loop = asyncio.new_event_loop()\n",
    "    asyncio.set_event_loop(loop)\n",
    "\n",
    "    # Run the async function and get the result\n",
    "    result = loop.run_until_complete(research_assistant_workflow(query))\n",
    "\n",
    "    # Clean up\n",
    "    loop.close()\n",
    "\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b705b630",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now call the function this way\n",
    "query = input(\"What research topic can I help you with today? \")\n",
    "run_virtual_research_assistant(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b67e6c26",
   "metadata": {},
   "source": [
    "## Agentic Chat System\n",
    "\n",
    "The next block adds thread-aware conversation handling on top of orchestration.\n",
    "\n",
    "For each user turn, the system:\n",
    "1. Stores the new message in Oracle.\n",
    "2. Reconstructs conversation context by `thread_id`.\n",
    "3. Runs orchestration + synthesis with that context.\n",
    "4. Saves the assistant response back to Oracle.\n",
    "\n",
    "This pattern supports long-running research sessions with continuity.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05741d83",
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "import uuid\n",
    "\n",
    "# Create chat_history table in Oracle\n",
    "with conn.cursor() as cur:\n",
    "    # Drop table if exists (for development)\n",
    "    cur.execute(\"\"\"\n",
    "        BEGIN\n",
    "            EXECUTE IMMEDIATE 'DROP TABLE chat_history';\n",
    "        EXCEPTION WHEN OTHERS THEN\n",
    "            IF SQLCODE != -942 THEN RAISE; END IF;\n",
    "        END;\n",
    "    \"\"\")\n",
    "    \n",
    "    # Create chat_history table\n",
    "    cur.execute(\"\"\"\n",
    "        CREATE TABLE chat_history (\n",
    "            id VARCHAR2(100) PRIMARY KEY,\n",
    "            thread_id VARCHAR2(100) NOT NULL,\n",
    "            role VARCHAR2(20) NOT NULL,\n",
    "            message CLOB NOT NULL,\n",
    "            timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n",
    "        )\n",
    "        TABLESPACE USERS\n",
    "    \"\"\")\n",
    "    \n",
    "    # Create index on thread_id and timestamp for efficient retrieval\n",
    "    cur.execute(\"\"\"\n",
    "        CREATE INDEX idx_thread_timestamp \n",
    "        ON chat_history(thread_id, timestamp)\n",
    "        TABLESPACE USERS\n",
    "    \"\"\")\n",
    "    \n",
    "    conn.commit()\n",
    "    print(\"✅ Table chat_history created successfully with index.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34c470ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "async def research_assistant_chat(user_query, thread_id=None):\n",
    "    \"\"\"\n",
    "    Run the complete research assistant workflow with conversation history.\n",
    "    For each conversation turn:\n",
    "      - Stores the user's input and the assistant's output in Oracle along with a timestamp and thread_id.\n",
    "      - Retrieves and appends previous conversation history (ordered by timestamp) to the agent's input.\n",
    "    \n",
    "    If no thread_id is provided, a new conversation session is started.\n",
    "    \n",
    "    Returns:\n",
    "      tuple: (final_output, thread_id) where thread_id is the session identifier.\n",
    "    \"\"\"\n",
    "    # Generate a new thread id if not provided\n",
    "    if thread_id is None:\n",
    "        thread_id = str(uuid.uuid4())\n",
    "        print(f\"📝 New research conversation started with thread ID: {thread_id}\")\n",
    "    else:\n",
    "        print(f\"📝 Continuing research conversation with thread ID: {thread_id}\")\n",
    "    \n",
    "    # --- Step 1: Store the new user query in Oracle ---\n",
    "    message_id = str(uuid.uuid4())\n",
    "    \n",
    "    with conn.cursor() as cur:\n",
    "        cur.execute(\"\"\"\n",
    "            INSERT INTO chat_history (id, thread_id, role, message, timestamp)\n",
    "            VALUES (:id, :thread_id, :role, :message, CURRENT_TIMESTAMP)\n",
    "        \"\"\", {\n",
    "            'id': message_id,\n",
    "            'thread_id': thread_id,\n",
    "            'role': 'user',\n",
    "            'message': user_query\n",
    "        })\n",
    "        conn.commit()\n",
    "    \n",
    "    # --- Step 2: Retrieve full conversation history for context ---\n",
    "    with conn.cursor() as cur:\n",
    "        cur.execute(\"\"\"\n",
    "            SELECT role, message, timestamp\n",
    "            FROM chat_history\n",
    "            WHERE thread_id = :thread_id\n",
    "            ORDER BY timestamp ASC\n",
    "        \"\"\", {'thread_id': thread_id})\n",
    "        \n",
    "        chat_history = cur.fetchall()\n",
    "    \n",
    "    conversation_context = \"\"\n",
    "    for entry in chat_history:\n",
    "        role, message, timestamp = entry\n",
    "        if role == \"user\":\n",
    "            conversation_context += f\"User: {message}\\n\"\n",
    "        else:\n",
    "            conversation_context += f\"Assistant: {message}\\n\"\n",
    "    \n",
    "    # --- Step 3: Run the orchestrator agent with the conversation context ---\n",
    "    with trace(\"Research Orchestrator\"):\n",
    "        orchestrator_result = await Runner.run(orchestrator_agent, conversation_context)\n",
    "    \n",
    "    # Print intermediate processing steps for debugging/transparency\n",
    "    print(\"\\n--- Research Orchestrator Processing Steps ---\")\n",
    "    for item in orchestrator_result.new_items:\n",
    "        if isinstance(item, MessageOutputItem):\n",
    "            text = ItemHelpers.text_message_output(item)\n",
    "            if text:\n",
    "                print(f\"  - Information gathering step: {text}\")\n",
    "    \n",
    "    # --- Step 4: Run the synthesizer agent to produce a cohesive response ---\n",
    "    synthesizer_result = await Runner.run(\n",
    "        synthesizer_agent, orchestrator_result.to_input_list()\n",
    "    )\n",
    "    \n",
    "    # --- Step 5: Store the assistant's final output in Oracle ---\n",
    "    response_id = str(uuid.uuid4())\n",
    "    \n",
    "    with conn.cursor() as cur:\n",
    "        cur.execute(\"\"\"\n",
    "            INSERT INTO chat_history (id, thread_id, role, message, timestamp)\n",
    "            VALUES (:id, :thread_id, :role, :message, CURRENT_TIMESTAMP)\n",
    "        \"\"\", {\n",
    "            'id': response_id,\n",
    "            'thread_id': thread_id,\n",
    "            'role': 'assistant',\n",
    "            'message': synthesizer_result.final_output\n",
    "        })\n",
    "        conn.commit()\n",
    "    \n",
    "    print(f\"\\n\\n--- Final Research Response ---\\n{synthesizer_result.final_output}\\n\")\n",
    "    \n",
    "    return synthesizer_result.final_output, thread_id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6250b66",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_research_assistant_chat(query, thread_id=None):\n",
    "    \"\"\"\n",
    "    Run the research assistant synchronously.\n",
    "    Optionally, a thread_id can be provided to continue an existing conversation.\n",
    "    Returns a tuple (final_output, thread_id).\n",
    "    \"\"\"\n",
    "    # Create a new event loop\n",
    "    loop = asyncio.new_event_loop()\n",
    "    asyncio.set_event_loop(loop)\n",
    "    \n",
    "    # Run the async function and get the result\n",
    "    result, thread_id = loop.run_until_complete(\n",
    "        research_assistant_chat(query, thread_id=thread_id)\n",
    "    )\n",
    "    \n",
    "    # Clean up the loop\n",
    "    loop.close()\n",
    "    \n",
    "    return result, thread_id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7cc560f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def research_chat_session():\n",
    "    \"\"\"\n",
    "    Launches a research chat session that continues until the user enters 'q', 'exit', or 'quit'.\n",
    "    The session uses a persistent thread_id to preserve conversation history.\n",
    "    \"\"\"\n",
    "    print(\"🔬 Starting Research Paper Assistant Chat\")\n",
    "    print(\"Type 'q', 'exit' or 'quit' to exit.\\n\")\n",
    "    \n",
    "    session_thread_id = None\n",
    "    \n",
    "    while True:\n",
    "        query = input(\"What research topic can I help you with today? \")\n",
    "        \n",
    "        if query.lower() in [\"q\", \"exit\", \"quit\"]:\n",
    "            print(\"Exiting research chat session.\")\n",
    "            break\n",
    "        \n",
    "        response, session_thread_id = run_research_assistant_chat(\n",
    "            query, thread_id=session_thread_id\n",
    "        )\n",
    "        \n",
    "        print(f\"\\n📚 Assistant: {response}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf336fe5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Start the research chat session\n",
    "research_chat_session()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5af5bfe",
   "metadata": {},
   "source": [
    "## Session Memory with Oracle AI Database\n",
    "\n",
    "This section implements a custom `OracleSession` adapter compatible with agent session APIs.\n",
    "\n",
    "Key capabilities:\n",
    "- persistent storage of message items,\n",
    "- retrieval in chronological order,\n",
    "- controlled trimming (`pop_item`) to manage memory budget,\n",
    "- full session reset (`clear_session`) when needed.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e20956c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List, Optional, Union\n",
    "from datetime import datetime\n",
    "import oracledb\n",
    "import json\n",
    "import uuid\n",
    "\n",
    "class OracleSession:\n",
    "    \"\"\"Custom Oracle session implementation following the Session protocol\"\"\"\n",
    "    \n",
    "    def __init__(\n",
    "        self, \n",
    "        session_id: str, \n",
    "        connection,\n",
    "        table_name: str = \"chat_history\"\n",
    "    ):\n",
    "        \"\"\"\n",
    "        Initialize Oracle session storage.\n",
    "        \n",
    "        Args:\n",
    "            session_id: Unique identifier for this conversation session\n",
    "            connection: Active oracledb connection object\n",
    "            table_name: Name of the Oracle table storing session data\n",
    "        \"\"\"\n",
    "        self.session_id = session_id\n",
    "        self.conn = connection\n",
    "        self.table_name = table_name\n",
    "    \n",
    "    async def get_items(self, limit: Optional[int] = None) -> List[dict]:\n",
    "        \"\"\"Retrieve conversation history for this session\"\"\"\n",
    "        try:\n",
    "            with self.conn.cursor() as cur:\n",
    "                if limit:\n",
    "                    cur.execute(f\"\"\"\n",
    "                        SELECT message\n",
    "                        FROM {self.table_name}\n",
    "                        WHERE thread_id = :session_id\n",
    "                        ORDER BY timestamp ASC\n",
    "                        FETCH FIRST :limit ROWS ONLY\n",
    "                    \"\"\", {'session_id': self.session_id, 'limit': limit})\n",
    "                else:\n",
    "                    cur.execute(f\"\"\"\n",
    "                        SELECT message\n",
    "                        FROM {self.table_name}\n",
    "                        WHERE thread_id = :session_id\n",
    "                        ORDER BY timestamp ASC\n",
    "                    \"\"\", {'session_id': self.session_id})\n",
    "                \n",
    "                rows = cur.fetchall()\n",
    "                \n",
    "                items = []\n",
    "                for row in rows:\n",
    "                    # Deserialize JSON from CLOB\n",
    "                    message_clob = row[0]\n",
    "                    if message_clob:\n",
    "                        message_str = message_clob.read() if hasattr(message_clob, 'read') else str(message_clob)\n",
    "                        items.append(json.loads(message_str))\n",
    "                \n",
    "                return items\n",
    "        \n",
    "        except Exception as e:\n",
    "            print(f\"Error retrieving items: {e}\")\n",
    "            return []\n",
    "    \n",
    "    async def add_items(self, items: List[dict]) -> None:\n",
    "        \"\"\"Store new items for this session\"\"\"\n",
    "        try:\n",
    "            with self.conn.cursor() as cur:\n",
    "                for item in items:\n",
    "                    item_id = str(uuid.uuid4())\n",
    "                    \n",
    "                    # Serialize the entire item as JSON\n",
    "                    message_json = json.dumps(item)\n",
    "                    \n",
    "                    # Extract role if available, otherwise default to 'system'\n",
    "                    role = item.get('role', 'system')\n",
    "                    \n",
    "                    cur.execute(f\"\"\"\n",
    "                        INSERT INTO {self.table_name} (id, thread_id, role, message, timestamp)\n",
    "                        VALUES (:id, :session_id, :role, :message, CURRENT_TIMESTAMP)\n",
    "                    \"\"\", {\n",
    "                        'id': item_id,\n",
    "                        'session_id': self.session_id,\n",
    "                        'role': role,\n",
    "                        'message': message_json\n",
    "                    })\n",
    "                \n",
    "                self.conn.commit()\n",
    "        \n",
    "        except Exception as e:\n",
    "            print(f\"Error adding items: {e}\")\n",
    "            self.conn.rollback()\n",
    "    \n",
    "    async def pop_item(self, limit: Optional[int] = None) -> Optional[Union[dict, List[dict]]]:\n",
    "        \"\"\"\n",
    "        Remove and return the most recent item(s) for this session.\n",
    "        \"\"\"\n",
    "        try:\n",
    "            with self.conn.cursor() as cur:\n",
    "                # Pop a single most-recent item\n",
    "                if not limit or limit <= 1:\n",
    "                    cur.execute(f\"\"\"\n",
    "                        SELECT id, message\n",
    "                        FROM {self.table_name}\n",
    "                        WHERE thread_id = :session_id\n",
    "                        ORDER BY timestamp DESC\n",
    "                        FETCH FIRST 1 ROW ONLY\n",
    "                    \"\"\", {'session_id': self.session_id})\n",
    "                    \n",
    "                    row = cur.fetchone()\n",
    "                    \n",
    "                    if row:\n",
    "                        item_id, message_clob = row\n",
    "                        message_str = message_clob.read() if hasattr(message_clob, 'read') else str(message_clob)\n",
    "                        item = json.loads(message_str)\n",
    "                        \n",
    "                        # Delete the item\n",
    "                        cur.execute(f\"\"\"\n",
    "                            DELETE FROM {self.table_name}\n",
    "                            WHERE id = :id\n",
    "                        \"\"\", {'id': item_id})\n",
    "                        \n",
    "                        self.conn.commit()\n",
    "                        return item\n",
    "                    \n",
    "                    return None\n",
    "                \n",
    "                # Pop multiple most-recent items\n",
    "                cur.execute(f\"\"\"\n",
    "                    SELECT id, message\n",
    "                    FROM {self.table_name}\n",
    "                    WHERE thread_id = :session_id\n",
    "                    ORDER BY timestamp DESC\n",
    "                    FETCH FIRST :limit ROWS ONLY\n",
    "                \"\"\", {'session_id': self.session_id, 'limit': limit})\n",
    "                \n",
    "                rows = cur.fetchall()\n",
    "                \n",
    "                if not rows:\n",
    "                    return []\n",
    "                \n",
    "                items = []\n",
    "                ids_to_delete = []\n",
    "                \n",
    "                for row in rows:\n",
    "                    item_id, message_clob = row\n",
    "                    message_str = message_clob.read() if hasattr(message_clob, 'read') else str(message_clob)\n",
    "                    items.append(json.loads(message_str))\n",
    "                    ids_to_delete.append(item_id)\n",
    "                \n",
    "                # Delete all items\n",
    "                for item_id in ids_to_delete:\n",
    "                    cur.execute(f\"\"\"\n",
    "                        DELETE FROM {self.table_name}\n",
    "                        WHERE id = :id\n",
    "                    \"\"\", {'id': item_id})\n",
    "                \n",
    "                self.conn.commit()\n",
    "                return items\n",
    "        \n",
    "        except Exception as e:\n",
    "            print(f\"Error popping item(s): {e}\")\n",
    "            self.conn.rollback()\n",
    "            return None if (not limit or limit <= 1) else []\n",
    "    \n",
    "    async def clear_session(self) -> None:\n",
    "        \"\"\"Clear all items for this session\"\"\"\n",
    "        try:\n",
    "            with self.conn.cursor() as cur:\n",
    "                cur.execute(f\"\"\"\n",
    "                    DELETE FROM {self.table_name}\n",
    "                    WHERE thread_id = :session_id\n",
    "                \"\"\", {'session_id': self.session_id})\n",
    "                \n",
    "                self.conn.commit()\n",
    "                print(f\"✅ Session {self.session_id} cleared successfully.\")\n",
    "        \n",
    "        except Exception as e:\n",
    "            print(f\"Error clearing session: {e}\")\n",
    "            self.conn.rollback()\n",
    "    \n",
    "    def close(self) -> None:\n",
    "        \"\"\"\n",
    "        Note: Connection is managed externally, so we don't close it here.\n",
    "        \"\"\"\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "079aa0f1",
   "metadata": {},
   "source": [
    "### Basic Example: Agent with Persistent Session Memory\n",
    "\n",
    "These turns demonstrate memory behavior explicitly:\n",
    "1. Introduce a user identity and topic.\n",
    "2. Ask follow-up questions that depend on prior context.\n",
    "3. Remove selected items to test forgetting behavior.\n",
    "4. Clear the session and verify memory reset.\n",
    "\n",
    "This gives a practical template for evaluating memory quality in your own assistants.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef7a559a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an agent\n",
    "research_agent = Agent(\n",
    "    name=\"Assistant\",\n",
    "    instructions=\"Research the topic and return the most relevant information.\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb1433a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an Oracle session instance\n",
    "session = OracleSession(\n",
    "    session_id=\"conversation_123\", \n",
    "    connection=conn,\n",
    "    table_name=\"chat_history\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4cc72f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First turn\n",
    "result_from_research_agent = await Runner.run(\n",
    "    starting_agent=research_agent,\n",
    "    input=\"Hi my name is Richmond, and I am a AI Memory Engineer researching LLMs and Agent Memory\",\n",
    "    session=session\n",
    ")\n",
    "\n",
    "print(f\"Assistant: {result_from_research_agent.final_output}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1ea86e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Second turn\n",
    "result_from_research_agent = await Runner.run(\n",
    "    starting_agent=research_agent,\n",
    "    input=\"What is a paper that introduces the attention mechanism in LLMs?\",\n",
    "    session=session\n",
    ")\n",
    "\n",
    "print(f\"Assistant: {result_from_research_agent.final_output}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "144c2920",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Third turn, the agent will remember the previous conversation\n",
    "result_from_research_agent = await Runner.run(\n",
    "    starting_agent=research_agent,\n",
    "    input=\"Who were the authors of the paper?\",\n",
    "    session=session\n",
    ")\n",
    "\n",
    "print(f\"Assistant: {result_from_research_agent.final_output}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22f955bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fourth turn - continuing the conversation\n",
    "result_from_research_agent = await Runner.run(\n",
    "    starting_agent=research_agent,\n",
    "    input=\"What was the year of publication?\",\n",
    "    session=session\n",
    ")\n",
    "\n",
    "print(f\"Assistant: {result_from_research_agent.final_output}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1d64efa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Change the conversation subject and ensure the agent does't remember the previous conversation\n",
    "# Specifiying pop without limit will remove the last item in the session\n",
    "await session.pop_item(limit=7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a7d9dcb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fifth turn: The agent should not remember the conversations about the paper\n",
    "result_from_research_agent = await Runner.run(\n",
    "    starting_agent=research_agent,\n",
    "    input=\"What paper have we been talking about?\",\n",
    "    session=session\n",
    ")\n",
    "\n",
    "print(f\"Assistant: {result_from_research_agent.final_output}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c06a4705",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Because we limited the session to a few items, the agent should still remember our name at the introduction\n",
    "result_from_research_agent = await Runner.run(\n",
    "    starting_agent=research_agent,\n",
    "    input=\"Do you still remember my name?\",\n",
    "    session=session\n",
    ")\n",
    "\n",
    "print(f\"Assistant: {result_from_research_agent.final_output}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3694ed8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clear the session\n",
    "await session.clear_session()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd2169c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Because we limited the session to 3 items, the agent should still remember our name at the introduction\n",
    "result_from_research_agent = await Runner.run(\n",
    "    starting_agent=research_agent,\n",
    "    input=\"Do you still remember my name?\",\n",
    "    session=session\n",
    ")\n",
    "\n",
    "print(f\"Assistant: {result_from_research_agent.final_output}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "playground",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}