{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Build a RAG pipeline\n", "\n", "Create a retrieval-augmented generation system that answers questions using your documents as context." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem\n", "\n", "You want an LLM to answer questions using your specific documents—not just its training data. You need to retrieve relevant context and include it in the prompt.\n", "\n", "| Use case | Documents | Questions |\n", "|----------|-----------|-----------|\n", "| Customer support | Help articles | \"How do I reset my password?\" |\n", "| Internal wiki | Company docs | \"What's our vacation policy?\" |\n", "| Research | Papers | \"What did the study find about X?\" |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n", "\n", "**What's in this recipe:**\n", "\n", "- Embed and index documents for retrieval\n", "- Create a query function that retrieves context\n", "- Generate answers grounded in your documents\n", "\n", "You build a pipeline that: (1) embeds documents, (2) finds relevant chunks for a query, and (3) generates an answer using those chunks as context." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -qU pixeltable openai" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import getpass\n", "import os\n", "\n", "if 'OPENAI_API_KEY' not in os.environ:\n", " os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata\n", "Created directory 'rag_demo'.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pixeltable as pxt\n", "from pixeltable.functions.openai import chat_completions, embeddings\n", "\n", "# Create a fresh directory\n", "pxt.drop_dir('rag_demo', force=True)\n", "pxt.create_dir('rag_demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: create document store with embeddings" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'chunks'.\n" ] } ], "source": [ "# Create table for document chunks\n", "chunks = pxt.create_table(\n", " 'rag_demo/chunks', {'doc_id': pxt.String, 'chunk_text': pxt.String}\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Add embedding index for semantic search\n", "chunks.add_embedding_index(\n", " column='chunk_text',\n", " string_embed=embeddings.using(model='text-embedding-3-small'),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: load documents" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `chunks`: 5 rows [00:00, 345.31 rows/s]\n", "Inserted 5 rows with 0 errors.\n" ] }, { "data": { "text/plain": [ "5 rows inserted, 15 values computed." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sample knowledge base (in production, load from files/database)\n", "documents = [\n", " {\n", " 'doc_id': 'password-reset',\n", " 'chunk_text': 'To reset your password, go to the login page and click \"Forgot Password\". Enter your email address and you will receive a reset link within 5 minutes. The link expires after 24 hours.',\n", " },\n", " {\n", " 'doc_id': 'password-reset',\n", " 'chunk_text': 'Password requirements: minimum 8 characters, at least one uppercase letter, one number, and one special character. Passwords expire every 90 days for security.',\n", " },\n", " {\n", " 'doc_id': 'account-settings',\n", " 'chunk_text': 'To update your profile, navigate to Settings > Account. You can change your display name, email address, and notification preferences. Changes take effect immediately.',\n", " },\n", " {\n", " 'doc_id': 'billing',\n", " 'chunk_text': 'Billing occurs on the first of each month. You can view invoices under Settings > Billing. To change your payment method, click \"Update Payment\" and enter your new card details.',\n", " },\n", " {\n", " 'doc_id': 'api-access',\n", " 'chunk_text': 'API keys can be generated in Settings > Developer. Each key has configurable permissions. Rate limits are 1000 requests per minute for standard plans, 10000 for enterprise.',\n", " },\n", "]\n", "\n", "chunks.insert(documents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: create the RAG query function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define a query function that retrieves context\n", "@pxt.query\n", "def retrieve_context(query: str, top_k: int = 3):\n", " \"\"\"Retrieve the most relevant chunks for a query.\"\"\"\n", " sim = chunks.chunk_text.similarity(string=query)\n", " return (\n", " chunks.where(sim > 0.5)\n", " .order_by(sim, asc=False)\n", " .limit(top_k)\n", " .select(doc_id=chunks.doc_id, text=chunks.chunk_text)\n", " )" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "retrieve_context('What are the key features?')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View retrieved context for a query\n", "query = 'What are the key features?'\n", "context_chunks = retrieve_context(query)\n", "context_chunks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4: generate answers with context" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'qa'.\n" ] } ], "source": [ "# Create a table for questions/answers\n", "qa = pxt.create_table('rag_demo/qa', {'question': pxt.String})" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 0 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "No rows affected." ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add retrieval step\n", "qa.add_computed_column(context=retrieve_context(qa.question, top_k=3))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 0 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "No rows affected." ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Build the RAG prompt\n", "@pxt.udf\n", "def build_rag_prompt(question: str, context: list[dict]) -> str:\n", " context_text = '\\n\\n'.join(\n", " [f'[{c[\"doc_id\"]}]: {c[\"text\"]}' for c in context]\n", " )\n", " return f\"\"\"Answer the question based only on the provided context. If the context doesn't contain the answer, say \"I don't have information about that.\"\n", "\n", "Context:\n", "{context_text}\n", "\n", "Question: {question}\n", "\n", "Answer:\"\"\"\n", "\n", "\n", "qa.add_computed_column(prompt=build_rag_prompt(qa.question, qa.context))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 0 column values with 0 errors.\n", "Added 0 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "No rows affected." ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Generate answer\n", "qa.add_computed_column(\n", " response=chat_completions(\n", " messages=[{'role': 'user', 'content': qa.prompt}],\n", " model='gpt-4o-mini',\n", " )\n", ")\n", "qa.add_computed_column(answer=qa.response.choices[0].message.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ask questions" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `qa`: 3 rows [00:00, 872.12 rows/s]\n", "Inserted 3 rows with 0 errors.\n" ] }, { "data": { "text/plain": [ "3 rows inserted, 18 values computed." ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Insert questions\n", "questions = [\n", " {'question': 'How do I reset my password?'},\n", " {'question': 'What are the API rate limits?'},\n", " {'question': 'When am I billed?'},\n", "]\n", "\n", "qa.insert(questions)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
questionanswer
When am I billed?You are billed on the first of each month.
What are the API rate limits?The API rate limits are 1000 requests per minute for standard plans and 10000 requests per minute for enterprise plans.
How do I reset my password?To reset your password, go to the login page and click "Forgot Password". Enter your email address, and you will receive a reset link within 5 minutes. The link expires after 24 hours.
" ], "text/plain": [ " question \\\n", "0 When am I billed? \n", "1 What are the API rate limits? \n", "2 How do I reset my password? \n", "\n", " answer \n", "0 You are billed on the first of each month. \n", "1 The API rate limits are 1000 requests per minu... \n", "2 To reset your password, go to the login page a... " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View answers\n", "qa.select(qa.question, qa.answer).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanation\n", "\n", "**RAG pipeline flow:**\n", "\n", "```\n", "Question → Embed → Retrieve similar chunks → Build prompt with context → Generate answer\n", "```\n", "\n", "**Key components:**\n", "\n", "| Component | Purpose |\n", "|-----------|---------|\n", "| Embedding index | Fast similarity search |\n", "| `@pxt.query` | Retrieve context from the database |\n", "| `@pxt.udf` | Build the augmented prompt |\n", "| Computed columns | Chain the pipeline together |\n", "\n", "**Scaling tips:**\n", "\n", "- Use `doc-chunk-for-rag` recipe to split long documents\n", "- Adjust `top_k` to balance context size vs. relevance\n", "- Consider metadata filtering for large knowledge bases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See also\n", "\n", "- [Chunk documents for RAG](https://docs.pixeltable.com/howto/cookbooks/text/doc-chunk-for-rag) - Split documents into chunks\n", "- [Create text embeddings](https://docs.pixeltable.com/howto/cookbooks/search/embed-text-openai) - Embedding fundamentals\n", "- [Semantic text search](https://docs.pixeltable.com/howto/cookbooks/search/search-semantic-text) - Search patterns" ] } ], "metadata": { "kernelspec": { "display_name": "pixeltable", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }