{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Build semantic search for text\n", "\n", "Create a searchable knowledge base that finds content by meaning, not just keywords." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem\n", "\n", "You have a collection of text content (articles, notes, documentation) and need to find relevant items based on meaning.\n", "\n", "Keyword search fails when users phrase queries differently from the source text:\n", "\n", "| Query | Keyword match | Semantic match |\n", "|-------|---------------|----------------|\n", "| \"how to fix bugs\" | ❌ No results | ✓ \"Debugging best practices\" |\n", "| \"ML training\" | ❌ No results | ✓ \"Machine learning model optimization\" |\n", "| \"deploy to cloud\" | ❌ No results | ✓ \"Production infrastructure setup\" |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n", "\n", "**What's in this recipe:**\n", "\n", "- Create a text table with embeddings\n", "- Search by semantic similarity\n", "- Combine with metadata filters\n", "\n", "You add an embedding index to your text column. Pixeltable automatically generates embeddings for each row and enables similarity search." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:41:30.588182Z", "iopub.status.busy": "2025-12-12T02:41:30.588005Z", "iopub.status.idle": "2025-12-12T02:41:33.451278Z", "shell.execute_reply": "2025-12-12T02:41:33.450676Z" } }, "outputs": [], "source": [ "%pip install -qU pixeltable sentence-transformers" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:41:33.467968Z", "iopub.status.busy": "2025-12-12T02:41:33.467749Z", "iopub.status.idle": "2025-12-12T02:41:34.730340Z", "shell.execute_reply": "2025-12-12T02:41:34.729837Z" } }, "outputs": [], "source": [ "import pixeltable as pxt\n", "from pixeltable.functions.huggingface import sentence_transformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create knowledge base" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:41:34.733205Z", "iopub.status.busy": "2025-12-12T02:41:34.732969Z", "iopub.status.idle": "2025-12-12T02:41:34.924547Z", "shell.execute_reply": "2025-12-12T02:41:34.924224Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata\n", "Created directory 'search_demo'.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a fresh directory\n", "pxt.drop_dir('search_demo', force=True)\n", "pxt.create_dir('search_demo')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:41:34.927702Z", "iopub.status.busy": "2025-12-12T02:41:34.927564Z", "iopub.status.idle": "2025-12-12T02:41:35.021834Z", "shell.execute_reply": "2025-12-12T02:41:35.021568Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'articles'.\n" ] } ], "source": [ "# Create table with content and metadata\n", "kb = pxt.create_table(\n", " 'search_demo/articles',\n", " {'title': pxt.String, 'content': pxt.String, 'category': pxt.String},\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:41:35.024303Z", "iopub.status.busy": "2025-12-12T02:41:35.024140Z", "iopub.status.idle": "2025-12-12T02:41:35.852965Z", "shell.execute_reply": "2025-12-12T02:41:35.852416Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `articles`: 4 rows [00:00, 577.69 rows/s]\n", "Inserted 4 rows with 0 errors.\n" ] }, { "data": { "text/plain": [ "4 rows inserted, 12 values computed." ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Insert sample content\n", "kb.insert(\n", " [\n", " {\n", " 'title': 'Debugging best practices',\n", " 'content': 'Use logging, breakpoints, and unit tests to identify and fix issues in your code.',\n", " 'category': 'engineering',\n", " },\n", " {\n", " 'title': 'Machine learning model optimization',\n", " 'content': 'Improve training efficiency with batch normalization, learning rate schedules, and early stopping.',\n", " 'category': 'ml',\n", " },\n", " {\n", " 'title': 'Production infrastructure setup',\n", " 'content': 'Deploy applications using containers, load balancers, and automated scaling.',\n", " 'category': 'devops',\n", " },\n", " {\n", " 'title': 'API design principles',\n", " 'content': 'Create RESTful endpoints with proper versioning, authentication, and error handling.',\n", " 'category': 'engineering',\n", " },\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add semantic search\n", "\n", "Create an embedding index on the content column:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:41:35.856120Z", "iopub.status.busy": "2025-12-12T02:41:35.855554Z", "iopub.status.idle": "2025-12-12T02:41:40.700457Z", "shell.execute_reply": "2025-12-12T02:41:40.699877Z" } }, "outputs": [], "source": [ "# Add embedding index\n", "kb.add_embedding_index(\n", " column='content',\n", " string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2'),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Search by meaning\n", "\n", "Find content semantically similar to your query:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:41:40.704273Z", "iopub.status.busy": "2025-12-12T02:41:40.703813Z", "iopub.status.idle": "2025-12-12T02:41:40.832226Z", "shell.execute_reply": "2025-12-12T02:41:40.831777Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlecontentscore
Debugging best practicesUse logging, breakpoints, and unit tests to identify and fix issues in your code.0.391
API design principlesCreate RESTful endpoints with proper versioning, authentication, and error handling.0.186
" ], "text/plain": [ " title \\\n", "0 Debugging best practices \n", "1 API design principles \n", "\n", " content score \n", "0 Use logging, breakpoints, and unit tests to id... 0.391208 \n", "1 Create RESTful endpoints with proper versionin... 0.186153 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Search by meaning\n", "query = 'how to fix bugs'\n", "sim = kb.content.similarity(string=query)\n", "\n", "results = (\n", " kb.order_by(sim, asc=False)\n", " .select(kb.title, kb.content, score=sim)\n", " .limit(2)\n", ")\n", "results.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filter by metadata\n", "\n", "Combine semantic search with metadata filters:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:41:40.834273Z", "iopub.status.busy": "2025-12-12T02:41:40.834039Z", "iopub.status.idle": "2025-12-12T02:41:40.985506Z", "shell.execute_reply": "2025-12-12T02:41:40.985088Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlecategoryscore
API design principlesengineering0.238
Debugging best practicesengineering0.157
" ], "text/plain": [ " title category score\n", "0 API design principles engineering 0.238239\n", "1 Debugging best practices engineering 0.157270" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Search within a specific category\n", "query = 'best practices'\n", "sim = kb.content.similarity(string=query)\n", "\n", "results = (\n", " kb.where(kb.category == 'engineering') # Filter first\n", " .order_by(sim, asc=False)\n", " .select(kb.title, kb.category, score=sim)\n", " .limit(2)\n", ")\n", "results.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanation\n", "\n", "**How similarity search works:**\n", "\n", "1. Your query is converted to an embedding vector\n", "1. Pixeltable finds the most similar vectors in the index\n", "1. Results are ranked by cosine similarity (0 to 1)\n", "\n", "**Embedding models:**\n", "\n", "| Model | Speed | Quality | Use case |\n", "|-------|-------|---------|----------|\n", "| `all-MiniLM-L6-v2` | Fast | Good | General text |\n", "| `all-mpnet-base-v2` | Medium | Better | Higher accuracy |\n", "| OpenAI `text-embedding-3-small` | API | Best | Production apps |\n", "\n", "**New content is indexed automatically:**\n", "\n", "When you insert new rows, embeddings are generated without extra code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See also\n", "\n", "- [Vector database documentation](https://docs.pixeltable.com/platform/embedding-indexes)\n", "- [Split documents for RAG](https://docs.pixeltable.com/howto/cookbooks/text/doc-chunk-for-rag)" ] } ], "metadata": { "kernelspec": { "display_name": "pixeltable", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }