{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Split data into multiple rows with iterators\n", "\n", "Transform a single document, video, image, or audio file into multiple rows for granular processing.\n", "\n", "**What's in this recipe:**\n", "\n", "- Split documents into text chunks for RAG\n", "- Extract frames or segments from videos\n", "- Tile images for high-resolution analysis\n", "- Chunk audio files for transcription" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem\n", "\n", "You have documents, videos, or text that you need to break into smaller pieces for processing. A PDF needs to be split into chunks for retrieval-augmented generation. A video needs individual frames for analysis. Text needs to be divided into sentences or sliding windows.\n", "\n", "You need a way to transform one source row into multiple output rows automatically." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n", "\n", "You create views with iterator functions that split source data into multiple rows. Pixeltable provides built-in iterators for documents, videos, images, audio, and strings.\n", "\n", "### Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -qU pixeltable spacy tiktoken\n", "!python -m spacy download en_core_web_sm -q" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split documents into chunks\n", "\n", "Use `document_splitter` to break documents (PDF, HTML, Markdown, TXT) into text chunks." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 1 row with 0 errors in 0.13 s (7.68 rows/s)\n" ] }, { "data": { "text/plain": [ "1 row inserted." ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pixeltable as pxt\n", "from pixeltable.functions.document import document_splitter\n", "\n", "pxt.drop_dir('split_demo', force=True)\n", "pxt.create_dir('split_demo')\n", "\n", "docs = pxt.create_table('split_demo/docs', {'doc': pxt.Document})\n", "docs.insert(\n", " [\n", " {\n", " 'doc': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Jefferson-Amazon.pdf'\n", " }\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| text | \n", "
|---|
| FINANCIAL SONAR™: REALITY RADAR ON COMPANY PERFORMANCE\n", "NASDAQGSAMZN AMAZON.COM | \n", "
| INC.\n", "REGION NORTH AMERICA\n", "INDUSTRY INTERNET AND DIRECT MARKETING RETAIL\n", "SELL OVERALL RATING FOR 1ST QUARTER 2024\n", "www.jeffersonresearch.com | \n", "
| © 2024 Jefferson Research & Management Report prepared on June 21, 2024\n", "OUR EVALUATION OF AMZN\n", "Amazon.com Inc. is showing strong Earnings Quality and Balance Sheet Quality, but\n", "Valuation suggests a higher amount of price risk, and Cash Flow Quality and Operating\n", "Efficiency are both weak. | \n", "
| frame | \n", "frame_attrs | \n", "
|---|---|
\n",
" | \n",
" {"dts": 0, "pts": 0, "time": 0., "index": 0, "key_frame": true, "pict_type": 1, "is_corrupt": false, "interlaced_frame": false} | \n", "
\n",
" | \n",
" {"dts": 25, "pts": 25, "time": 1., "index": 25, "key_frame": false, "pict_type": 3, "is_corrupt": false, "interlaced_frame": false} | \n", "
\n",
" | \n",
" {"dts": 50, "pts": 50, "time": 2., "index": 50, "key_frame": false, "pict_type": 2, "is_corrupt": false, "interlaced_frame": false} | \n", "
| text | \n", "
|---|
| AI data infrastructure simplifies ML workflows. | \n", "
| Declarative pipelines update incrementally. | \n", "
| This makes development faster and more maintainable. | \n", "
| tile_coord | \n", "tile | \n", "
|---|---|
| [14, 0] | \n", "\n",
" | \n",
"
| [1, 7] | \n", "\n",
" | \n",
"
| [19, 7] | \n", "\n",
" | \n",
"
| [1, 1] | \n", "\n",
" | \n",
"