{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Split data into multiple rows with iterators\n", "\n", "Transform a single document, video, image, or audio file into multiple rows for granular processing.\n", "\n", "**What's in this recipe:**\n", "\n", "- Split documents into text chunks for RAG\n", "- Extract frames or segments from videos\n", "- Tile images for high-resolution analysis\n", "- Chunk audio files for transcription" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem\n", "\n", "You have documents, videos, or text that you need to break into smaller pieces for processing. A PDF needs to be split into chunks for retrieval-augmented generation. A video needs individual frames for analysis. Text needs to be divided into sentences or sliding windows.\n", "\n", "You need a way to transform one source row into multiple output rows automatically." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n", "\n", "You create views with iterator functions that split source data into multiple rows. Pixeltable provides built-in iterators for documents, videos, images, audio, and strings.\n", "\n", "### Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -qU pixeltable spacy tiktoken\n", "!python -m spacy download en_core_web_sm -q" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split documents into chunks\n", "\n", "Use `document_splitter` to break documents (PDF, HTML, Markdown, TXT) into text chunks." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 1 row with 0 errors in 0.13 s (7.68 rows/s)\n" ] }, { "data": { "text/plain": [ "1 row inserted." ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pixeltable as pxt\n", "from pixeltable.functions.document import document_splitter\n", "\n", "pxt.drop_dir('split_demo', force=True)\n", "pxt.create_dir('split_demo')\n", "\n", "docs = pxt.create_table('split_demo/docs', {'doc': pxt.Document})\n", "docs.insert(\n", " [\n", " {\n", " 'doc': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Jefferson-Amazon.pdf'\n", " }\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text
FINANCIAL SONAR™: REALITY RADAR ON COMPANY PERFORMANCE\n", "NASDAQGSAMZN AMAZON.COM
INC.\n", "REGION NORTH AMERICA\n", "INDUSTRY INTERNET AND DIRECT MARKETING RETAIL\n", "SELL OVERALL RATING FOR 1ST QUARTER 2024\n", "www.jeffersonresearch.com
© 2024 Jefferson Research & Management Report prepared on June 21, 2024\n", "OUR EVALUATION OF AMZN\n", "Amazon.com Inc. is showing strong Earnings Quality and Balance Sheet Quality, but\n", "Valuation suggests a higher amount of price risk, and Cash Flow Quality and Operating\n", "Efficiency are both weak.
" ], "text/plain": [ " text\n", "0 FINANCIAL SONAR™: REALITY RADAR ON COMPANY PER...\n", "1 INC.\\nREGION NORTH AMERICA\\nINDUSTRY INTERNET ...\n", "2 © 2024 Jefferson Research & Management Report ..." ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunks = pxt.create_view(\n", " 'split_demo/doc_chunks',\n", " docs,\n", " iterator=document_splitter(\n", " docs.doc, separators='sentence,token_limit', limit=300\n", " ),\n", ")\n", "chunks.select(chunks.text).limit(3).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Available separators:**\n", "\n", "- `heading` — Split on HTML/Markdown headings\n", "- `sentence` — Split on sentence boundaries (requires spacy)\n", "- `token_limit` — Split by token count (requires tiktoken)\n", "- `char_limit` — Split by character count\n", "- `page` — Split by page (PDF only)\n", "\n", "[SDK Reference: document_splitter](https://docs.pixeltable.com/sdk/latest/document)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract frames from videos\n", "\n", "Use `frame_iterator` to extract frames at specified intervals." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 1 row with 0 errors in 1.28 s (0.78 rows/s)\n" ] }, { "data": { "text/plain": [ "1 row inserted." ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pixeltable.functions.video import frame_iterator\n", "\n", "videos = pxt.create_table('split_demo/videos', {'video': pxt.Video})\n", "videos.insert(\n", " [\n", " {\n", " 'video': 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/bangkok.mp4'\n", " }\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
frameframe_attrs
\n", " \n", "
{"dts": 0, "pts": 0, "time": 0., "index": 0, "key_frame": true, "pict_type": 1, "is_corrupt": false, "interlaced_frame": false}
\n", " \n", "
{"dts": 25, "pts": 25, "time": 1., "index": 25, "key_frame": false, "pict_type": 3, "is_corrupt": false, "interlaced_frame": false}
\n", " \n", "
{"dts": 50, "pts": 50, "time": 2., "index": 50, "key_frame": false, "pict_type": 2, "is_corrupt": false, "interlaced_frame": false}
" ], "text/plain": [ " frame \\\n", "0 \n", " \n", " \n", " segment_start\n", " segment_end\n", " video_segment\n", " \n", " \n", " \n", " \n", " 0.\n", " 5.\n", "
\n", " \n", "
\n", " \n", " \n", " 5.\n", " 10.\n", "
\n", " \n", "
\n", " \n", " \n", " 10.\n", " 15.\n", "
\n", " \n", "
\n", " \n", " \n", "" ], "text/plain": [ " segment_start segment_end \\\n", "0 0.0 5.0 \n", "1 5.0 10.0 \n", "2 10.0 15.0 \n", "\n", " video_segment \n", "0 /Users/asiegel/.pixeltable/media/3c7074a768504... \n", "1 /Users/asiegel/.pixeltable/media/3c7074a768504... \n", "2 /Users/asiegel/.pixeltable/media/3c7074a768504... " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pixeltable.functions.video import video_splitter\n", "\n", "segments = pxt.create_view(\n", " 'split_demo/segments',\n", " videos,\n", " iterator=video_splitter(\n", " videos.video, duration=5.0, min_segment_duration=1.0\n", " ),\n", ")\n", "segments.select(\n", " segments.segment_start, segments.segment_end, segments.video_segment\n", ").limit(3).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**video_splitter options:**\n", "\n", "- `duration` — Duration of each segment in seconds\n", "- `overlap` — Overlap between segments in seconds\n", "- `min_segment_duration` — Drop last segment if shorter than this\n", "\n", "[SDK Reference: video_splitter](https://docs.pixeltable.com/sdk/latest/video)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split strings into sentences\n", "\n", "Use `string_splitter` to divide text into sentences." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 1 row with 0 errors in 0.03 s (38.38 rows/s)\n" ] }, { "data": { "text/plain": [ "1 row inserted." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pixeltable.functions.string import string_splitter\n", "\n", "texts = pxt.create_table('split_demo/texts', {'content': pxt.String})\n", "texts.insert(\n", " [\n", " {\n", " 'content': 'AI data infrastructure simplifies ML workflows. Declarative pipelines update incrementally. This makes development faster and more maintainable.'\n", " }\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text
AI data infrastructure simplifies ML workflows.
Declarative pipelines update incrementally.
This makes development faster and more maintainable.
" ], "text/plain": [ " text\n", "0 AI data infrastructure simplifies ML workflows.\n", "1 Declarative pipelines update incrementally.\n", "2 This makes development faster and more maintai..." ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentences = pxt.create_view(\n", " 'split_demo/sentences',\n", " texts,\n", " iterator=string_splitter(texts.content, separators='sentence'),\n", ")\n", "sentences.select(sentences.text).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[SDK Reference: string_splitter](https://docs.pixeltable.com/sdk/latest/string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tile images for analysis\n", "\n", "Use `tile_iterator` to divide large images into a grid of smaller tiles. This is useful for processing high-resolution images that are too large to analyze at once, or for running object detection on different regions." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 1 row with 0 errors in 0.09 s (11.69 rows/s)\n" ] }, { "data": { "text/plain": [ "1 row inserted." ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pixeltable.functions.image import tile_iterator\n", "\n", "images = pxt.create_table('split_demo/images', {'image': pxt.Image})\n", "images.insert(\n", " [\n", " {\n", " 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/pixeltable-logo-large.png'\n", " }\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "tiles = pxt.create_view(\n", " 'split_demo/tiles',\n", " images,\n", " iterator=tile_iterator(images.image, tile_size=(100, 100)),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**tile_iterator options:**\n", "\n", "- `tile_size` — Size of each tile as `(width, height)`\n", "- `overlap` — Overlap between adjacent tiles as `(width, height)`\n", "\n", "[SDK Reference: tile_iterator](https://docs.pixeltable.com/sdk/latest/image)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tile_coordtile
[14, 0]
\n", " \n", "
[1, 7]
\n", " \n", "
[19, 7]
\n", " \n", "
[1, 1]
\n", " \n", "
" ], "text/plain": [ " tile_coord tile\n", "0 [14, 0] \n", " \n", " \n", " segment_start\n", " segment_end\n", " \n", " \n", " \n", " \n", " 0.\n", " 29.989\n", " \n", " \n", " 28.003\n", " 57.992\n", " \n", " \n", " 56.007\n", " 85.995\n", " \n", " \n", " 84.01\n", " 113.998\n", " \n", " \n", " 112.013\n", " 141.976\n", " \n", " \n", "" ], "text/plain": [ " segment_start segment_end\n", "0 0.0000 29.9887\n", "1 28.0033 57.9919\n", "2 56.0065 85.9952\n", "3 84.0098 113.9984\n", "4 112.0131 141.9756" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "audio_segments = pxt.create_view(\n", " 'split_demo/audio_chunks',\n", " audio,\n", " iterator=audio_splitter(audio.audio, duration=30.0, overlap=2.0),\n", ")\n", "audio_segments.select(\n", " audio_segments.segment_start, audio_segments.segment_end\n", ").limit(5).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**audio_splitter options:**\n", "\n", "- `duration` — Duration of each chunk in seconds\n", "- `overlap` — Overlap between chunks in seconds\n", "- `min_segment_duration` — Drop last chunk if shorter than this\n", "\n", "[SDK Reference: audio_splitter](https://docs.pixeltable.com/sdk/latest/audio)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See also\n", "\n", "- [Split documents for RAG](https://docs.pixeltable.com/howto/cookbooks/text/doc-chunk-for-rag)\n", "- [Extract frames from videos](https://docs.pixeltable.com/howto/cookbooks/video/video-extract-frames)\n", "- [Transcribe audio files](https://docs.pixeltable.com/howto/cookbooks/audio/audio-transcribe)" ] } ], "metadata": { "kernelspec": { "display_name": "pxt", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.19" } }, "nbformat": 4, "nbformat_minor": 2 }