{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Get fast feedback on transformations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem\n", "\n", "You need to iterate on transformation logic before running it on your entire dataset—especially for expensive operations like API calls or model inference." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n", "\n", "**What's in this recipe:**\n", "\n", "- Test transformations on sample rows before applying to your full dataset\n", "- Save expressions as variables to guarantee consistent logic\n", "- Apply the iterate-then-add workflow with built-in functions, expressions, and custom UDFs\n", "\n", "You test transformation logic on sample rows before processing your entire dataset using the iterate-then-add workflow. This lets you validate logic on a few rows before committing to your full table.\n", "\n", "You use `.select()` with `.collect()` to preview transformations—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you're satisfied with the results, use `.add_computed_column()` with the same expression to persist the transformation across your full table.\n", "\n", "This workflow applies to any data type in Pixeltable: images, videos, audio files, documents, and structured tabular data. This recipe uses text data and shows three examples:\n", "\n", "1. Testing built-in functions on sample data\n", "1. Saving expressions as variables to ensure consistency\n", "1. Iterating with custom user-defined functions (UDFs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -qU pixeltable" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pixeltable as pxt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create sample data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata\n", "Created directory 'demo_project'.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a fresh directory (drop existing if present)\n", "pxt.drop_dir('demo_project', force=True)\n", "pxt.create_dir('demo_project')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'lyrics'.\n" ] } ], "source": [ "t = pxt.create_table('demo_project/lyrics', {'text': pxt.String})" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `lyrics`: 6 rows [00:00, 867.67 rows/s]\n", "Inserted 6 rows with 0 errors.\n" ] }, { "data": { "text/plain": [ "6 rows inserted, 6 values computed." ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.insert(\n", " [\n", " {'text': 'Tumble out of bed and I stumble to the kitchen'},\n", " {'text': 'Pour myself a cup of ambition'},\n", " {'text': 'And yawn and stretch and try to come to life'},\n", " {'text': \"Jump in the shower and the blood starts pumpin'\"},\n", " {'text': \"Out on the street, the traffic starts jumpin'\"},\n", " {'text': 'With folks like me on the job from nine to five'},\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 1: built-in functions\n", "\n", "Iterate with built-in functions, then add to the table." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textuppercase
Tumble out of bed and I stumble to the kitchenTUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN
Pour myself a cup of ambitionPOUR MYSELF A CUP OF AMBITION
" ], "text/plain": [ " text \\\n", "0 Tumble out of bed and I stumble to the kitchen \n", "1 Pour myself a cup of ambition \n", "\n", " uppercase \n", "0 TUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN \n", "1 POUR MYSELF A CUP OF AMBITION " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Test uppercase transformation on subset\n", "t.select(t.text, uppercase=t.text.upper()).head(2)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text
Tumble out of bed and I stumble to the kitchen
Pour myself a cup of ambition
" ], "text/plain": [ " text\n", "0 Tumble out of bed and I stumble to the kitchen\n", "1 Pour myself a cup of ambition" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Confirm the transformation was only in memory—table unchanged\n", "t.head(2)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 6 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "6 rows updated, 12 values computed." ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Apply to all rows (same expression)\n", "t.add_computed_column(uppercase=t.text.upper())" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textuppercase
Tumble out of bed and I stumble to the kitchenTUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN
Pour myself a cup of ambitionPOUR MYSELF A CUP OF AMBITION
And yawn and stretch and try to come to lifeAND YAWN AND STRETCH AND TRY TO COME TO LIFE
Jump in the shower and the blood starts pumpin'JUMP IN THE SHOWER AND THE BLOOD STARTS PUMPIN'
Out on the street, the traffic starts jumpin'OUT ON THE STREET, THE TRAFFIC STARTS JUMPIN'
With folks like me on the job from nine to fiveWITH FOLKS LIKE ME ON THE JOB FROM NINE TO FIVE
" ], "text/plain": [ " text \\\n", "0 Tumble out of bed and I stumble to the kitchen \n", "1 Pour myself a cup of ambition \n", "2 And yawn and stretch and try to come to life \n", "3 Jump in the shower and the blood starts pumpin' \n", "4 Out on the street, the traffic starts jumpin' \n", "5 With folks like me on the job from nine to five \n", "\n", " uppercase \n", "0 TUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN \n", "1 POUR MYSELF A CUP OF AMBITION \n", "2 AND YAWN AND STRETCH AND TRY TO COME TO LIFE \n", "3 JUMP IN THE SHOWER AND THE BLOOD STARTS PUMPIN' \n", "4 OUT ON THE STREET, THE TRAFFIC STARTS JUMPIN' \n", "5 WITH FOLKS LIKE ME ON THE JOB FROM NINE TO FIVE " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View text with uppercase column\n", "t.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 2: save and reuse expressions\n", "\n", "Save an expression as a variable to guarantee the same logic in both iterate and add steps." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textchar_count
Tumble out of bed and I stumble to the kitchen46
Pour myself a cup of ambition29
" ], "text/plain": [ " text char_count\n", "0 Tumble out of bed and I stumble to the kitchen 46\n", "1 Pour myself a cup of ambition 29" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define the expression once - no duplication\n", "char_count_expr = t.text.len()\n", "\n", "# Iterate: Test on subset\n", "t.select(t.text, char_count=char_count_expr).head(2)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textuppercase
Tumble out of bed and I stumble to the kitchenTUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN
Pour myself a cup of ambitionPOUR MYSELF A CUP OF AMBITION
" ], "text/plain": [ " text \\\n", "0 Tumble out of bed and I stumble to the kitchen \n", "1 Pour myself a cup of ambition \n", "\n", " uppercase \n", "0 TUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN \n", "1 POUR MYSELF A CUP OF AMBITION " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Confirm the transformation was only in memory—table unchanged\n", "t.head(2)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 6 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "6 rows updated, 12 values computed." ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add: Use the SAME expression to persist\n", "t.add_computed_column(char_count=char_count_expr)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textuppercasechar_count
Tumble out of bed and I stumble to the kitchenTUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN46
Pour myself a cup of ambitionPOUR MYSELF A CUP OF AMBITION29
And yawn and stretch and try to come to lifeAND YAWN AND STRETCH AND TRY TO COME TO LIFE44
Jump in the shower and the blood starts pumpin'JUMP IN THE SHOWER AND THE BLOOD STARTS PUMPIN'47
Out on the street, the traffic starts jumpin'OUT ON THE STREET, THE TRAFFIC STARTS JUMPIN'45
With folks like me on the job from nine to fiveWITH FOLKS LIKE ME ON THE JOB FROM NINE TO FIVE47
" ], "text/plain": [ " text \\\n", "0 Tumble out of bed and I stumble to the kitchen \n", "1 Pour myself a cup of ambition \n", "2 And yawn and stretch and try to come to life \n", "3 Jump in the shower and the blood starts pumpin' \n", "4 Out on the street, the traffic starts jumpin' \n", "5 With folks like me on the job from nine to five \n", "\n", " uppercase char_count \n", "0 TUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN 46 \n", "1 POUR MYSELF A CUP OF AMBITION 29 \n", "2 AND YAWN AND STRETCH AND TRY TO COME TO LIFE 44 \n", "3 JUMP IN THE SHOWER AND THE BLOOD STARTS PUMPIN' 47 \n", "4 OUT ON THE STREET, THE TRAFFIC STARTS JUMPIN' 45 \n", "5 WITH FOLKS LIKE ME ON THE JOB FROM NINE TO FIVE 47 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View text with char_count column\n", "t.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This pattern works with any expression:\n", "\n", "- Built-in functions: `resize_expr = t.image.resize((224, 224))`\n", "- UDFs: `watermark_expr = add_watermark(t.image, '© 2024')`\n", "- Chained operations: `processed_expr = t.image.resize((224, 224)).rotate(90)`\n", "\n", "Benefits:\n", "\n", "- Write the expression once, use it twice\n", "- No copy-paste—reuse the same logic\n", "- Easy to iterate: change in one place, test again" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 3: custom UDF\n", "\n", "Iterate with a user-defined function, then add to the table." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Define a custom transformation\n", "@pxt.udf\n", "def word_count(text: str) -> int:\n", " return len(text.split())" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textword_count
Tumble out of bed and I stumble to the kitchen10
Pour myself a cup of ambition6
" ], "text/plain": [ " text word_count\n", "0 Tumble out of bed and I stumble to the kitchen 10\n", "1 Pour myself a cup of ambition 6" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Iterate: Test UDF on subset\n", "t.select(t.text, word_count=word_count(t.text)).head(2)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textuppercasechar_count
Tumble out of bed and I stumble to the kitchenTUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN46
Pour myself a cup of ambitionPOUR MYSELF A CUP OF AMBITION29
" ], "text/plain": [ " text \\\n", "0 Tumble out of bed and I stumble to the kitchen \n", "1 Pour myself a cup of ambition \n", "\n", " uppercase char_count \n", "0 TUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN 46 \n", "1 POUR MYSELF A CUP OF AMBITION 29 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Confirm the transformation was only in memory—table unchanged\n", "t.head(2)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 6 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "6 rows updated, 12 values computed." ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add: Apply to all rows (same expression)\n", "t.add_computed_column(word_count=word_count(t.text))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textuppercasechar_countword_count
Tumble out of bed and I stumble to the kitchenTUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN4610
Pour myself a cup of ambitionPOUR MYSELF A CUP OF AMBITION296
And yawn and stretch and try to come to lifeAND YAWN AND STRETCH AND TRY TO COME TO LIFE4410
Jump in the shower and the blood starts pumpin'JUMP IN THE SHOWER AND THE BLOOD STARTS PUMPIN'479
Out on the street, the traffic starts jumpin'OUT ON THE STREET, THE TRAFFIC STARTS JUMPIN'458
With folks like me on the job from nine to fiveWITH FOLKS LIKE ME ON THE JOB FROM NINE TO FIVE4711
" ], "text/plain": [ " text \\\n", "0 Tumble out of bed and I stumble to the kitchen \n", "1 Pour myself a cup of ambition \n", "2 And yawn and stretch and try to come to life \n", "3 Jump in the shower and the blood starts pumpin' \n", "4 Out on the street, the traffic starts jumpin' \n", "5 With folks like me on the job from nine to five \n", "\n", " uppercase char_count word_count \n", "0 TUMBLE OUT OF BED AND I STUMBLE TO THE KITCHEN 46 10 \n", "1 POUR MYSELF A CUP OF AMBITION 29 6 \n", "2 AND YAWN AND STRETCH AND TRY TO COME TO LIFE 44 10 \n", "3 JUMP IN THE SHOWER AND THE BLOOD STARTS PUMPIN' 47 9 \n", "4 OUT ON THE STREET, THE TRAFFIC STARTS JUMPIN' 45 8 \n", "5 WITH FOLKS LIKE ME ON THE JOB FROM NINE TO FIVE 47 11 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View text with word_count column\n", "t.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanation\n", "\n", "**How the iterate-then-add workflow works:**\n", "\n", "Queries and computed columns serve different purposes. Queries let you test transformations on sample rows without storing anything. Once you're satisfied with the results, you use the exact same expression with `.add_computed_column()` to persist it across your entire table.\n", "\n", "This workflow is especially valuable for expensive operations—API calls, model inference, complex image processing—where you want to validate logic before processing your full dataset. Test on 2-3 rows to catch errors early, then commit once.\n", "\n", "**To customize this workflow:**\n", "\n", "- **Sample size**: Use `.head(n)` to collect only the first n rows—`.head(1)` for single-row testing, `.head(10)` for broader validation, or `.collect()` to collect all rows\n", "- **Save expressions**: Store transformations as variables (Example 2) to guarantee identical logic in both iterate and add steps\n", "- **Chain transformations**: Test multiple operations together—`.select(t.text.upper().split())` works just like single operations\n", "- **Use with any data type**: This pattern works with images, videos, audio, documents—not just text. For multimodal data, visual inspection during iteration is especially valuable\n", "\n", "**The Pixeltable workflow:**\n", "\n", "In traditional databases, `.select()` just picks which columns to view. In Pixeltable, `.select()` also lets you compute new transformations on the fly—define new columns without storing them. This makes `.select()` perfect for testing transformations before you commit them.\n", "\n", "When you use `.select()`, you're creating a query. Queries are temporary operations that retrieve and transform data from tables—they don't store anything. Queries use lazy evaluation, meaning they don't execute until you call `.collect()`. You must use `.collect()` to execute the query and return results. `.head(n)` is a convenience method that collects only the first n rows instead of all rows. Use `.head(n)` when iterating to get fast feedback without processing your entire dataset.\n", "\n", "Nothing is stored in your table when you run queries. You can test different approaches quickly without affecting your data. You can store query results in a Python variable to work with them in your session.\n", "\n", "```python\n", "# Store query results as a variable (in memory only)\n", "results = t.select(\n", " t.text,\n", " uppercase=t.text.upper() # Label the transformed column\n", ").head(3)\n", "```\n", "\n", "These results are stored in memory and will not persist across sessions—only `.add_computed_column()` persists data to your table.\n", "\n", "Once you're satisfied, `.add_computed_column()` uses the same expression but adds it as a persistent column in your table. Now the transformation runs on all rows and results are stored permanently." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See also\n", "\n", "- [Transform images with PIL operations](https://docs.pixeltable.com/howto/cookbooks/images/img-pil-transforms)\n", "- [Convert RGB images to grayscale](https://docs.pixeltable.com/howto/cookbooks/images/img-rgb-to-grayscale)\n", "- [Apply filters to images](https://docs.pixeltable.com/howto/cookbooks/images/img-apply-filters)" ] } ], "metadata": { "kernelspec": { "display_name": "pixeltable", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }