{ "cells": [ { "cell_type": "markdown", "id": "ba79d33d-7ae2-4f0a-94c3-783f799e1ff4", "metadata": {}, "source": [ "# Queries and Expressions\n", "\n", "Expressions are the basic building blocks of Pixeltable. This guide explores how to use queries and expressions, including:\n", "\n", "- Different types of Pixeltable expressions\n", "- Column references and arithmetic operations\n", "- Function calls and media operations\n", "- The Pixeltable type system\n", "\n", "## Prerequisites\n", "\n", "This guide assumes you're familiar with:\n", "\n", "- Creating and managing tables\n", "- Basic table operations and queries\n", "- Computed columns\n", "\n", "If you're new to these concepts, start with:\n", "\n", "- [Tables and Data Operations](https://docs.pixeltable.com/tutorials/tables-and-data-operations)\n", "- [Computed Columns](https://docs.pixeltable.com/tutorials/computed-columns)\n", "\n", "## Understanding Expressions\n", "\n", "You can use Pixeltable expressions in queries:\n", "\n", "```python\n", "pop_t.select(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)).collect()\n", "```\n", "\n", "Or as computed columns that update automatically:\n", "\n", "```python\n", "pop_t.add_column(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022))\n", "```\n", "\n", "Both examples use the expression `pop_t.pop_2023 - pop_t.pop_2022`. You can also chain operations:\n", "\n", "```python\n", "t.source.convert('RGBA').rotate(10)\n", "```\n", "\n", "Or invoke models:\n", "\n", "```python\n", "detr_for_object_detection(\n", " t.source,\n", " model_id='facebook/detr-resnet-50',\n", " threshold=0.8\n", ")\n", "```\n", "\n", "You can include an expression in a `select()` statement to evaluate it dynamically, or in an `add_column()` statement to add it to the table schema as a computed column.\n", "\n", "To get started, let's import the necessary libraries and set up a demo directory." ] }, { "cell_type": "code", "execution_count": null, "id": "af95c4aa-7905-44e9-b387-af512c347b68", "metadata": {}, "outputs": [], "source": [ "%pip install -qU pixeltable datasets torch transformers" ] }, { "cell_type": "code", "execution_count": 1, "id": "7c14c1fe-83fb-48dd-a766-a2b22c9d214a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata\n", "Created directory 'demo'.\n" ] } ], "source": [ "import pixeltable as pxt\n", "\n", "pxt.drop_dir('demo', force=True)\n", "pxt.create_dir('demo')" ] }, { "cell_type": "markdown", "id": "ba3ff2d8-aafd-45af-b3e0-a86716ccb63a", "metadata": {}, "source": [ "In this guide we'll work with a subset of the MNIST dataset, a classic reference database of hand-drawn digits. A copy of the MNIST dataset is hosted on the Hugging Face datasets repository, so we can use `create_table()` with the `source` parameter to load it into a Pixeltable table." ] }, { "cell_type": "code", "execution_count": 2, "id": "c515c6ba-959f-4da0-9bba-872acf2a845c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'mnist'.\n", "Inserting rows into `mnist`: 50 rows [00:00, 7516.67 rows/s]\n", "Inserted 50 rows with 0 errors.\n" ] } ], "source": [ "import datasets\n", "\n", "# Download the first 50 images of the MNIST dataset\n", "ds = datasets.load_dataset('ylecun/mnist', split='train[:50]')\n", "\n", "# Import them into a Pixeltable table\n", "t = pxt.create_table('demo/mnist', source=ds)" ] }, { "cell_type": "code", "execution_count": 3, "id": "57119aee-d513-4489-a3b5-44f5e2988636", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imagelabel
\n", " \n", "
5
\n", " \n", "
0
\n", " \n", "
4
\n", " \n", "
1
\n", " \n", "
9
" ], "text/plain": [ " image label\n", "0 \n", "#T_2d71b_row0_col0 {\n", " white-space: pre-wrap;\n", " text-align: left;\n", " font-weight: bold;\n", "}\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column\n", "'image'\n", "(of table 'demo/mnist')
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
imageImage
\n" ], "text/plain": [ "Column\n", "'image'\n", "(of table 'demo/mnist')\n", "\n", " Column Name Type Computed With\n", " image Image " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.image" ] }, { "cell_type": "markdown", "id": "7bc642b7-22f9-43de-acb2-bde0242e024c", "metadata": {}, "source": [ "This is true of all Pixeltable expressions: we can freely create them and manipulate them in various ways, but no actual data will be loaded until we use them in a query." ] }, { "cell_type": "markdown", "id": "cfd530a1-96ad-4798-a3db-d6c0de05f532", "metadata": {}, "source": [ "### JSON Collections (Dicts and Lists)" ] }, { "cell_type": "markdown", "id": "8b20135d-7479-40aa-bf58-b6a7668d9538", "metadata": {}, "source": [ "Data is commonly presented in JSON format: for example, API responses and model output often take the shape of JSON dictionaries or lists of dictionaries. Pixeltable has native support for JSON accessors. To demonstrate this, let's add a computed column that runs an image classification model against the images in our dataset." ] }, { "cell_type": "code", "execution_count": 5, "id": "212b5bb4-af25-42a8-b936-209372c9f4bd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 50 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "50 rows updated, 50 values computed." ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pixeltable.functions.huggingface import vit_for_image_classification\n", "\n", "t.add_computed_column(\n", " classification=vit_for_image_classification(\n", " t.image, model_id='farleyknight-org-username/vit-base-mnist'\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "id": "d3521b63-dfc6-4ca8-aadf-2f48aca5525c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imageclassification
\n", " \n", "
{"labels": [5, 3, 2, 8, 7], "scores": [0.981, 0.013, 0.002, 0.001, 0.001], "label_text": ["5", "3", "2", "8", "7"]}
\n", " \n", "
{"labels": [0, 6, 9, 8, 1], "scores": [0.997, 0., 0., 0., 0.], "label_text": ["0", "6", "9", "8", "1"]}
\n", " \n", "
{"labels": [4, 1, 9, 7, 0], "scores": [0.997, 0.001, 0., 0., 0.], "label_text": ["4", "1", "9", "7", "0"]}
" ], "text/plain": [ " image \\\n", "0 \n", "#T_16187_row0_col0 {\n", " white-space: pre-wrap;\n", " text-align: left;\n", " font-weight: bold;\n", "}\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
table 'demo/mnist'
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
imageImage
labelString
classificationJsonvit_for_image_classification(image, model_id='farleyknight-org-username/vit-base-mnist')
\n" ], "text/plain": [ "table 'demo/mnist'\n", "\n", " Column Name Type Computed With\n", " image Image \n", " label String \n", " classification Json vit_for_image_classification(image, model_id='..." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t" ] }, { "cell_type": "markdown", "id": "9bfc6d8e-9a61-4399-8689-974caef19b24", "metadata": {}, "source": [ "Pixeltable provides a range of operators on `Json`-typed output that behave just as you'd expect. To look up a key in a dictionary, use the syntax `t.classification['labels']`:" ] }, { "cell_type": "code", "execution_count": 8, "id": "56f8cc75-42f2-4b03-be14-aadda41da31e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classification_labels
[5, 3, 2, 8, 7]
[0, 6, 9, 8, 1]
[4, 1, 9, 7, 0]
" ], "text/plain": [ " classification_labels\n", "0 [5, 3, 2, 8, 7]\n", "1 [0, 6, 9, 8, 1]\n", "2 [4, 1, 9, 7, 0]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.select(t.classification['labels']).head(3)" ] }, { "cell_type": "markdown", "id": "2798836c-1e35-42ba-8559-550f825eeeb5", "metadata": {}, "source": [ "You can also use a convenient \"attribute\" syntax for dictionary lookups. This follows the standard [JSONPath](https://en.wikipedia.org/wiki/JSONPath) expression syntax." ] }, { "cell_type": "code", "execution_count": 9, "id": "be58ac2e-648e-4922-be20-82b5168de813", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classification_labels
[5, 3, 2, 8, 7]
[0, 6, 9, 8, 1]
[4, 1, 9, 7, 0]
" ], "text/plain": [ " classification_labels\n", "0 [5, 3, 2, 8, 7]\n", "1 [0, 6, 9, 8, 1]\n", "2 [4, 1, 9, 7, 0]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.select(t.classification.labels).head(3)" ] }, { "cell_type": "markdown", "id": "73d8eb1c-d449-4990-b643-4e4e729fd29f", "metadata": {}, "source": [ "The \"attribute\" syntax isn't fully general (it won't work for dictionary keys that are not valid Python identifiers), but it's handy when it works.\n", "\n", "`t.classification.labels` is another Pixeltable expression; you can think of it as saying, \"do the `'labels'` lookup from every dictionary in the column `t.classification`, and return the result as a new column.\" As before, the expression by itself contains no data; it's the query that does the actual work of retrieving data. Here's what we see if we just give the expression by itself, without a query:" ] }, { "cell_type": "code", "execution_count": 10, "id": "eb2653ad-c389-486a-a103-e9af543fde90", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "classification.labels" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.classification.labels" ] }, { "cell_type": "markdown", "id": "a3047a37-f138-4780-a529-009f92a3073b", "metadata": {}, "source": [ "Similarly, one can pull out a specific item in a list (for this model, we're probably mostly interested in the first item anyway):" ] }, { "cell_type": "code", "execution_count": 11, "id": "e1628434-8810-4ef9-af4f-b92388229e3c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classification_labels0
5
0
4
" ], "text/plain": [ " classification_labels0\n", "0 5\n", "1 0\n", "2 4" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.select(t.classification.labels[0]).head(3)" ] }, { "cell_type": "markdown", "id": "0c90d66d-9013-44ea-9e16-490bccbe15d1", "metadata": {}, "source": [ "Or slice a list in the usual manner:" ] }, { "cell_type": "code", "execution_count": 12, "id": "e7e8138d-95b2-48bb-8f5d-f427128e22e2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classification_labels2
[5, 3]
[0, 6]
[4, 1]
" ], "text/plain": [ " classification_labels2\n", "0 [5, 3]\n", "1 [0, 6]\n", "2 [4, 1]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.select(t.classification.labels[:2]).head(3)" ] }, { "cell_type": "markdown", "id": "fe2da061-9485-4f79-9fa8-1d294cde3c62", "metadata": {}, "source": [ "Pixeltable is resilient against out-of-bounds indices or dictionary keys. If an index or key doesn't exist for a particular row, you'll get a `None` output for that row." ] }, { "cell_type": "code", "execution_count": 13, "id": "fcf6932e-b9e7-420c-8c9a-9c72a3b0f1b1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classification_notakey
None
None
None
" ], "text/plain": [ " classification_notakey\n", "0 None\n", "1 None\n", "2 None" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.select(t.classification.not_a_key).head(3)" ] }, { "cell_type": "markdown", "id": "35e60ade-dcc5-4638-b6bf-3ede6c85353f", "metadata": {}, "source": [ "As always, any expression can be used to create a computed column." ] }, { "cell_type": "code", "execution_count": 14, "id": "bb69e6a9-49cc-4fd9-9493-27b193003950", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 50 column values with 0 errors.\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
table 'demo/mnist'
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
imageImage
labelString
classificationJsonvit_for_image_classification(image, model_id='farleyknight-org-username/vit-base-mnist')
pred_labelJsonclassification.label_text[0]
\n" ], "text/plain": [ "table 'demo/mnist'\n", "\n", " Column Name Type Computed With\n", " image Image \n", " label String \n", " classification Json vit_for_image_classification(image, model_id='...\n", " pred_label Json classification.label_text[0]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use label_text to be consistent with t.label, which was given\n", "# to us as a string\n", "\n", "t.add_computed_column(pred_label=t.classification.label_text[0])\n", "t" ] }, { "cell_type": "markdown", "id": "388db168-f1b1-400e-958e-b56281c939ea", "metadata": {}, "source": [ "Finally, just as it's possible to extract items from lists and dictionaries using Pixeltable expressions, you can also construct new lists and dictionaries: just package them up in the usual way." ] }, { "cell_type": "code", "execution_count": 15, "id": "0ea19aab-bd3f-42e2-8b1c-be07a366159e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imagecol_1
\n", " \n", "
{"ground_truth": "5", "prediction": "5", "is_correct": true, "engine": "pixeltable"}
\n", " \n", "
{"ground_truth": "0", "prediction": "0", "is_correct": true, "engine": "pixeltable"}
\n", " \n", "
{"ground_truth": "4", "prediction": "4", "is_correct": true, "engine": "pixeltable"}
\n", " \n", "
{"ground_truth": "1", "prediction": "1", "is_correct": true, "engine": "pixeltable"}
\n", " \n", "
{"ground_truth": "9", "prediction": "9", "is_correct": true, "engine": "pixeltable"}
" ], "text/plain": [ " image \\\n", "0 \n", " \n", " \n", " image\n", " rot_label\n", " \n", " \n", " \n", " \n", "
\n", " \n", "
\n", " 4\n", " \n", " \n", "
\n", " \n", "
\n", " 0\n", " \n", " \n", "
\n", " \n", "
\n", " 5\n", " \n", " \n", "
\n", " \n", "
\n", " 1\n", " \n", " \n", "
\n", " \n", "
\n", " 8\n", " \n", " \n", "" ], "text/plain": [ " image rot_label\n", "0 \n", "Note that we employed a useful trick here: we assigned an expression to the variable rot_model_result for later reuse. Every Pixeltable expression is a Python object, so you can freely assign them to variables, reuse them, compose them, and so on. Remember that nothing actually happens until the expression is used in a query - so in this example, setting the variable rot_model_result doesn't itself result in any data being retrieved; that only happens later, when we actually use it in the select() query.\n", "\n", "\n", "There are a large number of built-in UDFs that ship with Pixeltable; you can always refer back to the [SDK Documentation](https://docs.pixeltable.com/sdk/latest/) for details." ] }, { "cell_type": "markdown", "id": "a9bdce5c-f774-4e87-9143-5383081060ff", "metadata": {}, "source": [ "### Method Calls\n", "\n", "Many built-in UDFs allow a convenient alternate syntax. The following two expressions are exactly equivalent:" ] }, { "cell_type": "code", "execution_count": 17, "id": "e4d9189b-39f8-47ff-807b-c4f7393a474c", "metadata": {}, "outputs": [], "source": [ "a = t.image.rotate(90)\n", "b = pxt.functions.image.rotate(t.image, 90)" ] }, { "cell_type": "markdown", "id": "827f0033-7b39-4d09-8db6-09326d1764d0", "metadata": {}, "source": [ "`a` and `b` can always be used interchangeably in queries, with identical results. Just like in standard Python classes, whenever Pixeltable sees the __method call__ `t.image.rotate(90)`, it interprets it as a __function call__ `pxt.functions.image.rotate(self, 90)`, with (in this case) `self` equal to `t.image`.\n", "\n", "Any method call can also be written as a function call, but (just like in standard Python) not every function call can be written as a method call. For example, the following won't work:\n", "\n", "```python\n", "t.image.vit_for_image_classification(\n", " model_id='farleyknight-org-username/vit-base-mnist'\n", ")\n", "```\n", "\n", "That's because `vit_for_image_classification` is part of the `pxt.functions.huggingface` module, not the core module `pxt.functions.image`. Most Pixeltable types have a corresponding __core module__ of UDFs that can be used as method calls (`pxt.functions.image` for `Image`; `pxt.functions.string` for `String`; and so on), described fully in the [SDK Documentation](https://docs.pixeltable.com/sdk/latest/)." ] }, { "cell_type": "markdown", "id": "f5f4c17a-61fb-4b89-b05a-20b79edebc91", "metadata": {}, "source": [ "### Arithmetic and Boolean Operations" ] }, { "cell_type": "markdown", "id": "7bd67623-5031-4eae-990f-18577ba8dc19", "metadata": {}, "source": [ "Expressions can also be combined using standard arithmetic and boolean operators. As with everything else, arithmetic and boolean expressions are operations on columns that (when used in a query) are applied to every row." ] }, { "cell_type": "code", "execution_count": 18, "id": "b603d361-eddf-4d53-87df-ab2cc30eb42b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imagelabelcol_2col_3
\n", " \n", "
5FalseFalse
\n", " \n", "
0FalseTrue
\n", " \n", "
4TrueTrue
\n", " \n", "
1FalseTrue
\n", " \n", "
9FalseFalse
" ], "text/plain": [ " image label col_2 col_3\n", "0 \n", " \n", " \n", " image\n", " \n", " \n", " \n", " \n", "
\n", " \n", "
\n", " \n", " \n", "
\n", " \n", "
\n", " \n", " \n", "
\n", " \n", "
\n", " \n", " \n", "
\n", " \n", "
\n", " \n", " \n", "" ], "text/plain": [ " image\n", "0 \n", " \n", " \n", " image\n", " label\n", " rot_label\n", " \n", " \n", " \n", " \n", "
\n", " \n", "
\n", " 5\n", " 4\n", " \n", " \n", "" ], "text/plain": [ " image label rot_label\n", "0 \n", " \n", " \n", " image\n", " clip\n", " \n", " \n", " \n", " \n", "
\n", " \n", "
\n", " [-0.052 -0.259 -0.481 -0.628 0.053 -0.032 ... 0.079 0.087 -0.346 0.904 0.037 -0.161]\n", " \n", " \n", "
\n", " \n", "
\n", " [ 0.063 -0.174 -0.263 -0.596 0.218 -0.332 ... 0.295 0.028 0.001 0.823 -0.149 -0.163]\n", " \n", " \n", "
\n", " \n", "
\n", " [-0.059 -0.222 -0.251 -0.503 -0.112 -0.233 ... 0.038 0.107 -0.329 0.765 -0.363 0.206]\n", " \n", " \n", "
\n", " \n", "
\n", " [ 0.098 -0.196 -0.42 -0.589 0.111 -0.107 ... 0.266 0.179 -0.143 1.023 -0.136 0.092]\n", " \n", " \n", "
\n", " \n", "
\n", " [-0.046 -0.147 -0.455 -0.576 0.228 -0.115 ... 0.521 0.086 -0.24 0.786 0.098 0.171]\n", " \n", " \n", "" ], "text/plain": [ " image \\\n", "0 \n", " \n", " \n", " col_0\n", " col_1\n", " col_2\n", " \n", " \n", " \n", " \n", " -0.052\n", " [-0.032 0.07 1.117 -0.037 -0.158]\n", " [ 0.904 0.037 -0.161]\n", " \n", " \n", " 0.063\n", " [-0.332 -0.201 1.023 -0.003 -0.094]\n", " [ 0.823 -0.149 -0.163]\n", " \n", " \n", " -0.059\n", " [-0.233 0.238 0.864 -0.135 -0.017]\n", " [ 0.765 -0.363 0.206]\n", " \n", " \n", " 0.098\n", " [-0.107 0.038 1.081 -0.111 0.172]\n", " [ 1.023 -0.136 0.092]\n", " \n", " \n", " -0.046\n", " [-0.115 0.066 1.089 -0.017 0.105]\n", " [0.786 0.098 0.171]\n", " \n", " \n", "" ], "text/plain": [ " col_0 col_1 \\\n", "0 -0.052472 [-0.03184929, 0.06963743, 1.1165005, -0.037148... \n", "1 0.062565 [-0.3323701, -0.20095566, 1.0229394, -0.002611... \n", "2 -0.059083 [-0.23313646, 0.23840791, 0.8635649, -0.135253... \n", "3 0.098077 [-0.106710635, 0.038182013, 1.0814098, -0.1106... \n", "4 -0.046089 [-0.11528104, 0.06613226, 1.0892408, -0.016736... \n", "\n", " col_2 \n", "0 [0.90351915, 0.036779515, -0.16105242] \n", "1 [0.82265085, -0.14863057, -0.16314176] \n", "2 [0.76470613, -0.36314437, 0.20581008] \n", "3 [1.0230694, -0.1362458, 0.09243362] \n", "4 [0.7860199, 0.09811185, 0.17112087] " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.select(t.clip[0], t.clip[5:10], t.clip[-3:]).head(5)" ] }, { "cell_type": "markdown", "id": "efba11df-cad4-4487-b572-cf92a0c08017", "metadata": {}, "source": [ "### Ad hoc UDFs with `apply`\n", "\n", "We've now seen the most commonly encountered Pixeltable expression types. There are a few other less commonly encountered expressions that are occasionally useful.\n", "\n", "You can use `apply` to map any Python function onto a column of data. You can think of `apply` as a quick way of constructing an \"on-the-fly\" UDF for one-off use." ] }, { "cell_type": "code", "execution_count": null, "id": "7c558028-433d-4efa-8832-a033299bb17f", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "t.select(t.clip.apply(np.ndarray.dumps, col_type=pxt.String)).head(2)" ] }, { "cell_type": "markdown", "id": "d6cc7bc8-7b81-4ee3-96df-0fb11be8617c", "metadata": {}, "source": [ "Note, however, that if the function you're `apply`ing doesn't have type hints (as in the example here), you'll need to specify the output column type explicitly." ] }, { "cell_type": "markdown", "id": "1684f6e3-bf87-4a6f-9ae9-a16703678ce9", "metadata": {}, "source": [ "### Type Conversion with `astype`\n", "\n", "Sometimes it's useful to transform an expression of one type into a different type. For example, you can use `astype` to turn an expression of type `pxt.Json` into one of type `pxt.String`. This assumes that the value being converted is actually a string; otherwise, you'll get an exception. Here's an example:" ] }, { "cell_type": "code", "execution_count": 24, "id": "0ce91390-f1b0-4b63-ac3a-5c19952b2c01", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Optional[Json]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the text in position 0 of `t.classification.label_text`; since\n", "# `t.classification.label_text` has type `pxt.Json`, so does\n", "# `t.classification.label_text[0]`\n", "\n", "t.classification.label_text[0].col_type" ] }, { "cell_type": "code", "execution_count": 25, "id": "44df18c5-8211-4c5d-bb14-6f6408c0e1b5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Optional[String]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the text in position 0 of `t.classification.label_text`, this time\n", "# cast as a `pxt.String`\n", "\n", "t.classification.label_text[0].astype(pxt.String).col_type" ] }, { "cell_type": "markdown", "id": "cff7dc93-a553-4fe1-b664-0d82a08df372", "metadata": {}, "source": [ "### Column Properties\n", "\n", "Some `ColumnRef` expressions have additional useful properties. A media column (image, video, audio, or document) has the following two properties:\n", "\n", "- `localpath`: the media location on the local filesystem\n", "- `fileurl`: the original URL where the media resides (could be the same as `localpath`)" ] }, { "cell_type": "code", "execution_count": 26, "id": "1c85868e-d648-4133-9568-8151a122dff9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imageimage_localpath
\n", " \n", "
/Users/asiegel/.pixeltable/media/99266a70fa5141bc9b5d1c9fddcc7a16/92/9276/99266a70fa5141bc9b5d1c9fddcc7a16_0_1_927616cf87bf45ea9a3ebabd1da9bc01.jpeg
\n", " \n", "
/Users/asiegel/.pixeltable/media/99266a70fa5141bc9b5d1c9fddcc7a16/d3/d360/99266a70fa5141bc9b5d1c9fddcc7a16_0_1_d360c94ae7624066a723e7e1ccd4cf37.jpeg
\n", " \n", "
/Users/asiegel/.pixeltable/media/99266a70fa5141bc9b5d1c9fddcc7a16/4f/4f38/99266a70fa5141bc9b5d1c9fddcc7a16_0_1_4f38ed14b91445b2b03f1d4772e7f2e7.jpeg
\n", " \n", "
/Users/asiegel/.pixeltable/media/99266a70fa5141bc9b5d1c9fddcc7a16/0c/0ca1/99266a70fa5141bc9b5d1c9fddcc7a16_0_1_0ca1cf9c60cc4c90abfd0d3d4e6ef4c3.jpeg
\n", " \n", "
/Users/asiegel/.pixeltable/media/99266a70fa5141bc9b5d1c9fddcc7a16/65/657d/99266a70fa5141bc9b5d1c9fddcc7a16_0_1_657dcf92103f409a9203817b2565d05a.jpeg
" ], "text/plain": [ " image \\\n", "0 225\u001b[0m item\u001b[38;5;241m.\u001b[39mrow[\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfn_call\u001b[38;5;241m.\u001b[39mslot_idx] \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mscalar_py_fn\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mitem\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mitem\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 226\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m exc:\n", "File \u001b[0;32m/opt/miniconda3/envs/pxt/lib/python3.10/site-packages/PIL/Image.py:2682\u001b[0m, in \u001b[0;36mImage.getchannel\u001b[0;34m(self, channel)\u001b[0m\n\u001b[1;32m 2680\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(msg) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01me\u001b[39;00m\n\u001b[0;32m-> 2682\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_new(\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mim\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgetband\u001b[49m\u001b[43m(\u001b[49m\u001b[43mchannel\u001b[49m\u001b[43m)\u001b[49m)\n", "\u001b[0;31mValueError\u001b[0m: band index out of range", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[27], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mt\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43madd_computed_column\u001b[49m\u001b[43m(\u001b[49m\u001b[43mchannel\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mt\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mimage\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgetchannel\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table.py:697\u001b[0m, in \u001b[0;36mTable.add_computed_column\u001b[0;34m(self, stored, destination, print_stats, on_error, if_exists, **kwargs)\u001b[0m\n\u001b[1;32m 695\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_verify_column(new_col)\n\u001b[1;32m 696\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_tbl_version \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m--> 697\u001b[0m result \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_tbl_version\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43madd_columns\u001b[49m\u001b[43m(\u001b[49m\u001b[43m[\u001b[49m\u001b[43mnew_col\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mprint_stats\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mprint_stats\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mon_error\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mon_error\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 698\u001b[0m FileCache\u001b[38;5;241m.\u001b[39mget()\u001b[38;5;241m.\u001b[39memit_eviction_warnings()\n\u001b[1;32m 699\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m result\n", "File \u001b[0;32m~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table_version.py:666\u001b[0m, in \u001b[0;36mTableVersion.add_columns\u001b[0;34m(self, cols, print_stats, on_error)\u001b[0m\n\u001b[1;32m 664\u001b[0m all_cols\u001b[38;5;241m.\u001b[39mappend(undo_col)\n\u001b[1;32m 665\u001b[0m \u001b[38;5;66;03m# Add all columns\u001b[39;00m\n\u001b[0;32m--> 666\u001b[0m status \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_add_columns\u001b[49m\u001b[43m(\u001b[49m\u001b[43mall_cols\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mprint_stats\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mprint_stats\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mon_error\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mon_error\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 667\u001b[0m \u001b[38;5;66;03m# Create indices and their md records\u001b[39;00m\n\u001b[1;32m 668\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m col, (idx, val_col, undo_col) \u001b[38;5;129;01min\u001b[39;00m index_cols\u001b[38;5;241m.\u001b[39mitems():\n", "File \u001b[0;32m~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table_version.py:732\u001b[0m, in \u001b[0;36mTableVersion._add_columns\u001b[0;34m(self, cols, print_stats, on_error)\u001b[0m\n\u001b[1;32m 730\u001b[0m plan\u001b[38;5;241m.\u001b[39mopen()\n\u001b[1;32m 731\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 732\u001b[0m excs_per_col \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mstore_tbl\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mload_column\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcol\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mplan\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mon_error\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m==\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mabort\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 733\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m sql_exc\u001b[38;5;241m.\u001b[39mDBAPIError \u001b[38;5;28;01mas\u001b[39;00m exc:\n\u001b[1;32m 734\u001b[0m Catalog\u001b[38;5;241m.\u001b[39mget()\u001b[38;5;241m.\u001b[39mconvert_sql_exc(exc, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mid, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandle, convert_db_excs\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m)\n", "File \u001b[0;32m~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/store.py:247\u001b[0m, in \u001b[0;36mStoreBase.load_column\u001b[0;34m(self, col, exec_plan, abort_on_exc)\u001b[0m\n\u001b[1;32m 245\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m abort_on_exc \u001b[38;5;129;01mand\u001b[39;00m row\u001b[38;5;241m.\u001b[39mhas_exc():\n\u001b[1;32m 246\u001b[0m exc \u001b[38;5;241m=\u001b[39m row\u001b[38;5;241m.\u001b[39mget_first_exc()\n\u001b[0;32m--> 247\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m excs\u001b[38;5;241m.\u001b[39mError(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mError while evaluating computed column \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mcol\u001b[38;5;241m.\u001b[39mname\u001b[38;5;132;01m!r}\u001b[39;00m\u001b[38;5;124m:\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;132;01m{\u001b[39;00mexc\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mexc\u001b[39;00m\n\u001b[1;32m 248\u001b[0m table_row, num_row_exc \u001b[38;5;241m=\u001b[39m row_builder\u001b[38;5;241m.\u001b[39mcreate_store_table_row(row, \u001b[38;5;28;01mNone\u001b[39;00m, row\u001b[38;5;241m.\u001b[39mpk)\n\u001b[1;32m 249\u001b[0m num_excs \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m num_row_exc\n", "\u001b[0;31mError\u001b[0m: Error while evaluating computed column 'channel':\nband index out of range" ] } ], "source": [ "t.add_computed_column(channel=t.image.getchannel(1))" ] }, { "cell_type": "markdown", "id": "96c325cb-aff9-42d4-9a37-91d0c1dc83dc", "metadata": {}, "source": [ "But if we use `on_error='ignore'`, the exception will be logged in the column properties instead." ] }, { "cell_type": "code", "execution_count": 28, "id": "245b5ac9-83b5-4a1e-866d-b5ecd0305e7c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 50 column values with 50 errors.\n" ] }, { "data": { "text/plain": [ "50 rows updated, 50 values computed, 50 exceptions." ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.add_computed_column(channel=t.image.getchannel(1), on_error='ignore')" ] }, { "cell_type": "markdown", "id": "548b5a1c-4dc0-4736-9f90-05d9b656b16b", "metadata": {}, "source": [ "Notice that the update status informs us that there were 50 errors. If we query the table, we see that the column contains only `None` values, but the `errortype` and `errormsg` fields contain details of the error." ] }, { "cell_type": "code", "execution_count": 29, "id": "b3b548fe-f8bd-438b-90fd-cab8d455b37d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imagechannelchannel_errortypechannel_errormsg
\n", " \n", "
NoneValueErrorband index out of range
\n", " \n", "
NoneValueErrorband index out of range
\n", " \n", "
NoneValueErrorband index out of range
\n", " \n", "
NoneValueErrorband index out of range
\n", " \n", "
NoneValueErrorband index out of range
" ], "text/plain": [ " image channel \\\n", "0 \n", "pxt.Audio, pxt.Video, and pxt.Document all correspond to the Python type str. This is because those types are represented by file paths that reference the media in question. When you query for, say, t.select(t.video_col), you're guaranteed to get a file path on the local filesystem (Pixeltable will download and cache a local copy of the video if necessary to ensure this). If you want the original URL, use t.video_col.fileurl instead.\n", "" ] }, { "cell_type": "markdown", "id": "ff20bf54-f4a0-435b-9874-13a4e8471dd3", "metadata": {}, "source": [ "Several types can be __specialized__ to constrain the allowable data in a column.\n", "\n", "- `pxt.Image` can be specialized with a resolution and/or an image mode:\n", " - `pxt.Image[(300,200)]` - images with width 300 and height 200\n", " - `pxt.Image['RGB']` - images with mode `'RGB'`; see the [PIL Documentation](https://pillow.readthedocs.io/en/stable/handbook/concepts.html) for the full list\n", " - `pxt.Image[(300,200), 'RGB']` - combines the above constraints\n", "- `pxt.Array` can be specialized with a shape and/or a dtype:\n", " - `pxt.Array[pxt.Float]` - arrays with dtype `pxt.Float`\n", " - `pxt.Array[(64,64,3), pxt.Float]` - 3-dimensional arrays with dtype `pxt.Float` and 64x64x3 shape\n", "\n", "If we look at the structure of our table now, we see examples of specialized image and array types." ] }, { "cell_type": "code", "execution_count": 30, "id": "0014d0c7-680c-4096-a599-50a788b1d840", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
table 'demo/mnist'
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
imageImage
labelString
classificationJsonvit_for_image_classification(image, model_id='farleyknight-org-username/vit-base-mnist')
pred_labelJsonclassification.label_text[0]
clipArray[(512,), Float]clip(image, model_id='openai/clip-vit-base-patch32')
channelImage['L']image.getchannel(1)
\n" ], "text/plain": [ "table 'demo/mnist'\n", "\n", " Column Name Type Computed With\n", " image Image \n", " label String \n", " classification Json vit_for_image_classification(image, model_id='...\n", " pred_label Json classification.label_text[0]\n", " clip Array[(512,), Float] clip(image, model_id='openai/clip-vit-base-pat...\n", " channel Image['L'] image.getchannel(1)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t" ] }, { "cell_type": "markdown", "id": "f7a11c3b-7907-4964-b8bb-0c5ea3d52612", "metadata": {}, "source": [ "`t.clip` has type `pxt.Array[(512,), pxt.Float]`, since the output of the embedding is always a 1x512 array. `t.channel` has type `Image['L']`, since it's always an `'L'` mode (1-channel) image." ] }, { "cell_type": "markdown", "id": "d39b5ed0-5859-4992-9e64-cf0776537d88", "metadata": {}, "source": [ "
\n", "You can freely use pxt.Image by itself to mean \"any image, without constraints\", but numerical arrays must always specify a shape and a dtype; pxt.Array by itself will raise an error.\n", "
" ] }, { "cell_type": "markdown", "id": "57c4ba4f-b2f9-455c-b816-691db79a7eba", "metadata": {}, "source": [ "
\n", "Array shapes follow standard numpy conventions: a shape is a tuple of integers, such as (512,) or (64,64,3). A None may be used in place of an integer to indicate an unconstrained size for that dimension, as in (None,None,3) (3-dimensional array with two unconstrained dimensions), or simply (None,) (unconstrained 1-dimensional array).\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.19" } }, "nbformat": 4, "nbformat_minor": 5 }