{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "iu6P8Pks-Y72" }, "source": [ "# Tables and Data Operations\n", "\n", "This guide shows you how to:\n", "\n", "- Create and manage tables: Understand Pixeltable's table structure, create and modify tables, and work with table schemas\n", "- Manipulate data: Insert, update, and delete data within tables, and retrieve data from tables into Python variables\n", "- Filter and select data: Use `where()`, `select()`, and `order_by()` to query for specific rows and columns\n", "- Import data from CSV files and other file types\n", "\n", "First, let's ensure the Pixeltable library is installed in your environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GwjjornhycHU" }, "outputs": [], "source": [ "%pip install -qU pixeltable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tables\n", "\n", "All data in Pixeltable is stored in tables. At a high level, a Pixeltable table behaves similarly to an ordinary SQL database table, but with many additional capabilities to support complex AI workflows. We'll introduce those advanced capabilities gradually throughout this tutorial; in this section, the focus is on basic table and data operations.\n", "\n", "Tables in Pixeltable are grouped into __directories__, which are simply user-defined namespaces. The following command creates a new directory, `fundamentals`, which we'll use to store the tables in our tutorial." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EIBhrFB30cbF", "outputId": "9c30c33d-e16b-4aeb-cfc5-9a5770416d59" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata\n", "Created directory 'fundamentals'.\n" ] } ], "source": [ "import pixeltable as pxt\n", "\n", "# First we delete the `fundamentals` directory and all its contents (if\n", "# it exists), in order to ensure a clean environment for the tutorial.\n", "pxt.drop_dir('fundamentals', force=True)\n", "\n", "# Now we create the directory.\n", "pxt.create_dir('fundamentals')" ] }, { "cell_type": "markdown", "metadata": { "id": "7StQcO-i-7NF" }, "source": [ "Now let's create our first table. To create a table, we must give it a name and a __schema__ that describes the table structure. Note that prefacing the name with `fundamentals` causes it to be placed in our newly-created directory." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'films'.\n" ] } ], "source": [ "films_t = pxt.create_table('fundamentals.films', {\n", " 'film_name': pxt.String,\n", " 'year': pxt.Int,\n", " 'revenue': pxt.Float\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To insert data into a table, we use the `insert()` method, passing it a list of Python dicts." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `films`: 3 rows [00:00, 572.84 rows/s]\n", "Inserted 3 rows with 0 errors.\n" ] }, { "data": { "text/plain": [ "3 rows inserted, 3 values computed." ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "films_t.insert([\n", " {'film_name': 'Jurassic Park', 'year': 1993, 'revenue': 1037.5},\n", " {'film_name': 'Titanic', 'year': 1997, 'revenue': 2257.8},\n", " {'film_name': 'Avengers: Endgame', 'year': 2019, 'revenue': 2797.5}\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you're inserting just a single row, you can use an alternate syntax that is sometimes more convenient." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `films`: 1 rows [00:00, 318.76 rows/s]\n", "Inserted 1 row with 0 errors.\n" ] }, { "data": { "text/plain": [ "1 row inserted, 1 value computed." ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "films_t.insert([{'film_name': 'Inside Out 2', 'year': 2024, 'revenue': 1462.7}])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can peek at the data in our table with the `collect()` method, which retrieves all the rows in the table." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
film_nameyearrevenue
Jurassic Park19931037.5
Titanic19972257.8
Avengers: Endgame20192797.5
Inside Out 220241462.7
" ], "text/plain": [ " film_name year revenue\n", "0 Jurassic Park 1993 1037.5\n", "1 Titanic 1997 2257.8\n", "2 Avengers: Endgame 2019 2797.5\n", "3 Inside Out 2 2024 1462.7" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "films_t.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pixeltable also provides `update()` and `delete()` methods for modifying and removing data from a table; we'll see examples of them shortly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering and Selecting Data\n", "\n", "Often you want to select only certain rows and/or certain columns in a table. You can do this with the `where()` and `select()` methods." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
film_nameyearrevenue
Titanic19972257.8
Avengers: Endgame20192797.5
" ], "text/plain": [ " film_name year revenue\n", "0 Titanic 1997 2257.8\n", "1 Avengers: Endgame 2019 2797.5" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "films_t.where(films_t.revenue >= 2000.0).collect()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
film_nameyear
Jurassic Park1993
Titanic1997
Avengers: Endgame2019
Inside Out 22024
" ], "text/plain": [ " film_name year\n", "0 Jurassic Park 1993\n", "1 Titanic 1997\n", "2 Avengers: Endgame 2019\n", "3 Inside Out 2 2024" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "films_t.select(films_t.film_name, films_t.year).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the expressions that appear inside the calls to `where()` and `select()`, such as `films_t.year`. These are __column references__ that point to specific columns within a table. In place of `films_t.year`, you can also use dictionary syntax and type `films_t['year']`, which means exactly the same thing but is sometimes more convenient." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
film_nameyear
Jurassic Park1993
Titanic1997
Avengers: Endgame2019
Inside Out 22024
" ], "text/plain": [ " film_name year\n", "0 Jurassic Park 1993\n", "1 Titanic 1997\n", "2 Avengers: Endgame 2019\n", "3 Inside Out 2 2024" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "films_t.select(films_t['film_name'], films_t['year']).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to selecting columns directly, you can use column references inside various kinds of expressions. For example, our `revenue` numbers are given in millions of dollars. Let's say we wanted to select revenue in thousands of dollars instead; we could do that as follows:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
film_namecol_1
Jurassic Park1037500.
Titanic2257800.
Avengers: Endgame2797500.
Inside Out 21462700.
" ], "text/plain": [ " film_name col_1\n", "0 Jurassic Park 1037500.0\n", "1 Titanic 2257800.0\n", "2 Avengers: Endgame 2797500.0\n", "3 Inside Out 2 1462700.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "films_t.select(films_t.film_name, films_t.revenue * 1000).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that since we selected an abstract expression rather than a specific column, Pixeltable gave it the generic name `col_1`. You can assign it a more informative name with Python keyword syntax:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
film_namerevenue_thousands
Jurassic Park1037500.
Titanic2257800.
Avengers: Endgame2797500.
Inside Out 21462700.
" ], "text/plain": [ " film_name revenue_thousands\n", "0 Jurassic Park 1037500.0\n", "1 Titanic 2257800.0\n", "2 Avengers: Endgame 2797500.0\n", "3 Inside Out 2 1462700.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "films_t.select(films_t.film_name, revenue_thousands=films_t.revenue * 1000).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tables are Persistent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a good time to mention a few key differences between Pixeltable tables and other familiar datastructures, such as Python dicts or Pandas dataframes.\n", "\n", "First, **Pixeltable is persistent. Unlike in-memory Python libraries such as Pandas, Pixeltable is a database**. When you reset a notebook kernel or start a new Python session, you'll have access to all the data you've stored previously in Pixeltable. Let's demonstrate this by using the IPython `%reset -f` command to clear out all our notebook variables, so that `films_t` is no longer defined." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "" }, "tags": [ "raises-exception" ] }, "outputs": [ { "ename": "NameError", "evalue": "name 'films_t' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[11], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m get_ipython()\u001b[38;5;241m.\u001b[39mrun_line_magic(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mreset\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m-f\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[0;32m----> 2\u001b[0m \u001b[43mfilms_t\u001b[49m\u001b[38;5;241m.\u001b[39mcollect() \u001b[38;5;66;03m# Throws an exception now\u001b[39;00m\n", "\u001b[0;31mNameError\u001b[0m: name 'films_t' is not defined" ] } ], "source": [ "%reset -f\n", "films_t.collect() # Throws an exception now" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `films_t` variable (along with all other variables in our Python session) has been cleared out - but that's ok, because it wasn't the source of record for our data. The `films_t` variable is just a reference to the underlying database table. We can recover it with the `get_table` command, referencing the `films` table by name." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
film_nameyearrevenue
Jurassic Park19931037.5
Titanic19972257.8
Avengers: Endgame20192797.5
Inside Out 220241462.7
" ], "text/plain": [ " film_name year revenue\n", "0 Jurassic Park 1993 1037.5\n", "1 Titanic 1997 2257.8\n", "2 Avengers: Endgame 2019 2797.5\n", "3 Inside Out 2 2024 1462.7" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pixeltable as pxt\n", "\n", "films_t = pxt.get_table('fundamentals.films')\n", "films_t.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can always get a list of existing tables with the Pixeltable `pxt.ls()` command. Let's use it to see the contents of the `fundamentals` directory." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameKindVersionBase
filmstable2
\n", "
" ], "text/plain": [ " Name Kind Version Base\n", " films table 2 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pxt.ls(path='fundamentals')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Note that if you're running Pixeltable on colab or kaggle, the database will persist only for as long as your colab/kaggle session remains active. If you're running it locally or on your own server, then your database will persist indefinitely (until you actively delete it).\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tables are Typed\n", "\n", "The second major difference is that **Pixeltable is strongly typed**. Because Pixeltable is a database, every column has a data type: that's why we specified `String`, `Int`, and `Float` for the three columns when we created the table. These __type specifiers__ are _mandatory_ when creating tables, and they become part of the table schema. You can always see the table schema with the `describe()` method." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
table 'fundamentals.films'
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
film_nameString
yearInt
revenueFloat
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "films_t.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a notebook, you can also just type `films_t` to see the schema; its output is identical to `films_t.describe()`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
table 'fundamentals.films'
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
film_nameString
yearInt
revenueFloat
\n" ], "text/plain": [ "table 'fundamentals.films'\n", "\n", " Column Name Type Computed With\n", " film_name String \n", " year Int \n", " revenue Float " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "films_t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "In addition to String, Int, and Float, Pixeltable provides several additional data types:\n", "\n", "We'll see examples of each of these types later in this guide.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Besides the column names and types, there's a third element to the schema, `Computed With`. To learn more about this, see the [Computed Columns](https://docs.pixeltable.com/tutorials/computed-columns) guide." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the methods we've discussed so far, such as `insert()` and `get_table()`, are documented in the [Pixeltable SDK](https://docs.pixeltable.com/sdk/latest/) Documentation. The following pages are particularly relevant:\n", "\n", "- [pixeltable](https://docs.pixeltable.com/sdk/latest/pixeltable) package reference\n", "- [pxt.Table](https://docs.pixeltable.com/sdk/latest/table) class reference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Real-World Example: Earthquake Data\n", "\n", "Now let's dive a little deeper into Pixeltable's data operations. To showcase all the features, it'll be helpful to have a real-world dataset, rather than our toy dataset with four movies. The dataset we'll be using consists of Earthquake data drawn from the US Geological Survey: all recorded Earthquakes that occurred within 100 km of Seattle, Washington, between January 1, 2023 and June 30, 2024.\n", "\n", "The dataset is in CSV format, and we can load it into Pixeltable by using `create_table()` with the `source` parameter, which creates a new Pixeltable table from the contents of a CSV file." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'earthquakes'.\n", "Inserting rows into `earthquakes`: 1823 rows [00:00, 19554.24 rows/s]\n", "Inserted 1823 rows with 0 errors.\n" ] } ], "source": [ "eq_t = pxt.create_table(\n", " 'fundamentals.earthquakes', # Name for the new table\n", " source='https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/earthquakes.csv',\n", " primary_key='id', # Column 'id' is the primary key\n", " schema_overrides={'timestamp': pxt.Timestamp} # Interpret column 3 as a timestamp\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "In Pixeltable, you can always import external data by giving a URL instead of a local file path. This applies to CSV datasets, media files (such images and video), and other types of content. The URL will often be an http:// URL, but it can also be an s3:// URL referencing an S3 bucket.\n", "
\n", "\n", "
\n", "Pixeltable's create_table() function with the source parameter can import data from various formats including CSV, Excel, and Hugging Face datasets. You can also use source to import from a Pandas dataframe. For more details, see the pixeltable.io package reference.\n", "
\n", "\n", "Let's have a peek at our new dataset. The dataset contains 1823 rows, and we probably don't want to display them all at once. We can limit our query to fewer rows with the `limit()` method." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudelocationtimestamplongitudelatitude
01.1510 km NW of Belfair, Washington2023-01-01 08:10:37.050000-08:00-122.9347.51
10.2923 km ENE of Ashford, Washington2023-01-02 01:02:43.950000-08:00-121.7646.85
20.223 km ENE of Ashford, Washington2023-01-02 12:05:01.420000-08:00-121.7546.86
30.5215 km NNE of Ashford, Washington2023-01-02 12:45:14.220000-08:00-121.9546.89
41.560 km WSW of Esperance, Washington2023-01-02 13:19:27.200000-08:00-122.3647.79
" ], "text/plain": [ " id magnitude location \\\n", "0 0 1.15 10 km NW of Belfair, Washington \n", "1 1 0.29 23 km ENE of Ashford, Washington \n", "2 2 0.20 23 km ENE of Ashford, Washington \n", "3 3 0.52 15 km NNE of Ashford, Washington \n", "4 4 1.56 0 km WSW of Esperance, Washington \n", "\n", " timestamp longitude latitude \n", "0 2023-01-01 08:10:37.050000-08:00 -122.93 47.51 \n", "1 2023-01-02 01:02:43.950000-08:00 -121.76 46.85 \n", "2 2023-01-02 12:05:01.420000-08:00 -121.75 46.86 \n", "3 2023-01-02 12:45:14.220000-08:00 -121.95 46.89 \n", "4 2023-01-02 13:19:27.200000-08:00 -122.36 47.79 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.limit(5).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A different way of achieving something similar is to use the `head()` and `tail()` methods. Pixeltable keeps track of the insertion order of all its data, and `head()` and `tail()` will always return the _earliest inserted_ and _most recently inserted_ rows in a table, respectively." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudelocationtimestamplongitudelatitude
01.1510 km NW of Belfair, Washington2023-01-01 08:10:37.050000-08:00-122.9347.51
10.2923 km ENE of Ashford, Washington2023-01-02 01:02:43.950000-08:00-121.7646.85
20.223 km ENE of Ashford, Washington2023-01-02 12:05:01.420000-08:00-121.7546.86
30.5215 km NNE of Ashford, Washington2023-01-02 12:45:14.220000-08:00-121.9546.89
41.560 km WSW of Esperance, Washington2023-01-02 13:19:27.200000-08:00-122.3647.79
" ], "text/plain": [ " id magnitude location \\\n", "0 0 1.15 10 km NW of Belfair, Washington \n", "1 1 0.29 23 km ENE of Ashford, Washington \n", "2 2 0.20 23 km ENE of Ashford, Washington \n", "3 3 0.52 15 km NNE of Ashford, Washington \n", "4 4 1.56 0 km WSW of Esperance, Washington \n", "\n", " timestamp longitude latitude \n", "0 2023-01-01 08:10:37.050000-08:00 -122.93 47.51 \n", "1 2023-01-02 01:02:43.950000-08:00 -121.76 46.85 \n", "2 2023-01-02 12:05:01.420000-08:00 -121.75 46.86 \n", "3 2023-01-02 12:45:14.220000-08:00 -121.95 46.89 \n", "4 2023-01-02 13:19:27.200000-08:00 -122.36 47.79 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.head(5)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudelocationtimestamplongitudelatitude
18181.714 km W of Skokomish, Washington2024-06-29 08:55:50.030000-07:00-123.3547.32
18191.067 km E of Lake McMurray, Washington2024-06-29 12:15:19.130000-07:00-122.1348.31
18200.484 km E of Duvall, Washington2024-06-30 09:15:43.020000-07:00-121.9347.75
18210.4612 km NE of Ashford, Washington2024-06-30 10:05:15.410000-07:00-121.9346.84
18220.726 km ENE of Oso, Washington2024-06-30 11:12:41.900000-07:00-121.8448.28
" ], "text/plain": [ " id magnitude location \\\n", "0 1818 1.70 14 km W of Skokomish, Washington \n", "1 1819 1.06 7 km E of Lake McMurray, Washington \n", "2 1820 0.48 4 km E of Duvall, Washington \n", "3 1821 0.46 12 km NE of Ashford, Washington \n", "4 1822 0.72 6 km ENE of Oso, Washington \n", "\n", " timestamp longitude latitude \n", "0 2024-06-29 08:55:50.030000-07:00 -123.35 47.32 \n", "1 2024-06-29 12:15:19.130000-07:00 -122.13 48.31 \n", "2 2024-06-30 09:15:43.020000-07:00 -121.93 47.75 \n", "3 2024-06-30 10:05:15.410000-07:00 -121.93 46.84 \n", "4 2024-06-30 11:12:41.900000-07:00 -121.84 48.28 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.tail(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "head(n) and limit(n).collect() appear similar in this example. But head() always returns the earliest rows in a table, whereas limit() makes no promises about the ordering of its results (unless you specify an order_by() clause - more on this below).\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's also peek at the schema:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
table 'fundamentals.earthquakes'
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
idRequired[Int]
magnitudeFloat
locationString
timestampTimestamp
longitudeFloat
latitudeFloat
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "eq_t.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that while specifying a schema is mandatory when _creating_ a table, it's not always required when _importing_ data. This is because Pixeltable uses the structure of the imported data to infer the column types, when feasible. You can always override the inferred column types with the `schema_overrides` parameter of `import_csv()`.\n", "\n", "The following examples showcase some common data operations." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1823" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.count() # Number of rows in the table" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudelocationtimestamplongitudelatitude
10024.3Port Townsend, WA2023-10-09 02:21:08.960000-07:00-122.7348.04
12264.046 km W of Quilcene, Washington2023-12-24 15:14:04.220000-08:00-122.9647.82
6993.919 km NNE of Snoqualmie, Washington2023-08-08 10:17:23.910000-07:00-121.7747.6
12813.487 km SSW of River Road, Washington2024-01-15 07:25:05.920000-08:00-123.1748.
13553.4217 km WSW of Brinnon, Washington2024-02-16 16:30:18.830000-08:00-123.0947.59
" ], "text/plain": [ " id magnitude location \\\n", "0 1002 4.30 Port Townsend, WA \n", "1 1226 4.04 6 km W of Quilcene, Washington \n", "2 699 3.91 9 km NNE of Snoqualmie, Washington \n", "3 1281 3.48 7 km SSW of River Road, Washington \n", "4 1355 3.42 17 km WSW of Brinnon, Washington \n", "\n", " timestamp longitude latitude \n", "0 2023-10-09 02:21:08.960000-07:00 -122.73 48.04 \n", "1 2023-12-24 15:14:04.220000-08:00 -122.96 47.82 \n", "2 2023-08-08 10:17:23.910000-07:00 -121.77 47.60 \n", "3 2024-01-15 07:25:05.920000-08:00 -123.17 48.00 \n", "4 2024-02-16 16:30:18.830000-08:00 -123.09 47.59 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 5 highest-magnitude earthquakes\n", "\n", "eq_t.order_by(eq_t.magnitude, asc=False).limit(5).collect()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudelocationtimestamplongitudelatitude
6993.919 km NNE of Snoqualmie, Washington2023-08-08 10:17:23.910000-07:00-121.7747.6
7992.865 km E of Ashford, Washington2023-08-27 10:10:23.770000-07:00-121.9646.77
7102.848 km ENE of Fall City, Washington2023-08-08 11:51:12.750000-07:00-121.7947.6
5772.790 km NE of Maple Valley, Washington2023-07-04 15:52:54.430000-07:00-122.0447.4
7692.7316 km NE of Ashford, Washington2023-08-22 23:44:12.250000-07:00-121.8846.87
" ], "text/plain": [ " id magnitude location \\\n", "0 699 3.91 9 km NNE of Snoqualmie, Washington \n", "1 799 2.86 5 km E of Ashford, Washington \n", "2 710 2.84 8 km ENE of Fall City, Washington \n", "3 577 2.79 0 km NE of Maple Valley, Washington \n", "4 769 2.73 16 km NE of Ashford, Washington \n", "\n", " timestamp longitude latitude \n", "0 2023-08-08 10:17:23.910000-07:00 -121.77 47.60 \n", "1 2023-08-27 10:10:23.770000-07:00 -121.96 46.77 \n", "2 2023-08-08 11:51:12.750000-07:00 -121.79 47.60 \n", "3 2023-07-04 15:52:54.430000-07:00 -122.04 47.40 \n", "4 2023-08-22 23:44:12.250000-07:00 -121.88 46.87 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from datetime import datetime\n", "\n", "# 5 highest-magnitude earthquakes in Q3 2023\n", "\n", "eq_t.where((eq_t.timestamp >= datetime(2023, 6, 1)) & (eq_t.timestamp < datetime(2023, 10, 1))) \\\n", " .order_by(eq_t.magnitude, asc=False).limit(5).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that Pixeltable uses Pandas-like operators for filtering data: the expression\n", "\n", "```python\n", "(eq_t.timestamp >= datetime(2023, 6, 1)) & (eq_t.timestamp < datetime(2023, 10, 1))\n", "```\n", "\n", "means _both_ conditions must be true; similarly (say),\n", "\n", "```python\n", "(eq_t.timestamp < datetime(2023, 6, 1)) | (eq_t.timestamp >= datetime(2023, 10, 1))\n", "```\n", "\n", "would mean _either_ condition must be true.\n", "\n", "You can also use the special `isin` operator to select just those values that appear within a particular list:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudelocationtimestamplongitudelatitude
1231.237 km SW of Rainier, Washington2023-02-17 00:28:25.460000-08:00-122.7546.84
4560.23Washington2023-05-23 08:49:02.450000-07:00-121.9846.87
7891.67Puget Sound region, Washington2023-08-26 04:04:11.200000-07:00-122.5747.6
" ], "text/plain": [ " id magnitude location \\\n", "0 123 1.23 7 km SW of Rainier, Washington \n", "1 456 0.23 Washington \n", "2 789 1.67 Puget Sound region, Washington \n", "\n", " timestamp longitude latitude \n", "0 2023-02-17 00:28:25.460000-08:00 -122.75 46.84 \n", "1 2023-05-23 08:49:02.450000-07:00 -121.98 46.87 \n", "2 2023-08-26 04:04:11.200000-07:00 -122.57 47.60 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Earthquakes with specific ids\n", "\n", "eq_t.where(eq_t.id.isin([123,456,789])).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to basic operators like `>=` and `isin`, a Pixeltable `where` clause can also contain more complex operations. For example, the `location` column in our dataset is a string that contains a lot of information, but in a relatively unstructured way. Suppose we wanted to see all Earthquakes in the vicinity of Rainier, Washington; one way to do this is with the `contains()` method:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudelocationtimestamplongitudelatitude
401.2211 km SSE of Rainier, Washington2023-01-19 21:52:29.910000-08:00-122.6546.79
851.4510 km SSE of Rainier, Washington2023-02-02 20:08:27.810000-08:00-122.6546.79
1231.237 km SW of Rainier, Washington2023-02-17 00:28:25.460000-08:00-122.7546.84
4671.0910 km SSE of Rainier, Washington2023-05-26 19:39:44.120000-07:00-122.6546.8
13991.085 km SW of Rainier, Washington2024-03-04 22:34:25.210000-08:00-122.7446.85
17091.1610 km S of Rainier, Washington2024-05-22 18:28:38.130000-07:00-122.6846.79
17761.1712 km S of Rainier, Washington2024-06-17 18:25:33.400000-07:00-122.6646.77
" ], "text/plain": [ " id magnitude location \\\n", "0 40 1.22 11 km SSE of Rainier, Washington \n", "1 85 1.45 10 km SSE of Rainier, Washington \n", "2 123 1.23 7 km SW of Rainier, Washington \n", "3 467 1.09 10 km SSE of Rainier, Washington \n", "4 1399 1.08 5 km SW of Rainier, Washington \n", "5 1709 1.16 10 km S of Rainier, Washington \n", "6 1776 1.17 12 km S of Rainier, Washington \n", "\n", " timestamp longitude latitude \n", "0 2023-01-19 21:52:29.910000-08:00 -122.65 46.79 \n", "1 2023-02-02 20:08:27.810000-08:00 -122.65 46.79 \n", "2 2023-02-17 00:28:25.460000-08:00 -122.75 46.84 \n", "3 2023-05-26 19:39:44.120000-07:00 -122.65 46.80 \n", "4 2024-03-04 22:34:25.210000-08:00 -122.74 46.85 \n", "5 2024-05-22 18:28:38.130000-07:00 -122.68 46.79 \n", "6 2024-06-17 18:25:33.400000-07:00 -122.66 46.77 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# All earthquakes in the vicinity of Rainier\n", "\n", "eq_t.where(eq_t.location.contains('Rainier')).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pixeltable also supports various __aggregators__; here's an example showcasing two fairly simple ones, `max()` and `min()`:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
minmax
01822
" ], "text/plain": [ " min max\n", "0 0 1822" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Min and max ids\n", "\n", "eq_t.select(min=pxt.functions.min(eq_t.id), max=pxt.functions.max(eq_t.id)).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To learn more about Pixeltable functions and expressions, see the [Computed Columns](https://docs.pixeltable.com/tutorials/computed-columns) guide. They're also exhaustively documented in the [Pixeltable SDK Documentation](https://docs.pixeltable.com/sdk/latest)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting Data from Tables into Python/Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes it's handy to pull out data from a table into a Python object. We've actually already done this; the call to `collect()` returns an in-memory result set, which we can then dereference in various ways. For example:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'id': 0,\n", " 'magnitude': 1.15,\n", " 'location': '10 km NW of Belfair, Washington',\n", " 'timestamp': datetime.datetime(2023, 1, 1, 8, 10, 37, 50000, tzinfo=zoneinfo.ZoneInfo(key='America/Los_Angeles')),\n", " 'longitude': -122.93,\n", " 'latitude': 47.51}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = eq_t.limit(5).collect()\n", "result[0] # Get the first row of the results as a dict" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[datetime.datetime(2023, 1, 1, 8, 10, 37, 50000, tzinfo=zoneinfo.ZoneInfo(key='America/Los_Angeles')),\n", " datetime.datetime(2023, 1, 2, 1, 2, 43, 950000, tzinfo=zoneinfo.ZoneInfo(key='America/Los_Angeles')),\n", " datetime.datetime(2023, 1, 2, 12, 5, 1, 420000, tzinfo=zoneinfo.ZoneInfo(key='America/Los_Angeles')),\n", " datetime.datetime(2023, 1, 2, 12, 45, 14, 220000, tzinfo=zoneinfo.ZoneInfo(key='America/Los_Angeles')),\n", " datetime.datetime(2023, 1, 2, 13, 19, 27, 200000, tzinfo=zoneinfo.ZoneInfo(key='America/Los_Angeles'))]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result['timestamp'] # Get a list of the `timestamp` field of all the rows that were queried" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 5.000000\n", "mean 0.744000\n", "std 0.587988\n", "min 0.200000\n", "25% 0.290000\n", "50% 0.520000\n", "75% 1.150000\n", "max 1.560000\n", "Name: magnitude, dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = result.to_pandas() # Convert the result set into a Pandas dataframe\n", "df['magnitude'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`collect()` without a preceding `limit()` returns the entire contents of a query or table. Be careful! For very large tables, this could result in out-of-memory errors. In this example, the 1823 rows in the table fit comfortably into a dataframe." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 1823.000000\n", "mean 0.900378\n", "std 0.625492\n", "min -0.830000\n", "25% 0.420000\n", "50% 0.850000\n", "75% 1.310000\n", "max 4.300000\n", "Name: magnitude, dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = eq_t.collect().to_pandas()\n", "df['magnitude'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like other database tables, Pixeltable tables aren't fixed entities: they're meant to evolve over time. Suppose we want to add a new column to hold user-specified comments about particular earthquake events. We can do this with the `add_column()` method:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 1823 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "1823 rows updated, 1823 values computed." ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.add_column(note=pxt.String)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, `note` is the column name, and `pxt.String` specifies the type of the new column." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 1823 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "1823 rows updated, 1823 values computed." ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.add_column(contact_email=pxt.String)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the revised schema." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
table 'fundamentals.earthquakes'
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
idRequired[Int]
magnitudeFloat
locationString
timestampTimestamp
longitudeFloat
latitudeFloat
noteString
contact_emailString
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "eq_t.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Updating Rows in a Table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Table rows can be modified and deleted with the SQL-like `update()` and `delete()` commands." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `earthquakes`: 2 rows [00:00, 366.84 rows/s]\n" ] }, { "data": { "text/plain": [ "2 rows updated, 4 values computed." ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add a comment to records with IDs 123 and 127\n", "\n", "(\n", " eq_t\n", " .where(eq_t.id.isin([121,123]))\n", " .update({'note': 'Still investigating.', 'contact_email': 'contact@pixeltable.com'})\n", ")" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudenotecontact_email
1201.17NoneNone
1211.87Still investigating.contact@pixeltable.com
1220.34NoneNone
1231.23Still investigating.contact@pixeltable.com
1240.13NoneNone
" ], "text/plain": [ " id magnitude note contact_email\n", "0 120 1.17 None None\n", "1 121 1.87 Still investigating. contact@pixeltable.com\n", "2 122 0.34 None None\n", "3 123 1.23 Still investigating. contact@pixeltable.com\n", "4 124 0.13 None None" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.where(eq_t.id >= 120).select(eq_t.id, eq_t.magnitude, eq_t.note, eq_t.contact_email).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`update()` can also accept an expression, rather than a constant value. For example, suppose we wanted to shorten the location strings by replacing every occurrence of `Washington` with `WA`. One way to do this is with an `update()` clause, using a Pixeltable expression with the `replace()` method." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `earthquakes`: 1823 rows [00:00, 21494.07 rows/s]\n" ] }, { "data": { "text/plain": [ "1823 rows updated, 1823 values computed." ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.update({'location': eq_t.location.replace('Washington', 'WA')})" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudelocationtimestamplongitudelatitudenotecontact_email
01.1510 km NW of Belfair, WA2023-01-01 08:10:37.050000-08:00-122.9347.51NoneNone
10.2923 km ENE of Ashford, WA2023-01-02 01:02:43.950000-08:00-121.7646.85NoneNone
20.223 km ENE of Ashford, WA2023-01-02 12:05:01.420000-08:00-121.7546.86NoneNone
30.5215 km NNE of Ashford, WA2023-01-02 12:45:14.220000-08:00-121.9546.89NoneNone
41.560 km WSW of Esperance, WA2023-01-02 13:19:27.200000-08:00-122.3647.79NoneNone
" ], "text/plain": [ " id magnitude location timestamp \\\n", "0 0 1.15 10 km NW of Belfair, WA 2023-01-01 08:10:37.050000-08:00 \n", "1 1 0.29 23 km ENE of Ashford, WA 2023-01-02 01:02:43.950000-08:00 \n", "2 2 0.20 23 km ENE of Ashford, WA 2023-01-02 12:05:01.420000-08:00 \n", "3 3 0.52 15 km NNE of Ashford, WA 2023-01-02 12:45:14.220000-08:00 \n", "4 4 1.56 0 km WSW of Esperance, WA 2023-01-02 13:19:27.200000-08:00 \n", "\n", " longitude latitude note contact_email \n", "0 -122.93 47.51 None None \n", "1 -121.76 46.85 None None \n", "2 -121.75 46.86 None None \n", "3 -121.95 46.89 None None \n", "4 -122.36 47.79 None None " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that in all cases, the `update()` clause takes a Python dictionary, but its values can be either constants such as `'contact@pixeltable.com'`, or more complex expressions such as `eq_t.location.replace('Washington', 'WA')`. Also notice that if `update()` appears without a `where()` clause, then every row in the table will be updated, as in the preceding example.\n", "\n", "### Batch Updates\n", "\n", "The `batch_update()` method provides an alternative way to update multiple rows with different values. With a `batch_update()`, the contents of each row are specified by individual `dict`s, rather than according to a formula. Here's a toy example that shows `batch_update()` in action." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `earthquakes`: 3 rows [00:00, 984.58 rows/s]\n" ] }, { "data": { "text/plain": [ "3 rows updated, 3 values computed." ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "updates = [\n", " {'id': 500, 'note': 'This is an example note.'},\n", " {'id': 501, 'note': 'This is a different note.'},\n", " {'id': 502, 'note': 'A third note, unrelated to the others.'}\n", "]\n", "eq_t.batch_update(updates)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudenotecontact_email
5000.75This is an example note.None
5010.23This is a different note.None
5020.43A third note, unrelated to the others.None
5030.31NoneNone
5040.35NoneNone
" ], "text/plain": [ " id magnitude note contact_email\n", "0 500 0.75 This is an example note. None\n", "1 501 0.23 This is a different note. None\n", "2 502 0.43 A third note, unrelated to the others. None\n", "3 503 0.31 None None\n", "4 504 0.35 None None" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.where(eq_t.id >= 500).select(eq_t.id, eq_t.magnitude, eq_t.note, eq_t.contact_email).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting Rows\n", "\n", "To delete rows from a table, use the `delete()` method." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "587 rows deleted." ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Delete all rows in 2024\n", "\n", "eq_t.where(eq_t.timestamp >= datetime(2024, 1, 1)).delete()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1236" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.count() # How many are left after deleting?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't forget to specify a `where()` clause when using `delete()`! If you run `delete()` without a `where()` clause, the entire contents of the table will be deleted." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1236 rows deleted." ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.delete()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table Versioning\n", "\n", "Every table in Pixeltable is versioned: some or all of its modification history is preserved. We've seen a reference to this already; `pxt.ls()` will show the most recent version along with each table it lists." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameKindVersionBase
earthquakestable8
filmstable2
\n", "
" ], "text/plain": [ " Name Kind Version Base\n", " earthquakes table 8 \n", " films table 2 " ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pxt.ls('fundamentals')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see the version history of a particular table:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
versioncreated_atuserchange_typeinsertsupdatesdeleteserrorscomputedschema_change
082025-10-10 18:18:17.357601+00:00Nonedata00123600None
172025-10-10 18:18:16.323425+00:00Nonedata0058700None
262025-10-10 18:18:15.740511+00:00Nonedata03003None
352025-10-10 18:18:15.200479+00:00Nonedata01823001823None
442025-10-10 18:18:14.646669+00:00Nonedata02004None
532025-10-10 18:18:13.675782+00:00Noneschema01823001823Added: contact_email
622025-10-10 18:18:13.336046+00:00Noneschema01823001823Added: note
712025-10-10 18:18:08.559563+00:00Nonedata18230001823None
802025-10-10 18:18:08.511119+00:00Noneschema00000Initial Version
\n", "
" ], "text/plain": [ " version created_at user change_type inserts \\\n", "0 8 2025-10-10 18:18:17.357601+00:00 None data 0 \n", "1 7 2025-10-10 18:18:16.323425+00:00 None data 0 \n", "2 6 2025-10-10 18:18:15.740511+00:00 None data 0 \n", "3 5 2025-10-10 18:18:15.200479+00:00 None data 0 \n", "4 4 2025-10-10 18:18:14.646669+00:00 None data 0 \n", "5 3 2025-10-10 18:18:13.675782+00:00 None schema 0 \n", "6 2 2025-10-10 18:18:13.336046+00:00 None schema 0 \n", "7 1 2025-10-10 18:18:08.559563+00:00 None data 1823 \n", "8 0 2025-10-10 18:18:08.511119+00:00 None schema 0 \n", "\n", " updates deletes errors computed schema_change \n", "0 0 1236 0 0 None \n", "1 0 587 0 0 None \n", "2 3 0 0 3 None \n", "3 1823 0 0 1823 None \n", "4 2 0 0 4 None \n", "5 1823 0 0 1823 Added: contact_email \n", "6 1823 0 0 1823 Added: note \n", "7 0 0 0 1823 None \n", "8 0 0 0 0 Initial Version " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.history()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you ever make a mistake, you can always call `revert()` to undo the most recent change to a table and roll back to the previous version. Let's try it out: we'll use it to revert the successive `delete()` calls that we just executed." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "eq_t.revert()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1236" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.count()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "eq_t.revert()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1823" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eq_t.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Be aware: calling revert() cannot be undone!\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multimodal Data\n", "\n", "In addition to the structured data we've been exploring so far, Pixeltable has native support for __media types__: images, video, audio, and unstructured documents such as pdfs. Media support is one of Pixeltable's core capabilities. Here's an example showing how media data lives side-by-side with structured data in Pixeltable." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 1823 column values with 0 errors.\n" ] }, { "data": { "text/markdown": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
table 'fundamentals.earthquakes'
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
idRequired[Int]
magnitudeFloat
locationString
timestampTimestamp
longitudeFloat
latitudeFloat
noteString
contact_emailString
map_imageImage
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Add a new column of type `Image`\n", "eq_t.add_column(map_image=pxt.Image)\n", "eq_t.describe()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `earthquakes`: 1 rows [00:00, 192.79 rows/s]\n" ] }, { "data": { "text/plain": [ "1 row updated, 1 value computed." ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Update the row with id == 1002, adding an image to the `map_image` column\n", "\n", "eq_t.where(eq_t.id == 1002).update(\n", " {'map_image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/port-townsend-map.jpeg'}\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Note that in Pixeltable, you can always insert images into a table by giving the file path or URL of the image (as a string). It's not necessary to load the image first; Pixeltable will manage the loading and caching of images in the background. The same applies to other media data such as documents and videos.\n", "
\n", "\n", "Pixeltable will also embed image thumbnails in your notebook when you do a query:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmagnitudelocationmap_image
1000-0.0217 km SSE of Carbonado, WANone
10010.8222 km ENE of Ashford, WANone
10024.3Port Townsend, WA
\n", " \n", "
10031.04WANone
10040.7924 km ENE of Ashford, WANone
" ], "text/plain": [ " id magnitude location \\\n", "0 1000 -0.02 17 km SSE of Carbonado, WA \n", "1 1001 0.82 22 km ENE of Ashford, WA \n", "2 1002 4.30 Port Townsend, WA \n", "3 1003 1.04 WA \n", "4 1004 0.79 24 km ENE of Ashford, WA \n", "\n", " map_image \n", "0 None \n", "1 None \n", "2 = 1000).select(eq_t.id, eq_t.magnitude, eq_t.location, eq_t.map_image).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Directory Hierarchies\n", "\n", "So far we've only seen an example of a single directory with a table inside it, but one can also put directories inside other directories, in whatever fashion makes the most sense for a given application." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created directory 'fundamentals.subdir'.\n", "Created directory 'fundamentals.subdir.subsubdir'.\n", "Created table 'my_table'.\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
table 'fundamentals.subdir.subsubdir.my_table'
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
my_colString
\n" ], "text/plain": [ "table 'fundamentals.subdir.subsubdir.my_table'\n", "\n", " Column Name Type Computed With\n", " my_col String " ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pxt.create_dir('fundamentals.subdir')\n", "pxt.create_dir('fundamentals.subdir.subsubdir')\n", "pxt.create_table('fundamentals.subdir.subsubdir.my_table', {'my_col': pxt.String})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting Columns, Tables, and Directories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`drop_column()`, `drop_table()`, and `drop_dir()` are used to delete columns, tables, and directories, respectively." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "# Delete the `contact_email` column\n", "\n", "eq_t.drop_column('contact_email')" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
table 'fundamentals.earthquakes'
\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameTypeComputed With
idRequired[Int]
magnitudeFloat
locationString
timestampTimestamp
longitudeFloat
latitudeFloat
noteString
map_imageImage
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "eq_t.describe()" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "# Delete the entire table (cannot be reverted!)\n", "\n", "pxt.drop_table('fundamentals.earthquakes')" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "# Delete the entire directory and all its contents, including any nested\n", "# subdirectories (cannot be reverted)\n", "\n", "pxt.drop_dir('fundamentals', force=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps\n", "\n", "Learn more about working with Pixeltable:\n", "- [Computed Columns](https://docs.pixeltable.com/tutorials/computed-columns)\n", "- [Queries and Expressions](https://docs.pixeltable.com/tutorials/queries-and-expressions)" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.19" } }, "nbformat": 4, "nbformat_minor": 4 }