{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using `perspective.Table`\n", "\n", "This notebook provides an overview of `perspective.Table`, Perspective's core component that allows for lightning-fast data loading, query, update, and transformation.\n", "\n", "Tables can be used alone to manage datasets, or connected for data to flow quickly between multiple tables. Outside of a Jupyter context, Perspective tables can be used to create efficient [servers](https://github.com/finos/perspective/tree/master/examples/tornado-python) which allow data to be hosted and viewed by clients in the browser using Perspective's Javascript library." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from perspective import Table\n", "from datetime import date, datetime\n", "import numpy as np\n", "import pandas as pd\n", "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supported Data Formats\n", "\n", "Perspective supports 6 core data types: `int`, `float`, `str`, `bool`, `datetime`, and `date`, and several data formats:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Pandas DataFrames\n", "df = pd.DataFrame({\n", " \"a\": np.arange(0, 2),\n", " \"b\": np.array([\"a\", \"b\"], dtype=object),\n", " \"nullable\": [1.5, np.nan], # perspective handles `None` and `np.nan` values\n", " \"mixed\": [None, 1]\n", "})\n", "\n", "# Column-oriented\n", "data = {\n", " \"int\": [i for i in range(4)],\n", " \"float\": [i * 1.25 for i in range(4)],\n", " \"str\": [\"a\", \"b\", \"c\", \"d\"],\n", " \"bool\": [True, False, True, False],\n", " \"date\": [date.today() for i in range(4)],\n", " \"datetime\": [datetime.now() for i in range(4)]\n", "}\n", "\n", "# Row-oriented\n", "rows = [{\"a\": 1, \"b\": True}, {\"a\": 2, \"b\": False}]\n", "\n", "# CSV strings\n", "csv = df.to_csv()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Schemas\n", "\n", "To explicitly specify data types for columns, create a schema (a `dict` of `str` column names to data types):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = {\n", " \"int\": float,\n", " \"float\": int,\n", " \"str\": str,\n", " \"bool\": bool,\n", " \"date\": datetime,\n", " \"datetime\": datetime\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a Table\n", "\n", "A Table can be created by passing in a dataset or a schema, like the ones created above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# From a dataset\n", "table = Table(data)\n", "\n", "# Or a dataframe\n", "df_table = Table(df)\n", "\n", "# Or a CSV\n", "csv_table = Table(csv)\n", "\n", "# tables can be created from schema\n", "table2 = Table(schema)\n", "assert table2.size() == 0\n", "\n", "# constructing a table with an index, which is a column name to be used as the primary key\n", "indexed = Table(data, index=\"str\")\n", "\n", "# or a limit, which is a total cap on the number of rows in the table - updates past `limit` overwite at row 0\n", "limited = Table(data, limit=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using the Table\n", "\n", "A Table has several queryable properties:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# schema() returns a mapping of column names to data types\n", "display(\"Table schema:\", table.schema())\n", "\n", "# size() returns the number of rows in the table\n", "display(\"Table has {} rows\".format(table.size()))\n", "\n", "# columns() returns a List of the table's column names\n", "display(\"Table columns:\", table.columns())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Updating with new data\n", "\n", "To update or stream new data into the Table, call `table.update(data)`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# you can update all columns\n", "table.update(data)\n", "print(\"after update:\", table.size())\n", "\n", "# or however many you'd like\n", "table.update({\n", " \"int\": [5, 6, 7],\n", " \"str\": [\"x\", \"y\", \"z\"]\n", "})\n", "\n", "# but you cannot add new columns through updating - create a new Table instead\n", "try:\n", " table.update({\n", " \"abcd\": [1]\n", " })\n", "except:\n", " pass\n", "\n", "# updates on unindexed tables always append\n", "print(\"after append:\", table.size())\n", "\n", "# updates on indexed tables should include the primary key - the new data overwrites at the row specified by the primary key\n", "indexed.update([{\"str\": \"b\", \"int\": 100}])\n", "print(\"after indexed partial update:\", indexed.size())\n", "\n", "# without a primary key, the update appends to the end of the dataset\n", "indexed.update([{\"int\": 101}])\n", "print(\"after indexed append:\", indexed.size())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Queries and transformations using `View`\n", "\n", "`table.view()` allows you to apply various pivots, aggregates, sorts, filters, column selections, and expression computations on the Table, as well as return the results in a variety of output data formats.\n", "\n", "To create a view, call the `view()` method on an existing table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "view = table.view() # a view with zero transformations - returns the dataset as passed in\n", "\n", "# view metadata\n", "print(\"View has {} rows and {} columns\".format(view.num_rows(), view.num_columns()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Applying transformations\n", "\n", "To apply transformations, pass in the relevant `kwargs` into the constructor:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pivoted = table.view(row_pivots=[\"int\"], column_pivots=[\"str\"]) # group and split the underlying dataset\n", "\n", "aggregated = table.view(row_pivots=[\"int\"], aggregates={\"float\": \"avg\"}) # specify aggregations for individual columns\n", "\n", "subset = table.view(columns=[\"float\"]) # show only the columns you're interested in\n", "\n", "sorted_view = table.view(sort=[[\"str\", \"desc\"], [\"int\", \"asc\"]]) # sort on a specific column, or multiple columns\n", "\n", "filtered = table.view(filter=[[\"int\", \">\", 2]]) # filter the dataset on a specific value\n", "\n", "expressions = table.view(expressions=['\"int\" + \"float\" / 100']) # calculate arbitrary expressions over the dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Serializing Data\n", "\n", "Views are used to serialize data to the user in several formats:\n", "- `to_records`: outputs a list of dictionaries, each of which is a single row\n", "- `to_dict`: outputs a dictionary of lists, each string key the name of a column\n", "- `to_numpy`: outputs a dictionary of numpy arrays\n", "- `to_df`: outputs a `pandas.DataFrame`\n", "- `to_arrow`: outputs an Apache Arrow binary, which can be passed into another `perspective.Table` to create a copy of the first table" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rows = view.to_records()\n", "\n", "columnar = view.to_dict()\n", "\n", "np_out = view.to_numpy()\n", "\n", "df_out = view.to_df()\n", "\n", "arrow_out = view.to_arrow()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data from pivoted or otherwise transformed views reflect the state of the transformed dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filtered_df = filtered.to_df()\n", "filtered_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the table is updated with data, views are automatically notified of the updates:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "v1 = table.view()\n", "print(\"v1 has {} rows and {} columns\".format(v1.num_rows(), v1.num_columns()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "table.update({\"int\": [100, 200, 300, 400]})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"v1 has {} rows and {} columns\".format(v1.num_rows(), v1.num_columns()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using callbacks to connect `Table` instances\n", "\n", "Custom callback functions can be applied on the `Table` and `View` instances.\n", "\n", "The most useful is `View.on_update`, which triggers a callback after the Table has been updated:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The `delta` property is an Arrow binary containing updated rows\n", "def callback(port_id, delta):\n", " new_table = Table(delta)\n", " display(new_table.view().to_dict())\n", " \n", "table = Table(data)\n", "view = table.view()\n", "\n", "# Register the callback with `mode=\"row\"` to enable pushing back updated data\n", "view.on_update(callback, mode=\"row\")\n", "\n", "# Update will trigger the callback\n", "table.update({\n", " \"int\": [1, 3],\n", " \"str\": [\"abc\", \"def\"]\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the callback can be triggered with a _copy_ of the updated data, `on_update` allows you to connect together multiple tables that all share state quickly and dependably:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a table and a view\n", "t1 = Table(data)\n", "v1 = t1.view()\n", "\n", "# And a new table that feeds from `t1`\n", "t2 = Table(t1.schema())\n", "\n", "# And a callback that updates `t2` whenever `t1` updates\n", "def cb(port_id, delta):\n", " t2.update(delta)\n", " \n", "# register the callback\n", "v1.on_update(cb, mode=\"row\")\n", " \n", "# update t1, which updates t2 automatically\n", "t1.update(data)\n", "\n", "# t2 now has data after t1 is updated\n", "t2.view().to_df()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }