{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataframes: The Basics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial will cover the following topics:\n", "\n", "* Storing a dataframe as a TileDB 1D dense array to allow fast (out-of-core) slicing on rows\n", "* Storing a dataframe as a TileDB ND sparse array to allow fast (out-of-core) execution of column range predicates\n", "* Interoperating with Pandas and [Apache Arrow](https://arrow.apache.org/)\n", "* Fast subselection on columns\n", "* Running SQL queries on the stored dataframes\n", "* Measuring performance in TileDB\n", "* Running on different storage backends\n", "* Some basic virtual file system (VFS) operations with TileDB\n", "\n", "[TileDB can model dataframes](https://docs.tiledb.com/main/basic-concepts/data-model#dataframes-can-be-modeled-as-dense-or-sparse-arrays) either as **dense** or **sparse arrays**. Storing a dataframe as a (1D) dense array allows for rapid slicing on row indices. On the other hand, storing the dataframe as a ND sparse array, specifying any subset of the columns to act as the *dimensions*, allows for rapid slicing on range predicates on those column dimensions.\n", "\n", "In either case and in addition to the slicing predicate, TileDB allows for very fast subselection of columns. This is because it implements a \"columnar\" format and, therefore, it fetches from persistent storage only data from the requested columns.\n", "\n", "This notebook was run on a **2.3 GHz Intel Core i9, 8 cores, 16GB RAM, running MacOS Mojave**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Started\n", "\n", "### Dataset\n", "\n", "We will use the [NYC Taxi Trip dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and specifically the **yellow taxi trip records** which has [this schema](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will focus on ingesting the data from [January 2020](https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv), namely file `yellow_tripdata_2020-01.csv`. The file is about 560MB." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-01.csv" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw---- 1 stavros staff 566M Jul 30 00:07 yellow_tripdata_2020-01.csv\n" ] } ], "source": [ "!ls -alh yellow_tripdata_2020-01.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Installation\n", "\n", "You need to install [TileDB-Py](https://github.com/TileDB-Inc/TileDB-Py), the Python wrapper of [TileDB Embedded](https://github.com/TileDB-Inc/TileDB), as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```bash\n", "# Pip:\n", "$ pip install tiledb\n", "\n", "# Or Conda:\n", "$ conda install -c conda-forge tiledb-py\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The notebook was run using **Pandas 1.1.0**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the TileDB core is a C++ library. To boost performance when integrating with pandas, we use Apache Arrow to achieve zero-copy when returning results from TileDB into pandas dataframes. You need to **install pyarrow** to take advantage of this optimization:\n", "\n", "```bash\n", "# Pip:\n", "$ pip install pyarrow\n", "\n", "# Or Conda:\n", "$ conda install -c conda-forge pyarrow\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup\n", "\n", "Import TileDB and check the versions of the C++ core and TileDB-Py respectively." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import tiledb, numpy as np" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2, 1, 3)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Version of TileDB core (C++ library)\n", "tiledb.libtiledb.version()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.7.4'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Version of TileDB-Py (Python wrapper)\n", "tiledb.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we start, we create the TileDB context passing a **configuration parameter** around memory allocation during read queries that will be explained in a later tutorial. That needs to be set at the *very beginning* of the code and before any other TileDB function is called." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tiledb.Ctx() [see Ctx.confg() for configuration]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cfg = tiledb.Ctx().config()\n", "cfg.update(\n", " {\n", " 'py.init_buffer_bytes': 1024**2 * 50\n", " }\n", ")\n", "tiledb.default_ctx(cfg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also enable the TileDB **stats** so that we can get some insight into performance." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "tiledb.stats_enable()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Dense Case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We ingest the `yellow_tripdata_2020-01.csv` CSV file into a TileDB dense array as shown below. The command takes the taxi CSV file and ingests it into a 1D dense array called `taxi_dense_array`. It sets the tile extent to 100K, which means that groups of 100K rows each across every column will comprise the atomic unit of compression and IO (i.e., a [data tile](https://docs.tiledb.com/main/basic-concepts/terminology#data-tile)). Two of the columns (`tpep_dropoff_datetime` and `tpep_pickup_datetime`) are dates, so we make sure to parse them as such. Finally, one of the columns (`store_and_fwd_flag`) may have nulls, so we explicitly set some null value." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/miniconda3/envs/tiledb/lib/python3.8/site-packages/IPython/core/magic.py:187: DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.\n", " call = lambda f, *a, **k: f(*a, **k)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "TileDB Embedded Version: (2, 1, 3)\n", "TileDB-Py Version: 0.7.4\n", "==== WRITE ====\n", "\n", "- Number of write queries: 1\n", "\n", "- Number of attributes written: 18\n", " * Number of fixed-sized attributes written: 17\n", " * Number of var-sized attributes written: 1\n", "\n", "- Number of bytes written: 141115071 bytes (0.131424 GB) \n", "- Number of write operations: 1362\n", "- Number of bytes filtered: 942532872 bytes (0.877802 GB) \n", "- Filtering deflation factor: 6.67918x\n", "\n", "- Total metadata written: 13304 bytes (1.23903e-05 GB) \n", " * Array schema: 1129 bytes (1.05146e-06 GB) \n", " * Fragment metadata footer: 846 bytes (7.87899e-07 GB) \n", " * R-tree: 99 bytes (9.22009e-08 GB) \n", " * Fixed-sized tile offsets: 6838 bytes (6.36838e-06 GB) \n", " * Var-sized tile offsets: 2307 bytes (2.14856e-06 GB) \n", " * Var-sized tile sizes: 2085 bytes (1.94181e-06 GB) \n", "\n", "- Time to write array metadata: 0.00435116 secs\n", " * Array metadata size: 55 bytes (5.12227e-08 GB) \n", "\n", "\n", "- Write time: 0.711282 secs\n", " * Time to split the coordinates buffer: 9.6e-08 secs\n", " * Time to check out-of-bounds coordinates: 3.8e-08 secs\n", " * Time to initialize dense cell range iterators: 3.4535e-05 secs\n", " * Time to compute cell ranges: 5.2313e-05 secs\n", " * Time to prepare and filter tiles: 0.624271 secs\n", " * Time to write tiles: 0.0603356 secs\n", " * Time to write fragment metadata: 0.0119379 secs\n", "\n", "- Time to finalize write query: 1.37e-07 secs\n", "CPU times: user 18.1 s, sys: 4.42 s, total: 22.6 s\n", "Wall time: 16 s\n" ] } ], "source": [ "%%time\n", "tiledb.stats_reset()\n", "tiledb.from_csv(\"taxi_dense_array\", \"yellow_tripdata_2020-01.csv\", \n", " tile = 100000, \n", " parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime'], \n", " fillna={'store_and_fwd_flag': ''})\n", "tiledb.stats_dump()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the stats, the actual write time in TileDB took under 1 second (the rest was mostly parsing the CSV in Pandas). The raw uncompressed CSV data was about 870 MB in binary format, and those got compressed down to about 131 MB in TileDB. There are 18 columns written as attributes, one of which is var-sized (of string type, as we will see in the schema below)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's open the written array and inspect the TileDB schema." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ArraySchema(\n", " domain=Domain(*[\n", " Dim(name='__tiledb_rows', domain=(0, 6405007), tile=100000, dtype='uint64'),\n", " ]),\n", " attrs=[\n", " Attr(name='VendorID', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='tpep_pickup_datetime', dtype='datetime64[ns]', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='tpep_dropoff_datetime', dtype='datetime64[ns]', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='passenger_count', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='trip_distance', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='RatecodeID', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='store_and_fwd_flag', dtype='\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surcharge
01.02020-01-01 00:28:152020-01-01 00:33:031.01.201.0N2382391.06.003.000.51.470.000.311.272.5
11.02020-01-01 00:35:392020-01-01 00:43:041.01.201.0N2392381.07.003.000.51.500.000.312.302.5
21.02020-01-01 00:47:412020-01-01 00:53:521.00.601.0N2382381.06.003.000.51.000.000.310.802.5
31.02020-01-01 00:55:232020-01-01 01:00:141.00.801.0N2381511.05.500.500.51.360.000.38.160.0
42.02020-01-01 00:01:582020-01-01 00:04:161.00.001.0N1931932.03.500.500.50.000.000.34.800.0
.........................................................
6405003NaN2020-01-31 22:51:002020-01-31 23:22:00NaN3.24NaN\u0000237234NaN17.592.750.50.000.000.321.140.0
6405004NaN2020-01-31 22:10:002020-01-31 23:26:00NaN22.13NaN\u000025945NaN46.672.750.50.0012.240.362.460.0
6405005NaN2020-01-31 22:50:072020-01-31 23:17:57NaN10.51NaN\u0000137169NaN48.852.750.00.000.000.351.900.0
6405006NaN2020-01-31 22:25:532020-01-31 22:48:32NaN5.49NaN\u00005042NaN27.172.750.00.000.000.330.220.0
6405007NaN2020-01-31 22:44:002020-01-31 23:06:00NaN11.60NaN\u0000179205NaN54.562.750.50.000.000.358.110.0
\n", "

6405008 rows × 18 columns

\n", "" ], "text/plain": [ " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", "0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0 \n", "1 1.0 2020-01-01 00:35:39 2020-01-01 00:43:04 1.0 \n", "2 1.0 2020-01-01 00:47:41 2020-01-01 00:53:52 1.0 \n", "3 1.0 2020-01-01 00:55:23 2020-01-01 01:00:14 1.0 \n", "4 2.0 2020-01-01 00:01:58 2020-01-01 00:04:16 1.0 \n", "... ... ... ... ... \n", "6405003 NaN 2020-01-31 22:51:00 2020-01-31 23:22:00 NaN \n", "6405004 NaN 2020-01-31 22:10:00 2020-01-31 23:26:00 NaN \n", "6405005 NaN 2020-01-31 22:50:07 2020-01-31 23:17:57 NaN \n", "6405006 NaN 2020-01-31 22:25:53 2020-01-31 22:48:32 NaN \n", "6405007 NaN 2020-01-31 22:44:00 2020-01-31 23:06:00 NaN \n", "\n", " trip_distance RatecodeID store_and_fwd_flag PULocationID \\\n", "0 1.20 1.0 N 238 \n", "1 1.20 1.0 N 239 \n", "2 0.60 1.0 N 238 \n", "3 0.80 1.0 N 238 \n", "4 0.00 1.0 N 193 \n", "... ... ... ... ... \n", "6405003 3.24 NaN \u0000 237 \n", "6405004 22.13 NaN \u0000 259 \n", "6405005 10.51 NaN \u0000 137 \n", "6405006 5.49 NaN \u0000 50 \n", "6405007 11.60 NaN \u0000 179 \n", "\n", " DOLocationID payment_type fare_amount extra mta_tax tip_amount \\\n", "0 239 1.0 6.00 3.00 0.5 1.47 \n", "1 238 1.0 7.00 3.00 0.5 1.50 \n", "2 238 1.0 6.00 3.00 0.5 1.00 \n", "3 151 1.0 5.50 0.50 0.5 1.36 \n", "4 193 2.0 3.50 0.50 0.5 0.00 \n", "... ... ... ... ... ... ... \n", "6405003 234 NaN 17.59 2.75 0.5 0.00 \n", "6405004 45 NaN 46.67 2.75 0.5 0.00 \n", "6405005 169 NaN 48.85 2.75 0.0 0.00 \n", "6405006 42 NaN 27.17 2.75 0.0 0.00 \n", "6405007 205 NaN 54.56 2.75 0.5 0.00 \n", "\n", " tolls_amount improvement_surcharge total_amount \\\n", "0 0.00 0.3 11.27 \n", "1 0.00 0.3 12.30 \n", "2 0.00 0.3 10.80 \n", "3 0.00 0.3 8.16 \n", "4 0.00 0.3 4.80 \n", "... ... ... ... \n", "6405003 0.00 0.3 21.14 \n", "6405004 12.24 0.3 62.46 \n", "6405005 0.00 0.3 51.90 \n", "6405006 0.00 0.3 30.22 \n", "6405007 0.00 0.3 58.11 \n", "\n", " congestion_surcharge \n", "0 2.5 \n", "1 2.5 \n", "2 2.5 \n", "3 0.0 \n", "4 0.0 \n", "... ... \n", "6405003 0.0 \n", "6405004 0.0 \n", "6405005 0.0 \n", "6405006 0.0 \n", "6405007 0.0 \n", "\n", "[6405008 rows x 18 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "tiledb.stats_reset()\n", "df = A.df[0:6405007]\n", "df" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TileDB Embedded Version: (2, 1, 3)\n", "TileDB-Py Version: 0.7.4\n", "==== READ ====\n", "\n", "- Number of read queries: 1\n", "- Number of attempts until results are found: 1\n", "\n", "- Number of attributes read: 18\n", " * Number of fixed-sized attributes read: 17\n", " * Number of var-sized attributes read: 1\n", "- Number of dimensions read: 1\n", " * Number of fixed-sized dimensions read: 1\n", "\n", "- Number of logical tiles overlapping the query: 65\n", "- Number of physical tiles read: 1300\n", " * Number of physical fixed-sized tiles read: 1170\n", " * Number of physical var-sized tiles read: 130\n", "- Number of cells read: 6500000\n", "- Number of result cells: 6405008\n", "- Percentage of useful cells read: 98.5386%\n", "\n", "- Number of bytes read: 141109657 bytes (0.131419 GB) \n", "- Number of read operations: 87\n", "- Number of bytes unfiltered: 942510623 bytes (0.877781 GB) \n", "- Unfiltering inflation factor: 6.67928x\n", "\n", "- Time to compute estimated result size: 0.00081386 secs\n", " * Time to compute tile overlap: 0.000311342 secs\n", " > Time to compute relevant fragments: 1.8264e-05 secs\n", " > Time to load relevant fragment R-trees: 0.000177244 secs\n", " > Time to compute relevant fragment tile overlap: 0.000106599 secs\n", "\n", "- Total metadata read: 10568 bytes (9.84222e-06 GB) \n", " * R-tree: 8 bytes (7.45058e-09 GB) \n", " * Fixed-sized tile offsets: 9504 bytes (8.85129e-06 GB) \n", " * Var-sized tile offsets: 528 bytes (4.91738e-07 GB) \n", " * Var-sized tile sizes: 528 bytes (4.91738e-07 GB) \n", "\n", "- Time to load array metadata: 0.00107096 secs\n", " * Array metadata size: 55 bytes (5.12227e-08 GB) \n", "\n", "- Time to initialize the read state: 6.5081e-05 secs\n", "\n", "- Read time: 0.853632 secs\n", " * Time to compute next partition: 0.000322538 secs\n", " * Time to compute tile coordinates: 3.8468e-05 secs\n", " * Time to compute result coordinates: 8.112e-06 secs\n", " > Time to compute sparse result tiles: 4.263e-06 secs\n", " * Time to compute dense result cell slabs: 0.000260793 secs\n", " * Time to copy result attribute values: 0.838509 secs\n", " > Time to read attribute tiles: 0.132942 secs\n", " > Time to unfilter attribute tiles: 0.209192 secs\n", " > Time to copy fixed-sized attribute values: 0.352938 secs\n", " > Time to copy var-sized attribute values: 0.0787757 secs\n", " * Time to fill dense coordinates: 0.0144067 secs\n", "\n", "- Total read query time (array open + init state + read): 0.853698 secs\n", "==== Python Stats ====\n", "\n", "- TileDB-Py Indexing Time: 1.49114\n", " * TileDB-Py query execution time: 0.861424\n", " > TileDB C++ Core initial query submit time: 0.853751\n", " * TileDB-Py buffer conversion time: 0.627318\n", "\n" ] } ], "source": [ "tiledb.stats_dump()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This operation fetches the entire array / dataframe from the disk, decompresses all tiles and creates a pandas dataframe with the result. The whole process takes 1.2 seconds in TileDB core (C++) and about 0.7 seconds on the Python wrapper side for buffer conversion. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The stats are quite informative. They break down how long it took to read from storage and unfilter (i.e., decompress), how many cells were fetched, what is the percentage of useful results, etc. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, note that you do not need to read the entire dataframe in main memory in order to process it. You can efficiently slice any subset of rows directly from storage as follows. TileDB makes very lightweight use of main memory to process the result. Note that `df[]` works with [mulit-index semantics](https://docs.tiledb.com/main/solutions/tiledb-embedded/api-usage/reading-arrays/multi-range-subarrays) and thus can take [multi-range subarrays](https://docs.tiledb.com/main/basic-concepts/terminology#subarray) as well." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 19.1 ms, sys: 137 ms, total: 156 ms\n", "Wall time: 74.2 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VendorIDtpep_pickup_datetimetpep_dropoff_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typefare_amountextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surcharge
01.02020-01-01 00:28:152020-01-01 00:33:031.01.201.0N2382391.06.03.00.51.470.00.311.272.5
11.02020-01-01 00:35:392020-01-01 00:43:041.01.201.0N2392381.07.03.00.51.500.00.312.302.5
21.02020-01-01 00:47:412020-01-01 00:53:521.00.601.0N2382381.06.03.00.51.000.00.310.802.5
31.02020-01-01 00:55:232020-01-01 01:00:141.00.801.0N2381511.05.50.50.51.360.00.38.160.0
42.02020-01-01 00:01:582020-01-01 00:04:161.00.001.0N1931932.03.50.50.50.000.00.34.800.0
.........................................................
9952.02020-01-01 00:17:022020-01-01 00:22:022.00.621.0N2371411.05.00.50.52.200.00.311.002.5
9962.02020-01-01 00:26:222020-01-01 00:52:312.07.091.0N1402442.025.50.50.50.000.00.329.302.5
9972.02020-01-01 00:54:562020-01-01 01:10:512.02.591.0N244421.012.50.50.50.000.00.313.800.0
9981.02020-01-01 00:34:172020-01-01 00:42:201.00.801.0N481421.06.53.00.52.050.00.312.352.5
9991.02020-01-01 00:44:242020-01-01 01:16:172.04.001.0N1422242.021.03.00.50.000.00.324.802.5
\n", "

1000 rows × 18 columns

\n", "
" ], "text/plain": [ " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", "0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0 \n", "1 1.0 2020-01-01 00:35:39 2020-01-01 00:43:04 1.0 \n", "2 1.0 2020-01-01 00:47:41 2020-01-01 00:53:52 1.0 \n", "3 1.0 2020-01-01 00:55:23 2020-01-01 01:00:14 1.0 \n", "4 2.0 2020-01-01 00:01:58 2020-01-01 00:04:16 1.0 \n", ".. ... ... ... ... \n", "995 2.0 2020-01-01 00:17:02 2020-01-01 00:22:02 2.0 \n", "996 2.0 2020-01-01 00:26:22 2020-01-01 00:52:31 2.0 \n", "997 2.0 2020-01-01 00:54:56 2020-01-01 01:10:51 2.0 \n", "998 1.0 2020-01-01 00:34:17 2020-01-01 00:42:20 1.0 \n", "999 1.0 2020-01-01 00:44:24 2020-01-01 01:16:17 2.0 \n", "\n", " trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n", "0 1.20 1.0 N 238 239 \n", "1 1.20 1.0 N 239 238 \n", "2 0.60 1.0 N 238 238 \n", "3 0.80 1.0 N 238 151 \n", "4 0.00 1.0 N 193 193 \n", ".. ... ... ... ... ... \n", "995 0.62 1.0 N 237 141 \n", "996 7.09 1.0 N 140 244 \n", "997 2.59 1.0 N 244 42 \n", "998 0.80 1.0 N 48 142 \n", "999 4.00 1.0 N 142 224 \n", "\n", " payment_type fare_amount extra mta_tax tip_amount tolls_amount \\\n", "0 1.0 6.0 3.0 0.5 1.47 0.0 \n", "1 1.0 7.0 3.0 0.5 1.50 0.0 \n", "2 1.0 6.0 3.0 0.5 1.00 0.0 \n", "3 1.0 5.5 0.5 0.5 1.36 0.0 \n", "4 2.0 3.5 0.5 0.5 0.00 0.0 \n", ".. ... ... ... ... ... ... \n", "995 1.0 5.0 0.5 0.5 2.20 0.0 \n", "996 2.0 25.5 0.5 0.5 0.00 0.0 \n", "997 1.0 12.5 0.5 0.5 0.00 0.0 \n", "998 1.0 6.5 3.0 0.5 2.05 0.0 \n", "999 2.0 21.0 3.0 0.5 0.00 0.0 \n", "\n", " improvement_surcharge total_amount congestion_surcharge \n", "0 0.3 11.27 2.5 \n", "1 0.3 12.30 2.5 \n", "2 0.3 10.80 2.5 \n", "3 0.3 8.16 0.0 \n", "4 0.3 4.80 0.0 \n", ".. ... ... ... \n", "995 0.3 11.00 2.5 \n", "996 0.3 29.30 2.5 \n", "997 0.3 13.80 0.0 \n", "998 0.3 12.35 2.5 \n", "999 0.3 24.80 2.5 \n", "\n", "[1000 rows x 18 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "df = A.df[0:999]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how much faster that operation was, taking only a few milliseconds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, you can slice any **subset of columns**, without fetching all the columns first in a pandas dataframe. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 423 ms, sys: 614 ms, total: 1.04 s\n", "Wall time: 176 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tpep_dropoff_datetimefare_amount
02020-01-01 00:33:036.00
12020-01-01 00:43:047.00
22020-01-01 00:53:526.00
32020-01-01 01:00:145.50
42020-01-01 00:04:163.50
.........
64050032020-01-31 23:22:0017.59
64050042020-01-31 23:26:0046.67
64050052020-01-31 23:17:5748.85
64050062020-01-31 22:48:3227.17
64050072020-01-31 23:06:0054.56
\n", "

6405008 rows × 2 columns

\n", "
" ], "text/plain": [ " tpep_dropoff_datetime fare_amount\n", "0 2020-01-01 00:33:03 6.00\n", "1 2020-01-01 00:43:04 7.00\n", "2 2020-01-01 00:53:52 6.00\n", "3 2020-01-01 01:00:14 5.50\n", "4 2020-01-01 00:04:16 3.50\n", "... ... ...\n", "6405003 2020-01-31 23:22:00 17.59\n", "6405004 2020-01-31 23:26:00 46.67\n", "6405005 2020-01-31 23:17:57 48.85\n", "6405006 2020-01-31 22:48:32 27.17\n", "6405007 2020-01-31 23:06:00 54.56\n", "\n", "[6405008 rows x 2 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "df = A.query(attrs=['tpep_dropoff_datetime', 'fare_amount']).df[0:6405007]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again, that operation was much faster than fetching the entire dataframe in main memory. The stats also inform you about how many attributes (i.e., columns) were retrieved, which is two in this example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember to close the array when you are done." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "A.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Sparse Case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Storing the dataframe as a 1D dense array allowed us to rapidly slice on row indexes. *But what if we wished to slice fast on predicates applied to column values*, such as dropoff time and fare amount? For such scenarios and if you know for a fact that the majority of your workloads involve applying a range (or equality) predicate on a specific subset of columns, you can create a sparse array with those columns set as the dimensions. \n", "\n", "This can be done as follows. Instead of the `tile` argument we used in dense arrays, we use `capacity` to determine how many rows to group in a data tile (read about [the difference between dense and sparse data tiles](https://docs.tiledb.com/main/basic-concepts/terminology#data-tile)). Also `index_col` determines which columns will act as dimensions." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/miniconda3/envs/tiledb/lib/python3.8/site-packages/IPython/core/magic.py:187: DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.\n", " call = lambda f, *a, **k: f(*a, **k)\n", "/opt/miniconda3/envs/tiledb/lib/python3.8/site-packages/numpy/lib/arraysetops.py:580: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison\n", " mask |= (ar1 == a)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "TileDB Embedded Version: (2, 1, 3)\n", "TileDB-Py Version: 0.7.4\n", "==== WRITE ====\n", "\n", "- Number of write queries: 1\n", "\n", "- Number of attributes written: 16\n", " * Number of fixed-sized attributes written: 15\n", " * Number of var-sized attributes written: 1\n", "- Number of dimensions written: 2\n", " * Number of fixed-sized dimensions written: 2\n", "\n", "- Number of bytes written: 117477982 bytes (0.10941 GB) \n", "- Number of write operations: 1356\n", "- Number of bytes filtered: 928759802 bytes (0.864975 GB) \n", "- Filtering deflation factor: 7.90582x\n", "\n", "- Total metadata written: 14124 bytes (1.3154e-05 GB) \n", " * Array schema: 1080 bytes (1.00583e-06 GB) \n", " * Fragment metadata footer: 822 bytes (7.65547e-07 GB) \n", " * R-tree: 1261 bytes (1.1744e-06 GB) \n", " * Fixed-sized tile offsets: 6770 bytes (6.30505e-06 GB) \n", " * Var-sized tile offsets: 2205 bytes (2.05357e-06 GB) \n", " * Var-sized tile sizes: 1986 bytes (1.84961e-06 GB) \n", "\n", "- Time to write array metadata: 0.000813311 secs\n", " * Array metadata size: 98 bytes (9.12696e-08 GB) \n", "\n", "- Number of logical cells written: 6405008\n", "- Number of logical tiles written: 65\n", " * Number of fixed-sized physical tiles written: 1105\n", " * Number of var-sized physical tiles written: 130\n", "\n", "- Write time: 1.87926 secs\n", " * Time to split the coordinates buffer: 1.02e-07 secs\n", " * Time to check out-of-bounds coordinates: 0.0182705 secs\n", " * Time to sort coordinates: 0.821675 secs\n", " * Time to check coordinate duplicates: 1.3e-07 secs\n", " * Time to prepare tiles: 0.336857 secs\n", " * Time to compute coordinate metadata (e.g., MBRs): 0.00437902 secs\n", " * Time to filter tiles: 0.602146 secs\n", " * Time to write tiles: 0.0639578 secs\n", " * Time to write fragment metadata: 0.010609 secs\n", "\n", "- Time to finalize write query: 4.93e-07 secs\n", "CPU times: user 28.1 s, sys: 4.63 s, total: 32.8 s\n", "Wall time: 17.4 s\n" ] } ], "source": [ "%%time\n", "tiledb.stats_reset()\n", "tiledb.from_csv(\"taxi_sparse_array\", \"yellow_tripdata_2020-01.csv\", \n", " capacity=100000, \n", " sparse=True, \n", " index_col=['tpep_dropoff_datetime', 'fare_amount'], \n", " parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime'], \n", " fillna={'store_and_fwd_flag': ''})\n", "tiledb.stats_dump()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again, most of the total ingestion time is spent on parsing on the pandas side. Notice that the R-tree (which is 2D) this time is slightly larger, as this is the main indexing method is sparse arrays. It is still tiny though relative to the entire array size, which is ~100MB. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that you can choose **any** subset of columns as the dimensions (any number with different types, even strings)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's open the array and print the schema." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ArraySchema(\n", " domain=Domain(*[\n", " Dim(name='tpep_dropoff_datetime', domain=(numpy.datetime64('2003-01-01T14:16:59.000000000'), numpy.datetime64('2021-01-02T01:25:01.000000000')), tile=1000 nanoseconds, dtype='datetime64[ns]'),\n", " Dim(name='fare_amount', domain=(-1238.0, 4265.0), tile=1000.0, dtype='float64'),\n", " ]),\n", " attrs=[\n", " Attr(name='VendorID', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='tpep_pickup_datetime', dtype='datetime64[ns]', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='passenger_count', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='trip_distance', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='RatecodeID', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='store_and_fwd_flag', dtype='\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VendorIDtpep_pickup_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typeextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surcharge
tpep_dropoff_datetimefare_amount
2003-01-01 14:16:590.02.02003-01-01 00:07:171.00.001.0N1931932.00.00.00.000.00.00.000.0
2008-12-31 23:32:329.52.02008-12-31 23:17:151.02.341.0N1622632.00.50.50.000.00.313.302.5
2008-12-31 23:36:533.52.02008-12-31 23:35:001.00.421.0N2632632.00.50.50.000.00.37.302.5
2008-12-31 23:37:1416.52.02008-12-31 23:06:131.03.481.0N1442302.00.50.50.000.00.320.302.5
2008-12-31 23:55:3512.52.02008-12-31 23:39:371.02.691.0N2631432.00.50.50.000.00.316.302.5
......................................................
2020-07-10 11:42:417.02.02020-07-10 11:34:111.01.071.0N2362621.01.00.52.260.00.313.562.5
2020-07-31 18:54:125.02.02020-07-31 18:50:411.00.721.0N236432.01.00.50.000.00.39.302.5
2021-01-02 00:36:5010.52.02021-01-02 00:22:001.01.561.0N1421612.01.00.50.000.00.314.802.5
2021-01-02 00:58:5611.52.02021-01-02 00:44:081.02.321.0N1701482.01.00.50.000.00.315.802.5
2021-01-02 01:25:0110.02.02021-01-02 01:12:101.01.871.0N901481.01.00.52.860.00.317.162.5
\n", "

6405008 rows × 16 columns

\n", "" ], "text/plain": [ " VendorID tpep_pickup_datetime \\\n", "tpep_dropoff_datetime fare_amount \n", "2003-01-01 14:16:59 0.0 2.0 2003-01-01 00:07:17 \n", "2008-12-31 23:32:32 9.5 2.0 2008-12-31 23:17:15 \n", "2008-12-31 23:36:53 3.5 2.0 2008-12-31 23:35:00 \n", "2008-12-31 23:37:14 16.5 2.0 2008-12-31 23:06:13 \n", "2008-12-31 23:55:35 12.5 2.0 2008-12-31 23:39:37 \n", "... ... ... \n", "2020-07-10 11:42:41 7.0 2.0 2020-07-10 11:34:11 \n", "2020-07-31 18:54:12 5.0 2.0 2020-07-31 18:50:41 \n", "2021-01-02 00:36:50 10.5 2.0 2021-01-02 00:22:00 \n", "2021-01-02 00:58:56 11.5 2.0 2021-01-02 00:44:08 \n", "2021-01-02 01:25:01 10.0 2.0 2021-01-02 01:12:10 \n", "\n", " passenger_count trip_distance RatecodeID \\\n", "tpep_dropoff_datetime fare_amount \n", "2003-01-01 14:16:59 0.0 1.0 0.00 1.0 \n", "2008-12-31 23:32:32 9.5 1.0 2.34 1.0 \n", "2008-12-31 23:36:53 3.5 1.0 0.42 1.0 \n", "2008-12-31 23:37:14 16.5 1.0 3.48 1.0 \n", "2008-12-31 23:55:35 12.5 1.0 2.69 1.0 \n", "... ... ... ... \n", "2020-07-10 11:42:41 7.0 1.0 1.07 1.0 \n", "2020-07-31 18:54:12 5.0 1.0 0.72 1.0 \n", "2021-01-02 00:36:50 10.5 1.0 1.56 1.0 \n", "2021-01-02 00:58:56 11.5 1.0 2.32 1.0 \n", "2021-01-02 01:25:01 10.0 1.0 1.87 1.0 \n", "\n", " store_and_fwd_flag PULocationID \\\n", "tpep_dropoff_datetime fare_amount \n", "2003-01-01 14:16:59 0.0 N 193 \n", "2008-12-31 23:32:32 9.5 N 162 \n", "2008-12-31 23:36:53 3.5 N 263 \n", "2008-12-31 23:37:14 16.5 N 144 \n", "2008-12-31 23:55:35 12.5 N 263 \n", "... ... ... \n", "2020-07-10 11:42:41 7.0 N 236 \n", "2020-07-31 18:54:12 5.0 N 236 \n", "2021-01-02 00:36:50 10.5 N 142 \n", "2021-01-02 00:58:56 11.5 N 170 \n", "2021-01-02 01:25:01 10.0 N 90 \n", "\n", " DOLocationID payment_type extra mta_tax \\\n", "tpep_dropoff_datetime fare_amount \n", "2003-01-01 14:16:59 0.0 193 2.0 0.0 0.0 \n", "2008-12-31 23:32:32 9.5 263 2.0 0.5 0.5 \n", "2008-12-31 23:36:53 3.5 263 2.0 0.5 0.5 \n", "2008-12-31 23:37:14 16.5 230 2.0 0.5 0.5 \n", "2008-12-31 23:55:35 12.5 143 2.0 0.5 0.5 \n", "... ... ... ... ... \n", "2020-07-10 11:42:41 7.0 262 1.0 1.0 0.5 \n", "2020-07-31 18:54:12 5.0 43 2.0 1.0 0.5 \n", "2021-01-02 00:36:50 10.5 161 2.0 1.0 0.5 \n", "2021-01-02 00:58:56 11.5 148 2.0 1.0 0.5 \n", "2021-01-02 01:25:01 10.0 148 1.0 1.0 0.5 \n", "\n", " tip_amount tolls_amount \\\n", "tpep_dropoff_datetime fare_amount \n", "2003-01-01 14:16:59 0.0 0.00 0.0 \n", "2008-12-31 23:32:32 9.5 0.00 0.0 \n", "2008-12-31 23:36:53 3.5 0.00 0.0 \n", "2008-12-31 23:37:14 16.5 0.00 0.0 \n", "2008-12-31 23:55:35 12.5 0.00 0.0 \n", "... ... ... \n", "2020-07-10 11:42:41 7.0 2.26 0.0 \n", "2020-07-31 18:54:12 5.0 0.00 0.0 \n", "2021-01-02 00:36:50 10.5 0.00 0.0 \n", "2021-01-02 00:58:56 11.5 0.00 0.0 \n", "2021-01-02 01:25:01 10.0 2.86 0.0 \n", "\n", " improvement_surcharge total_amount \\\n", "tpep_dropoff_datetime fare_amount \n", "2003-01-01 14:16:59 0.0 0.0 0.00 \n", "2008-12-31 23:32:32 9.5 0.3 13.30 \n", "2008-12-31 23:36:53 3.5 0.3 7.30 \n", "2008-12-31 23:37:14 16.5 0.3 20.30 \n", "2008-12-31 23:55:35 12.5 0.3 16.30 \n", "... ... ... \n", "2020-07-10 11:42:41 7.0 0.3 13.56 \n", "2020-07-31 18:54:12 5.0 0.3 9.30 \n", "2021-01-02 00:36:50 10.5 0.3 14.80 \n", "2021-01-02 00:58:56 11.5 0.3 15.80 \n", "2021-01-02 01:25:01 10.0 0.3 17.16 \n", "\n", " congestion_surcharge \n", "tpep_dropoff_datetime fare_amount \n", "2003-01-01 14:16:59 0.0 0.0 \n", "2008-12-31 23:32:32 9.5 2.5 \n", "2008-12-31 23:36:53 3.5 2.5 \n", "2008-12-31 23:37:14 16.5 2.5 \n", "2008-12-31 23:55:35 12.5 2.5 \n", "... ... \n", "2020-07-10 11:42:41 7.0 2.5 \n", "2020-07-31 18:54:12 5.0 2.5 \n", "2021-01-02 00:36:50 10.5 2.5 \n", "2021-01-02 00:58:56 11.5 2.5 \n", "2021-01-02 01:25:01 10.0 2.5 \n", "\n", "[6405008 rows x 16 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "tiledb.stats_reset()\n", "df = A.query().df[:]\n", "df" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TileDB Embedded Version: (2, 1, 3)\n", "TileDB-Py Version: 0.7.4\n", "==== READ ====\n", "\n", "- Number of read queries: 1\n", "- Number of attempts until results are found: 1\n", "\n", "- Number of attributes read: 16\n", " * Number of fixed-sized attributes read: 15\n", " * Number of var-sized attributes read: 1\n", "- Number of dimensions read: 2\n", " * Number of fixed-sized dimensions read: 2\n", "\n", "- Number of logical tiles overlapping the query: 65\n", "- Number of physical tiles read: 1235\n", " * Number of physical fixed-sized tiles read: 1105\n", " * Number of physical var-sized tiles read: 130\n", "- Number of cells read: 6405008\n", "- Number of result cells: 6405008\n", "- Percentage of useful cells read: 100%\n", "\n", "- Number of bytes read: 117472852 bytes (0.109405 GB) \n", "- Number of read operations: 86\n", "- Number of bytes unfiltered: 928739186 bytes (0.864956 GB) \n", "- Unfiltering inflation factor: 7.90599x\n", "\n", "- Time to compute estimated result size: 0.000917319 secs\n", " * Time to compute tile overlap: 0.000401086 secs\n", " > Time to compute relevant fragments: 1.2216e-05 secs\n", " > Time to load relevant fragment R-trees: 0.00025591 secs\n", " > Time to compute relevant fragment tile overlap: 0.000130426 secs\n", "\n", "- Total metadata read: 12928 bytes (1.20401e-05 GB) \n", " * R-tree: 2368 bytes (2.20537e-06 GB) \n", " * Fixed-sized tile offsets: 9504 bytes (8.85129e-06 GB) \n", " * Var-sized tile offsets: 528 bytes (4.91738e-07 GB) \n", " * Var-sized tile sizes: 528 bytes (4.91738e-07 GB) \n", "\n", "- Time to load array metadata: 0.000244082 secs\n", " * Array metadata size: 98 bytes (9.12696e-08 GB) \n", "\n", "- Time to initialize the read state: 6.9947e-05 secs\n", "\n", "- Read time: 1.36381 secs\n", " * Time to compute next partition: 0.000467399 secs\n", " * Time to compute result coordinates: 0.537777 secs\n", " > Time to compute sparse result tiles: 0.000178318 secs\n", " > Time to read coordinate tiles: 0.0154381 secs\n", " > Time to unfilter coordinate tiles: 0.0532034 secs\n", " > Time to compute range result coordinates: 0.319006 secs\n", " * Time to compute sparse result cell slabs: 0.0198344 secs\n", " * Time to copy result attribute values: 0.749709 secs\n", " > Time to read attribute tiles: 0.0899356 secs\n", " > Time to unfilter attribute tiles: 0.168391 secs\n", " > Time to copy fixed-sized attribute values: 0.352634 secs\n", " > Time to copy var-sized attribute values: 0.0858673 secs\n", " * Time to copy result coordinates: 0.0470755 secs\n", " > Time to copy fixed-sized coordinates: 0.0293159 secs\n", "\n", "- Total read query time (array open + init state + read): 1.36388 secs\n", "==== Python Stats ====\n", "\n", "- TileDB-Py Indexing Time: 2.94222\n", " * TileDB-Py query execution time: 1.37205\n", " > TileDB C++ Core initial query submit time: 1.36393\n", " * TileDB-Py buffer conversion time: 1.56823\n", "\n" ] } ], "source": [ "tiledb.stats_dump()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that this takes longer than the dense case. This is because the sparse case involves more advanced indexing and copying operations than dense. However, the real benefit of sparse dataframe modeling is the ability to **slice rapidly with range conditions on the indexed dimensions**, without having to fetch the entire dataframe in main memory." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14.7 ms, sys: 83.8 ms, total: 98.4 ms\n", "Wall time: 92.2 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VendorIDtpep_pickup_datetimepassenger_counttrip_distanceRatecodeIDstore_and_fwd_flagPULocationIDDOLocationIDpayment_typeextramta_taxtip_amounttolls_amountimprovement_surchargetotal_amountcongestion_surcharge
tpep_dropoff_datetimefare_amount
2020-07-10 10:52:465.52.02020-07-10 10:46:191.00.641.0N681641.01.00.51.960.00.311.762.5
2020-07-10 11:02:425.52.02020-07-10 10:56:461.00.931.0N1701631.01.00.51.960.00.311.762.5
2020-07-10 11:21:1412.52.02020-07-10 11:03:491.02.411.0N1632621.01.00.53.360.00.320.162.5
2020-07-10 11:42:417.02.02020-07-10 11:34:111.01.071.0N2362621.01.00.52.260.00.313.562.5
\n", "
" ], "text/plain": [ " VendorID tpep_pickup_datetime \\\n", "tpep_dropoff_datetime fare_amount \n", "2020-07-10 10:52:46 5.5 2.0 2020-07-10 10:46:19 \n", "2020-07-10 11:02:42 5.5 2.0 2020-07-10 10:56:46 \n", "2020-07-10 11:21:14 12.5 2.0 2020-07-10 11:03:49 \n", "2020-07-10 11:42:41 7.0 2.0 2020-07-10 11:34:11 \n", "\n", " passenger_count trip_distance RatecodeID \\\n", "tpep_dropoff_datetime fare_amount \n", "2020-07-10 10:52:46 5.5 1.0 0.64 1.0 \n", "2020-07-10 11:02:42 5.5 1.0 0.93 1.0 \n", "2020-07-10 11:21:14 12.5 1.0 2.41 1.0 \n", "2020-07-10 11:42:41 7.0 1.0 1.07 1.0 \n", "\n", " store_and_fwd_flag PULocationID \\\n", "tpep_dropoff_datetime fare_amount \n", "2020-07-10 10:52:46 5.5 N 68 \n", "2020-07-10 11:02:42 5.5 N 170 \n", "2020-07-10 11:21:14 12.5 N 163 \n", "2020-07-10 11:42:41 7.0 N 236 \n", "\n", " DOLocationID payment_type extra mta_tax \\\n", "tpep_dropoff_datetime fare_amount \n", "2020-07-10 10:52:46 5.5 164 1.0 1.0 0.5 \n", "2020-07-10 11:02:42 5.5 163 1.0 1.0 0.5 \n", "2020-07-10 11:21:14 12.5 262 1.0 1.0 0.5 \n", "2020-07-10 11:42:41 7.0 262 1.0 1.0 0.5 \n", "\n", " tip_amount tolls_amount \\\n", "tpep_dropoff_datetime fare_amount \n", "2020-07-10 10:52:46 5.5 1.96 0.0 \n", "2020-07-10 11:02:42 5.5 1.96 0.0 \n", "2020-07-10 11:21:14 12.5 3.36 0.0 \n", "2020-07-10 11:42:41 7.0 2.26 0.0 \n", "\n", " improvement_surcharge total_amount \\\n", "tpep_dropoff_datetime fare_amount \n", "2020-07-10 10:52:46 5.5 0.3 11.76 \n", "2020-07-10 11:02:42 5.5 0.3 11.76 \n", "2020-07-10 11:21:14 12.5 0.3 20.16 \n", "2020-07-10 11:42:41 7.0 0.3 13.56 \n", "\n", " congestion_surcharge \n", "tpep_dropoff_datetime fare_amount \n", "2020-07-10 10:52:46 5.5 2.5 \n", "2020-07-10 11:02:42 5.5 2.5 \n", "2020-07-10 11:21:14 12.5 2.5 \n", "2020-07-10 11:42:41 7.0 2.5 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "df = A.df[np.datetime64(\"2020-07-01\"):np.datetime64(\"2020-10-01\"), 5.5:12.5]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is truly rapid. In the dense case, you would have to load the whole dataframe in main memory and then slice using pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can subset on attributes as follows." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.65 s, sys: 798 ms, total: 2.45 s\n", "Wall time: 1.61 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trip_distance
tpep_dropoff_datetimefare_amount
2003-01-01 14:16:590.00.00
2008-12-31 23:32:329.52.34
2008-12-31 23:36:533.50.42
2008-12-31 23:37:1416.53.48
2008-12-31 23:55:3512.52.69
.........
2020-07-10 11:42:417.01.07
2020-07-31 18:54:125.00.72
2021-01-02 00:36:5010.51.56
2021-01-02 00:58:5611.52.32
2021-01-02 01:25:0110.01.87
\n", "

6405008 rows × 1 columns

\n", "
" ], "text/plain": [ " trip_distance\n", "tpep_dropoff_datetime fare_amount \n", "2003-01-01 14:16:59 0.0 0.00\n", "2008-12-31 23:32:32 9.5 2.34\n", "2008-12-31 23:36:53 3.5 0.42\n", "2008-12-31 23:37:14 16.5 3.48\n", "2008-12-31 23:55:35 12.5 2.69\n", "... ...\n", "2020-07-10 11:42:41 7.0 1.07\n", "2020-07-31 18:54:12 5.0 0.72\n", "2021-01-02 00:36:50 10.5 1.56\n", "2021-01-02 00:58:56 11.5 2.32\n", "2021-01-02 01:25:01 10.0 1.87\n", "\n", "[6405008 rows x 1 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "df = A.query(attrs=['trip_distance']).df[:]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, TileDB fetches also the coordinate values and sets them as pandas indices. To disable them, you can run:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 787 ms, sys: 533 ms, total: 1.32 s\n", "Wall time: 655 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trip_distance
00.00
12.34
20.42
33.48
42.69
......
64050031.07
64050040.72
64050051.56
64050062.32
64050071.87
\n", "

6405008 rows × 1 columns

\n", "
" ], "text/plain": [ " trip_distance\n", "0 0.00\n", "1 2.34\n", "2 0.42\n", "3 3.48\n", "4 2.69\n", "... ...\n", "6405003 1.07\n", "6405004 0.72\n", "6405005 1.56\n", "6405006 2.32\n", "6405007 1.87\n", "\n", "[6405008 rows x 1 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "df = A.query(dims=False, attrs=['trip_distance']).df[:]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wer can also subselect on dimensions:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 822 ms, sys: 690 ms, total: 1.51 s\n", "Wall time: 662 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trip_distance
tpep_dropoff_datetime
2003-01-01 14:16:590.00
2008-12-31 23:32:322.34
2008-12-31 23:36:530.42
2008-12-31 23:37:143.48
2008-12-31 23:55:352.69
......
2020-07-10 11:42:411.07
2020-07-31 18:54:120.72
2021-01-02 00:36:501.56
2021-01-02 00:58:562.32
2021-01-02 01:25:011.87
\n", "

6405008 rows × 1 columns

\n", "
" ], "text/plain": [ " trip_distance\n", "tpep_dropoff_datetime \n", "2003-01-01 14:16:59 0.00\n", "2008-12-31 23:32:32 2.34\n", "2008-12-31 23:36:53 0.42\n", "2008-12-31 23:37:14 3.48\n", "2008-12-31 23:55:35 2.69\n", "... ...\n", "2020-07-10 11:42:41 1.07\n", "2020-07-31 18:54:12 0.72\n", "2021-01-02 00:36:50 1.56\n", "2021-01-02 00:58:56 2.32\n", "2021-01-02 01:25:01 1.87\n", "\n", "[6405008 rows x 1 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "df = A.query(dims=['tpep_dropoff_datetime'], attrs=['trip_distance']).df[:]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, you can choose even attributes to act as dataframe indices using the `index_col` argument." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.02 s, sys: 1.3 s, total: 2.32 s\n", "Wall time: 811 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tpep_dropoff_datetimefare_amountpassenger_count
trip_distance
0.002003-01-01 14:16:590.01.0
2.342008-12-31 23:32:329.51.0
0.422008-12-31 23:36:533.51.0
3.482008-12-31 23:37:1416.51.0
2.692008-12-31 23:55:3512.51.0
............
1.072020-07-10 11:42:417.01.0
0.722020-07-31 18:54:125.01.0
1.562021-01-02 00:36:5010.51.0
2.322021-01-02 00:58:5611.51.0
1.872021-01-02 01:25:0110.01.0
\n", "

6405008 rows × 3 columns

\n", "
" ], "text/plain": [ " tpep_dropoff_datetime fare_amount passenger_count\n", "trip_distance \n", "0.00 2003-01-01 14:16:59 0.0 1.0\n", "2.34 2008-12-31 23:32:32 9.5 1.0\n", "0.42 2008-12-31 23:36:53 3.5 1.0\n", "3.48 2008-12-31 23:37:14 16.5 1.0\n", "2.69 2008-12-31 23:55:35 12.5 1.0\n", "... ... ... ...\n", "1.07 2020-07-10 11:42:41 7.0 1.0\n", "0.72 2020-07-31 18:54:12 5.0 1.0\n", "1.56 2021-01-02 00:36:50 10.5 1.0\n", "2.32 2021-01-02 00:58:56 11.5 1.0\n", "1.87 2021-01-02 01:25:01 10.0 1.0\n", "\n", "[6405008 rows x 3 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "df = A.query(index_col=['trip_distance'], attrs=['passenger_count', 'trip_distance']).df[:]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, TileDB can also return dataframe results as an **Arrow Table** as follows:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1 s, sys: 972 ms, total: 1.97 s\n", "Wall time: 742 ms\n" ] }, { "data": { "text/plain": [ "pyarrow.Table\n", "tpep_dropoff_datetime: timestamp[ns]\n", "fare_amount: double\n", "passenger_count: double\n", "trip_distance: double" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "df = A.query(return_arrow=True, index_col=['trip_distance'], attrs=['passenger_count', 'trip_distance']).df[:]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we are done, we can close the array." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "A.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Storing Pandas Dataframes in TileDB Arrays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also store a pandas dataframe you already created in main memory into a TileDB array. The following will create a new TileDB array and write the contents of a pandas dataframe." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# First read some data into a pandas dataframe\n", "A = tiledb.open(\"taxi_sparse_array\")\n", "df = A.query(attrs=['passenger_count', 'trip_distance']).df[:]\n", "df\n", "\n", "# Create and write into a TileDB array\n", "tiledb.from_pandas(\"sliced_taxi_sparse_array\", df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's inspect the schema." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ArraySchema(\n", " domain=Domain(*[\n", " Dim(name='tpep_dropoff_datetime', domain=(numpy.datetime64('2003-01-01T14:16:59.000000000'), numpy.datetime64('2021-01-02T01:25:01.000000000')), tile=1000 nanoseconds, dtype='datetime64[ns]'),\n", " Dim(name='fare_amount', domain=(-1238.0, 4265.0), tile=1000.0, dtype='float64'),\n", " ]),\n", " attrs=[\n", " Attr(name='passenger_count', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " Attr(name='trip_distance', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),\n", " ],\n", " cell_order='row-major',\n", " tile_order='row-major',\n", " capacity=10000,\n", " sparse=True,\n", " allows_duplicates=True,\n", " coords_filters=FilterList([ZstdFilter(level=-1), ])\n", ")" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A2 = tiledb.open(\"sliced_taxi_sparse_array\")\n", "A2.schema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading the array back:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
passenger_counttrip_distance
tpep_dropoff_datetimefare_amount
2003-01-01 14:16:590.01.00.00
2008-12-31 23:32:329.51.02.34
2008-12-31 23:36:533.51.00.42
2008-12-31 23:37:1416.51.03.48
2008-12-31 23:55:3512.51.02.69
............
2020-07-10 11:42:417.01.01.07
2020-07-31 18:54:125.01.00.72
2021-01-02 00:36:5010.51.01.56
2021-01-02 00:58:5611.51.02.32
2021-01-02 01:25:0110.01.01.87
\n", "

6405008 rows × 2 columns

\n", "
" ], "text/plain": [ " passenger_count trip_distance\n", "tpep_dropoff_datetime fare_amount \n", "2003-01-01 14:16:59 0.0 1.0 0.00\n", "2008-12-31 23:32:32 9.5 1.0 2.34\n", "2008-12-31 23:36:53 3.5 1.0 0.42\n", "2008-12-31 23:37:14 16.5 1.0 3.48\n", "2008-12-31 23:55:35 12.5 1.0 2.69\n", "... ... ...\n", "2020-07-10 11:42:41 7.0 1.0 1.07\n", "2020-07-31 18:54:12 5.0 1.0 0.72\n", "2021-01-02 00:36:50 10.5 1.0 1.56\n", "2021-01-02 00:58:56 11.5 1.0 2.32\n", "2021-01-02 01:25:01 10.0 1.0 1.87\n", "\n", "[6405008 rows x 2 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A2.df[:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we close the opened arrays." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "A.close()\n", "A2.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running SQL Queries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the cool things about TileDB is that it offers a powerful integration with [embedded MariaDB](https://docs.tiledb.com/main/solutions/tiledb-embedded/api-usage/embedded-sql). This allows for execution of arbitrary SQL queries directly on TileDB arrays (both dense and sparse). We took appropriate care to push the fast slicing and attribute subsetting portions of the query down to TileDB, leaving the rest of the SQL execution to MariaDB. In other words, we made MariaDB take advantage of the multi-dimensional indexing and columnar format of TileDB!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To install this capability, run:\n", "```bash\n", "conda install -c conda-forge libtiledb-sql-py\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The usage is very simple and intuitive. All results are retunred directly as pandas dataframes." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "import tiledb.sql, pandas as pd" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "db = tiledb.sql.connect()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 10.5 ms, sys: 21.6 ms, total: 32.1 ms\n", "Wall time: 24.7 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AVG(trip_distance)
02.81232
\n", "
" ], "text/plain": [ " AVG(trip_distance)\n", "0 2.81232" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "pd.read_sql(sql=\"SELECT AVG(trip_distance) FROM taxi_dense_array WHERE __tiledb_rows >= 0 AND __tiledb_rows <1000\", con=db)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14.4 ms, sys: 106 ms, total: 121 ms\n", "Wall time: 47.6 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AVG(trip_distance)
00.448333
\n", "
" ], "text/plain": [ " AVG(trip_distance)\n", "0 0.448333" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "pd.read_sql(sql=\"SELECT AVG(trip_distance) FROM taxi_sparse_array WHERE tpep_dropoff_datetime <= '2019-07-31' AND fare_amount < 5.5\", con=db)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other backends" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far we have explained how to store TileDB arrays to the local disk. TileDB is optimized for [numerous storage backends](https://docs.tiledb.com/main/solutions/tiledb-embedded/backends), including AWS S3, Azure Blob Storage and more. The entire functionality shown above (including SQL queries with embedded MariaDB) \"just works\" by replacing the array names `taxi_dense_array` and `taxi_sparse_array` with a URI that points to another backend, e.g., `s3:////array_name`. The TileDB data format is **cloud-native** (based on immutable objects for fast updates and time traveling, to be covered in later tutorials) and the storage engine takes it to the extreme to implement **parallel IO** while **minimizing the communication** with the backend wherever possible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple VFS Operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to be able to support numerous storage backends, we abstracted all IO (e.g., read, write, remove, move, list, etc.) behind a Virtual Filesystem (VFS) class, which [we exposed in our APIs](https://docs.tiledb.com/main/solutions/tiledb-embedded/api-usage/virtual-filesystem) as it is useful beyond the array internals. Everything we describe below in this section \"just works\" for any other storage backend URI." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, you can use the VFS functionality to list the contents of an array folder:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "vfs = tiledb.VFS()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['file:///Users/stavros/docs-tutorials/dataframes/taxi_sparse_array/__1608324905947_1608324905947_176244b5e6f342fb941ca3b6f519349c_6',\n", " 'file:///Users/stavros/docs-tutorials/dataframes/taxi_sparse_array/__1608324905947_1608324905947_176244b5e6f342fb941ca3b6f519349c_6.ok',\n", " 'file:///Users/stavros/docs-tutorials/dataframes/taxi_sparse_array/__array_schema.tdb',\n", " 'file:///Users/stavros/docs-tutorials/dataframes/taxi_sparse_array/__lock.tdb',\n", " 'file:///Users/stavros/docs-tutorials/dataframes/taxi_sparse_array/__meta']" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vfs.ls(\"taxi_sparse_array\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or remove the arrays we created." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "vfs.remove_dir(\"taxi_dense_array\")" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "vfs.remove_dir(\"taxi_sparse_array\")" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "vfs.remove_dir(\"sliced_taxi_sparse_array\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also you can remove the CSV file as follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vfs.remove_file('yellow_tripdata_2020-01.csv')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }