{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Datashader is designed to make it simple to work with even very large\n", "datasets. To get good performance, it is essential that each step in the\n", "overall processing pipeline be set up appropriately. Below we share some\n", "of our suggestions based on our own [benchmarking](https://github.com/holoviz/datashader/issues/313) and optimization\n", "experience, which should help you obtain suitable performance in your\n", "own work.\n", "\n", "## File formats\n", "\n", "Based on our [testing with various file formats](https://github.com/holoviz/datashader/issues/129), we recommend storing\n", "any large columnar datasets in the [Apache Parquet](https://parquet.apache.org/) format when\n", "possible, using the [fastparquet](https://github.com/dask/fastparquet) library with [Snappy](https://github.com/andrix/python-snappy) compression:\n", "\n", "```\n", ">>> import dask.dataframe as dd\n", ">>> dd.to_parquet(filename, df, compression=\"SNAPPY\")\n", "```\n", "\n", "If your data includes categorical values that take on a limited, fixed\n", "number of possible values (e.g. \"Male\", \"Female\"),\n", "Parquet's categorical columns use a more memory-efficient data representation and\n", "are optimized for common operations such as sorting and finding uniques.\n", "Before saving, just convert the column as follows:\n", "\n", "```\n", ">>> df[colname] = df[colname].astype('category')\n", "```\n", "\n", "By default, numerical datasets typically use 64-bit floats, but many\n", "applications do not require 64-bit precision when aggregating over a\n", "very large number of datapoints to show a distribution. Using 32-bit\n", "floats reduces storage and memory requirements in half, and also\n", "typically greatly speeds up computations because only half as much data\n", "needs to be accessed in memory. If applicable to your particular\n", "situation, just convert the data type before generating the file:\n", "\n", "```\n", ">>> df[colname] = df[colname].astype(numpy.float32)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data objects\n", "\n", "Datashader performance will vary significantly depending on the library and specific data object type used to represent the data in Python, because different libraries and data objects have very different abilities to use the available processing power and memory. Moreover, different libraries and objects are appropriate for different types of data, due to how they organize and store the data internally as well as the operations they provide for working with the data. The various data container objects available from the supported libraries all fall into one of the following three types of data structures:\n", "- **[Columnar (tabular) data](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html)**: Relational, table-like data consisting of arbitrarily many rows, each with data for a fixed number of columns. For example, if you track the location of a particular cell phone over time, each time sampled would be a row, and for each time there could be columns for the latitude and longitude for the location at that time.\n", "- **[n-D arrays (multidimensional data)](http://xarray.pydata.org/en/stable/why-xarray.html)**: Data laid out in _n_ dimensions, where _n_ is typically >1. For example, you might have the precipitation measured on a latitude and longitude grid covering the whole world, for every time at which precipitation was measured. Such data could be stored columnarly, but it would be very inefficient; instead it is stored as a three dimensional array of precipitation values, indexed with time, latitude, and longitude.\n", "- **[Ragged arrays](https://en.wikipedia.org/wiki/Jagged_array)**: Relational/columnar data where the value of at least one column is a list of values that could vary in length for each row. For example, you may have a table with one row per US state and columns for population, land area, and the geographic shape of that state. Here the shape would be stored as a polygon consisting of an arbitrarily long list of latitude and longitude coordinates, which does not fit efficiently into a standard columnar data structure due to its ragged (variable length) nature.\n", "\n", "As you can see, all three examples include latitude and longitude values, but they are very different data structures that need to be stored differently for them to be processed efficiently. \n", "\n", "Apart from the data structure involved, the data libraries and objects differ by how they handle computation:\n", "\n", "- **Single-core CPU**: All processing is done serially on a single core of a single CPU on one machine. This is the most common and straightforward implementation, but the slowest, as there are generally other processing resources available that could be used.\n", "- **Multi-core CPU**: All processing is done on a single CPU, but using multiple threads running on multiple separate cores. This approach is able to make use of all of the CPU resources available on a given machine, but cannot use multiple machines.\n", "- **Distributed CPU**: Processing is distributed across multiple cores that may be on multiple CPUs in a cluster or a set of cloud-based machines. This approach can be much faster than single-CPU approaches when many CPUs are available. Distributed approaches also normally support multi-core usage, utilizing multiple cores on a single or on multiple CPUs.\n", "- **GPU**: Processing is done not on the CPU but on a separate general-purpose graphics-processing unit (GP-GPU). The GPU typically has far more (though individually less powerful) cores available than a CPU does, and for highly parallelizable computations like those in Datashader a GPU can typically achieve much faster performance at a given price point than a CPU or distributed set of CPUs can. However, not all machines have a supported GPU, memory available on the GPU is often limited, and it takes special effort to support a GPU (e.g. to avoid expensive copying of data between CPU and GPU memory), and so not all CPU code has been rewritten appropriately. \n", "- **Distributed GPUs**: When there are multiple GPUs available, processing can be distributed across all of the GPUs to further speed it up or to fit large problems into the larger total amount of GPU memory available across all of them.\n", "\n", "Finally, libraries differ by whether they can handle datasets larger than memory:\n", "\n", "- **In-core processing**: All data must fit into the directly addressable memory space available (e.g. RAM for a CPU); larger datasets have to be explicitly split up and processed separately.\n", "- **Out-of-core processing**: The data library can process data in chunks that it reads in as needed, allowing it to work with data much larger than the available memory (at the cost of having to read in those chunks each time they are needed, instead of simply referring to the data in memory).\n", "\n", "Given all of these options, the data objects currently supported by Datashader are:\n", "\n", "\n", "
Data object | \n", "Structure | \n", "Compute | \n", "Memory | \n", "Description | \n", "points | \n", "line | \n", "area | \n", "trimesh | \n", "raster | \n", "quadmesh | \n", "polygons | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
Pandas DF | \n", "columnar | \n", "1-core CPU | \n", "in-core | \n", "Standard dataframes | \n", "Yes | \n", "Yes | \n", "Yes | \n", "Yes | \n", "- | \n", "- | \n", "- | \n", "
DaskDF + PandasDF | \n", "columnar | \n", "distributed CPU | \n", "out-of-core | \n", "Distributed DataFrames | \n", "Yes | \n", "Yes | \n", "Yes | \n", "Yes | \n", "- | \n", "- | \n", "- | \n", "
cuDF | \n", "columnar | \n", "single GPU | \n", "in-core | \n", "NVIDIA GPU DataFrames | \n", "Yes | \n", "Yes | \n", "Yes | \n", "No | \n", "- | \n", "- | \n", "- | \n", "
DaskDF + cuDF | \n", "columnar | \n", "distributed GPU | \n", "out-of-core | \n", "Distributed NVIDIA GPUs | \n", "Yes | \n", "Yes | \n", "Yes | \n", "No | \n", "- | \n", "- | \n", "- | \n", "
SpatialPandasDF | \n", "ragged | \n", "1-core CPU | \n", "in-core | \n", "Ragged + spatial indexing | \n", "Yes | \n", "Yes | \n", "- | \n", "- | \n", "- | \n", "- | \n", "Yes | \n", "
Dask + SpatialPandasDF | \n", "ragged | \n", "distributed CPU | \n", "out-of-core | \n", "Distributed ragged arrays | \n", "Yes | \n", "Yes | \n", "- | \n", "- | \n", "- | \n", "- | \n", "Yes | \n", "
Xarray + NumPy | \n", "n-D | \n", "1-core CPU | \n", "in-core | \n", "n-D CPU arrays | \n", "No | \n", "No | \n", "No | \n", "No | \n", "Yes | \n", "Yes | \n", "- | \n", "
Xarray+DaskArray | \n", "n-D | \n", "distributed CPU | \n", "out-of-core | \n", "Distributed n-D arrays | \n", "No | \n", "No | \n", "No | \n", "No | \n", "Yes | \n", "Yes | \n", "- | \n", "
Xarray+CuPy | \n", "n-D | \n", "single GPU | \n", "in-core | \n", "Single-GPU n-D arrays | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "Yes | \n", "- | \n", "