{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Xarray with Dask Arrays\n", "\n", "\n", " \n", "**[Xarray](http://xarray.pydata.org/en/stable/)** is an open source project and Python package that extends the labeled data functionality of [Pandas](https://pandas.pydata.org/) to N-dimensional array-like datasets. It shares a similar API to [NumPy](http://www.numpy.org/) and [Pandas](https://pandas.pydata.org/) and supports both [Dask](https://dask.org/) and [NumPy](http://www.numpy.org/) arrays under the hood." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "from dask.distributed import Client\n", "import xarray as xr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start Dask Client for Dashboard\n", "\n", "Starting the Dask Client is optional. It will provide a dashboard which \n", "is useful to gain insight on the computation. \n", "\n", "The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client = Client(n_workers=2, threads_per_worker=2, memory_limit='1GB')\n", "client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Open a sample dataset\n", "\n", "We will use some of xarray's tutorial data for this example. By specifying the chunk shape, xarray will automatically create Dask arrays for each data variable in the `Dataset`. In xarray, `Datasets` are dict-like container of labeled arrays, analogous to the `pandas.DataFrame`. Note that we're taking advantage of xarray's dimension labels when specifying chunk shapes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds = xr.tutorial.open_dataset('air_temperature',\n", " chunks={'lat': 25, 'lon': 25, 'time': -1})\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Quickly inspecting the `Dataset` above, we'll note that this `Dataset` has three _dimensions_ akin to axes in NumPy (`lat`, `lon`, and `time`), three _coordinate variables_ akin to `pandas.Index` objects (also named `lat`, `lon`, and `time`), and one data variable (`air`). Xarray also holds Dataset specific metadata as _attributes_." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da = ds['air']\n", "da" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each data variable in xarray is called a `DataArray`. These are the fundamental labeled array objects in xarray. Much like the `Dataset`, `DataArrays` also have _dimensions_ and _coordinates_ that support many of its label-based opperations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the underlying array of data is done via the `data` property. Here we can see that we have a Dask array. If this array were to be backed by a NumPy array, this property would point to the actual values in the array." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use Standard Xarray Operations\n", "\n", "In almost all cases, operations using xarray objects are identical, regardless if the underlying data is stored as a Dask array or a NumPy array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da2 = da.groupby('time.month').mean('time')\n", "da3 = da - da2\n", "da3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Call `.compute()` or `.load()` when you want your result as a `xarray.DataArray` with data stored as NumPy arrays. \n", "\n", "