{
"cells": [
{
"cell_type": "markdown",
"id": "990cfa4c-2117-4435-9806-ff9048890398",
"metadata": {
"tags": []
},
"source": [
"\n",
"
\n",
"\n",
"# Parallelizing Xarray with Dask\n",
"\n",
"### In this tutorial, you learn:\n",
"\n",
"* Using Dask with Xarray\n",
"* Read/write netCDF files with Dask\n",
"* Dask backed Xarray objects and operations\n",
"* Extract Dask arrays from Xarray objects and use Dask array directly.\n",
"* Xarray built-in operations can transparently use dask\n",
"\n",
"### Prerequisites\n",
"| Concepts | Importance | Notes |\n",
"| --- | --- | --- |\n",
"| [Intro to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro) | Necessary | |\n",
"| Dask Arrays | Necessary | |\n",
"| Dask DataFrames | Necessary | |\n",
"\n",
"- **Time to learn**: 40 minutes\n",
"---------\n",
"\n",
"## Introduction\n",
"\n",
"### Xarray Quick Overview\n",
"\n",
"\n",
" \n",
"Xarray is an open-source Python library designed for working with *labelled multi-dimensional* data. By *multi-dimensional* data (also often called *N-dimensional*), we mean data that has many independent dimensions or axes (e.g. latitude, longitude, time). By labelled we mean that these axes or dimensions are associated with coordinate names (like \"latitude\") and coordinate labels like \"30 degrees North\".\n",
"\n",
"Xarray provides pandas-level convenience for working with this type of data.\n",
"\n",
"\n",
"
\n",
"\n",
"*Image credit: Xarray Contributors*\n",
"\n",
"The dataset illustrated has two variables (`temperature` and `precipitation`) that have three dimensions. Coordinate vectors (e.g., latitude, longitude, time) that describe the data are also included.\n",
"\n",
" \n",
"#### Xarray Data Structures\n",
"\n",
"Xarray has two fundamental data structures:\n",
"\n",
"* `DataArray` : holds a single multi-dimensional variable and its coordinates\n",
"* `Dataset` : holds multiple DataArrays that potentially share the same coordinates\n",
"\n",
"\n",
"**Xarray DataArray**\n",
"\n",
"A `DataArray` has four essential attributes:\n",
"* `data`: a `numpy.ndarray` holding the values.\n",
"* `dims`: dimension names for each axis (e.g., latitude, longitude, time).\n",
"* `coords`: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings).\n",
"* `attrs`: a dictionary to hold arbitrary metadata (attributes).\n",
"\n",
"**Xarray DataSet**\n",
"\n",
"A dataset is simply an object containing multiple Xarray DataArrays indexed by variable name."
]
},
{
"cell_type": "markdown",
"id": "598055e0-3bac-491b-8b7f-d7313a306bc8",
"metadata": {},
"source": [
"### Xarray can wrap many array types like Numpy and Dask.\n",
"\n",
"Let's start with a random 2D NumPy array, for example this can be SST (sea-surface temperature) values of a domain with dimension of 300x450 grid:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b6d9a1d-6520-4374-a178-ad91af454628",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import numpy as np \n",
"import dask.array as da\n",
"import xarray as xr\n",
"\n",
"xr.set_options(display_expand_data=False);"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5185246-289d-4bc3-a355-4c5101bd6ddd",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# -- numpy array \n",
"sst_np = np.random.rand(300,450)\n",
"type(sst_np)"
]
},
{
"cell_type": "markdown",
"id": "c17adcd3-672d-4a46-8b16-1f579aa29e8b",
"metadata": {},
"source": [
"As we saw in the previous tutorial, we can convert them to a Dask Array:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "744e09ca-7a23-428b-9032-1808610c19b0",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"sst_da = da.from_array( sst_np)\n",
"sst_da"
]
},
{
"cell_type": "markdown",
"id": "62d098b3-a7fc-4562-bc7f-41c19b3c9280",
"metadata": {},
"source": [
"This is great and fast! BUT\n",
"* What if we want to attach coordinate values to this array?\n",
"* What if we want to add metadata (e.g. units) to this array?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2006822d-c6db-4e67-995e-732d92ff10b6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# similarly we can convert them to xarray datarray\n",
"sst_xr = xr.DataArray(sst_da)\n",
"sst_xr"
]
},
{
"cell_type": "markdown",
"id": "d1eb3ef7-8413-4578-bd1f-93f488d6b344",
"metadata": {},
"source": [
"A simple DataArray without dimensions or coordinates isn't much use."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81aad904-f133-4a28-b1fc-aecc3fab7a11",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# we can add dimension names to this:\n",
"sst_xr = xr.DataArray(sst_da,dims=['lat','lon'])\n",
"\n",
"sst_xr.dims"
]
},
{
"cell_type": "markdown",
"id": "5dd6bea8-35ff-4f7f-91aa-cb808d30621c",
"metadata": {},
"source": [
"We can add our coordinates with values to it :\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9525a4b-e99b-46e5-9154-f0badf205ff9",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# -- create some dummy values for lat and lon dimensions\n",
"lat = np.random.uniform(low=-90, high=90, size=300)\n",
"lon = np.random.uniform(low=-180, high=180, size=450)\n",
"\n",
"sst_xr = xr.DataArray(sst_da,\n",
" dims=['lat','lon'],\n",
" coords={'lat': lat, 'lon':lon},\n",
" attrs=dict(\n",
" description=\"Sea Surface Temperature.\",\n",
" units=\"degC\")\n",
" )\n",
"sst_xr"
]
},
{
"cell_type": "markdown",
"id": "c58938ac-372a-42dc-8168-a59e4b45294b",
"metadata": {},
"source": [
"Xarray data structures are a very powerful tool that allows us to use metadata to express different analysis patterns (slicing, selecting, groupby, averaging, and many other things). "
]
},
{
"cell_type": "markdown",
"id": "a9be8aad-9135-45f5-a0e7-3d1d26e77869",
"metadata": {},
"source": [
"