{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Xarray Tutorial\n", "\n", "
\n", "\n", "### Overview\n", " \n", "* **teaching:** 30 minutes\n", "* **exercises:** 0\n", "* **questions:**\n", " * What is xarray designed to do?\n", " * How do I create an xarray Dataset?\n", " * How does indexing work with xarray?\n", " * How do I run computations on xarray datasets?\n", " * What are ways to ploy xarray datasets?\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table of contents\n", "1. [**Xarray primer**](#Xarray-primer)\n", "1. [**Creating data**](#Creating-data)\n", "1. [**Loading data**](#Loading-data)\n", "1. [**Selecting data**](#Selecting-data)\n", "1. [**Basic computations**](#Basic-Computations)\n", "1. [**Advanced computations**](#Advanced-Computations)\n", "1. [**ENSO excercise**](#ENSO-exercise)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Xarray primer\n", "We've seen that [Pandas](https://pandas.pydata.org/pandas-docs/stable/) and [Geopandas](http://geopandas.org) are excellent libraries for analyzing tabular \"labeled data\". [Xarray](http://xarray.pydata.org/en/stable/) is designed to make it easier to work with with _labeled multidimensional data_. By _multidimensional data_ (also often called _N-dimensional_), we mean data with many independent dimensions or axes. For example, we might represent Earth's surface temperature $T$ as a three dimensional variable\n", "\n", "$$ T(x, y, t) $$\n", "\n", "where $x$ and $y$ are spatial dimensions and and $t$ is time. By _labeled_, we mean data that has metadata associated with it describing the names and relationships between the variables. The cartoon below shows a \"data cube\" schematic dataset with temperature and preciptation sharing the same three dimensions, plus longitude and latitude as auxilliary coordinates.\n", "\n", "![xarray data model](https://github.com/pydata/xarray/raw/master/doc/_static/dataset-diagram.png)\n", "\n", "### Xarray data structures\n", "\n", "Like Pandas, xarray has two fundamental data structures:\n", "* a `DataArray`, which holds a single multi-dimensional variable and its coordinates\n", "* a `Dataset`, which holds multiple variables that potentially share the same coordinates\n", "\n", "#### DataArray\n", "\n", "A `DataArray` has four essential attributes:\n", "* `values`: a `numpy.ndarray` holding the array’s values\n", "* `dims`: dimension names for each axis (e.g., `('x', 'y', 'z')`)\n", "* `coords`: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)\n", "* `attrs`: an `OrderedDict` to hold arbitrary metadata (attributes)\n", "\n", "#### DataSet\n", "\n", "A dataset is simply an object containing multiple DataArrays indexed by variable name\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating data\n", "\n", "Let's start by constructing some DataArrays manually " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import xarray as xr" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Numpy version: ', np.__version__)\n", "print('Xarray version: ', xr.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we model the simple function\n", "\n", "$$f(x) = sin(x)$$\n", "\n", "on the interval $-\\pi$ to $\\pi$. We start by creating the data as numpy arrays." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.linspace(-np.pi, np.pi, 19)\n", "f = np.sin(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are going to put this into an xarray DataArray.\n", "\n", "A simple DataArray without dimensions or coordinates isn't much use." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da_f = xr.DataArray(f)\n", "da_f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can add a dimension name..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da_f = xr.DataArray(f, dims=['x'])\n", "da_f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But things get most interesting when we add a coordinate:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da_f = xr.DataArray(f, dims=['x'], coords={'x': x})\n", "da_f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Xarray has built-in plotting, like pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da_f.plot(marker='o')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Data\n", "\n", "We can always use regular numpy indexing and slicing on DataArrays to get the data back out." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the 10th item\n", "da_f[10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the first 10 items\n", "da_f[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, it is often much more powerful to use xarray's `.sel()` method to use label-based indexing. This allows us to fetch values based on the value of the coordinate, not the numerical index." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da_f.sel(x=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da_f.sel(x=slice(0, np.pi)).plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Computations\n", "\n", "When we perform mathematical manipulations of xarray DataArrays, the coordinates come along for the ride.\n", "Imagine we want to calcuate\n", "\n", "$$ g = f^2 + 1 $$\n", "\n", "We can apply familiar numpy operations to xarray objects.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da_g = da_f**2 + 1\n", "da_g" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da_g.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "- Multipy the DataArrays `da_f` and `da_g` together.\n", "- Select the range $-1 < x < 1$\n", "- Plot the result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(da_f * da_g).sel(x=slice(-1, 1)).plot(marker='o')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multidimensional Data\n", "\n", "If we are just dealing with 1D data, Pandas and Xarray have very similar capabilities. Xarray's real potential comes with multidimensional data.\n", "\n", "At this point we will load data from a netCDF file into an xarray dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "git clone https://github.com/pangeo-data/tutorial-data.git" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds = xr.open_dataset('./tutorial-data/sst/NOAA_NCDC_ERSST_v3b_SST-1960.nc')\n", "ds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Xarray > v0.14.1 has a new HTML output type!\n", "xr.set_options(display_style=\"html\")\n", "ds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# both do the exact same thing\n", "\n", "# dictionary syntax\n", "sst = ds['sst']\n", "\n", "# attribute syntax\n", "sst = ds.sst\n", "\n", "sst" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multidimensional Indexing\n", "\n", "In this example, we take advantage of the fact that xarray understands time to select a particular date" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst.sel(time='1960-06-15').plot(vmin=-2, vmax=30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But we can select along any axis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst.sel(lon=180).transpose().plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst.sel(lon=180, lat=40).plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Label-Based Reduction Operations\n", "\n", "Usually the process of data analysis involves going from a big, multidimensional dataset to a few concise figures.\n", "Inevitably, the data must be \"reduced\" somehow. Examples of simple reduction operations include:\n", "\n", "- Mean\n", "- Standard Deviation\n", "- Minimum\n", "- Maximum\n", "\n", "etc. Xarray supports all of these and more, via a familiar numpy-like syntax. But with xarray, you can specify the reductions by dimension.\n", "\n", "First we start with the default, reduction over all dimensions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst_time_mean = sst.mean(dim='time')\n", "sst_time_mean.plot(vmin=-2, vmax=30)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst_zonal_mean = sst.mean(dim='lon')\n", "sst_zonal_mean.transpose().plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst_time_and_zonal_mean = sst.mean(dim=('time', 'lon'))\n", "sst_time_and_zonal_mean.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# some might prefer to have lat on the y axis\n", "sst_time_and_zonal_mean.plot(y='lat')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More Complicated Example: Weighted Mean\n", "\n", "The means we calculated above were \"naive\"; they were straightforward numerical means over the different dimensions of the dataset. They did not account, for example, for spherical geometry of the globe and the necessary weighting factors. Although xarray is very useful for geospatial analysis, **it has no built-in understanding of geography**.\n", "\n", "Below we show how to create a proper weighted mean by using the formula for the area element in spherical coordinates. This is a good illustration of several xarray concepts.\n", "\n", "The [area element for lat-lon coordinates](https://en.wikipedia.org/wiki/Spherical_coordinate_system#Integration_and_differentiation_in_spherical_coordinates) is\n", "\n", "$$ \\delta A = R^2 \\delta \\phi \\delta \\lambda \\cos(\\phi) $$\n", "\n", "where $\\phi$ is latitude, $\\delta \\phi$ is the spacing of the points in latitude, $\\delta \\lambda$ is the spacing of the points in longitude, and $R$ is Earth's radius. (In this formula, $\\phi$ and $\\lambda$ are measured in radians.) Let's use xarray to create the weight factor." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "R = 6.37e6\n", "# we know already that the spacing of the points is one degree latitude\n", "dϕ = np.deg2rad(1.)\n", "dλ = np.deg2rad(1.)\n", "dA = R**2 * dϕ * dλ * np.cos(np.deg2rad(ds.lat))\n", "dA.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dA.where(sst[0].notnull())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pixel_area = dA.where(sst[0].notnull())\n", "pixel_area.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "total_ocean_area = pixel_area.sum(dim=('lon', 'lat'))\n", "sst_weighted_mean = (sst * pixel_area).sum(dim=('lon', 'lat')) / total_ocean_area\n", "sst_weighted_mean.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Maps\n", "\n", "Xarray integrates with cartopy to enable you to plot your data on a map" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import cartopy.crs as ccrs\n", "\n", "plt.figure(figsize=(12, 8))\n", "ax = plt.axes(projection=ccrs.InterruptedGoodeHomolosine())\n", "ax.coastlines()\n", "\n", "sst[0].plot(transform=ccrs.PlateCarree(), vmin=-2, vmax=30,\n", " cbar_kwargs={'shrink': 0.4})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## IV. Opening Many Files\n", "\n", "One of the killer features of xarray is its ability to open many files into a single dataset. We do this with the `open_mfdataset` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')\n", "ds_all" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have 57 years of data instead of one!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## V. Groupby\n", "\n", "Now that we have a bigger dataset, this is a good time to check out xarray's groupby capabilities." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst_clim = ds_all.sst.groupby('time.month').mean(dim='time')\n", "sst_clim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the data has dimension `month` instead of time!\n", "Each value represents the average among all of the Januaries, Februaries, etc. in the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(sst_clim[6] - sst_clim[0]).plot()\n", "plt.title('June minus July SST Climatology')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## VI. Resample and Rolling\n", "\n", "Resample is meant specifically to work with time data (data with a `datetime64` variable as a dimension).\n", "It allows you to change the time-sampling frequency of your data.\n", "\n", "Let's illustrate by selecting a single point." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst_ts = ds_all.sst.sel(lon=300, lat=10)\n", "sst_ts_annual = sst_ts.resample(time='A').mean(dim='time')\n", "sst_ts_annual" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst_ts.plot()\n", "sst_ts_annual.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An alternative approach is a \"running mean\" over the time dimension.\n", "This can be accomplished with xarray's `.rolling` operation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sst_ts_rolling = sst_ts.rolling(time=24, center=True).mean()\n", "sst_ts_annual.plot(marker='o')\n", "sst_ts_rolling.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finale: Calculate the ENSO Index\n", "\n", "[This page from NOAA](https://www.ncdc.noaa.gov/teleconnections/enso/indicators/sst) explains how the El Niño Southern Oscillation index is calculated.\n", "\n", "\n", "- The Nino 3.4 region is defined as the region between +/- 5 deg. lat, 170 W - 120 W lon.\n", "- Warm or cold phases of the Oceanic Nino Index are defined by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of +0.5°C (-0.5°C). This is known as the Oceanic Niño Index (ONI).\n", "\n", "(Note that \"anomaly\" means that the seasonal cycle is removed.)\n", "\n", "_Try working on this on your own for 5 minutes._" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you're done, try comparing the ENSO Index you calculated with the NINO3.4 index published by [NOAA](https://www.ncdc.noaa.gov/teleconnections/enso/indicators/sst/). The pandas snippet below will load the official time series for comparison." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "noaa_nino34 = pd.read_csv('https://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices',\n", " sep=r\" \", skipinitialspace=True,\n", " parse_dates={'time': ['YR','MON']},\n", " index_col='time')['NINO3.4']\n", "noaa_nino34.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Help with Xarray\n", "\n", "Here are some important resources for learning more about xarray and getting help.\n", "\n", "- [Xarray Documentation](http://xarray.pydata.org/en/latest/)\n", "- [Xarray GitHub Issue Tracker](https://github.com/pydata/xarray/issues)\n", "- [Xarray questions on StackOverflow](https://stackoverflow.com/questions/tagged/python-xarray)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }