{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "\n", "
\n", "\"Unidata\n", "
\n", "\n", "

xarray: Indexing and Selecting Data

\n", "

Unidata AMS 2021 Student Conference

\n", "\n", "
\n", "
\n", "\n", "---\n", "\n", "Using [xarray](http://xarray.pydata.org/en/stable/), it is easy to select out subsets of or indexing into your data because of the coordinate information available on xarray's data structures. Unlike simply NumPy arrays, which only allow posititional dimension lookup by integer indexes, xarray also allows specifying dimensions by name (e.g., x=..., y=...) and indexes by labels ('2015-06-02T00:00'). This notebook will give a brief overview of the available options.\n", "\n", "For a full description of indexing and selecting data in xarray, see [the relevant page in xarray's documentation](http://xarray.pydata.org/en/stable/indexing.html)\n", "\n", "
\"HTML
\n", "\n", "\n", "### Focuses\n", "\n", "- Learn the basic forms of indexing with xarray (positional and name based dimensions, integer and label based indexing)\n", "- Learn additional useful methods relating to indexing/selection in xarray such as nearest neighbor lookups and dropping/masking\n", "- Discover more advanced options for vectorized indexing that are also available in xarray\n", "\n", "### Objectives\n", "\n", "1. [Positional (NumPy-Like) Indexing](#1.-Positional-(NumPy-like)-Indexing)\n", "1. [Indexing with Dimension Names](#2.-Indexing-with-Dimension-Names)\n", "1. [Nearest Neighbor Lookups](#3.-Nearest-Neighbor-Lookups)\n", "1. [Dropping and Masking](#4.-Droping-and-Masking)\n", "1. [Vectorized Indexing](#5.-Vectorized-Indexing)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import xarray as xr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Positional (NumPy-Like) Indexing\n", "\n", "First, let's open up a test dataset to work with" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = xr.open_dataset(\n", " \"https://thredds.unidata.ucar.edu/thredds/dodsC/grib/NCEP/GFS/Global_onedeg/GFS_Global_onedeg_20210110_0600.grib2\"\n", ")[[\n", " \"LatLon_Projection\",\n", " \"Geopotential_height_isobaric\",\n", " \"Geopotential_height_surface\",\n", " \"Temperature_isobaric\",\n", " \"u-component_of_wind_isobaric\",\n", " \"v-component_of_wind_isobaric\"\n", "]]\n", "\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And also pull out a single data variable to work with in detail" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights = data['Geopotential_height_isobaric']\n", "heights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(if any components of these data representations, with dimensions, coordinates, data variables, and attributes are unfamilar to you, be sure to check out the [xarray terminology document](http://xarray.pydata.org/en/stable/terminology.html)).\n", "\n", "Now, at the most basic level, we can index into heights like we would for a basic NumPy array. For example, if we wanted the data at the final time (which is the first dimension by position)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or if we wanted the 5th isobaric (vertical) level" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights[:, 4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we don't know the integer index of the data we want, and instead know its *label* or *coordinate value*? xarray can handle that with `.loc`, as in the example below. But, for now, we still need to keep track of the exact dimension order.\n", "\n", "To select the vertical profile of data at latitude 50 and longitude 260 at time 2021-01-13T00:00:00" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights.loc['2021-01-13T00:00:00', :, 50.0, 260.0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pay attention to the square brackets `[` and `]` here...this is different than the methods used below that are actually functions which use `(` and `)`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Indexing with Dimension Names\n", "\n", "So that's great, we can do what we can with NumPy, and even extend it to index our data using coordinate labels. But, it's inconvenient to keep track of which dimension goes in which order, especially when we already have names for all those dimensions...so let's just use them!\n", "\n", "### Integer Indexes\n", "\n", "To use integer indexes with dimension names, the `.isel` method is our friend.\n", "\n", "For example, what if we want all data between the first and 24th time on the 5th from last vertical level?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights.isel(isobaric6=-5, time=slice(0, 24))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how the order didn't matter, we just needed the right names! Also, take a look at the use of `slice`...this is the best way to specify index ranges (both here with integers and next up with labels) with xarray.\n", "\n", "### Label Indexes\n", "\n", "What if we want to leverage the full power of xarray's coordinate information to select by label along named dimensions? We use `.sel`!\n", "\n", "Let's do that same example again, times between '2021-01-10T06:00:00' and '2021-01-13T06:00:00' and a level of 90000 Pa, but with labels this time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "intial_heights_900 = heights.sel(time=slice('2021-01-10T06:00:00', '2021-01-13T06:00:00'), isobaric6=90000)\n", "intial_heights_900" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So that worked nicely! But do you notice something actually different about this subset? It has 25 times instead of 24. This is one little weird thing to keep in mind with xarray's label-based selection. Unlike most indexing in Python where the lower bound is inclusive and the upper bound is exclusive, here, xarray has decided that it makes more sense to also include the upper bound (since you're giving it that label directly)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Nearest Neighbor Lookups\n", "\n", "In case you don't have the exact coordinate labels of your data points, xarray also has you covered with `.sel` using the `method=` argument.\n", "\n", "Say we wanted to get a lowest-level time series for New Orleans (29.94, -90.06)? Also, let's show this time that `.sel` also works on Datasets, not just DataArrays." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wind_data = data[[\"u-component_of_wind_isobaric\", \"v-component_of_wind_isobaric\"]]\n", "\n", "wind_data.isel(isobaric=-1).sel(lat=29.94, lon=360-90.06, method='nearest')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many more interesting things you can do with these nearest neighbor lookups which can be found in [the xarray documentation](http://xarray.pydata.org/en/stable/indexing.html#nearest-neighbor-lookups), such as using methods of 'pad' or 'backfill' to round up or down your lookup. For full interpolation of data, take a look at the [interpolation training notebook](https://nbviewer.jupyter.org/github/Unidata/pyaos-ams-2021/blob/master/notebooks/analysis/xarray_interpolation.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Dropping and Masking\n", "\n", "So far we've covered extracting data from the full collection if you know where in the dataset the data you want are. But what if you instead knew the data you wanted to filter out or get rid of? Or what if you wanted to perform operations based on the data values themselves? This is where dropping and masking comes in handy.\n", "\n", "First, let's subset further from some of our earlier data to get something manageable to work with" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "subset = intial_heights_900.isel(time=0)\n", "\n", "subset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "subset.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, to show the power of masking, let's also get the geopotential heights at the ground/water surface, and filter out all the values where the 900 hPa surface dips below that level" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ground_surface = data['Geopotential_height_surface'].isel(time=0)\n", "\n", "subset_filtered = subset.where(subset >= ground_surface)\n", "\n", "subset_filtered" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "subset_filtered.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we have a substantial region where all of the values are filtered out and we want to get tighten our coordinate bounds to exclude them, we can also include the `drop=True` keyword argument" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "subset.where(subset >= ground_surface, drop=True).plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(see how the coordinates over Antarctica are clipped).\n", "\n", "What if we know the coordinate levels of the data we want to drop (like we would with `sel`, but reversed to remove rather than select)? We can use `.drop_sel`.\n", "\n", "For example, to drop the data from the tropics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "subset.drop_sel(lat=subset.lat.loc[slice(30, -30)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(notice how we did `subset.lat.loc[slice(30, -30)]`? This obtains the actual values of latitude rather than just the slice...this operation is a bit persnickety in that it requires hashable types to do its work, so the simple slice doesn't cut it here)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Vectorized Indexing\n", "\n", "What if you need something more flexible than just this orthogonal selection we've talked about so far...perhaps like selecting data along some custom path or set of points? For this, xarray offers vectorized indexing. While a full description is beyond the scope of this training notebook (for that, check out [the page in xarray's documentation](http://xarray.pydata.org/en/stable/indexing.html#vectorized-indexing), here's a quick demo below.\n", "\n", "The key here is to provide `DataArrays` as your indexers with shared common dimensions that you want in your output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "station_list = [\"A\", \"B\", \"C\"]\n", "station_lat = xr.DataArray([40., 30., 42.], dims=('station',), coords={'station': station_list})\n", "station_lon = xr.DataArray([260., 270., 280.], dims=('station',), coords={'station': station_list})\n", "\n", "data_at_stations = data.isel(time=0).sel(lat=station_lat, lon=station_lon)\n", "data_at_stations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See also\n", "\n", "While indexing and selecting data is a key feature of xarray, there's a lot more to it than just that! Be sure to take a look at the other xarray training notebooks linked below:\n", "\n", "- [Xarray data access](https://nbviewer.jupyter.org/github/Unidata/pyaos-ams-2021/blob/master/notebooks/dataAccess/xarray_data_access.ipynb)\n", "- [Interpolation and regridding with xarray](https://nbviewer.jupyter.org/github/Unidata/pyaos-ams-2021/blob/master/notebooks/analysis/xarray_interpolation.ipynb)\n", "- [Xarray aggregation operations](https://nbviewer.jupyter.org/github/Unidata/pyaos-ams-2021/blob/master/notebooks/analysis/xarray_aggregations.ipynb)\n", "- [Calculations in xarray](https://nbviewer.jupyter.org/github/Unidata/pyaos-ams-2021/blob/master/notebooks/analysis/xarray_calculations.ipynb)\n", "- [MetPy with xarray](https://nbviewer.jupyter.org/github/Unidata/pyaos-ams-2021/blob/master/notebooks/analysis/metpy_and_xarray.ipynb)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:pyaos-ams-2021]", "language": "python", "name": "conda-env-pyaos-ams-2021-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 4 }