{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Analysis of Gridded Ensemble Precipitation and Temperature Estimates over the Contiguous United States\n", "====\n", "\n", "For this example, we'll open a 9 members of a 100 member ensemble of precipitation and temperature data. The analysis we do below is quite simple but the problem is a good illustration of an IO bound task. \n", "\n", "Link to dataset: https://www.earthsystemgrid.org/dataset/gridded_precip_and_temp.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import numpy as np\n", "import xarray as xr\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connect to Dask Distributed Cluster" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dask.distributed import Client, progress\n", "# HPC\n", "# client = Client(scheduler_file='/glade/scratch/jhamman/scheduler.json')\n", "# client\n", "\n", "from dask_kubernetes import KubeCluster\n", "cluster = KubeCluster(n_workers=10)\n", "cluster" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client = Client(cluster)\n", "client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Open Dataset\n", "\n", "We try storing a traditional atmospheric dataset on the cloud with two approaches\n", "\n", "1. Hosting raw netCDF files from FUSE mounted file system based on the [gcsfs](gcsfs.readthedocs.io/en/latest/) Python project\n", "2. Storing the data in an easier-to-read but new format, [Zarr](http://zarr.readthedocs.io/en/stable/)\n", "3. Loading an [Intake](https://intake.readthedocs.io/en/latest/) catalog file from GCS, specifying the location of the zarr data.\n", "\n", "The dataset has dimensions of time, latitude, longitude, and ensmemble member. Each format is self-describing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Load with FUSE File system\n", "# ds = xr.open_mfdataset('/gcs/newmann-met-ensemble-netcdf/conus_ens_00*.nc',\n", "# engine='netcdf4', concat_dim='ensemble', chunks={'time': 50})\n", "\n", "### Load with Cloud object storage\n", "import gcsfs\n", "\n", "fs = gcsfs.GCSFileSystem(project='pangeo-181919', token='anon', access='read_only')\n", "gcsmap = gcsfs.mapping.GCSMap('pangeo-data/gmet_v1.zarr',\n", " gcs=fs, check=False, create=False)\n", "ds = xr.open_zarr(gcsmap)\n", "\n", "### Load with intake catalog service\n", "# import intake\n", "# cat = intake.Catalog('https://raw.githubusercontent.com/pangeo-data/pangeo/master/gce/catalog.yaml')\n", "# ds = cat.gmet_v1.read_chunked()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print dataset\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Figure: Elevation and domain mask\n", "\n", "A quick plot of the mask to give us an idea of our spatial domain" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if 'ensemble' in ds['elevation'].dims:\n", " elevation = ds['elevation'].isel(ensemble=0).persist()\n", "else:\n", " elevation = ds['elevation'].persist()\n", "elevation = elevation.where(elevation > 0)\n", "elevation.plot(figsize=(10, 6))\n", "plt.title('Domain Elevation')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Intra-ensemble range\n", "\n", "We calculate the intra-ensemble range for all the mean daily temperature in this dataset. This gives us a sense of uncertainty." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temp_mean = ds['t_mean'].mean(dim='time')\n", "spread = (temp_mean.max(dim='ensemble')\n", " - temp_mean.min(dim='ensemble'))\n", "spread" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calling compute\n", "The expressions above didn't actually compute anything. They just build the task graph. To do the computations, we call the `compute` or `persist` methods:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "spread = spread.persist()\n", "progress(spread)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Figure: Intra-ensemble range\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "spread\n", "spread.attrs['units'] = 'degC'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "spread.plot(robust=True, figsize=(10, 6))\n", "plt.title('Intra-ensemble range in mean annual temperature')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Average seasonal snowfall\n", "\n", "We can compute a crude estimate of average seasonal snowfall using the temperature and precipitation variables in our dataset. Here, we'll look at the first 4 ensemble members and make some maps of the seasonal total snowfall in each ensemble member." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "da_snow = ds['pcp'].where(ds['t_mean'] < 0.).resample(time='QS-Mar').sum('time')\n", "seasonal_snow = da_snow.isel(ensemble=slice(0, 4)).groupby('time.season').mean('time').persist()\n", "progress(seasonal_snow)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# properly sort the seasons\n", "seasonal_snow = seasonal_snow.sel(season=['DJF', 'MAM','JJA', 'SON'])\n", "seasonal_snow.attrs['units'] = 'mm/season'\n", "seasonal_snow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Figure: Average seasonal snowfall totals " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "seasonal_snow.plot.pcolormesh(col='season', row='ensemble', cmap='Blues', robust=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract a time series of annual maximum precipitation events over a region\n", "\n", "In the previous two examples, we've mostly reduced the time and/or ensemble dimension. Here, we'll do a reduction operation on the spatial dimension to look at a time series of extreme precipitation events near Austin, TX (30.2672° N, 97.7431° W)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "buf = 0.25 # look at Austin +/- 0.25 deg\n", "\n", "ds_tx = ds.sel(lon=slice(-97.7431-buf, -97.7431+buf), lat=slice(30.2672-buf, 30.2672+buf))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pcp_ann_max = ds_tx['pcp'].resample(time='AS').max('time')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pcp_ann_max_ts = pcp_ann_max.max(('lat', 'lon')).persist()\n", "progress(pcp_ann_max_ts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Figure: Timeseries of maximum precipitation near Austin, Tx." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = pcp_ann_max_ts.transpose().to_pandas().plot(title='Maximum precipitation near Austin, Tx', legend=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }