{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true }, "source": [ "

Working with big data: xarray and dask (DEMO)

\n", "\n", "\n", "> *DS Python for GIS and Geoscience* \n", "> *October, 2020*\n", ">\n", "> *© 2020, Joris Van den Bossche and Stijn Van Hoey. Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n", "\n", "---\n", "\n", "Throughout the course, we worked with small, often simplified or subsampled data. In practice, the tools we have seen still work well with data that fit easily in memory. But also for data larger than memory (e.g. large or high resolution climate data), we can still use many of the familiar tools.\n", "\n", "This notebooks includes a brief showcase of using xarray with dask, a package to scale Python workflows (https://dask.org/). Dask integrates very well with xarray, providing a familiar xarray workflow for working with large datasets in parallel or on clusters." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
\n", "

Client

\n", "\n", "
\n", "

Cluster

\n", "
    \n", "
  • Workers: 1
  • \n", "
  • Cores: 8
  • \n", "
  • Memory: 16.46 GB
  • \n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dask.distributed import Client, LocalCluster\n", "client = Client(LocalCluster(processes=False)) # set up local cluster on your laptop\n", "client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *Multi-Scale Ultra High Resolution (MUR) Sea Surface Temperature (SST)* dataset (https://registry.opendata.aws/mur/) provides freely available, global, gap-free, gridded, daily, 1 km data on sea surface temperate for the last 20 years. I downloaded a tiny part this dataset (8 days of 2020) to my local laptop, and stored a subset of the variables (only the \"sst\" itself) in the zarr format (https://zarr.readthedocs.io/en/stable/), so we can efficiently read it with xarray and dask:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import xarray as xr" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "ds = xr.open_zarr(\"data/mur_sst_zarr/\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:       (lat: 17999, lon: 36000, time: 8)\n",
       "Coordinates:\n",
       "  * lat           (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99\n",
       "  * lon           (lon) float32 -179.99 -179.98 -179.97 ... 179.98 179.99 180.0\n",
       "  * time          (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09...\n",
       "Data variables:\n",
       "    analysed_sst  (time, lat, lon) float32 dask.array<chunksize=(1, 5000, 5000), meta=np.ndarray>\n",
       "Attributes:\n",
       "    Conventions:                CF-1.7\n",
       "    Metadata_Conventions:       Unidata Observation Dataset v1.0\n",
       "    acknowledgment:             Please acknowledge the use of these data with...\n",
       "    cdm_data_type:              grid\n",
       "    comment:                    MUR = "Multi-scale Ultra-high Resolution"\n",
       "    creator_email:              ghrsst@podaac.jpl.nasa.gov\n",
       "    creator_name:               JPL MUR SST project\n",
       "    creator_url:                http://mur.jpl.nasa.gov\n",
       "    date_created:               20200124T151031Z\n",
       "    easternmost_longitude:      180.0\n",
       "    file_quality_level:         3\n",
       "    gds_version_id:             2.0\n",
       "    geospatial_lat_resolution:  0.009999999776482582\n",
       "    geospatial_lat_units:       degrees north\n",
       "    geospatial_lon_resolution:  0.009999999776482582\n",
       "    geospatial_lon_units:       degrees east\n",
       "    history:                    created at nominal 4-day latency; replaced nr...\n",
       "    id:                         MUR-JPL-L4-GLOB-v04.1\n",
       "    institution:                Jet Propulsion Laboratory\n",
       "    keywords:                   Oceans > Ocean Temperature > Sea Surface Temp...\n",
       "    keywords_vocabulary:        NASA Global Change Master Directory (GCMD) Sc...\n",
       "    license:                    These data are available free of charge under...\n",
       "    metadata_link:              http://podaac.jpl.nasa.gov/ws/metadata/datase...\n",
       "    naming_authority:           org.ghrsst\n",
       "    netcdf_version_id:          4.1\n",
       "    northernmost_latitude:      90.0\n",
       "    platform:                   Terra, Aqua, GCOM-W, MetOp-A, MetOp-B, Buoys/...\n",
       "    processing_level:           L4\n",
       "    product_version:            04.1\n",
       "    project:                    NASA Making Earth Science Data Records for Us...\n",
       "    publisher_email:            ghrsst-po@nceo.ac.uk\n",
       "    publisher_name:             GHRSST Project Office\n",
       "    publisher_url:              http://www.ghrsst.org\n",
       "    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...\n",
       "    sensor:                     MODIS, AMSR2, AVHRR, in-situ\n",
       "    source:                     MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRM...\n",
       "    southernmost_latitude:      -90.0\n",
       "    spatial_resolution:         0.01 degrees\n",
       "    standard_name_vocabulary:   NetCDF Climate and Forecast (CF) Metadata Con...\n",
       "    start_time:                 20200108T090000Z\n",
       "    stop_time:                  20200108T090000Z\n",
       "    summary:                    A merged, multi-sensor L4 Foundation SST anal...\n",
       "    time_coverage_end:          20200108T210000Z\n",
       "    time_coverage_start:        20200107T210000Z\n",
       "    title:                      Daily MUR SST, Final product\n",
       "    uuid:                       27665bc0-d5fc-11e1-9b23-0800200c9a66\n",
       "    westernmost_longitude:      -180.0
" ], "text/plain": [ "\n", "Dimensions: (lat: 17999, lon: 36000, time: 8)\n", "Coordinates:\n", " * lat (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99\n", " * lon (lon) float32 -179.99 -179.98 -179.97 ... 179.98 179.99 180.0\n", " * time (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09...\n", "Data variables:\n", " analysed_sst (time, lat, lon) float32 dask.array\n", "Attributes:\n", " Conventions: CF-1.7\n", " Metadata_Conventions: Unidata Observation Dataset v1.0\n", " acknowledgment: Please acknowledge the use of these data with...\n", " cdm_data_type: grid\n", " comment: MUR = \"Multi-scale Ultra-high Resolution\"\n", " creator_email: ghrsst@podaac.jpl.nasa.gov\n", " creator_name: JPL MUR SST project\n", " creator_url: http://mur.jpl.nasa.gov\n", " date_created: 20200124T151031Z\n", " easternmost_longitude: 180.0\n", " file_quality_level: 3\n", " gds_version_id: 2.0\n", " geospatial_lat_resolution: 0.009999999776482582\n", " geospatial_lat_units: degrees north\n", " geospatial_lon_resolution: 0.009999999776482582\n", " geospatial_lon_units: degrees east\n", " history: created at nominal 4-day latency; replaced nr...\n", " id: MUR-JPL-L4-GLOB-v04.1\n", " institution: Jet Propulsion Laboratory\n", " keywords: Oceans > Ocean Temperature > Sea Surface Temp...\n", " keywords_vocabulary: NASA Global Change Master Directory (GCMD) Sc...\n", " license: These data are available free of charge under...\n", " metadata_link: http://podaac.jpl.nasa.gov/ws/metadata/datase...\n", " naming_authority: org.ghrsst\n", " netcdf_version_id: 4.1\n", " northernmost_latitude: 90.0\n", " platform: Terra, Aqua, GCOM-W, MetOp-A, MetOp-B, Buoys/...\n", " processing_level: L4\n", " product_version: 04.1\n", " project: NASA Making Earth Science Data Records for Us...\n", " publisher_email: ghrsst-po@nceo.ac.uk\n", " publisher_name: GHRSST Project Office\n", " publisher_url: http://www.ghrsst.org\n", " references: http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...\n", " sensor: MODIS, AMSR2, AVHRR, in-situ\n", " source: MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRM...\n", " southernmost_latitude: -90.0\n", " spatial_resolution: 0.01 degrees\n", " standard_name_vocabulary: NetCDF Climate and Forecast (CF) Metadata Con...\n", " start_time: 20200108T090000Z\n", " stop_time: 20200108T090000Z\n", " summary: A merged, multi-sensor L4 Foundation SST anal...\n", " time_coverage_end: 20200108T210000Z\n", " time_coverage_start: 20200107T210000Z\n", " title: Daily MUR SST, Final product\n", " uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66\n", " westernmost_longitude: -180.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the actual sea surface temperature DataArray:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'analysed_sst' (time: 8, lat: 17999, lon: 36000)>\n",
       "dask.array<zarr, shape=(8, 17999, 36000), dtype=float32, chunksize=(1, 5000, 5000), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat      (lat) float32 -89.99 -89.98 -89.97 -89.96 ... 89.97 89.98 89.99\n",
       "  * lon      (lon) float32 -179.99 -179.98 -179.97 ... 179.98 179.99 180.0\n",
       "  * time     (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09:00:00\n",
       "Attributes:\n",
       "    comment:        "Final" version using Multi-Resolution Variational Analys...\n",
       "    long_name:      analysed sea surface temperature\n",
       "    source:         MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRMTA_G-NAVO, A...\n",
       "    standard_name:  sea_surface_foundation_temperature\n",
       "    units:          kelvin\n",
       "    valid_max:      32767\n",
       "    valid_min:      -32767
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * lat (lat) float32 -89.99 -89.98 -89.97 -89.96 ... 89.97 89.98 89.99\n", " * lon (lon) float32 -179.99 -179.98 -179.97 ... 179.98 179.99 180.0\n", " * time (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09:00:00\n", "Attributes:\n", " comment: \"Final\" version using Multi-Resolution Variational Analys...\n", " long_name: analysed sea surface temperature\n", " source: MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRMTA_G-NAVO, A...\n", " standard_name: sea_surface_foundation_temperature\n", " units: kelvin\n", " valid_max: 32767\n", " valid_min: -32767" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.analysed_sst" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The representation already indicated that this DataArray (although being a tiny part of the actual full dataset) is quite big: 20.7 GB if loaded fully into memory at once (which would not fit in the memory of my laptop).\n", "\n", "The xarray.DataArray is now backed by a dask array instead of a numpy array. This allows us to do computations on the large data in *chunked* way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, let's compute the overall average temperature for the full globe for each timestep:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'analysed_sst' (time: 8)>\n",
       "dask.array<mean_agg-aggregate, shape=(8,), dtype=float32, chunksize=(1,), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * time     (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09:00:00
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * time (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09:00:00" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "overall_mean = ds.analysed_sst.mean(dim=(\"lon\", \"lat\"))\n", "overall_mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This returned a lazy object, and not yet computed the actual average. Let's explicitly compute it:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 56s, sys: 46.3 s, total: 2min 42s\n", "Wall time: 31.5 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'analysed_sst' (time: 8)>\n",
       "array([287.08176, 287.08545, 287.0962 , 287.09042, 287.08246, 287.07053,\n",
       "       287.08984, 287.1125 ], dtype=float32)\n",
       "Coordinates:\n",
       "  * time     (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09:00:00
" ], "text/plain": [ "\n", "array([287.08176, 287.08545, 287.0962 , 287.09042, 287.08246, 287.07053,\n", " 287.08984, 287.1125 ], dtype=float32)\n", "Coordinates:\n", " * time (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09:00:00" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time \n", "overall_mean.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This takes some time, but it *did* run on my laptop even while the dataset did not fit in the memory of my laptop." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Integrating with hvplot and datashader, we can also still interactively plot and explore the large dataset:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import hvplot.xarray" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "ds.analysed_sst.isel(time=-1).hvplot.quadmesh(\n", " 'lon', 'lat', rasterize=True, dynamic=True,\n", " width=800, height=450, cmap='RdBu_r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zooming in on this figure we re-read and rasterize the subset we are viewing to provide a higher resolution image." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**As a summary**, using dask with xarray allows:\n", "\n", "- to use the familiar xarray workflows for larger data as well\n", "- use the same code to work on our laptop or on a big cluster" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# PANGEO: A community platform for Big Data geoscience\n", "\n", "\n", "
\n", "\n", "Website: https://pangeo.io/index.html\n", "\n", "They have a gallery with many interesting examples, many of them using this combination of xarray and dask.\n", "\n", "Pangeo focuses primarily on *cloud computing* (storing the big datasets in cloud-native file formats and also doing the computations in the cloud), but all the tools like xarray and dask developed by this community and shown in the examples also work on your laptop or university's cluster.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }