{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true }, "source": [ "
Working with big data: xarray and dask (DEMO)
\n", "\n", "\n", "> *DS Python for GIS and Geoscience* \n", "> *October, 2020*\n", ">\n", "> *© 2020, Joris Van den Bossche and Stijn Van Hoey. Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n", "\n", "---\n", "\n", "Throughout the course, we worked with small, often simplified or subsampled data. In practice, the tools we have seen still work well with data that fit easily in memory. But also for data larger than memory (e.g. large or high resolution climate data), we can still use many of the familiar tools.\n", "\n", "This notebooks includes a brief showcase of using xarray with dask, a package to scale Python workflows (https://dask.org/). Dask integrates very well with xarray, providing a familiar xarray workflow for working with large datasets in parallel or on clusters." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n",
"Client\n", "
| \n",
"\n",
"Cluster\n", "
| \n",
"
<xarray.Dataset>\n", "Dimensions: (lat: 17999, lon: 36000, time: 8)\n", "Coordinates:\n", " * lat (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99\n", " * lon (lon) float32 -179.99 -179.98 -179.97 ... 179.98 179.99 180.0\n", " * time (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09...\n", "Data variables:\n", " analysed_sst (time, lat, lon) float32 dask.array<chunksize=(1, 5000, 5000), meta=np.ndarray>\n", "Attributes:\n", " Conventions: CF-1.7\n", " Metadata_Conventions: Unidata Observation Dataset v1.0\n", " acknowledgment: Please acknowledge the use of these data with...\n", " cdm_data_type: grid\n", " comment: MUR = "Multi-scale Ultra-high Resolution"\n", " creator_email: ghrsst@podaac.jpl.nasa.gov\n", " creator_name: JPL MUR SST project\n", " creator_url: http://mur.jpl.nasa.gov\n", " date_created: 20200124T151031Z\n", " easternmost_longitude: 180.0\n", " file_quality_level: 3\n", " gds_version_id: 2.0\n", " geospatial_lat_resolution: 0.009999999776482582\n", " geospatial_lat_units: degrees north\n", " geospatial_lon_resolution: 0.009999999776482582\n", " geospatial_lon_units: degrees east\n", " history: created at nominal 4-day latency; replaced nr...\n", " id: MUR-JPL-L4-GLOB-v04.1\n", " institution: Jet Propulsion Laboratory\n", " keywords: Oceans > Ocean Temperature > Sea Surface Temp...\n", " keywords_vocabulary: NASA Global Change Master Directory (GCMD) Sc...\n", " license: These data are available free of charge under...\n", " metadata_link: http://podaac.jpl.nasa.gov/ws/metadata/datase...\n", " naming_authority: org.ghrsst\n", " netcdf_version_id: 4.1\n", " northernmost_latitude: 90.0\n", " platform: Terra, Aqua, GCOM-W, MetOp-A, MetOp-B, Buoys/...\n", " processing_level: L4\n", " product_version: 04.1\n", " project: NASA Making Earth Science Data Records for Us...\n", " publisher_email: ghrsst-po@nceo.ac.uk\n", " publisher_name: GHRSST Project Office\n", " publisher_url: http://www.ghrsst.org\n", " references: http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...\n", " sensor: MODIS, AMSR2, AVHRR, in-situ\n", " source: MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRM...\n", " southernmost_latitude: -90.0\n", " spatial_resolution: 0.01 degrees\n", " standard_name_vocabulary: NetCDF Climate and Forecast (CF) Metadata Con...\n", " start_time: 20200108T090000Z\n", " stop_time: 20200108T090000Z\n", " summary: A merged, multi-sensor L4 Foundation SST anal...\n", " time_coverage_end: 20200108T210000Z\n", " time_coverage_start: 20200107T210000Z\n", " title: Daily MUR SST, Final product\n", " uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66\n", " westernmost_longitude: -180.0
array([-89.99, -89.98, -89.97, ..., 89.97, 89.98, 89.99], dtype=float32)
array([-179.99, -179.98, -179.97, ..., 179.98, 179.99, 180. ],\n", " dtype=float32)
array(['2020-01-01T09:00:00.000000000', '2020-01-02T09:00:00.000000000',\n", " '2020-01-03T09:00:00.000000000', '2020-01-04T09:00:00.000000000',\n", " '2020-01-05T09:00:00.000000000', '2020-01-06T09:00:00.000000000',\n", " '2020-01-07T09:00:00.000000000', '2020-01-08T09:00:00.000000000'],\n", " dtype='datetime64[ns]')
\n",
"
| \n",
"\n", "\n", " | \n", "
<xarray.DataArray 'analysed_sst' (time: 8, lat: 17999, lon: 36000)>\n", "dask.array<zarr, shape=(8, 17999, 36000), dtype=float32, chunksize=(1, 5000, 5000), chunktype=numpy.ndarray>\n", "Coordinates:\n", " * lat (lat) float32 -89.99 -89.98 -89.97 -89.96 ... 89.97 89.98 89.99\n", " * lon (lon) float32 -179.99 -179.98 -179.97 ... 179.98 179.99 180.0\n", " * time (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09:00:00\n", "Attributes:\n", " comment: "Final" version using Multi-Resolution Variational Analys...\n", " long_name: analysed sea surface temperature\n", " source: MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRMTA_G-NAVO, A...\n", " standard_name: sea_surface_foundation_temperature\n", " units: kelvin\n", " valid_max: 32767\n", " valid_min: -32767
\n",
"
| \n",
"\n", "\n", " | \n", "
array([-89.99, -89.98, -89.97, ..., 89.97, 89.98, 89.99], dtype=float32)
array([-179.99, -179.98, -179.97, ..., 179.98, 179.99, 180. ],\n", " dtype=float32)
array(['2020-01-01T09:00:00.000000000', '2020-01-02T09:00:00.000000000',\n", " '2020-01-03T09:00:00.000000000', '2020-01-04T09:00:00.000000000',\n", " '2020-01-05T09:00:00.000000000', '2020-01-06T09:00:00.000000000',\n", " '2020-01-07T09:00:00.000000000', '2020-01-08T09:00:00.000000000'],\n", " dtype='datetime64[ns]')
<xarray.DataArray 'analysed_sst' (time: 8)>\n", "dask.array<mean_agg-aggregate, shape=(8,), dtype=float32, chunksize=(1,), chunktype=numpy.ndarray>\n", "Coordinates:\n", " * time (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09:00:00
\n",
"
| \n",
"\n", "\n", " | \n", "
array(['2020-01-01T09:00:00.000000000', '2020-01-02T09:00:00.000000000',\n", " '2020-01-03T09:00:00.000000000', '2020-01-04T09:00:00.000000000',\n", " '2020-01-05T09:00:00.000000000', '2020-01-06T09:00:00.000000000',\n", " '2020-01-07T09:00:00.000000000', '2020-01-08T09:00:00.000000000'],\n", " dtype='datetime64[ns]')
<xarray.DataArray 'analysed_sst' (time: 8)>\n", "array([287.08176, 287.08545, 287.0962 , 287.09042, 287.08246, 287.07053,\n", " 287.08984, 287.1125 ], dtype=float32)\n", "Coordinates:\n", " * time (time) datetime64[ns] 2020-01-01T09:00:00 ... 2020-01-08T09:00:00
array([287.08176, 287.08545, 287.0962 , 287.09042, 287.08246, 287.07053,\n", " 287.08984, 287.1125 ], dtype=float32)
array(['2020-01-01T09:00:00.000000000', '2020-01-02T09:00:00.000000000',\n", " '2020-01-03T09:00:00.000000000', '2020-01-04T09:00:00.000000000',\n", " '2020-01-05T09:00:00.000000000', '2020-01-06T09:00:00.000000000',\n", " '2020-01-07T09:00:00.000000000', '2020-01-08T09:00:00.000000000'],\n", " dtype='datetime64[ns]')