{ "cells": [ { "cell_type": "markdown", "id": "d36845bd-4c34-4703-ada7-edb5f55ba1de", "metadata": {}, "source": [ "## S3 Access for cloud enabled datasets : h5 example\n", "\n", "If a given dataset is in the cloud we have a set of libraries to access them. If the data is in a cloud friendly format we can efficiently load only what we need. If not we may need to read entire files.\n", "\n", "Dependencies:\n", "\n", "* [Valid `.netrc`](https://github.com/NASA-Openscapes/2021-Cloud-Hackathon/blob/main/tutorials/04_NASA_Earthdata_Authentication.ipynb) file in home directory\n", "* Running this code in AWS **us-west-2** (Like we are in the Openscapes hub)\n", "\n", "Glossary\n", "\n", "* Collection:\n", "* Granule:\n", "* S3: \n", "* S3 bucket:\n", "\n", "Workflow\n", "* Searched for a cloud-hosted dataset stored in hdf5 files (not in this notenook: no easy way to do this with CMR)\n", "* Getting credentials for DAAC that hosts the data\n", "* Search for granule with specified parameters\n", "* Setting up query to CMR and explore meta data to find direct links\n", "* Get urls of the data\n", "* Open 1 granule (file) on S3 to explore the data\n", "* Explored the data file hierarchy and chose a variable of interest\n", "* get the variables we want out of the h5 file\n", "* Make plots\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "54f325cc-792d-49a7-8a66-6622ebba7e77", "metadata": {}, "outputs": [], "source": [ "import requests\n", "from pprint import pprint\n", "from pathlib import Path\n", "import s3fs\n", "import fsspec\n", "\n", "# added this two for the h5 example\n", "import numpy\n", "import h5py\n", "\n", "import xarray as xr\n", "\n", "import matplotlib.pyplot as plt\n", "import cartopy.crs as ccrs" ] }, { "cell_type": "code", "execution_count": 2, "id": "1826a8d6-2f80-4a77-8813-4a3a091c8001", "metadata": {}, "outputs": [], "source": [ "# setting the endpoints and credentials\n", "# Here we know we want a certain dataset and know the concept ID\n", "\n", "# This endpoint is specific to daac, if we want cloud data from a different DAAC\n", "# we need to change it. See: https://raw.githubusercontent.com/betolink/earthdata/main/earthdata/daac.py \n", "s3_cred_endpoint = 'https://data.ornldaac.earthdata.nasa.gov/s3credentials'\n", "\n", "cmr_search_url = 'https://cmr.earthdata.nasa.gov/search'\n", "cmr_granule_url = f'{cmr_search_url}/{\"granules\"}'\n", "\n", "\n", "def get_temp_creds():\n", " temp_creds_url = s3_cred_endpoint\n", " return requests.get(temp_creds_url).json()\n", "temp_creds_req = get_temp_creds()\n", "\n", "s3_fs = s3fs.S3FileSystem(\n", " key=temp_creds_req['accessKeyId'],\n", " secret=temp_creds_req['secretAccessKey'],\n", " token=temp_creds_req['sessionToken'],\n", " client_kwargs={'region_name':'us-west-2'},\n", ")" ] }, { "cell_type": "code", "execution_count": 3, "id": "1dc7aeb6-bba0-4a1a-91aa-3b989461c7e4", "metadata": {}, "outputs": [], "source": [ "# Search for granule with specified parameters\n", "\n", "# In this case we know the dataset id in advance but we could use CMR to look for one\n", "concept_id = 'C2114031882-ORNL_CLOUD'\n", "# Bounding Box spatial parameter in decimal degree 'W,S,E,N' format.\n", "# If the dataset is global the bbox is just ignored because it will match anything.\n", "bounding_box = '-105,21,-125,32'\n", "# Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format\n", "temporal = '2019-06-22T00:00:00Z,2019-06-23T23:59:59Z'" ] }, { "cell_type": "code", "execution_count": 4, "id": "0ac89aea-1c90-4a97-bac0-9fe3cc9f9d05", "metadata": {}, "outputs": [], "source": [ "# setting up query to CMR and explore meta data to find direct links\n", "\n", "\n", "response = requests.get(cmr_granule_url, \n", " params={\n", " 'concept_id': concept_id,\n", " 'temporal': temporal,\n", " 'bounding_box': bounding_box,\n", " 'page_size': 200,\n", " },\n", " headers={\n", " 'Accept': 'application/json'\n", " }\n", " )\n", "\n", "# These are the metadata records for each granule, this is where we get our links to the data\n", "# If the data is in the cloud the link prefix will start with s3://\n", "granules = response.json()['feed']['entry']\n", "#urls = []\n", "\n", "#can uncomment to find links you need that points to the s3 data\n", "#for granule in granules:\n", "# print(granule['links'])\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "ed3ad6c4-63ee-45f3-9d45-5669036ef5f7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['s3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173011100_O02969_T02656_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173024347_O02970_T01081_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173041633_O02971_T01082_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173054919_O02972_T05352_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173072205_O02973_T05353_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173085451_O02974_T02508_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173102737_O02975_T03779_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173120023_O02976_T00934_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173133309_O02977_T00935_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173150556_O02978_T05205_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173181128_O02980_T02361_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173194414_O02981_T00786_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173211700_O02982_T00787_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019173224946_O02983_T05057_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174002232_O02984_T02212_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174015518_O02985_T03483_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174032805_O02986_T03484_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174050051_O02987_T00639_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174063337_O02988_T04909_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174080623_O02989_T02064_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174093909_O02990_T02065_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174111155_O02991_T03336_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174124441_O02992_T03337_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174141727_O02993_T00492_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174172300_O02995_T03187_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174185546_O02996_T03188_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174202832_O02997_T00343_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174220118_O02998_T00344_02_001_01.h5',\n", " 's3://ornl-cumulus-prod-protected/gedi/GEDI_L4A_AGB_Density/data/GEDI04_A_2019174233404_O02999_T04614_02_001_01.h5']\n" ] } ], "source": [ "# we knew we need the files that end with s3# from looking at the links\n", "# Get urls of the data\n", "\n", "urls = []\n", "for granule in granules:\n", " for link in granule['links']:\n", " if link['rel'].endswith('/s3#'):\n", " urls.append(link['href'])\n", " break\n", "pprint(urls)" ] }, { "cell_type": "code", "execution_count": 6, "id": "7a60481d-507e-44a1-b472-99b31a5183c4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2 µs, sys: 0 ns, total: 2 µs\n", "Wall time: 6.91 µs\n" ] }, { "data": { "text/html": [ "
<xarray.Dataset>\n", "Dimensions: ()\n", "Data variables:\n", " *empty*\n", "Attributes:\n", " short_name: GEDI_L4A
<xarray.Dataset>\n", "Dimensions: (segment: 364708)\n", "Dimensions without coordinates: segment\n", "Data variables:\n", " lat_lowestmode (segment) float64 -43.12 -43.12 -43.12 ... -51.81 -51.81\n", " lon_lowestmode (segment) float64 0.1851 0.1857 0.1863 ... -63.85 -63.85\n", " agbd (segment) float32 -9.999e+03 -9.999e+03 ... -9.999e+03
array([-43.11873616, -43.11845488, -43.11817365, ..., -51.805011 ,\n", " -51.80500994, -51.80500883])
array([ 0.18511495, 0.18569463, 0.18627412, ..., -63.84809997,\n", " -63.84727722, -63.84645444])
array([-9999., -9999., -9999., ..., -9999., -9999., -9999.], dtype=float32)