{ "cells": [ { "cell_type": "markdown", "id": "serious-stanford", "metadata": {}, "source": [ "# Topic 3: Data Discovery: STAC & CMR-STAC API" ] }, { "cell_type": "markdown", "id": "civil-portsmouth", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "norwegian-packing", "metadata": {}, "source": [ "## Summary\n", "\n", "In this example we will discover and access the NASA's Harmonized Landsat Sentinel-2 (HLS) version 2 assets, which are archived in cloud optimized geoTIFF (COG) format in the LP DAAC Cumulus cloud space. The COGs can be used like any other geoTIFF file, but have some added features that make them more efficient within the cloud data access paradigm. These features include: overviews and internal tiling. Below we will demonstrate how to leverage these features.\n", "\n", "First we need to find the data. Specifically, we want to find data that intersects with our region of interest for our desired date range. We will use [SpatioTemporal Asset Catalog (STAC)](https://stacspec.org/) and the CMR-STAC API to identify the data assets that meet our search criteria.\n", "\n", "### What is STAC? \n", "\n", "[SpatioTemporal Asset Catalog (STAC)](https://stacspec.org/) is a specification that provides a common language for interpreting geospatial information in order to standardize indexing and discovering data. \n", "\n", "The [STAC specification](https://stacspec.org/core.html) is made up of a collection of related, yet independent specifications that when used together provide search and discovery capabilities for remove assets.\n", "\n", "#### Four STAC Specifications \n", "\n", "- [STAC Item](https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md)\n", "- [STAC Catalog](https://github.com/radiantearth/stac-spec/blob/master/catalog-spec/catalog-spec.md)\n", "- [STAC Collection](https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md)\n", "- [STAC API](https://github.com/radiantearth/stac-api-spec)\n", "\n", "In the following sections, we will explore each of STAC element using NASA's Common Metadata Repository (CMR) STAC application programming interface (API), or [CMR-STAC API](https://github.com/nasa/cmr-stac) for short. \n", "\n", "### CMR-STAC API \n", "\n", "The [CMR-STAC](https://github.com/nasa/cmr-stac) API is NASA's implementation of the STAC API specification for all NASA data holdings within EOSDIS. The current implementation does not allow for querries accross the entire NASA catalog. Users must execute searches within provider catalogs (e.g., LPCLOUD) to find the STAC Items they are searching for. All the providers can be found at the CMR-STAC endpoint here: . \n", "\n", "In this exercise, we will query the **LPCLOUD** provider to identify STAC Items from the Harmonized Landsat Sentinel-2 (HLS) collection that fall within our region of interest (ROI) and within our specified time range." ] }, { "cell_type": "markdown", "id": "altered-lucas", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "seven-touch", "metadata": {}, "source": [ "## Exercise" ] }, { "cell_type": "markdown", "id": "racial-crawford", "metadata": {}, "source": [ "### Import Required Packages" ] }, { "cell_type": "code", "execution_count": null, "id": "collective-arthritis", "metadata": {}, "outputs": [], "source": [ "import requests\n", "from pystac_client import Client # https://pystac-client.readthedocs.io/en/latest/index.html \n", "from collections import defaultdict \n", "import json\n", "import geopandas\n", "import geoviews as gv\n", "from cartopy import crs\n", "gv.extension('bokeh', 'matplotlib')" ] }, { "cell_type": "markdown", "id": "6162ca89-5899-4a41-99a7-e13f0c7f5658", "metadata": {}, "source": [ "### Submit `GET` request to the CMR STAC API\n", "\n", "Use the `reqests` package to submit a `GET` request to the CMR STAC API. We'll parse the response and extract the information we need to navigate the STAC Catalog." ] }, { "cell_type": "code", "execution_count": null, "id": "d5bf762e-a9db-4efe-ae22-b5c3180bcbe0", "metadata": {}, "outputs": [], "source": [ "stac_url = 'https://cmr.earthdata.nasa.gov/stac'" ] }, { "cell_type": "code", "execution_count": null, "id": "1589df27-c851-451a-9f31-56b772c6a532", "metadata": {}, "outputs": [], "source": [ "provider_cat = requests.get(stac_url)" ] }, { "cell_type": "markdown", "id": "6bc30baa-3068-4885-8fb1-b94af3ebf143", "metadata": {}, "source": [ "The CMR STAC API endpoint lists the available providers. Each **provider** is a seperate STAC Catalog endpoint that can be used to submit spatiotemporal queries agaist." ] }, { "cell_type": "code", "execution_count": null, "id": "87bba091-fac5-43cb-abdc-69bbcaeebd2e", "metadata": {}, "outputs": [], "source": [ "providers = [p['title'] for p in provider_cat.json()['links'] if 'child' in p['rel']]\n", "providers" ] }, { "cell_type": "markdown", "id": "0403d3e7-64ae-4e75-a53e-de6c7204083a", "metadata": {}, "source": [ "### Explore the `LPCLOUD` provider" ] }, { "cell_type": "code", "execution_count": null, "id": "7d5dc819-89f7-408f-8ab7-c35cd7103e03", "metadata": {}, "outputs": [], "source": [ "provider = 'LPCLOUD'" ] }, { "cell_type": "code", "execution_count": null, "id": "0aed3b74-fc0f-4519-8e07-c0c2fe1f2c3f", "metadata": {}, "outputs": [], "source": [ "provider_url = f'{stac_url}/{provider}'\n", "provider_url" ] }, { "cell_type": "code", "execution_count": null, "id": "a8589e46-e7da-4d27-9554-d474f7f0192b", "metadata": { "tags": [] }, "outputs": [], "source": [ "cat = requests.get(provider_url)\n", "cat.json()" ] }, { "cell_type": "markdown", "id": "d67e44c8-d270-4724-8aaf-a6d2f17ea3b7", "metadata": {}, "source": [ "### List the STAC Collections within the `LPCLOUD` Catalog" ] }, { "cell_type": "code", "execution_count": null, "id": "e639c1de-1fd9-4bf3-ad36-430cdeffd0c7", "metadata": { "tags": [] }, "outputs": [], "source": [ "cols = [{l['href'].split('/')[-1]: l['href']} for l in cat.json() ['links'] if 'child' in l['rel']]\n", "for c in cols:\n", " print(c)" ] }, { "cell_type": "markdown", "id": "1bbf6365-71fc-4499-8a14-ce07ae1351b8", "metadata": {}, "source": [ "**Notice** that only 10 collections are returned here, but the `LPCLOUD` provider has over 100 data products available in Earthdata Cloud. This is because the CMR STAC API returns 10 collections by default. A `limit` parameter can be added to the end of the `LPCLOUD` STAC API endpoint to increase the number of collections returned at one time, but this can sometimes be time consuming. Here we'll loop through each `next` page link to make a seperate request for the next 10 collections from each available page." ] }, { "cell_type": "code", "execution_count": null, "id": "99a517eb-e7fd-44c2-b9fd-e6f35d81e46e", "metadata": {}, "outputs": [], "source": [ "try:\n", " print(f\"Requesting page:\")\n", " while nxt_pg := [l for l in cat.json()['links'] if 'next' in l['rel']][0]:\n", " print(f\"{nxt_pg['href'].split('=')[-1]}...\", end = ' ')\n", " cat = requests.get(nxt_pg['href'])\n", " cols.extend([{l['href'].split('/')[-1]: l['href']} for l in cat.json()['links']if 'child' in l['rel']])\n", "except:\n", " print('No additional pages')" ] }, { "cell_type": "markdown", "id": "420a85ff-f60d-4bdb-a54d-5564962a60f5", "metadata": {}, "source": [ "### Print all available STAC Collections within the `LPCLOUD` Catalog" ] }, { "cell_type": "code", "execution_count": null, "id": "d9d1295b-d620-4702-8786-b5025071b5b6", "metadata": {}, "outputs": [], "source": [ "print(f'LPCLOUD has {len(cols)} Collections')" ] }, { "cell_type": "code", "execution_count": null, "id": "5e9cdd05-6350-4790-95e5-38dd050ac3b2", "metadata": {}, "outputs": [], "source": [ "for c in cols:\n", " print(c)" ] }, { "cell_type": "markdown", "id": "c8cfa860-010f-4c3b-8689-6544db7fd063", "metadata": {}, "source": [ "### Get information about an individual Collection\n", "\n", "Below we'll specify a STAC Collection, `HLSL30.v2.0`, and request the STAC metadata" ] }, { "cell_type": "code", "execution_count": null, "id": "9806342a-5959-492a-9c12-1b48315a91dc", "metadata": {}, "outputs": [], "source": [ "collection = 'HLSL30.v2.0'" ] }, { "cell_type": "code", "execution_count": null, "id": "fe4630ed-c622-4463-8431-ae363c4eb0aa", "metadata": {}, "outputs": [], "source": [ "collection_link = list(filter(lambda c: collection == list(c.keys())[0], cols))[0]\n", "collection_link" ] }, { "cell_type": "code", "execution_count": null, "id": "e58c4b49-72d2-42d1-8b6d-c879deaa7435", "metadata": {}, "outputs": [], "source": [ "requests.get(collection_link[collection]).json()" ] }, { "cell_type": "markdown", "id": "packed-tournament", "metadata": {}, "source": [ "### Set up query parameters to submit to the CMR-STAC API\n", "\n", "For this next sections we'll use a Python package call `pystac_client` to submit a spatiotemporal querie for data assets across multiple Collections. We will define our area of interest using the geojson file from the previous exercise, while also specifying the data collections and time range of needed for our example." ] }, { "cell_type": "code", "execution_count": null, "id": "fitted-mixture", "metadata": {}, "outputs": [], "source": [ "field = geopandas.read_file('../data/ne_w_agfields.geojson')\n", "field" ] }, { "cell_type": "code", "execution_count": null, "id": "conservative-corpus", "metadata": {}, "outputs": [], "source": [ "fieldShape = field['geometry'][0]\n", "fieldShape" ] }, { "cell_type": "code", "execution_count": null, "id": "antique-event", "metadata": {}, "outputs": [], "source": [ "base = gv.tile_sources.EsriImagery.opts(width=650, height=500)\n", "farmField = gv.Polygons(fieldShape).opts(line_color='yellow', line_width=10, color=None)\n", "base * farmField" ] }, { "cell_type": "markdown", "id": "tamil-gossip", "metadata": {}, "source": [ "We will now start to specify the search criteria we are interested in, i.e, the **date range**, the **region of interest** (roi), and the **data collections**, to pass to the STAC API" ] }, { "cell_type": "markdown", "id": "mobile-arnold", "metadata": {}, "source": [ "#### Specify the region of interest" ] }, { "cell_type": "code", "execution_count": null, "id": "human-breathing", "metadata": {}, "outputs": [], "source": [ "roi = json.loads(field.to_json())['features'][0]['geometry']\n", "roi" ] }, { "cell_type": "markdown", "id": "informal-introduction", "metadata": {}, "source": [ "#### Specify date range" ] }, { "cell_type": "code", "execution_count": null, "id": "intensive-russell", "metadata": {}, "outputs": [], "source": [ "#date_range = \"2021-05-01T00:00:00Z/2021-08-30T23:59:59Z\" # closed interval\n", "#date_range = \"2021-05-01T00:00:00Z/..\" # open interval - does not currently work with the CMR-STAC API\n", "date_range = \"2021-05/2021-08\"" ] }, { "cell_type": "markdown", "id": "filled-security", "metadata": {}, "source": [ "#### Specify the STAC Collections\n", "\n", "**Note,** a STAC Collection is synonomous with what we usually consider a data product." ] }, { "cell_type": "code", "execution_count": null, "id": "forced-copying", "metadata": {}, "outputs": [], "source": [ "collections = ['HLSL30.v2.0', 'HLSS30.v2.0']\n", "collections" ] }, { "cell_type": "markdown", "id": "proper-generic", "metadata": {}, "source": [ "### Perform Search Against the CMR-STAC API" ] }, { "cell_type": "code", "execution_count": null, "id": "ee17694e-1b02-45ab-9b79-c077c6614d50", "metadata": {}, "outputs": [], "source": [ "catalog = Client.open(provider_url)" ] }, { "cell_type": "code", "execution_count": null, "id": "unknown-prairie", "metadata": {}, "outputs": [], "source": [ "search = catalog.search(\n", " collections=collections,\n", " intersects=roi,\n", " datetime=date_range\n", ")" ] }, { "cell_type": "markdown", "id": "superior-pastor", "metadata": {}, "source": [ "#### Print out how many STAC Items match our search query" ] }, { "cell_type": "code", "execution_count": null, "id": "norwegian-blair", "metadata": {}, "outputs": [], "source": [ "search.matched()" ] }, { "cell_type": "markdown", "id": "through-whole", "metadata": {}, "source": [ "We now have a search object containing the STAC records that matched our query. Now, let's pull out all of the STAC Items (as a PySTAC ItemCollection object) and explore the contents (i.e., the STAC Items)" ] }, { "cell_type": "code", "execution_count": null, "id": "turned-shelf", "metadata": { "tags": [] }, "outputs": [], "source": [ "item_collection = list(search.get_items())" ] }, { "cell_type": "code", "execution_count": null, "id": "likely-rolling", "metadata": {}, "outputs": [], "source": [ "item_collection[0:5]" ] }, { "cell_type": "markdown", "id": "original-enemy", "metadata": {}, "source": [ "#### Grab the first Item and print it out as a dictionary" ] }, { "cell_type": "code", "execution_count": null, "id": "legitimate-eligibility", "metadata": {}, "outputs": [], "source": [ "item_collection[0].to_dict()" ] }, { "cell_type": "markdown", "id": "animal-facility", "metadata": {}, "source": [ "### Filtering STAC Items\n", "\n", "While the CMR-STAC API is a powerful search and discovery utility, it is still maturing and currently does not have the full gamut of filtering capabilities that the STAC API specification allows for. Hence, additional filtering is required if we want to filter by a property like cloud cover for example. Below we will loop through and filter the item_collection by a specified cloud cover as well as extract the band we need to do an Enhanced Vegetation Index (EVI) calculation later." ] }, { "cell_type": "markdown", "id": "raised-desire", "metadata": {}, "source": [ "Now we will set the max cloud cover allowable and extract the band links for those Items that match or are less than the max cloud cover." ] }, { "cell_type": "code", "execution_count": null, "id": "tracked-passing", "metadata": {}, "outputs": [], "source": [ "cloudcover = 25" ] }, { "cell_type": "markdown", "id": "hidden-nothing", "metadata": {}, "source": [ "We will also specify the STAC Assets (i.e., bands/layers) of interest for both the S30 and L30 collections" ] }, { "cell_type": "code", "execution_count": null, "id": "crude-convert", "metadata": {}, "outputs": [], "source": [ "s30_bands = ['B8A', 'B04', 'B02', 'Fmask'] # S30 bands for EVI calculation and quality filtering -> NIR, RED, BLUE, Quality \n", "l30_bands = ['B05', 'B04', 'B02', 'Fmask'] # L30 bands for EVI calculation and quality filtering -> NIR, RED, BLUE, Quality " ] }, { "cell_type": "code", "execution_count": null, "id": "early-letters", "metadata": { "tags": [] }, "outputs": [], "source": [ "evi_band_links = []\n", "\n", "for i in item_collection:\n", " if i.properties['eo:cloud_cover'] <= cloudcover:\n", " if i.collection_id == 'HLSS30.v2.0':\n", " #print(i.properties['eo:cloud_cover'])\n", " evi_bands = s30_bands\n", " elif i.collection_id == 'HLSL30.v2.0':\n", " #print(i.properties['eo:cloud_cover'])\n", " evi_bands = l30_bands\n", "\n", " for a in i.assets:\n", " if any(b==a for b in evi_bands):\n", " evi_band_links.append(i.assets[a].href)" ] }, { "cell_type": "code", "execution_count": null, "id": "other-thunder", "metadata": {}, "outputs": [], "source": [ "#len(evi_band_links)/4 # Print the number of Items that match our cloud criteria " ] }, { "cell_type": "markdown", "id": "infectious-premiere", "metadata": {}, "source": [ "The filtering done in the previous steps produces a list of links to STAC Assets. Let's print out the first ten links." ] }, { "cell_type": "code", "execution_count": null, "id": "japanese-force", "metadata": {}, "outputs": [], "source": [ "evi_band_links[:10]" ] }, { "cell_type": "markdown", "id": "instructional-prime", "metadata": {}, "source": [ "**NOTICE** that in the list of links that we have multiple tiles, i.e. **T14TKL** & **T13TGF**, that intersect with our region of interest. These tiles represent neighboring UTM zones. We will split the list of links into seperate lists for each tile." ] }, { "cell_type": "markdown", "id": "killing-shareware", "metadata": {}, "source": [ "We now have a list of links to data assets that meet our search and filtering criteria. The commands that follow will split this list into logical groupings using python routines." ] }, { "cell_type": "markdown", "id": "approximate-massachusetts", "metadata": {}, "source": [ "### Split Data Links List into Logical Groupings" ] }, { "cell_type": "markdown", "id": "finite-doubt", "metadata": {}, "source": [ "Split by Universal Transverse Mercator (UTM) tile specified in the file name (e.g., T14TKL & T13TGF)" ] }, { "cell_type": "code", "execution_count": null, "id": "billion-chicago", "metadata": {}, "outputs": [], "source": [ "tile_dicts = defaultdict(list) # https://stackoverflow.com/questions/26367812/appending-to-list-in-python-dictionary" ] }, { "cell_type": "code", "execution_count": null, "id": "checked-microphone", "metadata": {}, "outputs": [], "source": [ "for l in evi_band_links:\n", " tile = l.split('.')[-6]\n", " tile_dicts[tile].append(l)" ] }, { "cell_type": "markdown", "id": "acceptable-feedback", "metadata": {}, "source": [ "#### Print dictionary keys and values, i.e. the data links" ] }, { "cell_type": "code", "execution_count": null, "id": "suited-setup", "metadata": {}, "outputs": [], "source": [ "tile_dicts.keys()" ] }, { "cell_type": "code", "execution_count": null, "id": "north-frank", "metadata": {}, "outputs": [], "source": [ "tile_dicts['T13TGF'][:5]" ] }, { "cell_type": "markdown", "id": "imported-restriction", "metadata": {}, "source": [ "Now we will create a seperate list of data links for each tile" ] }, { "cell_type": "code", "execution_count": null, "id": "absent-small", "metadata": {}, "outputs": [], "source": [ "tile_links_T14TKL = tile_dicts['T14TKL']\n", "tile_links_T13TGF = tile_dicts['T13TGF']" ] }, { "cell_type": "markdown", "id": "communist-utility", "metadata": {}, "source": [ "#### Print band/layer links for HLS tile T13TGF" ] }, { "cell_type": "code", "execution_count": null, "id": "mental-pontiac", "metadata": {}, "outputs": [], "source": [ "tile_links_T13TGF[:10]" ] }, { "cell_type": "markdown", "id": "immediate-corporation", "metadata": {}, "source": [ "#### Split the links by band" ] }, { "cell_type": "code", "execution_count": null, "id": "blind-hudson", "metadata": {}, "outputs": [], "source": [ "bands_dicts = defaultdict(list)" ] }, { "cell_type": "code", "execution_count": null, "id": "suffering-picture", "metadata": {}, "outputs": [], "source": [ "for b in tile_links_T13TGF:\n", " band = b.split('.')[-2]\n", " bands_dicts[band].append(b)" ] }, { "cell_type": "code", "execution_count": null, "id": "olive-transport", "metadata": {}, "outputs": [], "source": [ "bands_dicts.keys()" ] }, { "cell_type": "code", "execution_count": null, "id": "adjustable-income", "metadata": {}, "outputs": [], "source": [ "bands_dicts['B04']" ] }, { "cell_type": "markdown", "id": "ceramic-justice", "metadata": {}, "source": [ "### Save links to a text file\n", "\n", "To finish off this exercise, we will save the idividual link lists as seperate text files with descriptive names." ] }, { "cell_type": "markdown", "id": "phantom-indonesia", "metadata": {}, "source": [ "#### Write links from CMR-STAC API to a file" ] }, { "cell_type": "code", "execution_count": null, "id": "special-berlin", "metadata": {}, "outputs": [], "source": [ "for k, v in bands_dicts.items():\n", " name = (f'HTTPS_T13TGF_{k}_Links.txt')\n", " with open(f'../data/{name}', 'w') as f:\n", " for l in v:\n", " f.write(f\"{l}\" + '\\n')" ] }, { "cell_type": "markdown", "id": "trained-brush", "metadata": {}, "source": [ "#### Write links to file for S3 access" ] }, { "cell_type": "code", "execution_count": null, "id": "olympic-cause", "metadata": {}, "outputs": [], "source": [ "for k, v in bands_dicts.items():\n", " name = (f'S3_T13TGF_{k}_Links.txt')\n", " with open(f'../data/{name}', 'w') as f:\n", " for l in v:\n", " s3l = l.replace('https://data.lpdaac.earthdatacloud.nasa.gov/', 's3://')\n", " f.write(f\"{s3l}\" + '\\n')" ] }, { "cell_type": "markdown", "id": "assured-money", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "necessary-subscriber", "metadata": {}, "source": [ "## Resources\n", "\n", "- https://github.com/nasa/cmr-stac\n", "- https://stacspec.org/\n", "- https://stackoverflow.com/questions/26367812/appending-to-list-in-python-dictionary\n", "- https://pystac-client.readthedocs.io/en/latest/index.html\n", "- https://pystac.readthedocs.io/en/1.0/" ] }, { "cell_type": "markdown", "id": "f0ebf970-a928-4b6a-af09-59ba0e519493", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "3fd34b74-24c2-46b4-a5ae-8fce19915003", "metadata": {}, "source": [ "## [Next: Topic 4 - Data Access - Direct S3 Access](Topic_4__Data_Proximate_Compute.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "py39", "language": "python", "name": "py39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 5 }