{ "cells": [ { "cell_type": "markdown", "id": "standing-configuration", "metadata": {}, "source": [ "# Topic 1: AWS Data Access" ] }, { "cell_type": "markdown", "id": "excessive-london", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "alleged-practice", "metadata": {}, "source": [ "AWS buckets can be configured to control access to the data they contain. Buckets can be configured to be **completely free and open** to users, like the [Sentinel-2 Cloud Optimized GeoTIFF](https://registry.opendata.aws/sentinel-2-l2a-cogs/) data within [Open Data on AWS](https://registry.opendata.aws/). Buckets can also be **free and open, but require authenication**, like NASA assets stored in the NASA Cumulus cloud space. Others may require user to confirm they will pay for data that is pulled out of an AWS Region (e.g., **requester pays**). Below we will walk through how to access buckets with different configurations, as well as show how to access assets via HTTPS and as s3 (when it's an option). " ] }, { "cell_type": "markdown", "id": "green-arrival", "metadata": {}, "source": [ "## Import Required Packages" ] }, { "cell_type": "code", "execution_count": null, "id": "automotive-celebrity", "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import requests \n", "import boto3\n", "import rasterio as rio # https://rasterio.readthedocs.io/en/latest/\n", "from rasterio.plot import show\n", "from rasterio.session import AWSSession\n", "from osgeo import gdal\n", "import rioxarray # https://corteva.github.io/rioxarray/stable/index.html" ] }, { "cell_type": "markdown", "id": "statistical-enforcement", "metadata": {}, "source": [ "## Quick GDAL Detour \n", "\n", "#### GDAL environmental variables \n", "\n", "GDAL is a foundational piece of geospatial software that is leveraged by several popular open-source, and closed, geospatial software. The `rasterio` package is no exception. Rasterio leverages GDAL to, among other things, read and write raster data files, e.g., GeoTIFFs/Cloud Optimized GeoTIFFs. To read remote files, i.e., files/objects stored in the cloud, GDAL uses its [Virtual File System](https://gdal.org/user/virtual_file_systems.html) API. In a perfect world, one would be able to point a Virtual File System (there are several) at a remote data asset and have the asset retrieved, but that is not always the case. GDAL has a host of [configurations](https://gdal.org/user/configoptions.html)/environmental variables that adjust its behavior to, for example, make a req`uest more performant or to pass AWS credentials to the distribution system. Below, we'll identify the evironmental variables that will help us get our data from cloud." ] }, { "cell_type": "markdown", "id": "informed-rehabilitation", "metadata": {}, "source": [ "### GDAL configuration option (https://gdal.org/user/configoptions.html)\n", "\n", "- GDAL_DISABLE_READDIR_ON_OPEN\n", "- CPL_VSIL_CURL_ALLOWED_EXTENSIONS\n", "- AWS_NO_SIGN_REQUEST\n", "- GDAL_MAX_RAW_BLOCK_CACHE_SIZE\n", "- GDAL_SWATH_SIZE\n", "- VSI_CURL_CACHE_SIZE\n", "- GDAL_HTTP_COOKIEFILE\n", "- GDAL_HTTP_COOKIEJAR\n", "- GDAL_HTTP_UNSAFESSL # When using VPN" ] }, { "cell_type": "markdown", "id": "hybrid-reunion", "metadata": {}, "source": [ "## Free & Open\n", "- No AWS account required\n", "- Access anytime and anywhere" ] }, { "cell_type": "code", "execution_count": null, "id": "pretty-sigma", "metadata": {}, "outputs": [], "source": [ "fo_url = 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/18/S/UH/2020/7/S2A_18SUH_20200729_0_L2A/B04.tif'" ] }, { "cell_type": "code", "execution_count": null, "id": "adequate-paste", "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "with rio.Env(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR', # https://rasterio.readthedocs.io/en/latest/topics/configuration.html\n", " AWS_NO_SIGN_REQUEST='YES',\n", " CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif'):\n", " with rio.open(fo_url) as src:\n", " print(src.profile)\n", " print(f'Overviews levels: {src.overviews(1)}')\n", " print(type(src))\n", " \n", " fig, ax = plt.subplots(1, figsize=(12, 12))\n", " show(src)" ] }, { "cell_type": "markdown", "id": "activated-strand", "metadata": {}, "source": [ "## Free & Open with Authentication\n", "- No AWS account required\n", "- Access anytime and anywhere\n", "- For NASA data, authenication is done through Earthdata Login\n", "- Must have a netrc file with proper credetials in your user directory" ] }, { "cell_type": "code", "execution_count": null, "id": "hollywood-device", "metadata": {}, "outputs": [], "source": [ "foa_url = \"https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSS30.020/HLS.S30.T13TGF.2020274T174141.v2.0/HLS.S30.T13TGF.2020274T174141.v2.0.B04.tif\"" ] }, { "cell_type": "markdown", "id": "magnetic-traveler", "metadata": {}, "source": [ "**If you do not have a [netrc](https://git.earthdata.nasa.gov/projects/LPDUR/repos/daac_data_download_python/browse) file with proper credetial in your user directory and do not set the proper GDAL environmental variables, you will get this fun/misleading error.**\n" ] }, { "cell_type": "markdown", "id": "republican-chapter", "metadata": {}, "source": [ "**In this example we need to set `GDAL_HTTP_COOKIEFILE` and `GDAL_HTTP_COOKIEJAR`. These two configurations settings, along with the configurations we set above will allow us to access these data.** " ] }, { "cell_type": "code", "execution_count": null, "id": "celtic-wrestling", "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "with rio.Env(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR', \n", " CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif', \n", " GDAL_HTTP_COOKIEFILE='~/cookies.txt', \n", " GDAL_HTTP_COOKIEJAR='~/cookies.txt'):\n", " with rio.open(foa_url) as src:\n", " print(src.profile)\n", " print(f'Overviews levels: {src.overviews(1)}')\n", " print(type(src))\n", " \n", " hls_ov_levels = src.overviews(1) # We'll use this later on...\n", " hls_proj = src.crs.to_string() # We'll use this later on too...\n", " fig, ax = plt.subplots(1, figsize=(12, 12))\n", " show(src, cmap='Reds')" ] }, { "cell_type": "markdown", "id": "nominated-strip", "metadata": {}, "source": [ "## Requester Pays\n", "- AWS account required\n", "- No egress cost if your instance is running in **US West (Oregon) us-west-2**" ] }, { "cell_type": "code", "execution_count": null, "id": "gentle-diameter", "metadata": {}, "outputs": [], "source": [ "session = boto3.Session()" ] }, { "cell_type": "code", "execution_count": null, "id": "similar-frost", "metadata": {}, "outputs": [], "source": [ "rp_url = 's3://usgs-landsat/collection02/level-2/standard/oli-tirs/2020/032/031/LC08_L2SP_032031_20200704_20210330_02_T1/LC08_L2SP_032031_20200704_20210330_02_T1_SR_B4.TIF'\n", "#rp_url = 's3://usgs-landsat/collection02/level-2/standard/oli-tirs/2020/032/032/LC08_L2SP_032032_20200704_20210330_02_T1/LC08_L2SP_032032_20200704_20210330_02_T1_SR_B4.TIF'" ] }, { "cell_type": "code", "execution_count": null, "id": "essential-lyric", "metadata": {}, "outputs": [], "source": [ "with rio.Env(AWSSession(session, requester_pays=True), \n", " AWS_NO_SIGN_REQUEST='NO',\n", " GDAL_DISABLE_READDIR_ON_OPEN='TRUE'):\n", " with rio.open(rp_url) as src:\n", " print(src.profile)\n", " print(f'Overviews levels: {src.overviews(1)}')\n", " print(type(src))\n", " \n", " fig, ax = plt.subplots(1, figsize=(12, 12))\n", " show(src, cmap='Reds')" ] }, { "cell_type": "markdown", "id": "living-executive", "metadata": {}, "source": [ "## Direct S3 Bucket Access vs Access via HTTPS" ] }, { "cell_type": "markdown", "id": "celtic-casting", "metadata": {}, "source": [ "Temperary S3 bucket access credential: https://lpdaac.earthdata.nasa.gov/s3credentials" ] }, { "cell_type": "code", "execution_count": null, "id": "successful-plastic", "metadata": {}, "outputs": [], "source": [ "nasa_hls_s3_url = 's3://lp-prod-protected/HLSS30.020/HLS.S30.T13TGF.2020274T174141.v2.0/HLS.S30.T13TGF.2020274T174141.v2.0.B04.tif'\n" ] }, { "cell_type": "code", "execution_count": null, "id": "unusual-glossary", "metadata": {}, "outputs": [], "source": [ "def get_temp_creds():\n", " temp_creds_url = 'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials'\n", " return requests.get(temp_creds_url).json()" ] }, { "cell_type": "code", "execution_count": null, "id": "separate-tomato", "metadata": {}, "outputs": [], "source": [ "temp_creds_req = get_temp_creds()\n", "#temp_creds_req" ] }, { "cell_type": "code", "execution_count": null, "id": "developing-revelation", "metadata": {}, "outputs": [], "source": [ "session = boto3.Session(aws_access_key_id=temp_creds_req['accessKeyId'], \n", " aws_secret_access_key=temp_creds_req['secretAccessKey'],\n", " aws_session_token=temp_creds_req['sessionToken'],\n", " region_name='us-west-2')" ] }, { "cell_type": "code", "execution_count": null, "id": "2cd14b2f-b3f8-4b7b-bfe5-82db635e6889", "metadata": {}, "outputs": [], "source": [ "with rio.Env(AWSSession(session),\n", " GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',\n", " CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif',\n", " VSI_CACHE=True, \n", " region_name='us-west-2'):\n", " with rio.open(nasa_hls_s3_url, 'r') as src:\n", " print(src.profile)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "forbidden-atlas", "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "with rio.Env(AWSSession(session),\n", " GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',\n", " CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif',\n", " VSI_CACHE=True, \n", " region_name='us-west-2'):\n", " with rio.open(nasa_hls_s3_url, 'r') as src:\n", " print(src.profile)\n", " print(f'Overviews levels: {src.overviews(1)}')\n", " print(type(src))\n", " \n", " fig, ax = plt.subplots(1, figsize=(12, 12))\n", " show(src, cmap='Reds')" ] }, { "cell_type": "markdown", "id": "systematic-lottery", "metadata": {}, "source": [ "The image above is, in fact, the same NASA HLS asset we requested using HTTPS. Notice how much faster the rendering is when direct S3 access is used." ] }, { "cell_type": "markdown", "id": "final-creativity", "metadata": {}, "source": [ "### Using rasterio and rioxarray\n", "\n", "All the examples above use `Rasterio` directly to open the cloud assets. This results in a `numpy.ndarray` object. We can also use a package call `rioxarray` to read our cloud assets in as an xarray/dask array object." ] }, { "cell_type": "markdown", "id": "instant-lotus", "metadata": {}, "source": [ "**Access via `rioxarray`**" ] }, { "cell_type": "code", "execution_count": null, "id": "directed-muslim", "metadata": {}, "outputs": [], "source": [ "with rio.Env(AWSSession(session), \n", " GDAL_DISABLE_READDIR_ON_OPEN='TRUE',\n", " CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif',\n", " VSI_CACHE=True,\n", " region_name='us-west-2'):\n", " with rioxarray.open_rasterio(nasa_hls_s3_url) as src:\n", " ds = src.squeeze('band', drop=True)\n", " print(ds)" ] }, { "cell_type": "markdown", "id": "touched-indonesia", "metadata": {}, "source": [ "Note about context manager/environments " ] }, { "cell_type": "code", "execution_count": null, "id": "floppy-stuart", "metadata": {}, "outputs": [], "source": [ "ds" ] }, { "cell_type": "code", "execution_count": null, "id": "important-civilization", "metadata": {}, "outputs": [], "source": [ "ds.values # Throws an error...on purpose" ] }, { "cell_type": "markdown", "id": "introductory-panel", "metadata": {}, "source": [ "**Access data value within the `rasterio` environment context manager (execute the cell if you get an error the first time)**" ] }, { "cell_type": "code", "execution_count": null, "id": "diagnostic-spread", "metadata": {}, "outputs": [], "source": [ "with rio.Env(AWSSession(session), \n", " GDAL_DISABLE_READDIR_ON_OPEN='TRUE',\n", " CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif',\n", " VSI_CACHE=True,\n", " region_name='us-west-2'):\n", " with rioxarray.open_rasterio(nasa_hls_s3_url) as src:\n", " ds = src.squeeze('band', drop=True)\n", " print(ds.values)" ] }, { "cell_type": "markdown", "id": "thrown-advantage", "metadata": {}, "source": [ "## Resources\n", "\n", "- https://github.com/pangeo-data/cog-best-practices\n", "- https://geohackweek.github.io/raster/04-workingwithrasters/\n", "- https://www.usgs.gov/core-science-systems/nli/landsat/landsat-commercial-cloud-data-access\n", "- https://rasterio.readthedocs.io/en/latest/topics/configuration.html" ] }, { "cell_type": "markdown", "id": "formal-liquid", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "stuck-order", "metadata": {}, "source": [ "## [Next: Topic 2 - Cloud Optimized Data](Topic_2__Cloud_Optimized_Data.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 5 }