{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Making Zarr data from NetCDF files\n", "\n", "- Funding: Interagency Implementation and Advanced Concepts Team [IMPACT](https://earthdata.nasa.gov/esds/impact) for the Earth Science Data Systems (ESDS) program and AWS Public Dataset Program\n", "- Software developed during [OceanHackWeek 2020](https://github.com/oceanhackweek) \n", " \n", "### Credits: Tutorial development\n", "* [Dr. Chelle Gentemann](mailto:gentemann@faralloninstitute.org) - [Twitter](https://twitter.com/ChelleGentemann) - Farallon Institute\n", "* [Patrick Gray](mailto:patrick.c.gray@duke.edu) - [Twitter](https://twitter.com/clifgray) - Duke University\n", "* [Phoebe Hudson](mailto:pahdsn@outlook.com) - University of Southampton\n", "\n", "## Why data format matters\n", "- NetCDF sprinkles metadata throughout files, making them slow to access and read data\n", "- Zarr consolidates the metadata, making them FAST for access and reading\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# filter some warning messages\n", "import warnings \n", "warnings.filterwarnings(\"ignore\") \n", "\n", "#libraries\n", "import datetime as dt\n", "import xarray as xr\n", "import fsspec\n", "import s3fs\n", "from matplotlib import pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "# make datasets display nicely\n", "xr.set_options(display_style=\"html\") \n", "import os.path\n", "\n", "#magic fncts #put static images of your plot embedded in the notebook\n", "%matplotlib inline \n", "plt.rcParams['figure.figsize'] = 12, 6\n", "%config InlineBackend.figure_format = 'retina' \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_geo_data(sat,lyr,idyjl):\n", " # arguments\n", " # sat goes-east,goes-west,himawari\n", " # lyr year\n", " # idyjl day of year\n", " \n", " ds,iexist=[],False\n", " \n", " d = dt.datetime(lyr,1,1) + dt.timedelta(days=idyjl)\n", " fs = s3fs.S3FileSystem(anon=True) #connect to s3 bucket!\n", "\n", " #create strings for the year and julian day\n", " imon,idym=d.month,d.day\n", " syr,sjdy,smon,sdym = str(lyr).zfill(4),str(idyjl).zfill(3),str(imon).zfill(2),str(idym).zfill(2)\n", " \n", " #use glob to list all the files in the directory\n", " if sat=='goes-east':\n", " file_location,var = fs.glob('s3://noaa-goes16/ABI-L2-SSTF/'+syr+'/'+sjdy+'/*/*.nc'),'SST'\n", " if sat=='goes-west':\n", " file_location,var = fs.glob('s3://noaa-goes17/ABI-L2-SSTF/'+syr+'/'+sjdy+'/*/*.nc'),'SST'\n", " if sat=='himawari':\n", " file_location,var = fs.glob('s3://noaa-himawari8/AHI-L2-FLDK-SST/'+syr+'/'+smon+'/'+sdym+'/*/*L2P*.nc'),'sea_surface_temperature'\n", " \n", " #make a list of links to the file keys\n", " if len(file_location)<1:\n", " return file_ob\n", " file_ob = [fs.open(file) for file in file_location] #open connection to files\n", " \n", " #open all the day's data\n", " with xr.open_mfdataset(file_ob,combine='nested',concat_dim='time') as ds:\n", "\n", " iexist = True #file exists\n", "\n", " #clean up coordinates which are a MESS in GOES\n", " #rename one of the coordinates that doesn't match a dim & should\n", " if not sat=='himawari':\n", " ds = ds.rename({'t':'time'})\n", " ds = ds.reset_coords()\n", " else:\n", " ds = ds.rename({'ni':'x','nj':'y'})\n", " \n", " #put in to Celsius\n", " #ds[var] -= 273.15 #nice python shortcut to +- from itself a-=273.15 is the same as a=a-273.15\n", " #ds[var].attrs['units'] = '$^\\circ$C'\n", " \n", " return ds,iexist\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Open GOES-16 (East Coast) Data\n", "- Careful of what you ask for.... each day is about 3 min to access" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "lyr = 2020\n", "\n", "satlist = ['goes-east','goes-west','himawari']\n", "\n", "for sat in satlist:\n", "\n", " init = 0 #reset new data store\n", "\n", " for idyjl in range(180,201): #6/28/2020-7/18/2020\n", "\n", " print('starting ', idyjl)\n", "\n", " ds,iexist = get_geo_data(sat,lyr,idyjl)\n", " \n", " if not iexist:\n", " continue\n", "\n", " print('writing zarr store')\n", "\n", " if init == 0:\n", " ds.to_zarr(sat)\n", " init = 1\n", " else:\n", " ds.to_zarr(sat,append_dim='time')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Now write this to our shared AWS S3 bucket\n", "\n", "Note that in order to do this you need the aws command line tools which can be installed by running from the command line \n", "\n", "`pip install awscli`\n", "\n", "`aws s3 sync ./goes_east s3://ohw-bucket/goes_east`\n", "\n", "`aws s3 sync ./goes_west s3://ohw-bucket/goes_west`\n", "\n", "`aws s3 sync ./goes_west s3://ohw-bucket/himawari`\n", "\n", "#### note that putting the ! in front of a command in jupyter send it to the terminal so you could run it here with\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install awscli\n", "! aws s3 sync ./goes_east s3://ohw-bucket/goes_east\n", "! aws s3 sync ./goes_west s3://ohw-bucket/goes_west\n", "! aws s3 sync ./goes_west s3://ohw-bucket/himawari" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test reading the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "file_location = 's3://ohw-bucket/goes_east'\n", "\n", "ds = xr.open_zarr(fsspec.get_mapper(file_location,anon=False))\n", "\n", "ds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }