{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../../thumbnail.png\" width=250 alt=\"CESM LENS image\"></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Enhanced Intake-ESM Catalog Demo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "This notebook compares one [Intake-ESM](https://intake-esm.readthedocs.io/en/stable/) catalog with an enhanced version that includes additional attributes. Both catalogs are an inventory of the [NCAR Community Earth System Model (CESM) Large Ensemble (LENS) data hosted on AWS S3](https://doi.org/10.26024/wt24-5j82)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "| Concepts | Importance | Notes |\n",
    "| --- | --- | --- |\n",
    "| [Intro to Pandas](https://foundations.projectpythia.org/core/pandas/pandas.html) | Necessary | |\n",
    "\n",
    "- **Time to learn**: 10 minutes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import intake\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Original Intake-ESM Catalog"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At import time, the `intake-esm` plugin is available in `intake`’s registry as `esm_datastore` and can be accessed with `intake.open_esm_datastore()` function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_url_orig = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'\n",
    "coll_orig = intake.open_esm_datastore(cat_url_orig)\n",
    "coll_orig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a summary representation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(coll_orig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In an Intake-ESM catalog object, the `esmcat` class provides many useful attributes and functions. For example, we can get the collection's description:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coll_orig.esmcat.description"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also get the URL pointing to the catalog's underlying tabular representation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coll_orig.esmcat.catalog_file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a CSV file ... let's take a peek."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_orig = pd.read_csv(coll_orig.esmcat.catalog_file)\n",
    "df_orig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, we can save a step since an ESM catalog object provides a `df` instance which returns a dataframe too:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_orig = coll_orig.df\n",
    "df_orig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print out a sorted list of the unique values of selected columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for col in ['component', 'frequency', 'experiment', 'variable']:\n",
    "    unique_vals = coll_orig.unique()[col]\n",
    "    unique_vals.sort()\n",
    "    count = len(unique_vals)\n",
    "    print (col + ': ' ,unique_vals, \" count: \", count, '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you happen to know the meaning of the variable names, you can find what data are available for that variable. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = coll_orig.search(variable='FLNS').df\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can narrow the filter to specific frequency and experiment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = coll_orig.search(variable='FLNS', frequency='daily', experiment='RCP85').df\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\"><b>The problem: </b>Do all potential users know that `FLNS` is a CESM-specific abbreviation for \"net longwave flux at surface”? How would a novice user find out, other than by finding separate documentation, or by opening a Zarr store in the hopes that the long name might be recorded there? How do we address the fact that every climate model code seems to have a different, non-standard name for all the variables, thus making multi-source research needlessly difficult?</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\"><b>The solution: </b><br>\n",
    "    <h2> <i>Enhanced</i> Intake-ESM Catalog!</h2></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By adding additional columns to the Intake-ESM catalog, we should be able to improve semantic interoperability and provide potentially useful information to the users. Let's now open the *enhanced* collection description file:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\"><b>Note: </b>The URL for the enhanced catalog differs from the original only in that it has <code>-enhanced</code> appended to <code>aws-cesm1-le</code></div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_url = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le-enhanced.json'\n",
    "coll = intake.open_esm_datastore(cat_url)\n",
    "coll"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we did for the first catalog, let's obtain the `description` and `catalog_file` attributes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(coll.esmcat.description) # Description of collection\n",
    "print(\"Catalog file:\", coll.esmcat.catalog_file)\n",
    "print(coll) # Summary of collection structure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Long names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the catalog's representation above, note the addition of additional elements: `long_name`, `start`, `end`, and `dim`. Here are the first/last few lines of the enhanced catalog:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_enh = coll.df\n",
    "df_enh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition alert alert-warning\">\n",
    "    <p class=\"admonition-title\" style=\"font-weight:bold\">Warning</p>\n",
    "    The <code>long_name</code>s are <em>not</em> <a href=\"https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html\">CF Standard Names</a>, but rather are those documented at \n",
    "<a href=\"https://www.cesm.ucar.edu/projects/community-projects/LENS/data-sets.html\">the NCAR LENS website</a>. For interoperability, the <code>long_name</code> column should be replaced by a <code>cf_name</code> column and possibly an <code>attribute</code> column to disambiguate if needed.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "List all available variables by long name, sorted alphabetically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nameList = coll.unique()['long_name']\n",
    "nameList.sort()\n",
    "print(*nameList, sep='\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Search capabilities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use an `intake-esm` catalog object's `search` function in several ways: "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show all available data for a specific variable based on long name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "myName = 'Salinity'\n",
    "df = coll.search(long_name=myName).df\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Search based on multiple criteria:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = coll.search(experiment=['20C','RCP85'], dim='3D', variable=['T','Q']).df\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Substring matches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In some cases, you may not know the exact term to look for. For such cases, inkake-esm supports searching for substring matches. With use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column. Let’s search for:\n",
    "- entries from experiment = ‘20C’\n",
    "- all entries whose variable long name contains wind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coll_subset = coll.search(experiment=\"20C\", long_name=\"Wind*\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coll_subset.df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we wanted to search for Wind and wind, we can take advantage of [regular expression](https://docs.python.org/3/library/re.html) syntax to do so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coll_subset = coll.search(experiment=\"20C\" , long_name=\"[Ww]ind*\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coll_subset.df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other columns in the enhanced catalog may be useful. For example, the dimensionality column enables us to list all data from the ocean component that is 3D."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = coll.search(dim=\"3D\",component=\"ocn\").df\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Spatiotemporal filtering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If there were both regional and global data available (e.g., LENS and NA-CORDEX data for the same variable, both listed in same catalog), some type of coverage indicator (or columns for bounding box edges) could be listed.\n",
    "\n",
    "Temporal extent in LENS is conveyed by the experiment (HIST, 20C, etc) but this is imprecise and requires external documentation. We have added start/end columns to the catalog, but Intake-ESM currently does not have built-in functionality to filter based on time.\n",
    "\n",
    "We can do a simple search that exactly matches a temporal value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = coll.search(dim=\"3D\",component=\"ocn\", end='2100-12').df\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "In this notebook, we used Intake-ESM to explore a catalog of CESM LENS data. We then worked through some helpful features of the enhanced catalog.\n",
    "\n",
    "### What's next?\n",
    "We will use this data to recreate some figures from a paper published in [BAMS](https://www.ametsoc.org/index.cfm/ams/publications/bulletin-of-the-american-meteorological-society-bams/) that describes the CESM LENS project: {cite:t}`Kay:2015a`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Resources and references\n",
    "\n",
    "1. [Original notebook in the (no longer maintained) Pangeo Gallery](https://gallery.pangeo.io/repos/NCAR/cesm-lens-aws/notebooks/EnhancedIntakeCatalogDemo.html)\n",
    "2. [Intake-esm documentation](https://intake-esm.readthedocs.io)\n",
    "3. [Kay et al. (2015): _The Community Earth System Model (CESM) Large Ensemble Project_, BAMS, doi:10.1175/BAMS-D-13-00255.1](https://doi.org/10.1175/BAMS-D-13-00255.1)\n",
    "4. [Community Earth System Model Large Ensemble (CESM LENS) Data Sets on AWS](https://doi.org/10.26024/wt24-5j82)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  },
  "nbdime-conflicts": {
   "local_diff": [
    {
     "diff": [
      {
       "diff": [
        {
         "key": 0,
         "op": "addrange",
         "valuelist": [
          "Python 3"
         ]
        },
        {
         "key": 0,
         "length": 1,
         "op": "removerange"
        }
       ],
       "key": "display_name",
       "op": "patch"
      }
     ],
     "key": "kernelspec",
     "op": "patch"
    }
   ],
   "remote_diff": [
    {
     "diff": [
      {
       "diff": [
        {
         "key": 0,
         "op": "addrange",
         "valuelist": [
          "Python3"
         ]
        },
        {
         "key": 0,
         "length": 1,
         "op": "removerange"
        }
       ],
       "key": "display_name",
       "op": "patch"
      }
     ],
     "key": "kernelspec",
     "op": "patch"
    }
   ]
  },
  "toc-autonumbering": false
 },
 "nbformat": 4,
 "nbformat_minor": 4
}