{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pattern as Path - CSV\n",
    "\n",
    "Here we are demonstrating a new functionality within [Intake](https://intake.readthedocs.io/), which can parse and make use of information stored in the filenames of a given dataset. This notebook demonstrates the functionality from the point of view of the catalog-author/data-engineer: you can create a catalog entry for a data source that allows users to get all the data they need in one or two lines. Link to the [blog post](https://www.anaconda.com/intake-parsing-data-from-filenames-and-paths/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import intake"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we'll open the catalog and take a look at the data sources defined within:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat = intake.open_catalog('catalog.yml')\n",
    "list(cat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this example we will be using `southern_rockies`. We can learn more about the data source by reading the `description`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "southern_rockies = cat.southern_rockies()\n",
    "print(southern_rockies.description)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also inspect the  `pattern` property of the data source:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "southern_rockies.pattern"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the `read` method to load the data into memory, as a `pandas.dataframe`, in one shot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = southern_rockies.read()\n",
    "df.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values for `emissions` and `model` are parsed from the filenames and added to the data. By inspecting more closely we can see that these new columns have categorical datatypes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a highly efficient representation of the data and takes up minimal memory. It is also more performant for select and groupby operations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualization\n",
    "Intake provides a plotting API which uses `hvplot` to declare plots in the catalog itself. [Hvplot](https://hvplot.pyviz.org/) allows you to easily create interactive plots that are actually `holoviews` objects making it an incredibly powerful tool for rapid data visualization. This API can be used to set default values for a particular data source and to declare specific plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import hvplot.intake\n",
    "\n",
    "intake.output_notebook()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case specify some defaults to make it easy to produce plots quickly. You'll find these lines in the catalog file:\n",
    "```\n",
    "metadata:\n",
    "  plot:\n",
    "    x: time\n",
    "    y: precip\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "southern_rockies.plot(groupby='emissions', by='model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to declaring defaults, the catalog author can specify complete plots:\n",
    "```\n",
    "metadata:\n",
    "  plots:\n",
    "    model_emissions_grid:\n",
    "      col: model\n",
    "      row: emissions\n",
    "      width: 300\n",
    "      height: 200\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "southern_rockies.plot.model_emissions_grid()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that these plots are one-liners. Providing default plots for users is a great way to give them a quick sense of the data, and the interactivity makes it straightforward to zoom into the area of interest and derive real meaning. We can use the `pandas.dataframe` directly to do computations and make more visualizations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import hvplot.pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unit = southern_rockies.metadata['fields']['precip']['unit']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "thresh = 3\n",
    "years = 20\n",
    "label = f'Months per {years} years with precip ({unit}) greater than {thresh}'\n",
    "\n",
    "(df[df['precip'] > thresh]\n",
    "    .groupby('emissions').resample(f'{years}y', on='time').sum()\n",
    "    .rename(columns={'precip': 'count'}) \n",
    "    .hvplot.bar(by='emissions', x='time') \n",
    "    .relabel(label))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using a list of paths \n",
    "\n",
    "When you are starting to build a catalog it is sometimes helpful to use `intake.open_csv` to figure out the best way to load your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "paths = ['./data/SRLCC_b1_Precip_ECHAM5-MPI.csv', './data/SRLCC_b1_Precip_MIROC3.2(medres).csv']\n",
    "\n",
    "southern_rockies_list = intake.open_csv(urlpath=paths,\n",
    "                path_as_pattern='SRLCC_{emissions}_Precip_{model}.csv',\n",
    "                csv_kwargs=dict(\n",
    "                    skiprows=3,\n",
    "                    names=['time', 'precip'],\n",
    "                    parse_dates=['time']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = southern_rockies_list.read()\n",
    "df.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case since we are using inline loading rather than the catalog, we need declare ever feature of out plot inline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "southern_rockies_list.hvplot(x='time', y='precip', col='model', row='emissions', width=300, height=200)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you are ready to save your catalog, use `.yaml` to generate an approximate version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(southern_rockies_list.yaml())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have a catalog, you have a single reference of truth for the data source, no need for copy/paste, and the end-user can get on with their work. You can find an example from the data user's perspective in the [landsat](./landsat.ipynb) example."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}