{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pattern as Path - CSV\n", "\n", "Here we are demonstrating a new functionality within [Intake](https://intake.readthedocs.io/), which can parse and make use of information stored in the filenames of a given dataset. This notebook demonstrates the functionality from the point of view of the catalog-author/data-engineer: you can create a catalog entry for a data source that allows users to get all the data they need in one or two lines. Link to the [blog post](https://www.anaconda.com/intake-parsing-data-from-filenames-and-paths/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import intake" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we'll open the catalog and take a look at the data sources defined within:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cat = intake.open_catalog('catalog.yml')\n", "list(cat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example we will be using `southern_rockies`. We can learn more about the data source by reading the `description`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "southern_rockies = cat.southern_rockies()\n", "print(southern_rockies.description)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also inspect the `pattern` property of the data source:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "southern_rockies.pattern" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the `read` method to load the data into memory, as a `pandas.dataframe`, in one shot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = southern_rockies.read()\n", "df.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values for `emissions` and `model` are parsed from the filenames and added to the data. By inspecting more closely we can see that these new columns have categorical datatypes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a highly efficient representation of the data and takes up minimal memory. It is also more performant for select and groupby operations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization\n", "Intake provides a plotting API which uses `hvplot` to declare plots in the catalog itself. [Hvplot](https://hvplot.pyviz.org/) allows you to easily create interactive plots that are actually `holoviews` objects making it an incredibly powerful tool for rapid data visualization. This API can be used to set default values for a particular data source and to declare specific plots." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import hvplot.intake\n", "\n", "intake.output_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case specify some defaults to make it easy to produce plots quickly. You'll find these lines in the catalog file:\n", "```\n", "metadata:\n", " plot:\n", " x: time\n", " y: precip\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "southern_rockies.plot(groupby='emissions', by='model')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to declaring defaults, the catalog author can specify complete plots:\n", "```\n", "metadata:\n", " plots:\n", " model_emissions_grid:\n", " col: model\n", " row: emissions\n", " width: 300\n", " height: 200\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "southern_rockies.plot.model_emissions_grid()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that these plots are one-liners. Providing default plots for users is a great way to give them a quick sense of the data, and the interactivity makes it straightforward to zoom into the area of interest and derive real meaning. We can use the `pandas.dataframe` directly to do computations and make more visualizations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import hvplot.pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "unit = southern_rockies.metadata['fields']['precip']['unit']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "thresh = 3\n", "years = 20\n", "label = f'Months per {years} years with precip ({unit}) greater than {thresh}'\n", "\n", "(df[df['precip'] > thresh]\n", " .groupby('emissions').resample(f'{years}y', on='time').sum()\n", " .rename(columns={'precip': 'count'}) \n", " .hvplot.bar(by='emissions', x='time') \n", " .relabel(label))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using a list of paths \n", "\n", "When you are starting to build a catalog it is sometimes helpful to use `intake.open_csv` to figure out the best way to load your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "paths = ['./data/SRLCC_b1_Precip_ECHAM5-MPI.csv', './data/SRLCC_b1_Precip_MIROC3.2(medres).csv']\n", "\n", "southern_rockies_list = intake.open_csv(urlpath=paths,\n", " path_as_pattern='SRLCC_{emissions}_Precip_{model}.csv',\n", " csv_kwargs=dict(\n", " skiprows=3,\n", " names=['time', 'precip'],\n", " parse_dates=['time']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = southern_rockies_list.read()\n", "df.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case since we are using inline loading rather than the catalog, we need declare ever feature of out plot inline." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "southern_rockies_list.hvplot(x='time', y='precip', col='model', row='emissions', width=300, height=200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you are ready to save your catalog, use `.yaml` to generate an approximate version." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(southern_rockies_list.yaml())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have a catalog, you have a single reference of truth for the data source, no need for copy/paste, and the end-user can get on with their work. You can find an example from the data user's perspective in the [landsat](./landsat.ipynb) example." ] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }