{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Engineering with Intake\n", "\n", "Intake philosophy contains a clear separation of concerns between the provider of data and\n", "the consumer of data. This tutorial concerns the former: someone who cares about where a\n", "particular dataset it stored and the right format and options for best retrieval. It is\n", "their task to make these choices, and then expose the data to end-users (such as data scientists),\n", "so that they have a clear path to finding and accessing their data. There is no need to\n", "train users in how to investigate or load a particular dataset, those details are encoded\n", "in the catalog.\n", "\n", "Intake catalogs act as a single source of truth about the data in question.\n", "The principal job of a data scientist, while interacting with Intake, is to find the\n", "best representation of data-sets (as they would have to do in any case) and to author\n", "catalogs as a means of both codifying the data-sets in versionable files and exposing\n", "them to users with a clear contract.\n", "\n", "In this tutorial we will show the work-flow for writing a catalog, and thereby providing\n", "data to your users." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import intake\n", "import hvplot.pandas\n", "intake.output_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Intake datasets are loaded by data *drivers*, which are class definitions, some of which\n", "are included with the main Intake package, but many more of which are available as extra\n", "installs (see the [plugins directory](https://intake.readthedocs.io/en/latest/plugin-directory.html).\n", "The currently installed set of drivers are listed by name in the registry:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "list(intake.registry)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each of these is associated with an `intake.open_*` function. The registry, and these functions,\n", "are built at import time by scanning all installed packaged with names starting \"intake_\", and\n", "containing `DataSource` subclasses at the top level. It is also possible to refer to drivers\n", "by the fully-qualified name (e.g., \"package.submodule.DriverClass\"), but in this example we\n", "will concentrate on the \"csv\" driver, which is included in the default Intake install.\n", "\n", "Commonly, the first step to writing a catalog entry might be to use the relevant `open_`\n", "function to create a DataSource object. The functions are documented with text from the\n", "driver class. Getting the *right* set of parameters can take some iterations, and most\n", "of the drivers can take many optional parameters - some domain knowledge of the format\n", "in question may be required at this point. The CSV format is fairly simple, and in this\n", "particular case, it loads without extra arguments:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source = intake.open_csv('https://timeseries.weebly.com/uploads/'\n", " '2/1/0/8/21086414/sea_ice.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That line just created the DataSource object, which may have verified that the arguments\n", "were reasonable, but did not actually access the data (i.e., it is a lazy process). To\n", "test if the loading is successful, we need to interrogate the source, such as looking\n", "at the details, or some (or all) of the loaded data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# access the data basic details\n", "source.discover()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load and use data\n", "df = source.read()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We find that the data loaded fine (although, we may have wished to parse those Time values into timestamps).\n", "Having got here, we might attempt to investigate some plots which best show the characteristics of the data.\n", "The purpose of this is to come up with plot specs that might be stored along with the arguments required\n", "to load the data, so that users can get a quick overview of the contents. This is normally an iterative\n", "process, but we just show a particular invocation that happens to show a figure that might be informative:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# try some plotting\n", "df.hvplot(kind='line', x='Time', y= ['Arctic', 'Antarctica'],\n", " width=700, height=500)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we plotted on the (in-memory) Pandas data-frame here, but the identical output can\n", "be had from the DataSource instance too. This loads the data every time, which may or may\n", "not be desired. It is possible, but not included in this tutorial, to specify a dataset\n", "as cache-on-first-access, so that the speed of such output is greatly improved." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source.hvplot(kind='line', x='Time', y= ['Arctic', 'Antarctica'],\n", " width=700, height=500)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, we have a source which correctly loads the data, and we have graphical output too.\n", "We can get the YAML-syntax prescription of the first by directly calling a method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# presciption for the source we made\n", "print(source.yaml())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can create a YAML-syntax catalog file containing that prescription, with\n", "some extra description, and the addition of the plot we trialled above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile sea.yaml\n", "sources:\n", " sea_ice:\n", " args:\n", " urlpath: \"https://timeseries.weebly.com/uploads/2/1/0/8/21086414/sea_ice.csv\"\n", " description: \"Polar sea ice cover\"\n", " driver: csv\n", " metadata:\n", " plots:\n", " basic:\n", " kind: line\n", " x: Time\n", " y: [Arctic, Antarctica]\n", " width: 700\n", " height: 500" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test that this catalog does the right thing, we can load it in again, and try to\n", "work with it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load that prescription\n", "cat = intake.open_catalog('sea.yaml')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot is automatic\n", "cat.sea_ice.plot.basic()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK! The catalog is functional, and ready to be shared. The very easiest way to share an Intake\n", "catalog is simply to put it in a location where it can be read by your target audience. For this\n", "tutorial, which is stored in a github repo, that can be the URL of the file in the repo.\n", "\n", "So all that you need to share with your users is the URL of the catalog. You can try it for yourself,\n", "but the following line is what you would expect the data users to execute. Of course, it is good practice\n", "to verify that this works too." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# put catalog in public place for sharing\n", "cat = intake.open_catalog('https://raw.githubusercontent.com/intake/'\n", " 'intake-examples/master/tutorial/sea.yaml')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load data as before\n", "cat.sea_ice.read().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One interesting note, is that the catalog is *also* a DataSource instance. This means that\n", "you can refer to it from other catalogs, and thus build up a hierarchy of data sources. For instance,\n", "you might have a root or master catalog, which points to several other catalogs, which each contain\n", "entries of a particular type; and the whole thing is searchable with either code or the Intake GUI. \n", "This way, the data collection as a whole has a structure that will make\n", "navigating to the right data-set simpler. You can even have separate hierarchies pointing to the\n", "same data, for alternative ways to split up the set of data-sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the cat as a data source\n", "print(cat.yaml())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }