{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intake for a Data Science workflow\n", "\n", "A data scientist wishes to find data for a given task, and introspect\n", "the attributes of that data, to make sure it is right for a given task.\n", "Next, the data should be loaded, ready for analysis, with the less known\n", "about the specifics of the loading package, format and storage backend, the\n", "better.\n", "\n", "Intake provides an easy way to find your data, locally, in a cloud service, or\n", "on an Intake server. In this tutorial, we show an overview of working with Intake\n", "from the point-of-view of the data scientist, or data end-user: someone who wants\n", "a package which can fetch the data and then \"get out of the way\".\n", "\n", "The common starting point for finding and inspecting data-sets is with a *catalog*, which\n", "is a collection of entries, each of which corresponds to a specific data-set. The\n", "entries have names, descriptions and metadata, to allow for searching and filtering\n", "of the entries, to find the specific data which solves a particular problem.\n", "\n", "There are two specific starting points that a data scientist might usually use:\n", "- built-in data, which may have been conda-installed, or otherwise appear on\n", " a predetermined search-path\n", "- specific URLs containing catalog specifications.\n", "\n", "Both of these can be accessed programatically or via the GUI, and we'll demonstrate\n", "each." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import intake\n", "intake.output_notebook() # enables source.plot API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this environment, we have \"conda installed data\" which could\n", "look as follows:\n", "```\n", "conda install -c intake us_crime\n", "```\n", "This action puts a special catalog spec file in the YAML format\n", "in a location which Intake searches at import time. The effect is\n", "to make some datasets automatically available in the special\n", "catalog `intake.cat`. In the following cell,\n", "the entry \"us_crime\" corresponds to the dataset which is installed\n", "with the conda command above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "intake.cat" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# find the entries in the catalog\n", "list(intake.cat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any entry of eny catalog is easy to investigate. The standard repr\n", "already gives a lot of information about the data. Note that you\n", "can use attribute access to grab a specific entry, if it has a name\n", "which can be a python identifier, or with getitem syntax `[\"name\"]`,\n", "which always works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select an entry\n", "# same as intake.cat[\"us_crime\"]\n", "intake.cat.us_crime" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We found that this represents a dataframe, to be read by the driver \"csv\".\n", "\n", "A catalog entry is just a definition of how Intake should go about grabbing the data, together with useful descriptions. To go deeper, we can instantiate the *data source*, which a concrete representation of the data, and can be used for loading metadata bound in the storage itself, and to actually read all of the data. The\n", "following cell touches the first part of the data, to infer the types of the columns of the dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# detailed info - touches data\n", "s = intake.cat.us_crime()\n", "s.discover()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So what should we do with this data? Most obvious, and hinted in the output, above, would be\n", "to view the pre-specified plot that was included in the spec. There is one named plot\n", "called \"example\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.plots" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which we can view with the following one-liner. Note that this plot is\n", "interactive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the included quick-plot\n", "intake.cat.us_crime.plot.example()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even more commonly, we may just load the data into memory. In this case, the in-memory\n", "representation of the data is a Pandas data-frame, which will be familiar to most\n", "data scientists. For every data-set, there will be only one particular type of container,\n", "like the Pandas data-frame, which is the right way to represent that data. Often, though,\n", "there may be multiple ways to load this data, for example the `.to_dask()` method would\n", "have created a dask.dataframe.DataFrame instead, which can be important for data-sets that\n", "are very large.\n", "\n", "At this point, Intake is essentially done with this dataset: it has\n", "been delivered to the scientist, so that they can get on with whatever analysis they needed\n", "this data for." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load all data into memory\n", "df = intake.cat.us_crime.read()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now we have a pandas dataframe\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Analysis:\n", "# fraction of all theft that was vehicle theft across all years\n", "df['Motor vehicle theft'].sum() / df['Larceny-theft'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This workflow could have been achieved via the Intake GUI also. Initially, the\n", "GUI will show the built-in datasets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "intake.gui" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how only one catalog is loaded, and that if you select the source \"us_crime\", the same information is\n", "displayed about it as at the top of this notebook. Also, the \"plot\" button opens a panel below the\n", "main interface, containing the one prescribed plot for this data-set.\n", "\n", "The (+) button in the interface allows you to use the file-browser or any URL to add catalogs to the interface;\n", "but you can also do this in code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "intake.gui.add('sea.yaml')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now a new catalog \"sea\" has appeared, and its one item is selected - which happens to be another csv data-set,\n", "but this time loaded from a remote location. Note that the *catalog file* is local, but the *data* is remote.\n", "It is perfectly possible for the catalog file to also be remote." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = intake.gui.item()\n", "s.discover() # access information on the selected item." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.plot.basic()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }