{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An Introduction to Analysing LibCrowds Results Data Using Python\n",
    "\n",
    "The purpose of this notebook is to introduce a key Python library, [pandas](https://pandas.pydata.org/), that can be used to manipulate and analyse LibCrowds results data.\n",
    "\n",
    "The pandas library provides access to high-performance data analysis tools via an accessible Python interface. We will use the library to load all of our *In the Spotlight* results into a structure called a dataframe. A dataframe is a two-dimensional data structure, similar to a spreadsheet, that accepts many different kinds of input. As everything is stored in memory, rather than on disk, the only limitation to this type of data structure is going to be the amount of RAM installed on the computer. However, for any modern computer this is unlikely to be an issue until we reach tens of millions of results.\n",
    "\n",
    "We begin by importing pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The dataset\n",
    "\n",
    "For this notebook, our input will be all of the performance data collected so far via the crowdsourcing projects presented on [*In the Spotlight*](https://www.libcrowds.com/collections/playbills). In a previous notebook we saw how these results are modelled in their raw form. However, for the purposes of this notebook we have converted this raw data into a table of performances, where each row contains the known data for a specific performance (e.g. title, date, genre and theatre). The way in which this was achieved is slightly too complex to introduce here but for those interested the scripts can be found in [this repository](https://github.com/LibCrowds/data).\n",
    "\n",
    "All we currently need to know about the code block below is that it loads our dataframe of performance data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "module_path = os.path.abspath(os.path.join('..', 'data', 'scripts'))\n",
    "if module_path not in sys.path:\n",
    "    sys.path.append(module_path)\n",
    "from get_its_performances import get_performances_df\n",
    "df = get_performances_df()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [head](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) function returns the first *n* rows of a dataset (defaults to 5); we can use this function to take a first glance at our dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>date</th>\n",
       "      <th>genre</th>\n",
       "      <th>link</th>\n",
       "      <th>theatre</th>\n",
       "      <th>city</th>\n",
       "      <th>source</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Pageantry</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>http://access.bl.uk/item/viewer/ark:/81055/vdc...</td>\n",
       "      <td>Theatre Royal, Margate</td>\n",
       "      <td>Margate</td>\n",
       "      <td>https://api.bl.uk/metadata/iiif/ark:/81055/vdc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>The Hypocrite</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>http://access.bl.uk/item/viewer/ark:/81055/vdc...</td>\n",
       "      <td>Theatre Royal, Margate</td>\n",
       "      <td>Margate</td>\n",
       "      <td>https://api.bl.uk/metadata/iiif/ark:/81055/vdc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The Padlock</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Musical Farce</td>\n",
       "      <td>http://access.bl.uk/item/viewer/ark:/81055/vdc...</td>\n",
       "      <td>Theatre Royal, Margate</td>\n",
       "      <td>Margate</td>\n",
       "      <td>https://api.bl.uk/metadata/iiif/ark:/81055/vdc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>The Village Lawyer</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Farce</td>\n",
       "      <td>http://access.bl.uk/item/viewer/ark:/81055/vdc...</td>\n",
       "      <td>Theatre Royal, Margate</td>\n",
       "      <td>Margate</td>\n",
       "      <td>https://api.bl.uk/metadata/iiif/ark:/81055/vdc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Death of Gen. Wolfe</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Ballet</td>\n",
       "      <td>http://access.bl.uk/item/viewer/ark:/81055/vdc...</td>\n",
       "      <td>Theatre Royal, Margate</td>\n",
       "      <td>Margate</td>\n",
       "      <td>https://api.bl.uk/metadata/iiif/ark:/81055/vdc...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 title date          genre  \\\n",
       "0            Pageantry  NaN            NaN   \n",
       "1        The Hypocrite  NaN         Comedy   \n",
       "2          The Padlock  NaN  Musical Farce   \n",
       "3   The Village Lawyer  NaN          Farce   \n",
       "4  Death of Gen. Wolfe  NaN         Ballet   \n",
       "\n",
       "                                                link                 theatre  \\\n",
       "0  http://access.bl.uk/item/viewer/ark:/81055/vdc...  Theatre Royal, Margate   \n",
       "1  http://access.bl.uk/item/viewer/ark:/81055/vdc...  Theatre Royal, Margate   \n",
       "2  http://access.bl.uk/item/viewer/ark:/81055/vdc...  Theatre Royal, Margate   \n",
       "3  http://access.bl.uk/item/viewer/ark:/81055/vdc...  Theatre Royal, Margate   \n",
       "4  http://access.bl.uk/item/viewer/ark:/81055/vdc...  Theatre Royal, Margate   \n",
       "\n",
       "      city                                             source  \n",
       "0  Margate  https://api.bl.uk/metadata/iiif/ark:/81055/vdc...  \n",
       "1  Margate  https://api.bl.uk/metadata/iiif/ark:/81055/vdc...  \n",
       "2  Margate  https://api.bl.uk/metadata/iiif/ark:/81055/vdc...  \n",
       "3  Margate  https://api.bl.uk/metadata/iiif/ark:/81055/vdc...  \n",
       "4  Margate  https://api.bl.uk/metadata/iiif/ark:/81055/vdc...  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The remainder of this notebook will introduce a few basic functions that we can use to begin analysing and manipulating our dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summarising dataframes\n",
    "\n",
    "The [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) function generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>date</th>\n",
       "      <th>genre</th>\n",
       "      <th>link</th>\n",
       "      <th>theatre</th>\n",
       "      <th>city</th>\n",
       "      <th>source</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>2317</td>\n",
       "      <td>989</td>\n",
       "      <td>1298</td>\n",
       "      <td>2317</td>\n",
       "      <td>2317</td>\n",
       "      <td>2317</td>\n",
       "      <td>2317</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>1305</td>\n",
       "      <td>438</td>\n",
       "      <td>147</td>\n",
       "      <td>1076</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>1076</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>Rosina</td>\n",
       "      <td>1830-11-23</td>\n",
       "      <td>Farce</td>\n",
       "      <td>http://access.bl.uk/item/viewer/ark:/81055/vdc...</td>\n",
       "      <td>Miscellaneous Plymouth theatres</td>\n",
       "      <td>Plymouth</td>\n",
       "      <td>https://api.bl.uk/metadata/iiif/ark:/81055/vdc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>13</td>\n",
       "      <td>7</td>\n",
       "      <td>221</td>\n",
       "      <td>12</td>\n",
       "      <td>1230</td>\n",
       "      <td>1230</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         title        date  genre  \\\n",
       "count     2317         989   1298   \n",
       "unique    1305         438    147   \n",
       "top     Rosina  1830-11-23  Farce   \n",
       "freq        13           7    221   \n",
       "\n",
       "                                                     link  \\\n",
       "count                                                2317   \n",
       "unique                                               1076   \n",
       "top     http://access.bl.uk/item/viewer/ark:/81055/vdc...   \n",
       "freq                                                   12   \n",
       "\n",
       "                                theatre      city  \\\n",
       "count                              2317      2317   \n",
       "unique                                6         6   \n",
       "top     Miscellaneous Plymouth theatres  Plymouth   \n",
       "freq                               1230      1230   \n",
       "\n",
       "                                                   source  \n",
       "count                                                2317  \n",
       "unique                                               1076  \n",
       "top     https://api.bl.uk/metadata/iiif/ark:/81055/vdc...  \n",
       "freq                                                   12  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This simple function already presents some interesting results. At the time of writing, we can see that we have over 140 unique genres. \n",
    "\n",
    "We might be curious about what some of the more unusual genres are. To find them, we can use the [value_counts](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function, which returns an object containing counts of unique values. Below, we call this function with the argument `ascending=True`, to sort the output in ascending order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "counts = df.genre.value_counts(ascending=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then use then run the following command to display the first ten rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Fairy Spectacle                      1\n",
       "Historical Melodrama                 1\n",
       "National Play                        1\n",
       "Drawing Room Entertainment           1\n",
       "Grand National Military Spectacle    1\n",
       "Masquerade                           1\n",
       "Sketch                               1\n",
       "Comic Drama                          1\n",
       "Grand Comic Pantomime                1\n",
       "Petite Farce                         1\n",
       "Name: genre, dtype: int64"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "counts[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To display the top ten genres we could just change the `ascending` argument above to `False` (or remove it, as `False` is the default).\n",
    "\n",
    "Note that if we call the `describe()` function with purely numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. We can demonstrate this by counting and describing the unique `source` values. The `source` contains the canvas ID that identifies an individual playbill, so the following function will produce a simple numeric analysis of the number of performances recorded on each playbill."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    1076.000000\n",
       "mean        2.153346\n",
       "std         0.991920\n",
       "min         1.000000\n",
       "25%         2.000000\n",
       "50%         2.000000\n",
       "75%         3.000000\n",
       "max        12.000000\n",
       "Name: source, dtype: float64"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.source.value_counts().describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the time of writing we have a minimum of 1 and a maximum of 12 performances recorded on a playbill, with a mean of 2.15."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this notebook, we found out how to load all of our performance data from the *In the Spotlight* crowdsourcing projects into a pandas dataframe. We then run some functions to perform a basic anaysis of this dataframe.\n",
    "\n",
    "For an introduction producing visualisations of this data using Python see [*An Introduction to Visualising In the Spotlight Data Using Python*](intro_to_visualising_its_data_using_python.ipynb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}