# An Introduction to Analysing LibCrowds Results Data Using Python

The purpose of this notebook is to introduce a key Python library, [pandas](https://pandas.pydata.org/), that can be used to manipulate and analyse LibCrowds results data.

The pandas library provides access to high-performance data analysis tools via an accessible Python interface. We will use the library to load all of our *In the Spotlight* results into a structure called a dataframe. A dataframe is a two-dimensional data structure, similar to a spreadsheet, that accepts many different kinds of input. As everything is stored in memory, rather than on disk, the only limitation to this type of data structure is going to be the amount of RAM installed on the computer. However, for any modern computer this is unlikely to be an issue until we reach tens of millions of results.

We begin by importing pandas.

In [23]:
import pandas

## The dataset

For this notebook, our input will be all of the performance data collected so far via the crowdsourcing projects presented on [*In the Spotlight*](https://www.libcrowds.com/collections/playbills). In a previous notebook we saw how these results are modelled in their raw form. However, for the purposes of this notebook we have converted this raw data into a table of performances, where each row contains the known data for a specific performance (e.g. title, date, genre and theatre). The way in which this was achieved is slightly too complex to introduce here but for those interested the scripts can be found in [this repository](https://github.com/LibCrowds/data).

All we currently need to know about the code block below is that it loads our dataframe of performance data.

In [24]:
import os
import sys
module_path = os.path.abspath(os.path.join('..', 'data', 'scripts'))
if module_path not in sys.path:
    sys.path.append(module_path)
from get_its_performances import get_performances_df
df = get_performances_df()

The [head](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) function returns the first *n* rows of a dataset (defaults to 5); we can use this function to take a first glance at our dataframe.

In [25]:
df.head()

Unnamed: 0,title,date,genre,link,theatre,city,source
0,Pageantry,,,http://access.bl.uk/item/viewer/ark:/81055/vdc...,"Theatre Royal, Margate",Margate,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
1,The Hypocrite,,Comedy,http://access.bl.uk/item/viewer/ark:/81055/vdc...,"Theatre Royal, Margate",Margate,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
2,The Padlock,,Musical Farce,http://access.bl.uk/item/viewer/ark:/81055/vdc...,"Theatre Royal, Margate",Margate,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
3,The Village Lawyer,,Farce,http://access.bl.uk/item/viewer/ark:/81055/vdc...,"Theatre Royal, Margate",Margate,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
4,Death of Gen. Wolfe,,Ballet,http://access.bl.uk/item/viewer/ark:/81055/vdc...,"Theatre Royal, Margate",Margate,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...


The remainder of this notebook will introduce a few basic functions that we can use to begin analysing and manipulating our dataset.

## Summarising dataframes

The [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) function generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [26]:
df.describe()

Unnamed: 0,title,date,genre,link,theatre,city,source
count,2317,989,1298,2317,2317,2317,2317
unique,1305,438,147,1076,6,6,1076
top,Rosina,1830-11-23,Farce,http://access.bl.uk/item/viewer/ark:/81055/vdc...,Miscellaneous Plymouth theatres,Plymouth,https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
freq,13,7,221,12,1230,1230,12


This simple function already presents some interesting results. At the time of writing, we can see that we have over 140 unique genres. 

We might be curious about what some of the more unusual genres are. To find them, we can use the [value_counts](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function, which returns an object containing counts of unique values. Below, we call this function with the argument `ascending=True`, to sort the output in ascending order.

In [27]:
counts = df.genre.value_counts(ascending=True)

We can then use then run the following command to display the first ten rows.

In [28]:
counts[:10]

Fairy Spectacle                      1
Historical Melodrama                 1
National Play                        1
Drawing Room Entertainment           1
Grand National Military Spectacle    1
Masquerade                           1
Sketch                               1
Comic Drama                          1
Grand Comic Pantomime                1
Petite Farce                         1
Name: genre, dtype: int64

To display the top ten genres we could just change the `ascending` argument above to `False` (or remove it, as `False` is the default).

Note that if we call the `describe()` function with purely numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. We can demonstrate this by counting and describing the unique `source` values. The `source` contains the canvas ID that identifies an individual playbill, so the following function will produce a simple numeric analysis of the number of performances recorded on each playbill.

In [29]:
df.source.value_counts().describe()

count    1076.000000
mean        2.153346
std         0.991920
min         1.000000
25%         2.000000
50%         2.000000
75%         3.000000
max        12.000000
Name: source, dtype: float64

At the time of writing we have a minimum of 1 and a maximum of 12 performances recorded on a playbill, with a mean of 2.15.

## Summary

In this notebook, we found out how to load all of our performance data from the *In the Spotlight* crowdsourcing projects into a pandas dataframe. We then run some functions to perform a basic anaysis of this dataframe.

For an introduction producing visualisations of this data using Python see [*An Introduction to Visualising In the Spotlight Data Using Python*](intro_to_visualising_its_data_using_python.ipynb).