{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# An Introduction to Analysing LibCrowds Results Data Using Python\n", "\n", "The purpose of this notebook is to introduce a key Python library, [pandas](https://pandas.pydata.org/), that can be used to manipulate and analyse LibCrowds results data.\n", "\n", "The pandas library provides access to high-performance data analysis tools via an accessible Python interface. We will use the library to load all of our *In the Spotlight* results into a structure called a dataframe. A dataframe is a two-dimensional data structure, similar to a spreadsheet, that accepts many different kinds of input. As everything is stored in memory, rather than on disk, the only limitation to this type of data structure is going to be the amount of RAM installed on the computer. However, for any modern computer this is unlikely to be an issue until we reach tens of millions of results.\n", "\n", "We begin by importing pandas." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "import pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The dataset\n", "\n", "For this notebook, our input will be all of the performance data collected so far via the crowdsourcing projects presented on [*In the Spotlight*](https://www.libcrowds.com/collections/playbills). In a previous notebook we saw how these results are modelled in their raw form. However, for the purposes of this notebook we have converted this raw data into a table of performances, where each row contains the known data for a specific performance (e.g. title, date, genre and theatre). The way in which this was achieved is slightly too complex to introduce here but for those interested the scripts can be found in [this repository](https://github.com/LibCrowds/data).\n", "\n", "All we currently need to know about the code block below is that it loads our dataframe of performance data." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "module_path = os.path.abspath(os.path.join('..', 'data', 'scripts'))\n", "if module_path not in sys.path:\n", " sys.path.append(module_path)\n", "from get_its_performances import get_performances_df\n", "df = get_performances_df()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [head](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) function returns the first *n* rows of a dataset (defaults to 5); we can use this function to take a first glance at our dataframe." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titledategenrelinktheatrecitysource
0PageantryNaNNaNhttp://access.bl.uk/item/viewer/ark:/81055/vdc...Theatre Royal, MargateMargatehttps://api.bl.uk/metadata/iiif/ark:/81055/vdc...
1The HypocriteNaNComedyhttp://access.bl.uk/item/viewer/ark:/81055/vdc...Theatre Royal, MargateMargatehttps://api.bl.uk/metadata/iiif/ark:/81055/vdc...
2The PadlockNaNMusical Farcehttp://access.bl.uk/item/viewer/ark:/81055/vdc...Theatre Royal, MargateMargatehttps://api.bl.uk/metadata/iiif/ark:/81055/vdc...
3The Village LawyerNaNFarcehttp://access.bl.uk/item/viewer/ark:/81055/vdc...Theatre Royal, MargateMargatehttps://api.bl.uk/metadata/iiif/ark:/81055/vdc...
4Death of Gen. WolfeNaNBallethttp://access.bl.uk/item/viewer/ark:/81055/vdc...Theatre Royal, MargateMargatehttps://api.bl.uk/metadata/iiif/ark:/81055/vdc...
\n", "
" ], "text/plain": [ " title date genre \\\n", "0 Pageantry NaN NaN \n", "1 The Hypocrite NaN Comedy \n", "2 The Padlock NaN Musical Farce \n", "3 The Village Lawyer NaN Farce \n", "4 Death of Gen. Wolfe NaN Ballet \n", "\n", " link theatre \\\n", "0 http://access.bl.uk/item/viewer/ark:/81055/vdc... Theatre Royal, Margate \n", "1 http://access.bl.uk/item/viewer/ark:/81055/vdc... Theatre Royal, Margate \n", "2 http://access.bl.uk/item/viewer/ark:/81055/vdc... Theatre Royal, Margate \n", "3 http://access.bl.uk/item/viewer/ark:/81055/vdc... Theatre Royal, Margate \n", "4 http://access.bl.uk/item/viewer/ark:/81055/vdc... Theatre Royal, Margate \n", "\n", " city source \n", "0 Margate https://api.bl.uk/metadata/iiif/ark:/81055/vdc... \n", "1 Margate https://api.bl.uk/metadata/iiif/ark:/81055/vdc... \n", "2 Margate https://api.bl.uk/metadata/iiif/ark:/81055/vdc... \n", "3 Margate https://api.bl.uk/metadata/iiif/ark:/81055/vdc... \n", "4 Margate https://api.bl.uk/metadata/iiif/ark:/81055/vdc... " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The remainder of this notebook will introduce a few basic functions that we can use to begin analysing and manipulating our dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarising dataframes\n", "\n", "The [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) function generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titledategenrelinktheatrecitysource
count231798912982317231723172317
unique13054381471076661076
topRosina1830-11-23Farcehttp://access.bl.uk/item/viewer/ark:/81055/vdc...Miscellaneous Plymouth theatresPlymouthhttps://api.bl.uk/metadata/iiif/ark:/81055/vdc...
freq137221121230123012
\n", "
" ], "text/plain": [ " title date genre \\\n", "count 2317 989 1298 \n", "unique 1305 438 147 \n", "top Rosina 1830-11-23 Farce \n", "freq 13 7 221 \n", "\n", " link \\\n", "count 2317 \n", "unique 1076 \n", "top http://access.bl.uk/item/viewer/ark:/81055/vdc... \n", "freq 12 \n", "\n", " theatre city \\\n", "count 2317 2317 \n", "unique 6 6 \n", "top Miscellaneous Plymouth theatres Plymouth \n", "freq 1230 1230 \n", "\n", " source \n", "count 2317 \n", "unique 1076 \n", "top https://api.bl.uk/metadata/iiif/ark:/81055/vdc... \n", "freq 12 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This simple function already presents some interesting results. At the time of writing, we can see that we have over 140 unique genres. \n", "\n", "We might be curious about what some of the more unusual genres are. To find them, we can use the [value_counts](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function, which returns an object containing counts of unique values. Below, we call this function with the argument `ascending=True`, to sort the output in ascending order." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "counts = df.genre.value_counts(ascending=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then use then run the following command to display the first ten rows." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Fairy Spectacle 1\n", "Historical Melodrama 1\n", "National Play 1\n", "Drawing Room Entertainment 1\n", "Grand National Military Spectacle 1\n", "Masquerade 1\n", "Sketch 1\n", "Comic Drama 1\n", "Grand Comic Pantomime 1\n", "Petite Farce 1\n", "Name: genre, dtype: int64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To display the top ten genres we could just change the `ascending` argument above to `False` (or remove it, as `False` is the default).\n", "\n", "Note that if we call the `describe()` function with purely numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. We can demonstrate this by counting and describing the unique `source` values. The `source` contains the canvas ID that identifies an individual playbill, so the following function will produce a simple numeric analysis of the number of performances recorded on each playbill." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 1076.000000\n", "mean 2.153346\n", "std 0.991920\n", "min 1.000000\n", "25% 2.000000\n", "50% 2.000000\n", "75% 3.000000\n", "max 12.000000\n", "Name: source, dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.source.value_counts().describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the time of writing we have a minimum of 1 and a maximum of 12 performances recorded on a playbill, with a mean of 2.15." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In this notebook, we found out how to load all of our performance data from the *In the Spotlight* crowdsourcing projects into a pandas dataframe. We then run some functions to perform a basic anaysis of this dataframe.\n", "\n", "For an introduction producing visualisations of this data using Python see [*An Introduction to Visualising In the Spotlight Data Using Python*](intro_to_visualising_its_data_using_python.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }