{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analyse a series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "from IPython.display import Image as DImage\n",
    "from IPython.core.display import display, HTML\n",
    "import series_details\n",
    "\n",
    "# Plotly helps us make pretty charts\n",
    "import plotly.offline as py\n",
    "import plotly.graph_objs as go\n",
    "\n",
    "# This lets Plotly draw charts in cells\n",
    "py.init_notebook_mode()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is for analysing a series that you've already harvested. If you haven't harvested any data yet, then you need to go back to the ['Harvesting a series' notebook](Harvesting series.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# What series do you want to analyse?\n",
    "# Insert the series id between the quotes.\n",
    "series = 'A6122'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the CSV data for the specified series into a dataframe. Parse the dates as dates!\n",
    "df = pd.read_csv(os.path.join('data', '{}.csv'.format(series.replace('/', '-'))), parse_dates=['start_date', 'end_date'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get some summary data\n",
    "\n",
    "We're going to create a simple summary of some of the main characteristics of the series, as reflected in the harvested files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We're going to assemble some summary data about the series in a 'summary' dictionary\n",
    "# Let's create the dictionary and add the series identifier\n",
    "summary = {'series': series}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The 'shape' property returns the number of rows and columns. So 'shape[0]' gives us the number of items harvested.\n",
    "summary['total_items'] = df.shape[0]\n",
    "print(summary['total_items'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the frequency of the different access status categories\n",
    "summary['access_counts'] = df['access_status'].value_counts().to_dict()\n",
    "print(summary['access_counts'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the number of files that have been digitised\n",
    "summary['digitised_files'] = len(df.loc[df['digitised_status'] == True])\n",
    "print(summary['digitised_files'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the number of individual pages that have been digitised\n",
    "summary['digitised_pages'] = df['digitised_pages'].sum()\n",
    "print(summary['digitised_pages'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the earliest start date\n",
    "start = df['start_date'].min()\n",
    "try:\n",
    "    summary['date_from'] = start.year\n",
    "except AttributeError:\n",
    "    summary['date_from'] = None\n",
    "print(summary['date_from'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the latest end date\n",
    "end = df['end_date'].max()\n",
    "try:\n",
    "    summary['date_to'] = end.year\n",
    "except AttributeError:\n",
    "    summary['date_to'] = None\n",
    "print(summary['date_to'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's display all the summary data\n",
    "print('SERIES: {}'.format(summary['series']))\n",
    "print('Number of items: {:,}'.format(summary['total_items']))\n",
    "print('Access status:')\n",
    "for status, total in summary['access_counts'].items():\n",
    "    print('    {}: {:,}'.format(status, total))\n",
    "print('Contents dates: {} to {}'.format(summary['date_from'], summary['date_to']))\n",
    "print('Digitised files: {:,}'.format(summary['digitised_files']))\n",
    "print('Digitised pages: {:,}'.format(summary['digitised_pages']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that a slightly enhanced version of the code above is available in the `series_details` module that you can import into any notebook. So to create a summary of a series you can just:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the module\n",
    "import series_details\n",
    "\n",
    "# Call display_series() providing the series name and the dataframe\n",
    "series_details.display_summary(series, df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot the contents dates\n",
    "\n",
    "Plotting the dates is a bit tricky. Each file can have both a start date and an end date. So if we want to plot the years covered by a file, we need to include all the years between the start and end dates. Also dates can be recorded at different levels of granularity, for specific days to just years. And sometimes there are no end dates recorded at all – what does this mean?\n",
    "\n",
    "The code in the cell below does a few things:\n",
    "\n",
    "* It fills any empty end dates with the start date from the same item. This probably means some content years will be missed, but it's the only date we can be certain of.\n",
    "* It loops through all the rows in the dataframe, then for each row it extracts the years between the start and end date. Currently this looks to see if the 1 January is covered by the date range, so if there's an exact start date after 1 January I don't think it will be captured. I need to investigate this further.\n",
    "* It combines all of the years into one big series and then totals up the frquency of each year.\n",
    "\n",
    "I'm sure this is not perfect, but it seems to produce useful results.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fill any blank end dates with start dates\n",
    "df['end_date'] = df[['end_date']].apply(lambda x: x.fillna(value=df['start_date']))\n",
    "\n",
    "# This is a bit tricky.\n",
    "# For each item we want to find the years that it has content from -- ie start_year <= year <= end_year.\n",
    "# Then we want to put all the years from all the items together and look at their frequency\n",
    "years = []\n",
    "for row in df.itertuples(index=False):\n",
    "    try:\n",
    "        years_in_range = pd.date_range(start=row.start_date, end=row.end_date, freq='AS').year.to_series()\n",
    "    except ValueError:\n",
    "        # No start date\n",
    "        pass\n",
    "    else:\n",
    "        years.append(years_in_range)\n",
    "year_counts = pd.concat(years).value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Put the resulting series in a dataframe so it looks pretty.\n",
    "year_totals = pd.DataFrame(year_counts)\n",
    "\n",
    "# Sort results by year\n",
    "year_totals.sort_index(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Display the results\n",
    "year_totals.style.format({0: '{:,}'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's graph the frequency of content years\n",
    "plotly_data = [go.Bar(\n",
    "            x=year_totals.index.values, # The years are the index\n",
    "            y=year_totals[0]\n",
    "    )]\n",
    "\n",
    "# Add some labels\n",
    "layout = go.Layout(\n",
    "    title='Content dates',\n",
    "    xaxis=dict(\n",
    "        title='Year'\n",
    "    ),\n",
    "    yaxis=dict(\n",
    "        title='Number of items'\n",
    "    )\n",
    ")\n",
    "\n",
    "# Create a chart \n",
    "fig = go.Figure(data=plotly_data, layout=layout)\n",
    "py.iplot(fig, filename='series-dates-bar')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that a slightly enhanced version of the code above is available in the series_details module that you can import into any notebook. So to create a summary of a series you can just:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the module\n",
    "import series_details\n",
    "\n",
    "# Call plot_series() providing the series name and the dataframe\n",
    "fig = series_details.plot_dates(df)\n",
    "py.iplot(fig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter by words in file titles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find titles containing a particular phrase -- in this case 'wife'\n",
    "# This creates a new dataframe\n",
    "# Try changing this to filter for other words\n",
    "\n",
    "search_term = 'wife'\n",
    "df_filtered = df.loc[df['title'].str.contains(search_term, case=False)].copy()\n",
    "df_filtered"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can plot this filtered dataframe just like the series\n",
    "fig = series_details.plot_dates(df)\n",
    "py.iplot(fig)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the new dataframe as a csv\n",
    "df_filtered.to_csv(os.path.join('data', '{}-{}.csv'.format(series.replace('/', '-'), search_term)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find titles containing one of two words -- ie an OR statement\n",
    "# Try changing this to filter for other words\n",
    "\n",
    "df_filtered = df.loc[df['title'].str.contains('chinese', case=False) | df['title'].str.contains(r'\\bah\\b', case=False)].copy()\n",
    "df_filtered"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter by date range"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_year = '1920'\n",
    "end_year = '1930'\n",
    "df_filtered = df[(df['start_date'] >= start_year) & (df['end_date'] <= end_year)]\n",
    "df_filtered"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## N-gram frequencies in file titles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import TextBlob for text analysis\n",
    "from textblob import TextBlob\n",
    "import nltk\n",
    "stopwords = nltk.corpus.stopwords.words('english')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Combine all of the file titles into a single string\n",
    "title_text = a = df['title'].str.lower().str.cat(sep=' ')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "blob = TextBlob(title_text)\n",
    "words = [[word, count] for word, count in blob.lower().word_counts.items() if word not in stopwords]\n",
    "word_counts = pd.DataFrame(words).rename({0: 'word', 1: 'count'}, axis=1).sort_values(by='count', ascending=False)\n",
    "word_counts[:25].style.format({'count': '{:,}'}).bar(subset=['count'], color='#d65f5f').set_properties(subset=['count'], **{'width': '300px'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_ngram_counts(text, size):\n",
    "    blob = TextBlob(text)\n",
    "    # Extract n-grams as WordLists, then convert to a list of strings\n",
    "    ngrams = [' '.join(ngram).lower() for ngram in blob.lower().ngrams(size)]\n",
    "    # Convert to dataframe then count values and rename columns\n",
    "    ngram_counts = pd.DataFrame(ngrams)[0].value_counts().rename_axis('ngram').reset_index(name='count')\n",
    "    return ngram_counts\n",
    "    \n",
    "def display_top_ngrams(text, size):\n",
    "    ngram_counts = get_ngram_counts(text, size)\n",
    "    # Display top 25 results as a bar chart\n",
    "    display(ngram_counts[:25].style.format({'count': '{:,}'}).bar(subset=['count'], color='#d65f5f').set_properties(subset=['count'], **{'width': '300px'}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display_top_ngrams(title_text, 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display_top_ngrams(title_text, 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}