{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Corrections of OCRd text in Trove's newspapers\n",
    "\n",
    "The full text of newspaper articles in Trove is extracted from page images using Optical Character Recognition (OCR). The accuracy of the OCR process is influenced by a range of factors including the font and the quality of the images. Many errors slip through. Volunteers have done a remarkable job in correcting these errors, but it's a huge task. This notebook explores the scale of OCR correction in Trove.\n",
    "\n",
    "There are two ways of getting data about OCR corrections using the Trove API. To get aggregate data you can include `has:corrections` in your query to limit the results to articles that have at least one OCR correction.\n",
    "\n",
    "To get information about the number of corrections made to the articles in your results, you can add the `reclevel=full` parameter to include the number of corrections and details of the most recent correction to the article record. For example, note the `correctionCount` and `lastCorrection` values in the record below:\n",
    "\n",
    "``` json\n",
    "{\n",
    "    \"article\": {\n",
    "        \"id\": \"41697877\",\n",
    "        \"url\": \"/newspaper/41697877\",\n",
    "        \"heading\": \"WRAGGE AND WEATHER CYCLES.\",\n",
    "        \"category\": \"Article\",\n",
    "        \"title\": {\n",
    "            \"id\": \"101\",\n",
    "            \"value\": \"Western Mail (Perth, WA : 1885 - 1954)\"\n",
    "        },\n",
    "        \"date\": \"1922-11-23\",\n",
    "        \"page\": 4,\n",
    "        \"pageSequence\": 4,\n",
    "        \"troveUrl\": \"https://trove.nla.gov.au/ndp/del/article/41697877\",\n",
    "        \"illustrated\": \"N\",\n",
    "        \"wordCount\": 1054,\n",
    "        \"correctionCount\": 1,\n",
    "        \"listCount\": 0,\n",
    "        \"tagCount\": 0,\n",
    "        \"commentCount\": 0,\n",
    "        \"lastCorrection\": {\n",
    "            \"by\": \"*anon*\",\n",
    "            \"lastupdated\": \"2016-09-12T07:08:57Z\"\n",
    "        },\n",
    "        \"identifier\": \"https://nla.gov.au/nla.news-article41697877\",\n",
    "        \"trovePageUrl\": \"https://trove.nla.gov.au/ndp/del/page/3522839\",\n",
    "        \"pdf\": \"https://trove.nla.gov.au/ndp/imageservice/nla.news-page3522839/print\"\n",
    "    }\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting things up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import os\n",
    "import ipywidgets as widgets\n",
    "from operator import itemgetter # used for sorting\n",
    "import pandas as pd # makes manipulating the data easier\n",
    "import altair as alt\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from tqdm.auto import tqdm\n",
    "from IPython.display import display, HTML, FileLink, clear_output\n",
    "import math\n",
    "from collections import OrderedDict\n",
    "import time\n",
    "\n",
    "# Make sure data directory exists\n",
    "os.makedirs('data', exist_ok=True)\n",
    "\n",
    "# Create a session that will automatically retry on server errors\n",
    "s = requests.Session()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n",
    "s.mount('http://', HTTPAdapter(max_retries=retries))\n",
    "s.mount('https://', HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "api_key = 'YOUR API KEY'\n",
    "print('Your API key is: {}'.format(api_key))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic parameters for Trove API\n",
    "params = {\n",
    "    'facet': 'year', # Get the data aggregated by year.\n",
    "    'zone': 'newspaper',\n",
    "    'key': api_key,\n",
    "    'encoding': 'json',\n",
    "    'n': 0 # We don't need any records, just the facets!\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_results(params):\n",
    "    '''\n",
    "    Get JSON response data from the Trove API.\n",
    "    Parameters:\n",
    "        params\n",
    "    Returns:\n",
    "        JSON formatted response data from Trove API \n",
    "    '''\n",
    "    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params, timeout=30)\n",
    "    response.raise_for_status()\n",
    "    # print(response.url) # This shows us the url that's sent to the API\n",
    "    data = response.json()\n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How many newspaper articles have corrections?\n",
    "\n",
    "Let's find out what proportion of newspaper articles have at least one OCR correction.\n",
    "\n",
    "First we'll get to the total number of newspaper articles in Trove."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "232,202,146\n"
     ]
    }
   ],
   "source": [
    "# Set the q parameter to a single space to get everything\n",
    "params['q'] = ' '\n",
    "\n",
    "# Get the data from the API\n",
    "data = get_results(params)\n",
    "\n",
    "# Extract the total number of results\n",
    "total = int(data['response']['zone'][0]['records']['total'])\n",
    "print('{:,}'.format(total))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll set the `q` parameter to `has:corrections` to limit the results to newspaper articles that have at least one correction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12,743,782\n"
     ]
    }
   ],
   "source": [
    "# Set the q parameter to 'has:corrections' to limit results to articles with corrections\n",
    "params['q'] = 'has:corrections'\n",
    "\n",
    "# Get the data from the API\n",
    "data = get_results(params)\n",
    "\n",
    "# Extract the total number of results\n",
    "corrected = int(data['response']['zone'][0]['records']['total'])\n",
    "print('{:,}'.format(corrected))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate the proportion of articles with corrections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.49% of articles have at least one correction\n"
     ]
    }
   ],
   "source": [
    "print('{:.2%} of articles have at least one correction'.format(corrected/total))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might be thinking that these figures don't seem to match the number of corrections by individuals displayed on the digitised newspapers home page. Remember that these figures show the **number of articles** that include corrections, while the individual scores show the **number of lines** corrected by each volunteer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Number of corrections by year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_facets(data):\n",
    "    '''\n",
    "    Loop through facets in Trove API response, saving terms and counts.\n",
    "    Parameters:\n",
    "        data  - JSON formatted response data from Trove API  \n",
    "    Returns:\n",
    "        A list of dictionaries containing: 'term', 'total_results'\n",
    "    '''\n",
    "    facets = []\n",
    "    try:\n",
    "        # The facets are buried a fair way down in the results\n",
    "        # Note that if you ask for more than one facet, you'll have use the facet['name'] param to find the one you want\n",
    "        # In this case there's only one facet, so we can just grab the list of terms (which are in fact the results by year)\n",
    "        for term in data['response']['zone'][0]['facets']['facet']['term']:\n",
    "            \n",
    "            # Get the year and the number of results, and convert them to integers, before adding to our results\n",
    "            facets.append({'term': term['search'], 'total_results': int(term['count'])})\n",
    "            \n",
    "        # Sort facets by year\n",
    "        facets.sort(key=itemgetter('term'))\n",
    "    except TypeError:\n",
    "        pass\n",
    "    return facets\n",
    "\n",
    "def get_facet_data(params, start_decade=180, end_decade=201):\n",
    "    '''\n",
    "    Loop throught the decades from 'start_decade' to 'end_decade',\n",
    "    getting the number of search results for each year from the year facet.\n",
    "    Combine all the results into a single list.\n",
    "    Parameters:\n",
    "        params - parameters to send to the API\n",
    "        start_decade\n",
    "        end_decade\n",
    "    Returns:\n",
    "        A list of dictionaries containing 'year', 'total_results' for the complete \n",
    "        period between the start and end decades.\n",
    "    '''\n",
    "    # Create a list to hold the facets data\n",
    "    facet_data = []\n",
    "    \n",
    "    # Loop through the decades\n",
    "    for decade in tqdm(range(start_decade, end_decade + 1)):\n",
    "        \n",
    "        #print(params)\n",
    "        # Avoid confusion by copying the params before we change anything.\n",
    "        search_params = params.copy()\n",
    "        \n",
    "        # Add decade value to params\n",
    "        search_params['l-decade'] = decade\n",
    "        \n",
    "        # Get the data from the API\n",
    "        data = get_results(search_params)\n",
    "        \n",
    "        # Get the facets from the data and add to facets_data\n",
    "        facet_data += get_facets(data)\n",
    "        \n",
    "    # Reomve the progress bar (you can also set leave=False in tqdm, but that still leaves white space in Jupyter Lab)\n",
    "    clear_output()\n",
    "    return facet_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "facet_data = get_facet_data(params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert our data to a dataframe called df\n",
    "df = pd.DataFrame(facet_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>total_results</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1803</td>\n",
       "      <td>526</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1804</td>\n",
       "      <td>619</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1805</td>\n",
       "      <td>430</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1806</td>\n",
       "      <td>367</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1807</td>\n",
       "      <td>134</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   term  total_results\n",
       "0  1803            526\n",
       "1  1804            619\n",
       "2  1805            430\n",
       "3  1806            367\n",
       "4  1807            134"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So which year has the most corrections?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "term               1915\n",
       "total_results    256092\n",
       "Name: 112, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[df['total_results'].idxmax()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The fact that there's more corrections in newspaper articles from 1915, might make you think that people have been more motivated to correct articles relating to WWI. But if you look at the [total number of articles per year](visualise-total-newspaper-articles-by-state-year.ipynb), you'll see that there's been more articles digitised from 1915! The raw number of corrections is probably not very useful, so let's look instead at the *proportion* of articles each year that have at least one correction.\n",
    "\n",
    "To do that we'll re-harvest the facet data, but this time with a blank, or empty search, to get the total number of articles available from each year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reset the 'q' parameter\n",
    "# Use a an empty search (a single space) to get ALL THE ARTICLES\n",
    "params['q'] = ' '\n",
    "\n",
    "# Get facet data for all articles\n",
    "all_facet_data = get_facet_data(params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert the results to a dataframe\n",
    "df_total = pd.DataFrame(all_facet_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No we'll merge the number of articles by year with corrections with the total number of articles. Then we'll calculate the proportion with corrections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def merge_df_with_total(df, df_total, how='left'):\n",
    "    '''\n",
    "    Merge dataframes containing search results with the total number of articles by year.\n",
    "    This is a left join on the year column. The total number of articles will be added as a column to \n",
    "    the existing results.\n",
    "    Once merged, do some reorganisation and calculate the proportion of search results.\n",
    "    Parameters:\n",
    "        df - the search results in a dataframe\n",
    "        df_total - total number of articles per year in a dataframe\n",
    "    Returns:\n",
    "        A dataframe with the following columns - 'year', 'total_results', 'total_articles', 'proportion' \n",
    "        (plus any other columns that are in the search results dataframe).\n",
    "    '''\n",
    "    # Merge the two dataframes on year\n",
    "    # Note that we're joining the two dataframes on the year column\n",
    "    df_merged = pd.merge(df, df_total, how=how, on='term')\n",
    "\n",
    "    # Rename the columns for convenience\n",
    "    df_merged.rename({'total_results_y': 'total_articles'}, inplace=True, axis='columns')\n",
    "    df_merged.rename({'total_results_x': 'total_results'}, inplace=True, axis='columns')\n",
    "\n",
    "    # Set blank values to zero to avoid problems\n",
    "    df_merged['total_results'] = df_merged['total_results'].fillna(0).astype(int)\n",
    "\n",
    "    # Calculate proportion by dividing the search results by the total articles\n",
    "    df_merged['proportion'] = df_merged['total_results'] / df_merged['total_articles']\n",
    "    return df_merged"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>total_results</th>\n",
       "      <th>total_articles</th>\n",
       "      <th>proportion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1803</td>\n",
       "      <td>526</td>\n",
       "      <td>526</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1804</td>\n",
       "      <td>619</td>\n",
       "      <td>619</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1805</td>\n",
       "      <td>430</td>\n",
       "      <td>430</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1806</td>\n",
       "      <td>367</td>\n",
       "      <td>367</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1807</td>\n",
       "      <td>134</td>\n",
       "      <td>134</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   term  total_results  total_articles  proportion\n",
       "0  1803            526             526         1.0\n",
       "1  1804            619             619         1.0\n",
       "2  1805            430             430         1.0\n",
       "3  1806            367             367         1.0\n",
       "4  1807            134             134         1.0"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Merge the search results with the total articles\n",
    "df_merged = merge_df_with_total(df, df_total)\n",
    "df_merged.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's visualise the results, showing both the number of articles with corrections each year, and the proportion of articles each year with corrections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div id=\"altair-viz-ed6215e0f7854265b4edecfea832da96\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-ed6215e0f7854265b4edecfea832da96\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-ed6215e0f7854265b4edecfea832da96\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm//vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm//vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm//vega-lite@4.8.1?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm//vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function loadScript(lib) {\n",
       "      return new Promise(function(resolve, reject) {\n",
       "        var s = document.createElement('script');\n",
       "        s.src = paths[lib];\n",
       "        s.async = true;\n",
       "        s.onload = () => resolve(paths[lib]);\n",
       "        s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "        document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "      });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else if (typeof vegaEmbed === \"function\") {\n",
       "      displayChart(vegaEmbed);\n",
       "    } else {\n",
       "      loadScript(\"vega\")\n",
       "        .then(() => loadScript(\"vega-lite\"))\n",
       "        .then(() => loadScript(\"vega-embed\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 400, \"continuousHeight\": 300}}, \"vconcat\": [{\"mark\": {\"type\": \"line\", \"point\": true}, \"encoding\": {\"tooltip\": [{\"type\": \"quantitative\", \"field\": \"term\", \"title\": \"Year\"}, {\"type\": \"quantitative\", \"field\": \"total_results\", \"format\": \",\", \"title\": \"Articles\"}], \"x\": {\"type\": \"quantitative\", \"axis\": {\"format\": \"c\", \"title\": \"Year\"}, \"field\": \"term\"}, \"y\": {\"type\": \"quantitative\", \"axis\": {\"format\": \",d\", \"title\": \"Number of articles with corrections\"}, \"field\": \"total_results\"}}, \"height\": 250, \"width\": 700}, {\"mark\": {\"type\": \"line\", \"color\": \"red\", \"point\": true}, \"encoding\": {\"color\": {\"value\": \"orange\"}, \"tooltip\": [{\"type\": \"quantitative\", \"field\": \"term\", \"title\": \"Year\"}, {\"type\": \"quantitative\", \"field\": \"proportion\", \"format\": \"%\", \"title\": \"Proportion\"}], \"x\": {\"type\": \"quantitative\", \"axis\": {\"format\": \"c\", \"title\": \"Year\"}, \"field\": \"term\"}, \"y\": {\"type\": \"quantitative\", \"axis\": {\"format\": \"%\", \"title\": \"Proportion of articles with corrections\"}, \"field\": \"proportion\"}}, \"height\": 250, \"width\": 700}], \"data\": {\"name\": \"data-dff7c4e8a360bd1667ee5e570716c2e6\"}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.8.1.json\", \"datasets\": {\"data-dff7c4e8a360bd1667ee5e570716c2e6\": [{\"term\": 1803, \"total_results\": 526, \"total_articles\": 526, \"proportion\": 1.0}, {\"term\": 1804, \"total_results\": 619, \"total_articles\": 619, \"proportion\": 1.0}, {\"term\": 1805, \"total_results\": 430, \"total_articles\": 430, \"proportion\": 1.0}, {\"term\": 1806, \"total_results\": 367, \"total_articles\": 367, \"proportion\": 1.0}, {\"term\": 1807, \"total_results\": 134, \"total_articles\": 134, \"proportion\": 1.0}, {\"term\": 1808, \"total_results\": 155, \"total_articles\": 155, \"proportion\": 1.0}, {\"term\": 1809, \"total_results\": 237, \"total_articles\": 237, \"proportion\": 1.0}, {\"term\": 1810, \"total_results\": 274, \"total_articles\": 274, \"proportion\": 1.0}, {\"term\": 1811, \"total_results\": 228, \"total_articles\": 228, \"proportion\": 1.0}, {\"term\": 1812, \"total_results\": 215, \"total_articles\": 215, \"proportion\": 1.0}, {\"term\": 1813, \"total_results\": 234, \"total_articles\": 234, \"proportion\": 1.0}, {\"term\": 1814, \"total_results\": 239, \"total_articles\": 239, \"proportion\": 1.0}, {\"term\": 1815, \"total_results\": 226, \"total_articles\": 226, \"proportion\": 1.0}, {\"term\": 1816, \"total_results\": 644, \"total_articles\": 644, \"proportion\": 1.0}, {\"term\": 1817, \"total_results\": 1067, \"total_articles\": 1068, \"proportion\": 0.9990636704119851}, {\"term\": 1818, \"total_results\": 1249, \"total_articles\": 1249, \"proportion\": 1.0}, {\"term\": 1819, \"total_results\": 1152, \"total_articles\": 1152, \"proportion\": 1.0}, {\"term\": 1820, \"total_results\": 1275, \"total_articles\": 1275, \"proportion\": 1.0}, {\"term\": 1821, \"total_results\": 1005, \"total_articles\": 1005, \"proportion\": 1.0}, {\"term\": 1822, \"total_results\": 1033, \"total_articles\": 1033, \"proportion\": 1.0}, {\"term\": 1823, \"total_results\": 1156, \"total_articles\": 1156, \"proportion\": 1.0}, {\"term\": 1824, \"total_results\": 1646, \"total_articles\": 1646, \"proportion\": 1.0}, {\"term\": 1825, \"total_results\": 3311, \"total_articles\": 3366, \"proportion\": 0.9836601307189542}, {\"term\": 1826, \"total_results\": 4337, \"total_articles\": 5201, \"proportion\": 0.8338781003653144}, {\"term\": 1827, \"total_results\": 4447, \"total_articles\": 6537, \"proportion\": 0.6802814746825762}, {\"term\": 1828, \"total_results\": 4047, \"total_articles\": 6726, \"proportion\": 0.6016949152542372}, {\"term\": 1829, \"total_results\": 4873, \"total_articles\": 8320, \"proportion\": 0.5856971153846153}, {\"term\": 1830, \"total_results\": 4610, \"total_articles\": 8369, \"proportion\": 0.5508423945513203}, {\"term\": 1831, \"total_results\": 5240, \"total_articles\": 10311, \"proportion\": 0.5081951314130541}, {\"term\": 1832, \"total_results\": 6061, \"total_articles\": 13732, \"proportion\": 0.44137780367025925}, {\"term\": 1833, \"total_results\": 6769, \"total_articles\": 15438, \"proportion\": 0.43846353154553697}, {\"term\": 1834, \"total_results\": 7432, \"total_articles\": 18704, \"proportion\": 0.3973481608212147}, {\"term\": 1835, \"total_results\": 9827, \"total_articles\": 20389, \"proportion\": 0.481975575064986}, {\"term\": 1836, \"total_results\": 9116, \"total_articles\": 20553, \"proportion\": 0.4435362234223714}, {\"term\": 1837, \"total_results\": 9338, \"total_articles\": 21697, \"proportion\": 0.43038208047195464}, {\"term\": 1838, \"total_results\": 10469, \"total_articles\": 25191, \"proportion\": 0.4155849311261959}, {\"term\": 1839, \"total_results\": 13323, \"total_articles\": 31159, \"proportion\": 0.4275811162104047}, {\"term\": 1840, \"total_results\": 16023, \"total_articles\": 40450, \"proportion\": 0.3961186650185414}, {\"term\": 1841, \"total_results\": 15690, \"total_articles\": 43583, \"proportion\": 0.36000275336713855}, {\"term\": 1842, \"total_results\": 16892, \"total_articles\": 44264, \"proportion\": 0.38161937466112417}, {\"term\": 1843, \"total_results\": 14316, \"total_articles\": 45968, \"proportion\": 0.31143404107205014}, {\"term\": 1844, \"total_results\": 13122, \"total_articles\": 51377, \"proportion\": 0.25540611557700915}, {\"term\": 1845, \"total_results\": 14566, \"total_articles\": 60228, \"proportion\": 0.241847645613336}, {\"term\": 1846, \"total_results\": 14629, \"total_articles\": 58794, \"proportion\": 0.24881790658910774}, {\"term\": 1847, \"total_results\": 15457, \"total_articles\": 59503, \"proportion\": 0.25976841503789727}, {\"term\": 1848, \"total_results\": 16680, \"total_articles\": 60302, \"proportion\": 0.27660774103678154}, {\"term\": 1849, \"total_results\": 19501, \"total_articles\": 59698, \"proportion\": 0.3266608596602901}, {\"term\": 1850, \"total_results\": 20429, \"total_articles\": 78726, \"proportion\": 0.25949495719330334}, {\"term\": 1851, \"total_results\": 22286, \"total_articles\": 93422, \"proportion\": 0.23855194707884653}, {\"term\": 1852, \"total_results\": 20551, \"total_articles\": 77328, \"proportion\": 0.26576401820815226}, {\"term\": 1853, \"total_results\": 25426, \"total_articles\": 89716, \"proportion\": 0.28340541263542735}, {\"term\": 1854, \"total_results\": 28992, \"total_articles\": 111809, \"proportion\": 0.2592993408401828}, {\"term\": 1855, \"total_results\": 32995, \"total_articles\": 152862, \"proportion\": 0.21584828145647708}, {\"term\": 1856, \"total_results\": 35817, \"total_articles\": 165260, \"proportion\": 0.21673121142442212}, {\"term\": 1857, \"total_results\": 37099, \"total_articles\": 167962, \"proportion\": 0.22087734130339004}, {\"term\": 1858, \"total_results\": 39362, \"total_articles\": 170465, \"proportion\": 0.23090957087965272}, {\"term\": 1859, \"total_results\": 44802, \"total_articles\": 192157, \"proportion\": 0.23315309876819476}, {\"term\": 1860, \"total_results\": 44029, \"total_articles\": 203666, \"proportion\": 0.21618237702905738}, {\"term\": 1861, \"total_results\": 44888, \"total_articles\": 222964, \"proportion\": 0.20132398055291437}, {\"term\": 1862, \"total_results\": 41461, \"total_articles\": 231360, \"proportion\": 0.17920556708160443}, {\"term\": 1863, \"total_results\": 44823, \"total_articles\": 236359, \"proportion\": 0.18963948908228584}, {\"term\": 1864, \"total_results\": 45558, \"total_articles\": 250078, \"proportion\": 0.18217516134965892}, {\"term\": 1865, \"total_results\": 48453, \"total_articles\": 270535, \"proportion\": 0.17910067089286044}, {\"term\": 1866, \"total_results\": 47554, \"total_articles\": 286531, \"proportion\": 0.1659645902188594}, {\"term\": 1867, \"total_results\": 47752, \"total_articles\": 294045, \"proportion\": 0.16239691203727322}, {\"term\": 1868, \"total_results\": 46846, \"total_articles\": 301851, \"proportion\": 0.15519577539912074}, {\"term\": 1869, \"total_results\": 48635, \"total_articles\": 324128, \"proportion\": 0.15004874617435088}, {\"term\": 1870, \"total_results\": 54727, \"total_articles\": 369786, \"proportion\": 0.14799640873370004}, {\"term\": 1871, \"total_results\": 47227, \"total_articles\": 357135, \"proportion\": 0.13223850924720343}, {\"term\": 1872, \"total_results\": 49756, \"total_articles\": 381901, \"proportion\": 0.13028507388040356}, {\"term\": 1873, \"total_results\": 54608, \"total_articles\": 419226, \"proportion\": 0.13025909652550177}, {\"term\": 1874, \"total_results\": 55932, \"total_articles\": 434720, \"proportion\": 0.12866212734633786}, {\"term\": 1875, \"total_results\": 57634, \"total_articles\": 465013, \"proportion\": 0.12394062101489636}, {\"term\": 1876, \"total_results\": 58990, \"total_articles\": 476043, \"proportion\": 0.12391737721172247}, {\"term\": 1877, \"total_results\": 57466, \"total_articles\": 497516, \"proportion\": 0.11550583297823587}, {\"term\": 1878, \"total_results\": 63363, \"total_articles\": 567221, \"proportion\": 0.11170778232822832}, {\"term\": 1879, \"total_results\": 63847, \"total_articles\": 605369, \"proportion\": 0.10546790469944778}, {\"term\": 1880, \"total_results\": 68413, \"total_articles\": 620926, \"proportion\": 0.11017899073319526}, {\"term\": 1881, \"total_results\": 66456, \"total_articles\": 663699, \"proportion\": 0.10012972748188562}, {\"term\": 1882, \"total_results\": 71668, \"total_articles\": 743779, \"proportion\": 0.09635657903759047}, {\"term\": 1883, \"total_results\": 74377, \"total_articles\": 777112, \"proportion\": 0.09570949875950957}, {\"term\": 1884, \"total_results\": 79263, \"total_articles\": 855504, \"proportion\": 0.09265064803905067}, {\"term\": 1885, \"total_results\": 80910, \"total_articles\": 908552, \"proportion\": 0.08905379108735659}, {\"term\": 1886, \"total_results\": 84160, \"total_articles\": 992953, \"proportion\": 0.08475728458446674}, {\"term\": 1887, \"total_results\": 86484, \"total_articles\": 1041609, \"proportion\": 0.08302923649853255}, {\"term\": 1888, \"total_results\": 93190, \"total_articles\": 1160131, \"proportion\": 0.08032713547004605}, {\"term\": 1889, \"total_results\": 102965, \"total_articles\": 1250597, \"proportion\": 0.08233267791302874}, {\"term\": 1890, \"total_results\": 103984, \"total_articles\": 1356077, \"proportion\": 0.0766800115332684}, {\"term\": 1891, \"total_results\": 103307, \"total_articles\": 1427624, \"proportion\": 0.0723628910693572}, {\"term\": 1892, \"total_results\": 103681, \"total_articles\": 1484823, \"proportion\": 0.06982717805421926}, {\"term\": 1893, \"total_results\": 101535, \"total_articles\": 1508559, \"proportion\": 0.06730595223653832}, {\"term\": 1894, \"total_results\": 101197, \"total_articles\": 1552223, \"proportion\": 0.06519488501330029}, {\"term\": 1895, \"total_results\": 104842, \"total_articles\": 1669993, \"proportion\": 0.06277990386786052}, {\"term\": 1896, \"total_results\": 108402, \"total_articles\": 1787172, \"proportion\": 0.060655605616023525}, {\"term\": 1897, \"total_results\": 109394, \"total_articles\": 1905923, \"proportion\": 0.05739686230765881}, {\"term\": 1898, \"total_results\": 116426, \"total_articles\": 2124671, \"proportion\": 0.05479718977667601}, {\"term\": 1899, \"total_results\": 131504, \"total_articles\": 2291285, \"proportion\": 0.05739312219998822}, {\"term\": 1900, \"total_results\": 147132, \"total_articles\": 2526021, \"proportion\": 0.05824654664391151}, {\"term\": 1901, \"total_results\": 147388, \"total_articles\": 2602286, \"proportion\": 0.05663789452811874}, {\"term\": 1902, \"total_results\": 142278, \"total_articles\": 2663426, \"proportion\": 0.05341916764347874}, {\"term\": 1903, \"total_results\": 136156, \"total_articles\": 2702366, \"proportion\": 0.05038399683832612}, {\"term\": 1904, \"total_results\": 134108, \"total_articles\": 2866915, \"proportion\": 0.04677780820149882}, {\"term\": 1905, \"total_results\": 138514, \"total_articles\": 3000008, \"proportion\": 0.0461712102101061}, {\"term\": 1906, \"total_results\": 144598, \"total_articles\": 3053141, \"proportion\": 0.04736040687279101}, {\"term\": 1907, \"total_results\": 160663, \"total_articles\": 3204029, \"proportion\": 0.050144053003265576}, {\"term\": 1908, \"total_results\": 161042, \"total_articles\": 3299942, \"proportion\": 0.048801463783302856}, {\"term\": 1909, \"total_results\": 159095, \"total_articles\": 3365867, \"proportion\": 0.04726716771637144}, {\"term\": 1910, \"total_results\": 169585, \"total_articles\": 3599953, \"proportion\": 0.047107559459804056}, {\"term\": 1911, \"total_results\": 165261, \"total_articles\": 3726874, \"proportion\": 0.044343060699127475}, {\"term\": 1912, \"total_results\": 172239, \"total_articles\": 3877702, \"proportion\": 0.04441780209000073}, {\"term\": 1913, \"total_results\": 173117, \"total_articles\": 3941035, \"proportion\": 0.04392678573014449}, {\"term\": 1914, \"total_results\": 230004, \"total_articles\": 4637240, \"proportion\": 0.04959933063632678}, {\"term\": 1915, \"total_results\": 256092, \"total_articles\": 4734766, \"proportion\": 0.0540875726487856}, {\"term\": 1916, \"total_results\": 234811, \"total_articles\": 4329234, \"proportion\": 0.05423846343256105}, {\"term\": 1917, \"total_results\": 216736, \"total_articles\": 4222288, \"proportion\": 0.05133141083696802}, {\"term\": 1918, \"total_results\": 214507, \"total_articles\": 4000486, \"proportion\": 0.053620235141430314}, {\"term\": 1919, \"total_results\": 172753, \"total_articles\": 3469713, \"proportion\": 0.04978884420699925}, {\"term\": 1920, \"total_results\": 149014, \"total_articles\": 3232280, \"proportion\": 0.046101822861880776}, {\"term\": 1921, \"total_results\": 138256, \"total_articles\": 3424854, \"proportion\": 0.040368436143555314}, {\"term\": 1922, \"total_results\": 141086, \"total_articles\": 3552184, \"proportion\": 0.03971810018850375}, {\"term\": 1923, \"total_results\": 160246, \"total_articles\": 3825199, \"proportion\": 0.041892199595367455}, {\"term\": 1924, \"total_results\": 168622, \"total_articles\": 4114212, \"proportion\": 0.04098524820791928}, {\"term\": 1925, \"total_results\": 167465, \"total_articles\": 4075498, \"proportion\": 0.04109068388697529}, {\"term\": 1926, \"total_results\": 166825, \"total_articles\": 4057541, \"proportion\": 0.04111480327617145}, {\"term\": 1927, \"total_results\": 177836, \"total_articles\": 4094761, \"proportion\": 0.043430129377514344}, {\"term\": 1928, \"total_results\": 176537, \"total_articles\": 4177998, \"proportion\": 0.04225396948490641}, {\"term\": 1929, \"total_results\": 200171, \"total_articles\": 4400731, \"proportion\": 0.04548585223682156}, {\"term\": 1930, \"total_results\": 184798, \"total_articles\": 4253542, \"proportion\": 0.04344567421692321}, {\"term\": 1931, \"total_results\": 160265, \"total_articles\": 3881774, \"proportion\": 0.04128653548609476}, {\"term\": 1932, \"total_results\": 160669, \"total_articles\": 3870855, \"proportion\": 0.04150736723540407}, {\"term\": 1933, \"total_results\": 172074, \"total_articles\": 4052182, \"proportion\": 0.0424645289870988}, {\"term\": 1934, \"total_results\": 184255, \"total_articles\": 4166726, \"proportion\": 0.04422057029907894}, {\"term\": 1935, \"total_results\": 177757, \"total_articles\": 4257870, \"proportion\": 0.04174786923978421}, {\"term\": 1936, \"total_results\": 174635, \"total_articles\": 4366927, \"proportion\": 0.039990363933264744}, {\"term\": 1937, \"total_results\": 168481, \"total_articles\": 4352108, \"proportion\": 0.03871250437718917}, {\"term\": 1938, \"total_results\": 174589, \"total_articles\": 4346573, \"proportion\": 0.04016704654448459}, {\"term\": 1939, \"total_results\": 179529, \"total_articles\": 4115148, \"proportion\": 0.04362637747172155}, {\"term\": 1940, \"total_results\": 166537, \"total_articles\": 3486062, \"proportion\": 0.04777224271972214}, {\"term\": 1941, \"total_results\": 141482, \"total_articles\": 3072953, \"proportion\": 0.046041055623044023}, {\"term\": 1942, \"total_results\": 138325, \"total_articles\": 2463771, \"proportion\": 0.056143610749538005}, {\"term\": 1943, \"total_results\": 125143, \"total_articles\": 2181583, \"proportion\": 0.05736339162892267}, {\"term\": 1944, \"total_results\": 119428, \"total_articles\": 2224588, \"proportion\": 0.05368544647368412}, {\"term\": 1945, \"total_results\": 151360, \"total_articles\": 2402976, \"proportion\": 0.06298856085121117}, {\"term\": 1946, \"total_results\": 129226, \"total_articles\": 2702599, \"proportion\": 0.04781545467899603}, {\"term\": 1947, \"total_results\": 125621, \"total_articles\": 2781436, \"proportion\": 0.04516408071226517}, {\"term\": 1948, \"total_results\": 126144, \"total_articles\": 2547357, \"proportion\": 0.04951956086249395}, {\"term\": 1949, \"total_results\": 139161, \"total_articles\": 2801116, \"proportion\": 0.04968055589272276}, {\"term\": 1950, \"total_results\": 152717, \"total_articles\": 2796460, \"proportion\": 0.0546108294057487}, {\"term\": 1951, \"total_results\": 122718, \"total_articles\": 2554760, \"proportion\": 0.04803504047346913}, {\"term\": 1952, \"total_results\": 126866, \"total_articles\": 2608754, \"proportion\": 0.04863087895600735}, {\"term\": 1953, \"total_results\": 129470, \"total_articles\": 2752367, \"proportion\": 0.04703951180928997}, {\"term\": 1954, \"total_results\": 164069, \"total_articles\": 2776461, \"proportion\": 0.0590928523757402}, {\"term\": 1955, \"total_results\": 21463, \"total_articles\": 279899, \"proportion\": 0.07668123144419951}, {\"term\": 1956, \"total_results\": 17537, \"total_articles\": 196851, \"proportion\": 0.08908768560992832}, {\"term\": 1957, \"total_results\": 9516, \"total_articles\": 123224, \"proportion\": 0.07722521586703888}, {\"term\": 1958, \"total_results\": 8187, \"total_articles\": 122829, \"proportion\": 0.066653640426935}, {\"term\": 1959, \"total_results\": 9682, \"total_articles\": 117623, \"proportion\": 0.08231383317888508}, {\"term\": 1960, \"total_results\": 7675, \"total_articles\": 111398, \"proportion\": 0.06889710766800122}, {\"term\": 1961, \"total_results\": 6521, \"total_articles\": 106039, \"proportion\": 0.06149624194871698}, {\"term\": 1962, \"total_results\": 6945, \"total_articles\": 105267, \"proportion\": 0.06597509190914531}, {\"term\": 1963, \"total_results\": 5754, \"total_articles\": 115187, \"proportion\": 0.049953553786451596}, {\"term\": 1964, \"total_results\": 5962, \"total_articles\": 116889, \"proportion\": 0.051005654937590364}, {\"term\": 1965, \"total_results\": 6105, \"total_articles\": 122965, \"proportion\": 0.04964827389907697}, {\"term\": 1966, \"total_results\": 6259, \"total_articles\": 120050, \"proportion\": 0.05213660974593919}, {\"term\": 1967, \"total_results\": 5976, \"total_articles\": 115512, \"proportion\": 0.051734884687305215}, {\"term\": 1968, \"total_results\": 5826, \"total_articles\": 121820, \"proportion\": 0.047824659333442786}, {\"term\": 1969, \"total_results\": 7346, \"total_articles\": 147908, \"proportion\": 0.049666008599940505}, {\"term\": 1970, \"total_results\": 8378, \"total_articles\": 163211, \"proportion\": 0.05133232441440834}, {\"term\": 1971, \"total_results\": 6777, \"total_articles\": 161232, \"proportion\": 0.04203259898779398}, {\"term\": 1972, \"total_results\": 6290, \"total_articles\": 160887, \"proportion\": 0.039095762864619264}, {\"term\": 1973, \"total_results\": 6000, \"total_articles\": 159593, \"proportion\": 0.037595633893717145}, {\"term\": 1974, \"total_results\": 6244, \"total_articles\": 158949, \"proportion\": 0.03928304047210111}, {\"term\": 1975, \"total_results\": 6763, \"total_articles\": 148081, \"proportion\": 0.045670950358249876}, {\"term\": 1976, \"total_results\": 5615, \"total_articles\": 148816, \"proportion\": 0.03773115794000645}, {\"term\": 1977, \"total_results\": 5777, \"total_articles\": 151194, \"proportion\": 0.038209188195298754}, {\"term\": 1978, \"total_results\": 5459, \"total_articles\": 156115, \"proportion\": 0.034967812189731926}, {\"term\": 1979, \"total_results\": 6054, \"total_articles\": 163573, \"proportion\": 0.03701099814761605}, {\"term\": 1980, \"total_results\": 6818, \"total_articles\": 162249, \"proportion\": 0.04202183064302399}, {\"term\": 1981, \"total_results\": 5270, \"total_articles\": 139953, \"proportion\": 0.03765549863168349}, {\"term\": 1982, \"total_results\": 9136, \"total_articles\": 127408, \"proportion\": 0.07170664322491524}, {\"term\": 1983, \"total_results\": 4401, \"total_articles\": 114763, \"proportion\": 0.03834859667314378}, {\"term\": 1984, \"total_results\": 4868, \"total_articles\": 118605, \"proportion\": 0.04104380085156612}, {\"term\": 1985, \"total_results\": 5108, \"total_articles\": 128562, \"proportion\": 0.03973180255440954}, {\"term\": 1986, \"total_results\": 5043, \"total_articles\": 128312, \"proportion\": 0.03930263732152877}, {\"term\": 1987, \"total_results\": 5540, \"total_articles\": 127581, \"proportion\": 0.04342339376552935}, {\"term\": 1988, \"total_results\": 7102, \"total_articles\": 136944, \"proportion\": 0.05186061455777544}, {\"term\": 1989, \"total_results\": 7708, \"total_articles\": 140197, \"proportion\": 0.05497977845460317}, {\"term\": 1990, \"total_results\": 7978, \"total_articles\": 132568, \"proportion\": 0.0601804357009233}, {\"term\": 1991, \"total_results\": 8225, \"total_articles\": 131719, \"proportion\": 0.06244353510123824}, {\"term\": 1992, \"total_results\": 9154, \"total_articles\": 137408, \"proportion\": 0.06661911970190965}, {\"term\": 1993, \"total_results\": 9612, \"total_articles\": 137916, \"proportion\": 0.0696945967110415}, {\"term\": 1994, \"total_results\": 10195, \"total_articles\": 133386, \"proportion\": 0.07643230923785105}, {\"term\": 1995, \"total_results\": 11197, \"total_articles\": 132401, \"proportion\": 0.08456884766731369}, {\"term\": 1996, \"total_results\": 825, \"total_articles\": 47101, \"proportion\": 0.017515551686800704}, {\"term\": 1997, \"total_results\": 1134, \"total_articles\": 49822, \"proportion\": 0.02276102926418048}, {\"term\": 1998, \"total_results\": 1448, \"total_articles\": 56947, \"proportion\": 0.02542715156197868}, {\"term\": 1999, \"total_results\": 2726, \"total_articles\": 67005, \"proportion\": 0.040683531079770165}, {\"term\": 2000, \"total_results\": 1900, \"total_articles\": 44770, \"proportion\": 0.04243913334822426}, {\"term\": 2001, \"total_results\": 1685, \"total_articles\": 42081, \"proportion\": 0.04004182410113828}, {\"term\": 2002, \"total_results\": 1695, \"total_articles\": 42631, \"proportion\": 0.039759799207149726}, {\"term\": 2003, \"total_results\": 1630, \"total_articles\": 24075, \"proportion\": 0.06770508826583593}, {\"term\": 2004, \"total_results\": 1030, \"total_articles\": 22072, \"proportion\": 0.04666545849945632}, {\"term\": 2005, \"total_results\": 416, \"total_articles\": 18071, \"proportion\": 0.02302030878202645}, {\"term\": 2006, \"total_results\": 339, \"total_articles\": 17708, \"proportion\": 0.019143889767336796}, {\"term\": 2007, \"total_results\": 451, \"total_articles\": 18719, \"proportion\": 0.024093167370051818}, {\"term\": 2008, \"total_results\": 280, \"total_articles\": 17059, \"proportion\": 0.016413623307345096}, {\"term\": 2009, \"total_results\": 257, \"total_articles\": 5762, \"proportion\": 0.04460256855258591}, {\"term\": 2010, \"total_results\": 260, \"total_articles\": 5794, \"proportion\": 0.04487400759406282}, {\"term\": 2011, \"total_results\": 259, \"total_articles\": 5386, \"proportion\": 0.04808763460824359}, {\"term\": 2012, \"total_results\": 147, \"total_articles\": 4832, \"proportion\": 0.030422185430463575}, {\"term\": 2013, \"total_results\": 225, \"total_articles\": 4904, \"proportion\": 0.04588091353996737}, {\"term\": 2014, \"total_results\": 144, \"total_articles\": 4718, \"proportion\": 0.03052140737600678}, {\"term\": 2015, \"total_results\": 303, \"total_articles\": 4819, \"proportion\": 0.06287611537663415}, {\"term\": 2016, \"total_results\": 8, \"total_articles\": 1422, \"proportion\": 0.005625879043600563}, {\"term\": 2017, \"total_results\": 14, \"total_articles\": 1094, \"proportion\": 0.012797074954296161}, {\"term\": 2018, \"total_results\": 3, \"total_articles\": 1100, \"proportion\": 0.0027272727272727275}, {\"term\": 2019, \"total_results\": 14, \"total_articles\": 1153, \"proportion\": 0.012142237640936688}]}}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.VConcatChart(...)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Number of articles with corrections\n",
    "chart1 = alt.Chart(df_merged).mark_line(point=True).encode(\n",
    "        x=alt.X('term:Q', axis=alt.Axis(format='c', title='Year')),\n",
    "        y=alt.Y('total_results:Q', axis=alt.Axis(format=',d', title='Number of articles with corrections')),\n",
    "        tooltip=[alt.Tooltip('term:Q', title='Year'), alt.Tooltip('total_results:Q', title='Articles', format=',')]\n",
    "    ).properties(width=700, height=250)\n",
    "\n",
    "# Proportion of articles with corrections\n",
    "chart2 = alt.Chart(df_merged).mark_line(point=True, color='red').encode(\n",
    "        x=alt.X('term:Q', axis=alt.Axis(format='c', title='Year')),\n",
    "    \n",
    "        # This time we're showing the proportion (formatted as a percentage) on the Y axis\n",
    "        y=alt.Y('proportion:Q', axis=alt.Axis(format='%', title='Proportion of articles with corrections')),\n",
    "        tooltip=[alt.Tooltip('term:Q', title='Year'), alt.Tooltip('proportion:Q', title='Proportion', format='%')],\n",
    "        \n",
    "        # Make the charts different colors\n",
    "        color=alt.value('orange')\n",
    "    ).properties(width=700, height=250)\n",
    "\n",
    "# This is a shorthand way of stacking the charts on top of each other\n",
    "chart1 & chart2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is really interesting – it seems there's been a deliberate effort to get the earliest newspapers corrected."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Number of corrections by category\n",
    "\n",
    "Let's see how the number of corrections varies across categories. This time we'll use the `category` facet instead of `year`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "params['q'] = 'has:corrections'\n",
    "params['facet'] = 'category'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = get_results(params)\n",
    "facets = []\n",
    "for term in data['response']['zone'][0]['facets']['facet']['term']:\n",
    "    # Get the state and the number of results, and convert it to integers, before adding to our results\n",
    "    facets.append({'term': term['search'], 'total_results': int(term['count'])})\n",
    "df_categories = pd.DataFrame(facets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>total_results</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Article</td>\n",
       "      <td>9707996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Family Notices</td>\n",
       "      <td>1324644</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Advertising</td>\n",
       "      <td>1231444</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Detailed Lists, Results, Guides</td>\n",
       "      <td>485544</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Literature</td>\n",
       "      <td>9371</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              term  total_results\n",
       "0                          Article        9707996\n",
       "1                   Family Notices        1324644\n",
       "2                      Advertising        1231444\n",
       "3  Detailed Lists, Results, Guides         485544\n",
       "4                       Literature           9371"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_categories.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, the raw numbers are probably not all that useful, so let's get the total number of articles in each category and calculate the proportion that have at least one correction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Blank query\n",
    "params['q'] = ' '\n",
    "data = get_results(params)\n",
    "facets = []\n",
    "for term in data['response']['zone'][0]['facets']['facet']['term']:\n",
    "    # Get the state and the number of results, and convert it to integers, before adding to our results\n",
    "    facets.append({'term': term['search'], 'total_results': int(term['count'])})\n",
    "df_total_categories = pd.DataFrame(facets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll merge the two corrections by category data with the total articles per category and calculate the proportion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>total_results</th>\n",
       "      <th>total_articles</th>\n",
       "      <th>proportion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Article</td>\n",
       "      <td>9707996</td>\n",
       "      <td>161358361</td>\n",
       "      <td>0.060164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Family Notices</td>\n",
       "      <td>1324644</td>\n",
       "      <td>1913143</td>\n",
       "      <td>0.692392</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Advertising</td>\n",
       "      <td>1231444</td>\n",
       "      <td>42882886</td>\n",
       "      <td>0.028716</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Detailed Lists, Results, Guides</td>\n",
       "      <td>485544</td>\n",
       "      <td>26049761</td>\n",
       "      <td>0.018639</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Literature</td>\n",
       "      <td>9371</td>\n",
       "      <td>32539</td>\n",
       "      <td>0.287993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Obituaries</td>\n",
       "      <td>6626</td>\n",
       "      <td>7004</td>\n",
       "      <td>0.946031</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Humour</td>\n",
       "      <td>6313</td>\n",
       "      <td>22693</td>\n",
       "      <td>0.278192</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>News</td>\n",
       "      <td>6035</td>\n",
       "      <td>7439</td>\n",
       "      <td>0.811265</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Law, Courts, And Crime</td>\n",
       "      <td>5244</td>\n",
       "      <td>6445</td>\n",
       "      <td>0.813654</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Sport And Games</td>\n",
       "      <td>4501</td>\n",
       "      <td>8982</td>\n",
       "      <td>0.501113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Letters</td>\n",
       "      <td>2511</td>\n",
       "      <td>9137</td>\n",
       "      <td>0.274817</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Arts And Culture</td>\n",
       "      <td>1579</td>\n",
       "      <td>2241</td>\n",
       "      <td>0.704596</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Editorial</td>\n",
       "      <td>1480</td>\n",
       "      <td>9274</td>\n",
       "      <td>0.159586</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Puzzles</td>\n",
       "      <td>1403</td>\n",
       "      <td>29650</td>\n",
       "      <td>0.047319</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Classified Advertisements And Notices</td>\n",
       "      <td>1129</td>\n",
       "      <td>1291</td>\n",
       "      <td>0.874516</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Shipping Notices</td>\n",
       "      <td>1056</td>\n",
       "      <td>1164</td>\n",
       "      <td>0.907216</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Official Appointments And Notices</td>\n",
       "      <td>815</td>\n",
       "      <td>838</td>\n",
       "      <td>0.972554</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Weather</td>\n",
       "      <td>743</td>\n",
       "      <td>5223</td>\n",
       "      <td>0.142255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Commerce And Business</td>\n",
       "      <td>666</td>\n",
       "      <td>1038</td>\n",
       "      <td>0.641618</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Reviews</td>\n",
       "      <td>594</td>\n",
       "      <td>898</td>\n",
       "      <td>0.661470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Display Advertisement</td>\n",
       "      <td>247</td>\n",
       "      <td>282</td>\n",
       "      <td>0.875887</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     term  total_results  total_articles  \\\n",
       "0                                 Article        9707996       161358361   \n",
       "1                          Family Notices        1324644         1913143   \n",
       "2                             Advertising        1231444        42882886   \n",
       "3         Detailed Lists, Results, Guides         485544        26049761   \n",
       "4                              Literature           9371           32539   \n",
       "5                              Obituaries           6626            7004   \n",
       "6                                  Humour           6313           22693   \n",
       "7                                    News           6035            7439   \n",
       "8                  Law, Courts, And Crime           5244            6445   \n",
       "9                         Sport And Games           4501            8982   \n",
       "10                                Letters           2511            9137   \n",
       "11                       Arts And Culture           1579            2241   \n",
       "12                              Editorial           1480            9274   \n",
       "13                                Puzzles           1403           29650   \n",
       "14  Classified Advertisements And Notices           1129            1291   \n",
       "15                       Shipping Notices           1056            1164   \n",
       "16      Official Appointments And Notices            815             838   \n",
       "17                                Weather            743            5223   \n",
       "18                  Commerce And Business            666            1038   \n",
       "19                                Reviews            594             898   \n",
       "20                  Display Advertisement            247             282   \n",
       "\n",
       "    proportion  \n",
       "0     0.060164  \n",
       "1     0.692392  \n",
       "2     0.028716  \n",
       "3     0.018639  \n",
       "4     0.287993  \n",
       "5     0.946031  \n",
       "6     0.278192  \n",
       "7     0.811265  \n",
       "8     0.813654  \n",
       "9     0.501113  \n",
       "10    0.274817  \n",
       "11    0.704596  \n",
       "12    0.159586  \n",
       "13    0.047319  \n",
       "14    0.874516  \n",
       "15    0.907216  \n",
       "16    0.972554  \n",
       "17    0.142255  \n",
       "18    0.641618  \n",
       "19    0.661470  \n",
       "20    0.875887  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_categories_merged = merge_df_with_total(df_categories, df_total_categories)\n",
    "df_categories_merged"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A lot of the categories have been added recently and don't contain a lot of articles. Some of these have a very high proportion of articles with corrections – 'Obituaries' for example. This suggests users are systematically categorising and correcting certain types of article.\n",
    "\n",
    "Let's focus on the main categories by filtering out those with less than 30,000 articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>total_results</th>\n",
       "      <th>total_articles</th>\n",
       "      <th>proportion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Article</td>\n",
       "      <td>9707996</td>\n",
       "      <td>161358361</td>\n",
       "      <td>0.060164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Family Notices</td>\n",
       "      <td>1324644</td>\n",
       "      <td>1913143</td>\n",
       "      <td>0.692392</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Advertising</td>\n",
       "      <td>1231444</td>\n",
       "      <td>42882886</td>\n",
       "      <td>0.028716</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Detailed Lists, Results, Guides</td>\n",
       "      <td>485544</td>\n",
       "      <td>26049761</td>\n",
       "      <td>0.018639</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Literature</td>\n",
       "      <td>9371</td>\n",
       "      <td>32539</td>\n",
       "      <td>0.287993</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              term  total_results  total_articles  proportion\n",
       "0                          Article        9707996       161358361    0.060164\n",
       "1                   Family Notices        1324644         1913143    0.692392\n",
       "2                      Advertising        1231444        42882886    0.028716\n",
       "3  Detailed Lists, Results, Guides         485544        26049761    0.018639\n",
       "4                       Literature           9371           32539    0.287993"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_categories_filtered = df_categories_merged.loc[df_categories_merged['total_articles'] > 30000]\n",
    "df_categories_filtered"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we can visualise the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div id=\"altair-viz-b169f68d955e4a959a34c2c8e9fade99\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-b169f68d955e4a959a34c2c8e9fade99\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-b169f68d955e4a959a34c2c8e9fade99\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm//vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm//vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm//vega-lite@4.8.1?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm//vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function loadScript(lib) {\n",
       "      return new Promise(function(resolve, reject) {\n",
       "        var s = document.createElement('script');\n",
       "        s.src = paths[lib];\n",
       "        s.async = true;\n",
       "        s.onload = () => resolve(paths[lib]);\n",
       "        s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "        document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "      });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else if (typeof vegaEmbed === \"function\") {\n",
       "      displayChart(vegaEmbed);\n",
       "    } else {\n",
       "      loadScript(\"vega\")\n",
       "        .then(() => loadScript(\"vega-lite\"))\n",
       "        .then(() => loadScript(\"vega-embed\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 400, \"continuousHeight\": 300}}, \"hconcat\": [{\"mark\": \"bar\", \"encoding\": {\"x\": {\"type\": \"nominal\", \"field\": \"term\", \"title\": \"Category\"}, \"y\": {\"type\": \"quantitative\", \"field\": \"total_results\", \"title\": \"Articles with corrections\"}}}, {\"mark\": \"bar\", \"encoding\": {\"color\": {\"value\": \"orange\"}, \"x\": {\"type\": \"nominal\", \"field\": \"term\", \"title\": \"Category\"}, \"y\": {\"type\": \"quantitative\", \"axis\": {\"format\": \"%\", \"title\": \"Proportion of articles with corrections\"}, \"field\": \"proportion\"}}}], \"data\": {\"name\": \"data-cceac99486e6b6cf84046fc5854bfc68\"}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.8.1.json\", \"datasets\": {\"data-cceac99486e6b6cf84046fc5854bfc68\": [{\"term\": \"Article\", \"total_results\": 9707996, \"total_articles\": 161358361, \"proportion\": 0.06016419564400508}, {\"term\": \"Family Notices\", \"total_results\": 1324644, \"total_articles\": 1913143, \"proportion\": 0.6923915253590558}, {\"term\": \"Advertising\", \"total_results\": 1231444, \"total_articles\": 42882886, \"proportion\": 0.028716444131115616}, {\"term\": \"Detailed Lists, Results, Guides\", \"total_results\": 485544, \"total_articles\": 26049761, \"proportion\": 0.01863909615140039}, {\"term\": \"Literature\", \"total_results\": 9371, \"total_articles\": 32539, \"proportion\": 0.2879928700943483}]}}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.HConcatChart(...)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_chart1 = alt.Chart(df_categories_filtered).mark_bar().encode(\n",
    "    x=alt.X('term:N', title='Category'),\n",
    "    y=alt.Y('total_results:Q', title='Articles with corrections')\n",
    ")\n",
    "\n",
    "cat_chart2 = alt.Chart(df_categories_filtered).mark_bar().encode(\n",
    "    x=alt.X('term:N', title='Category'),\n",
    "    y=alt.Y('proportion:Q', axis=alt.Axis(format='%', title='Proportion of articles with corrections')),\n",
    "    color=alt.value('orange')\n",
    ")\n",
    "\n",
    "cat_chart1 | cat_chart2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the rate of corrections is much higher in the 'Family Notices' category than any other. This probably reflects the work of family historians and others searching for, and correcting, articles containing particular names."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Number of corrections by newspaper\n",
    "\n",
    "How do rates of correction vary across newspapers? We can use the `title` facet to find out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "params['q'] = 'has:corrections'\n",
    "params['facet'] = 'title'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = get_results(params)\n",
    "facets = []\n",
    "for term in data['response']['zone'][0]['facets']['facet']['term']:\n",
    "    # Get the state and the number of results, and convert it to integers, before adding to our results\n",
    "    facets.append({'term': term['search'], 'total_results': int(term['count'])})\n",
    "df_newspapers = pd.DataFrame(facets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>total_results</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>35</td>\n",
       "      <td>801023</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>13</td>\n",
       "      <td>757930</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>11</td>\n",
       "      <td>347365</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>16</td>\n",
       "      <td>335237</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>30</td>\n",
       "      <td>304692</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   term  total_results\n",
       "0    35         801023\n",
       "1    13         757930\n",
       "2    11         347365\n",
       "3    16         335237\n",
       "4    30         304692"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_newspapers.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again we'll calculate the proportion of articles corrected for each newspaper by getting the total number of articles for each newspaper on Trove."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "params['q'] = ' '"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = get_results(params)\n",
    "facets = []\n",
    "for term in data['response']['zone'][0]['facets']['facet']['term']:\n",
    "    # Get the state and the number of results, and convert it to integers, before adding to our results\n",
    "    facets.append({'term': term['search'], 'total_results': int(term['count'])})\n",
    "df_newspapers_total = pd.DataFrame(facets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_newspapers_merged = merge_df_with_total(df_newspapers, df_newspapers_total, how='right')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_newspapers_merged.sort_values(by='proportion', ascending=False, inplace=True)\n",
    "df_newspapers_merged.rename(columns={'term': 'id'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>total_results</th>\n",
       "      <th>total_articles</th>\n",
       "      <th>proportion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1628</th>\n",
       "      <td>729</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1614</th>\n",
       "      <td>154</td>\n",
       "      <td>21</td>\n",
       "      <td>21</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1338</th>\n",
       "      <td>5</td>\n",
       "      <td>1556</td>\n",
       "      <td>1556</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1522</th>\n",
       "      <td>1028</td>\n",
       "      <td>286</td>\n",
       "      <td>286</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1540</th>\n",
       "      <td>273</td>\n",
       "      <td>193</td>\n",
       "      <td>193</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  total_results  total_articles  proportion\n",
       "1628   729              3               3         1.0\n",
       "1614   154             21              21         1.0\n",
       "1338     5           1556            1556         1.0\n",
       "1522  1028            286             286         1.0\n",
       "1540   273            193             193         1.0"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_newspapers_merged.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `title` facet only gives us the `id` number for each newspaper, not its title. Let's get all the titles and then merge them with the facet data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get all the newspaper titles\n",
    "title_params = {\n",
    "    'key': api_key,\n",
    "    'encoding': 'json',\n",
    "}\n",
    "\n",
    "title_data = s.get('https://api.trove.nla.gov.au/v2/newspaper/titles', params=params).json()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "titles = []\n",
    "for newspaper in title_data['response']['records']['newspaper']:\n",
    "    titles.append({'title': newspaper['title'], 'id': int(newspaper['id'])})\n",
    "df_titles = pd.DataFrame(titles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Canberra Community News (ACT : 1925 - 1927)</td>\n",
       "      <td>166</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Canberra Illustrated: A Quarterly Magazine (AC...</td>\n",
       "      <td>165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Federal Capital Pioneer (Canberra, ACT : 1924 ...</td>\n",
       "      <td>69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Good Neighbour (ACT : 1950 - 1969)</td>\n",
       "      <td>871</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Student Notes/Canberra University College Stud...</td>\n",
       "      <td>665</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title   id\n",
       "0        Canberra Community News (ACT : 1925 - 1927)  166\n",
       "1  Canberra Illustrated: A Quarterly Magazine (AC...  165\n",
       "2  Federal Capital Pioneer (Canberra, ACT : 1924 ...   69\n",
       "3                 Good Neighbour (ACT : 1950 - 1969)  871\n",
       "4  Student Notes/Canberra University College Stud...  665"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_titles.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1666, 2)"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_titles.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One problem with this list is that it also includes the titles of the Government Gazettes (this seems to be a bug in the API). Let's get the gazette titles and then subtract them from the complete list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get gazette titles\n",
    "gazette_data = s.get('https://api.trove.nla.gov.au/v2/gazette/titles', params=params).json()\n",
    "gazettes = []\n",
    "for gaz in gazette_data['response']['records']['newspaper']:\n",
    "    gazettes.append({'title': gaz['title'], 'id': int(gaz['id'])})\n",
    "df_gazettes = pd.DataFrame(gazettes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(38, 2)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_gazettes.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Subtract the gazettes from the list of titles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_titles_not_gazettes = df_titles[~df_titles['id'].isin(df_gazettes['id'])]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can merge the newspaper titles with the facet data using the `id` to link the two datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_newspapers_with_titles = pd.merge(df_titles_not_gazettes, df_newspapers_merged, how='left', on='id').fillna(0).sort_values(by='proportion', ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert the totals back to integers\n",
    "df_newspapers_with_titles[['total_results', 'total_articles']] = df_newspapers_with_titles[['total_results', 'total_articles']].astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can display the newspapers with the highest rates of correction. Remember, that a `proportion` of 1.00 means that every available article has at least one correction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>id</th>\n",
       "      <th>total_results</th>\n",
       "      <th>total_articles</th>\n",
       "      <th>proportion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>191</th>\n",
       "      <td>Party (Sydney, NSW : 1942)</td>\n",
       "      <td>1000</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>The Australian Abo Call (National : 1938)</td>\n",
       "      <td>51</td>\n",
       "      <td>78</td>\n",
       "      <td>78</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416</th>\n",
       "      <td>The Satirist and Sporting Chronicle (Sydney, N...</td>\n",
       "      <td>1028</td>\n",
       "      <td>286</td>\n",
       "      <td>286</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>Justice (Narrabri, NSW : 1891)</td>\n",
       "      <td>885</td>\n",
       "      <td>45</td>\n",
       "      <td>45</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>463</th>\n",
       "      <td>The Temora Telegraph and Mining Advocate (NSW ...</td>\n",
       "      <td>729</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>467</th>\n",
       "      <td>The True Sun and New South Wales Independent P...</td>\n",
       "      <td>1038</td>\n",
       "      <td>20</td>\n",
       "      <td>20</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>533</th>\n",
       "      <td>Moonta Herald and Northern Territory Gazette (...</td>\n",
       "      <td>118</td>\n",
       "      <td>56</td>\n",
       "      <td>56</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>735</th>\n",
       "      <td>Suedaustralische Zeitung (Adelaide, SA : 1850 ...</td>\n",
       "      <td>314</td>\n",
       "      <td>47</td>\n",
       "      <td>47</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>816</th>\n",
       "      <td>Hobart Town Gazette and Van Diemen's Land Adve...</td>\n",
       "      <td>5</td>\n",
       "      <td>1556</td>\n",
       "      <td>1556</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>829</th>\n",
       "      <td>Tasmanian and Port Dalrymple Advertiser (Launc...</td>\n",
       "      <td>273</td>\n",
       "      <td>193</td>\n",
       "      <td>193</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>851</th>\n",
       "      <td>The Derwent Star and Van Diemen's Land Intelli...</td>\n",
       "      <td>1046</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>903</th>\n",
       "      <td>Alexandra and Yea Standard, Thornton, Gobur an...</td>\n",
       "      <td>154</td>\n",
       "      <td>21</td>\n",
       "      <td>21</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>961</th>\n",
       "      <td>Elsternwick Leader and East Brighton, ... (Vic...</td>\n",
       "      <td>201</td>\n",
       "      <td>17</td>\n",
       "      <td>17</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>280</th>\n",
       "      <td>The Branxton Advocate: Greta and Rothbury Reco...</td>\n",
       "      <td>686</td>\n",
       "      <td>53</td>\n",
       "      <td>53</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>212</th>\n",
       "      <td>Society (Sydney, NSW : 1887)</td>\n",
       "      <td>1042</td>\n",
       "      <td>21</td>\n",
       "      <td>21</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1442</th>\n",
       "      <td>Swan River Guardian (WA : 1836 - 1838)</td>\n",
       "      <td>1142</td>\n",
       "      <td>437</td>\n",
       "      <td>437</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>The Van Diemen's Land Gazette and General Adve...</td>\n",
       "      <td>1047</td>\n",
       "      <td>38</td>\n",
       "      <td>38</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>857</th>\n",
       "      <td>The Hobart Town Gazette and Southern Reporter ...</td>\n",
       "      <td>4</td>\n",
       "      <td>1922</td>\n",
       "      <td>1923</td>\n",
       "      <td>0.999480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Federal Capital Pioneer (Canberra, ACT : 1924 ...</td>\n",
       "      <td>69</td>\n",
       "      <td>542</td>\n",
       "      <td>545</td>\n",
       "      <td>0.994495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1211</th>\n",
       "      <td>The Melbourne Advertiser (Vic. : 1838)</td>\n",
       "      <td>935</td>\n",
       "      <td>120</td>\n",
       "      <td>121</td>\n",
       "      <td>0.991736</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>721</th>\n",
       "      <td>South Australian Gazette and Colonial Register...</td>\n",
       "      <td>40</td>\n",
       "      <td>1051</td>\n",
       "      <td>1065</td>\n",
       "      <td>0.986854</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>140</th>\n",
       "      <td>Intelligence (Bowral, NSW : 1884)</td>\n",
       "      <td>624</td>\n",
       "      <td>117</td>\n",
       "      <td>119</td>\n",
       "      <td>0.983193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1625</th>\n",
       "      <td>York Advocate (WA : 1915)</td>\n",
       "      <td>1131</td>\n",
       "      <td>236</td>\n",
       "      <td>241</td>\n",
       "      <td>0.979253</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>558</th>\n",
       "      <td>Logan and Albert Advocate (Qld. : 1893 - 1900)</td>\n",
       "      <td>842</td>\n",
       "      <td>82</td>\n",
       "      <td>84</td>\n",
       "      <td>0.976190</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>383</th>\n",
       "      <td>The Newcastle Argus and District Advertiser (N...</td>\n",
       "      <td>513</td>\n",
       "      <td>29</td>\n",
       "      <td>30</td>\n",
       "      <td>0.966667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                  title    id  total_results  \\\n",
       "191                          Party (Sydney, NSW : 1942)  1000              6   \n",
       "20            The Australian Abo Call (National : 1938)    51             78   \n",
       "416   The Satirist and Sporting Chronicle (Sydney, N...  1028            286   \n",
       "146                      Justice (Narrabri, NSW : 1891)   885             45   \n",
       "463   The Temora Telegraph and Mining Advocate (NSW ...   729              3   \n",
       "467   The True Sun and New South Wales Independent P...  1038             20   \n",
       "533   Moonta Herald and Northern Territory Gazette (...   118             56   \n",
       "735   Suedaustralische Zeitung (Adelaide, SA : 1850 ...   314             47   \n",
       "816   Hobart Town Gazette and Van Diemen's Land Adve...     5           1556   \n",
       "829   Tasmanian and Port Dalrymple Advertiser (Launc...   273            193   \n",
       "851   The Derwent Star and Van Diemen's Land Intelli...  1046             12   \n",
       "903   Alexandra and Yea Standard, Thornton, Gobur an...   154             21   \n",
       "961   Elsternwick Leader and East Brighton, ... (Vic...   201             17   \n",
       "280   The Branxton Advocate: Greta and Rothbury Reco...   686             53   \n",
       "212                        Society (Sydney, NSW : 1887)  1042             21   \n",
       "1442             Swan River Guardian (WA : 1836 - 1838)  1142            437   \n",
       "887   The Van Diemen's Land Gazette and General Adve...  1047             38   \n",
       "857   The Hobart Town Gazette and Southern Reporter ...     4           1922   \n",
       "2     Federal Capital Pioneer (Canberra, ACT : 1924 ...    69            542   \n",
       "1211             The Melbourne Advertiser (Vic. : 1838)   935            120   \n",
       "721   South Australian Gazette and Colonial Register...    40           1051   \n",
       "140                   Intelligence (Bowral, NSW : 1884)   624            117   \n",
       "1625                          York Advocate (WA : 1915)  1131            236   \n",
       "558      Logan and Albert Advocate (Qld. : 1893 - 1900)   842             82   \n",
       "383   The Newcastle Argus and District Advertiser (N...   513             29   \n",
       "\n",
       "      total_articles  proportion  \n",
       "191                6    1.000000  \n",
       "20                78    1.000000  \n",
       "416              286    1.000000  \n",
       "146               45    1.000000  \n",
       "463                3    1.000000  \n",
       "467               20    1.000000  \n",
       "533               56    1.000000  \n",
       "735               47    1.000000  \n",
       "816             1556    1.000000  \n",
       "829              193    1.000000  \n",
       "851               12    1.000000  \n",
       "903               21    1.000000  \n",
       "961               17    1.000000  \n",
       "280               53    1.000000  \n",
       "212               21    1.000000  \n",
       "1442             437    1.000000  \n",
       "887               38    1.000000  \n",
       "857             1923    0.999480  \n",
       "2                545    0.994495  \n",
       "1211             121    0.991736  \n",
       "721             1065    0.986854  \n",
       "140              119    0.983193  \n",
       "1625             241    0.979253  \n",
       "558               84    0.976190  \n",
       "383               30    0.966667  "
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_newspapers_with_titles[:25]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the other end, we can see the newspapers with the smallest rates of correction. Note that some newspapers have no corrections at all."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>id</th>\n",
       "      <th>total_results</th>\n",
       "      <th>total_articles</th>\n",
       "      <th>proportion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1125</th>\n",
       "      <td>Seamen's Strike Bulletin (Melbourne, Vic. : 1919)</td>\n",
       "      <td>1043</td>\n",
       "      <td>0</td>\n",
       "      <td>14</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1521</th>\n",
       "      <td>The Miner's Right (Perth, WA : 1894)</td>\n",
       "      <td>1729</td>\n",
       "      <td>0</td>\n",
       "      <td>426</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59</th>\n",
       "      <td>Campbelltown Ingleburn News (NSW : 1953 - 1954)</td>\n",
       "      <td>1699</td>\n",
       "      <td>0</td>\n",
       "      <td>6248</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104</th>\n",
       "      <td>Progress (North Fitzroy, Vic. : 1889 - 1890)</td>\n",
       "      <td>1574</td>\n",
       "      <td>0</td>\n",
       "      <td>254</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1439</th>\n",
       "      <td>Sunday Figaro (Kalgoorlie, WA : 1904)</td>\n",
       "      <td>1664</td>\n",
       "      <td>0</td>\n",
       "      <td>362</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1476</th>\n",
       "      <td>The Derby News (WA : 1887)</td>\n",
       "      <td>1617</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1527</th>\n",
       "      <td>The Mount Margaret Mercury (WA : 1897)</td>\n",
       "      <td>1641</td>\n",
       "      <td>0</td>\n",
       "      <td>24</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>495</th>\n",
       "      <td>To Ethnico Vema = Greek National Tribune (Arnc...</td>\n",
       "      <td>1592</td>\n",
       "      <td>7</td>\n",
       "      <td>62861</td>\n",
       "      <td>0.000111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>249</th>\n",
       "      <td>The Australian Jewish Times (Sydney, NSW : 195...</td>\n",
       "      <td>1694</td>\n",
       "      <td>43</td>\n",
       "      <td>268379</td>\n",
       "      <td>0.000160</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>509</th>\n",
       "      <td>Vil'na Dumka = Free Thought (Sydney, NSW : 194...</td>\n",
       "      <td>1593</td>\n",
       "      <td>2</td>\n",
       "      <td>11607</td>\n",
       "      <td>0.000172</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>743</th>\n",
       "      <td>The Coromandel Times (Blackwood, SA : 1970 - 1...</td>\n",
       "      <td>1681</td>\n",
       "      <td>2</td>\n",
       "      <td>9900</td>\n",
       "      <td>0.000202</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>787</th>\n",
       "      <td>West Coast Recorder (Port Lincoln, SA : 1909 -...</td>\n",
       "      <td>1702</td>\n",
       "      <td>23</td>\n",
       "      <td>104481</td>\n",
       "      <td>0.000220</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1588</th>\n",
       "      <td>The W.A. Sportsman (Kalgoorlie, WA : 1901 - 1902)</td>\n",
       "      <td>1666</td>\n",
       "      <td>1</td>\n",
       "      <td>4129</td>\n",
       "      <td>0.000242</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>447</th>\n",
       "      <td>The Sydney Jewish News (Sydney, N.S.W : 1939 -...</td>\n",
       "      <td>1693</td>\n",
       "      <td>18</td>\n",
       "      <td>71686</td>\n",
       "      <td>0.000251</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1201</th>\n",
       "      <td>The Jewish Weekly News (Melbourne, Vic. : 1933...</td>\n",
       "      <td>1707</td>\n",
       "      <td>3</td>\n",
       "      <td>11865</td>\n",
       "      <td>0.000253</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>742</th>\n",
       "      <td>The Coromandel (Blackwood, SA : 1945 - 1970)</td>\n",
       "      <td>1680</td>\n",
       "      <td>16</td>\n",
       "      <td>55691</td>\n",
       "      <td>0.000287</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>175</th>\n",
       "      <td>Mu̇sų Pastogė = Our Haven (Sydney, NSW : 195...</td>\n",
       "      <td>1594</td>\n",
       "      <td>3</td>\n",
       "      <td>9060</td>\n",
       "      <td>0.000331</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1418</th>\n",
       "      <td>Northam Advertiser and Toodyay Times (WA : 1954)</td>\n",
       "      <td>1652</td>\n",
       "      <td>1</td>\n",
       "      <td>2619</td>\n",
       "      <td>0.000382</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1483</th>\n",
       "      <td>The Evening News (Boulder, WA : 1921 - 1922)</td>\n",
       "      <td>1621</td>\n",
       "      <td>4</td>\n",
       "      <td>8310</td>\n",
       "      <td>0.000481</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1436</th>\n",
       "      <td>Sporting Life : Dryblower's Journal (Kalgoorli...</td>\n",
       "      <td>1663</td>\n",
       "      <td>4</td>\n",
       "      <td>8242</td>\n",
       "      <td>0.000485</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>152</th>\n",
       "      <td>L'Italo-Australiano = The Italo-Australian (Sy...</td>\n",
       "      <td>1597</td>\n",
       "      <td>3</td>\n",
       "      <td>6106</td>\n",
       "      <td>0.000491</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84</th>\n",
       "      <td>Cowra Guardian and Lachlan Agricultural Record...</td>\n",
       "      <td>1697</td>\n",
       "      <td>20</td>\n",
       "      <td>36453</td>\n",
       "      <td>0.000549</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1377</th>\n",
       "      <td>Kulin Advocate and Dudinin-Jitarning Harrismit...</td>\n",
       "      <td>1632</td>\n",
       "      <td>6</td>\n",
       "      <td>10856</td>\n",
       "      <td>0.000553</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>143</th>\n",
       "      <td>Italian Bulletin of Commerce (Sydney, NSW : 19...</td>\n",
       "      <td>1603</td>\n",
       "      <td>1</td>\n",
       "      <td>1775</td>\n",
       "      <td>0.000563</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>705</th>\n",
       "      <td>Port Lincoln, Tumby and West Coast Recorder (S...</td>\n",
       "      <td>1700</td>\n",
       "      <td>7</td>\n",
       "      <td>11597</td>\n",
       "      <td>0.000604</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                  title    id  total_results  \\\n",
       "1125  Seamen's Strike Bulletin (Melbourne, Vic. : 1919)  1043              0   \n",
       "1521               The Miner's Right (Perth, WA : 1894)  1729              0   \n",
       "59      Campbelltown Ingleburn News (NSW : 1953 - 1954)  1699              0   \n",
       "1104       Progress (North Fitzroy, Vic. : 1889 - 1890)  1574              0   \n",
       "1439              Sunday Figaro (Kalgoorlie, WA : 1904)  1664              0   \n",
       "1476                         The Derby News (WA : 1887)  1617              0   \n",
       "1527             The Mount Margaret Mercury (WA : 1897)  1641              0   \n",
       "495   To Ethnico Vema = Greek National Tribune (Arnc...  1592              7   \n",
       "249   The Australian Jewish Times (Sydney, NSW : 195...  1694             43   \n",
       "509   Vil'na Dumka = Free Thought (Sydney, NSW : 194...  1593              2   \n",
       "743   The Coromandel Times (Blackwood, SA : 1970 - 1...  1681              2   \n",
       "787   West Coast Recorder (Port Lincoln, SA : 1909 -...  1702             23   \n",
       "1588  The W.A. Sportsman (Kalgoorlie, WA : 1901 - 1902)  1666              1   \n",
       "447   The Sydney Jewish News (Sydney, N.S.W : 1939 -...  1693             18   \n",
       "1201  The Jewish Weekly News (Melbourne, Vic. : 1933...  1707              3   \n",
       "742        The Coromandel (Blackwood, SA : 1945 - 1970)  1680             16   \n",
       "175   Mu̇sų Pastogė = Our Haven (Sydney, NSW : 195...  1594              3   \n",
       "1418   Northam Advertiser and Toodyay Times (WA : 1954)  1652              1   \n",
       "1483       The Evening News (Boulder, WA : 1921 - 1922)  1621              4   \n",
       "1436  Sporting Life : Dryblower's Journal (Kalgoorli...  1663              4   \n",
       "152   L'Italo-Australiano = The Italo-Australian (Sy...  1597              3   \n",
       "84    Cowra Guardian and Lachlan Agricultural Record...  1697             20   \n",
       "1377  Kulin Advocate and Dudinin-Jitarning Harrismit...  1632              6   \n",
       "143   Italian Bulletin of Commerce (Sydney, NSW : 19...  1603              1   \n",
       "705   Port Lincoln, Tumby and West Coast Recorder (S...  1700              7   \n",
       "\n",
       "      total_articles  proportion  \n",
       "1125              14    0.000000  \n",
       "1521             426    0.000000  \n",
       "59              6248    0.000000  \n",
       "1104             254    0.000000  \n",
       "1439             362    0.000000  \n",
       "1476               9    0.000000  \n",
       "1527              24    0.000000  \n",
       "495            62861    0.000111  \n",
       "249           268379    0.000160  \n",
       "509            11607    0.000172  \n",
       "743             9900    0.000202  \n",
       "787           104481    0.000220  \n",
       "1588            4129    0.000242  \n",
       "447            71686    0.000251  \n",
       "1201           11865    0.000253  \n",
       "742            55691    0.000287  \n",
       "175             9060    0.000331  \n",
       "1418            2619    0.000382  \n",
       "1483            8310    0.000481  \n",
       "1436            8242    0.000485  \n",
       "152             6106    0.000491  \n",
       "84             36453    0.000549  \n",
       "1377           10856    0.000553  \n",
       "143             1775    0.000563  \n",
       "705            11597    0.000604  "
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_newspapers_with_titles.sort_values(by='proportion')[:25]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll save the full list of newspapers as a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_newspapers_with_titles_csv = df_newspapers_with_titles.copy()\n",
    "df_newspapers_with_titles_csv.rename({'total_results': 'articles_with_corrections'}, axis=1, inplace=True)\n",
    "df_newspapers_with_titles_csv['percentage_with_corrections'] = df_newspapers_with_titles_csv['proportion'] * 100\n",
    "df_newspapers_with_titles_csv.sort_values(by=['percentage_with_corrections'], inplace=True)\n",
    "df_newspapers_with_titles_csv[['id', 'title', 'articles_with_corrections', 'total_articles', 'percentage_with_corrections']].to_csv('titles_corrected.csv', index=False)\n",
    "df_newspapers_with_titles_csv['title_url'] = df_newspapers_with_titles_csv['id'].apply(lambda x: f'http://nla.gov.au/nla.news-title{x}')\n",
    "df_newspapers_with_titles_csv.to_csv('titles_corrected.csv', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href='titles_corrected.csv' target='_blank'>titles_corrected.csv</a><br>"
      ],
      "text/plain": [
       "/Volumes/Workspace/mycode/glam-workbench/trove-newspapers/notebooks/titles_corrected.csv"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(FileLink('titles_corrected.csv'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Neediest newspapers\n",
    "\n",
    "Let's see if we can combine some guesses about OCR error rates with the correction data to find the newspapers most in need of help.\n",
    "\n",
    "To make a guesstimate of error rates, we'll use the occurance of 'tbe' – ie a common OCR error for 'the'. I don't know how valid this is, but it's a place to start!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Search for 'tbe' to get an indication of errors by newspaper\n",
    "params['q'] = 'text:\"tbe\"~0'\n",
    "params['facet'] = 'title'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = get_results(params)\n",
    "facets = []\n",
    "for term in data['response']['zone'][0]['facets']['facet']['term']:\n",
    "    # Get the state and the number of results, and convert it to integers, before adding to our results\n",
    "    facets.append({'term': term['search'], 'total_results': int(term['count'])})\n",
    "df_errors = pd.DataFrame(facets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Merge the error data with the total articles per newspaper to calculate the proportion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_errors_merged = merge_df_with_total(df_errors, df_newspapers_total, how='right')\n",
    "df_errors_merged.sort_values(by='proportion', ascending=False, inplace=True)\n",
    "df_errors_merged.rename(columns={'term': 'id'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>total_results</th>\n",
       "      <th>total_articles</th>\n",
       "      <th>proportion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1235</th>\n",
       "      <td>1316</td>\n",
       "      <td>2005</td>\n",
       "      <td>2954</td>\n",
       "      <td>0.678741</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1034</th>\n",
       "      <td>758</td>\n",
       "      <td>5250</td>\n",
       "      <td>8078</td>\n",
       "      <td>0.649913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>812</th>\n",
       "      <td>927</td>\n",
       "      <td>9450</td>\n",
       "      <td>17227</td>\n",
       "      <td>0.548557</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>902</th>\n",
       "      <td>382</td>\n",
       "      <td>6966</td>\n",
       "      <td>12744</td>\n",
       "      <td>0.546610</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>927</th>\n",
       "      <td>262</td>\n",
       "      <td>6279</td>\n",
       "      <td>11527</td>\n",
       "      <td>0.544721</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  total_results  total_articles  proportion\n",
       "1235  1316           2005            2954    0.678741\n",
       "1034   758           5250            8078    0.649913\n",
       "812    927           9450           17227    0.548557\n",
       "902    382           6966           12744    0.546610\n",
       "927    262           6279           11527    0.544721"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_errors_merged.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add the title names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_errors_with_titles = pd.merge(df_titles_not_gazettes, df_errors_merged, how='left', on='id').fillna(0).sort_values(by='proportion', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So this is a list of the newspapers with the highest rate of OCR error (by our rather dodgy measure)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>id</th>\n",
       "      <th>total_results</th>\n",
       "      <th>total_articles</th>\n",
       "      <th>proportion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>482</th>\n",
       "      <td>The Weekly Advance (Granville, NSW : 1892 - 1893)</td>\n",
       "      <td>1316</td>\n",
       "      <td>2005</td>\n",
       "      <td>2954</td>\n",
       "      <td>0.678741</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>959</th>\n",
       "      <td>Dunolly and Betbetshire Express and County of ...</td>\n",
       "      <td>758</td>\n",
       "      <td>5250</td>\n",
       "      <td>8078</td>\n",
       "      <td>0.649913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1001</th>\n",
       "      <td>Hamilton Spectator and Grange District Adverti...</td>\n",
       "      <td>927</td>\n",
       "      <td>9450</td>\n",
       "      <td>17227</td>\n",
       "      <td>0.548557</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>514</th>\n",
       "      <td>Wagga Wagga Express and Murrumbidgee District ...</td>\n",
       "      <td>382</td>\n",
       "      <td>6966</td>\n",
       "      <td>12744</td>\n",
       "      <td>0.546610</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>615</th>\n",
       "      <td>The North Australian, Ipswich and General Adve...</td>\n",
       "      <td>262</td>\n",
       "      <td>6279</td>\n",
       "      <td>11527</td>\n",
       "      <td>0.544721</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>614</th>\n",
       "      <td>The North Australian (Brisbane, Qld. : 1863 - ...</td>\n",
       "      <td>264</td>\n",
       "      <td>2875</td>\n",
       "      <td>5314</td>\n",
       "      <td>0.541024</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>338</th>\n",
       "      <td>The Hay Standard and Advertiser for Balranald,...</td>\n",
       "      <td>725</td>\n",
       "      <td>21698</td>\n",
       "      <td>42068</td>\n",
       "      <td>0.515784</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>206</th>\n",
       "      <td>Robertson Advocate (NSW : 1894 - 1923)</td>\n",
       "      <td>530</td>\n",
       "      <td>37007</td>\n",
       "      <td>72376</td>\n",
       "      <td>0.511316</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>232</th>\n",
       "      <td>Temora Herald and Mining Journal (NSW : 1882 -...</td>\n",
       "      <td>728</td>\n",
       "      <td>640</td>\n",
       "      <td>1253</td>\n",
       "      <td>0.510774</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>831</th>\n",
       "      <td>Tasmanian Morning Herald (Hobart, Tas. : 1865 ...</td>\n",
       "      <td>865</td>\n",
       "      <td>4857</td>\n",
       "      <td>9559</td>\n",
       "      <td>0.508108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>226</th>\n",
       "      <td>Sydney Mail (NSW : 1860 - 1871)</td>\n",
       "      <td>697</td>\n",
       "      <td>24593</td>\n",
       "      <td>48535</td>\n",
       "      <td>0.506707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1096</th>\n",
       "      <td>Port Phillip Gazette and Settler's Journal (Vi...</td>\n",
       "      <td>1138</td>\n",
       "      <td>6116</td>\n",
       "      <td>12127</td>\n",
       "      <td>0.504329</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>827</th>\n",
       "      <td>Morning Star and Commercial Advertiser (Hobart...</td>\n",
       "      <td>1242</td>\n",
       "      <td>855</td>\n",
       "      <td>1703</td>\n",
       "      <td>0.502055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>166</th>\n",
       "      <td>Molong Argus (NSW : 1896 - 1921)</td>\n",
       "      <td>424</td>\n",
       "      <td>52111</td>\n",
       "      <td>104984</td>\n",
       "      <td>0.496371</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1095</th>\n",
       "      <td>Port Phillip Gazette (Vic. : 1851)</td>\n",
       "      <td>1139</td>\n",
       "      <td>243</td>\n",
       "      <td>491</td>\n",
       "      <td>0.494908</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>837</th>\n",
       "      <td>Telegraph (Hobart Town, Tas. : 1867)</td>\n",
       "      <td>1250</td>\n",
       "      <td>68</td>\n",
       "      <td>140</td>\n",
       "      <td>0.485714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>890</th>\n",
       "      <td>Trumpeter General (Hobart, Tas. : 1833 - 1834)</td>\n",
       "      <td>869</td>\n",
       "      <td>701</td>\n",
       "      <td>1482</td>\n",
       "      <td>0.473009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>306</th>\n",
       "      <td>The Cumberland Free Press (Parramatta, NSW : 1...</td>\n",
       "      <td>724</td>\n",
       "      <td>6238</td>\n",
       "      <td>13247</td>\n",
       "      <td>0.470899</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>560</th>\n",
       "      <td>Logan Witness (Beenleigh, Qld. : 1878 - 1893)</td>\n",
       "      <td>850</td>\n",
       "      <td>6845</td>\n",
       "      <td>14654</td>\n",
       "      <td>0.467108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>645</th>\n",
       "      <td>Adelaide Chronicle and South Australian Litera...</td>\n",
       "      <td>986</td>\n",
       "      <td>901</td>\n",
       "      <td>1937</td>\n",
       "      <td>0.465152</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>607</th>\n",
       "      <td>The Darling Downs Gazette and General Advertis...</td>\n",
       "      <td>257</td>\n",
       "      <td>29514</td>\n",
       "      <td>65268</td>\n",
       "      <td>0.452197</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>388</th>\n",
       "      <td>The News, Shoalhaven and Southern Coast Distri...</td>\n",
       "      <td>1588</td>\n",
       "      <td>2473</td>\n",
       "      <td>5495</td>\n",
       "      <td>0.450045</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>848</th>\n",
       "      <td>The Cornwall Chronicle (Launceston, Tas. : 183...</td>\n",
       "      <td>170</td>\n",
       "      <td>72730</td>\n",
       "      <td>163791</td>\n",
       "      <td>0.444041</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>940</th>\n",
       "      <td>Chronicle, South Yarra Gazette, Toorak Times a...</td>\n",
       "      <td>847</td>\n",
       "      <td>1639</td>\n",
       "      <td>3720</td>\n",
       "      <td>0.440591</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>865</th>\n",
       "      <td>The Mount Lyell Standard and Strahan Gazette (...</td>\n",
       "      <td>1251</td>\n",
       "      <td>36450</td>\n",
       "      <td>83363</td>\n",
       "      <td>0.437244</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                  title    id  total_results  \\\n",
       "482   The Weekly Advance (Granville, NSW : 1892 - 1893)  1316           2005   \n",
       "959   Dunolly and Betbetshire Express and County of ...   758           5250   \n",
       "1001  Hamilton Spectator and Grange District Adverti...   927           9450   \n",
       "514   Wagga Wagga Express and Murrumbidgee District ...   382           6966   \n",
       "615   The North Australian, Ipswich and General Adve...   262           6279   \n",
       "614   The North Australian (Brisbane, Qld. : 1863 - ...   264           2875   \n",
       "338   The Hay Standard and Advertiser for Balranald,...   725          21698   \n",
       "206              Robertson Advocate (NSW : 1894 - 1923)   530          37007   \n",
       "232   Temora Herald and Mining Journal (NSW : 1882 -...   728            640   \n",
       "831   Tasmanian Morning Herald (Hobart, Tas. : 1865 ...   865           4857   \n",
       "226                     Sydney Mail (NSW : 1860 - 1871)   697          24593   \n",
       "1096  Port Phillip Gazette and Settler's Journal (Vi...  1138           6116   \n",
       "827   Morning Star and Commercial Advertiser (Hobart...  1242            855   \n",
       "166                    Molong Argus (NSW : 1896 - 1921)   424          52111   \n",
       "1095                 Port Phillip Gazette (Vic. : 1851)  1139            243   \n",
       "837                Telegraph (Hobart Town, Tas. : 1867)  1250             68   \n",
       "890      Trumpeter General (Hobart, Tas. : 1833 - 1834)   869            701   \n",
       "306   The Cumberland Free Press (Parramatta, NSW : 1...   724           6238   \n",
       "560       Logan Witness (Beenleigh, Qld. : 1878 - 1893)   850           6845   \n",
       "645   Adelaide Chronicle and South Australian Litera...   986            901   \n",
       "607   The Darling Downs Gazette and General Advertis...   257          29514   \n",
       "388   The News, Shoalhaven and Southern Coast Distri...  1588           2473   \n",
       "848   The Cornwall Chronicle (Launceston, Tas. : 183...   170          72730   \n",
       "940   Chronicle, South Yarra Gazette, Toorak Times a...   847           1639   \n",
       "865   The Mount Lyell Standard and Strahan Gazette (...  1251          36450   \n",
       "\n",
       "      total_articles  proportion  \n",
       "482             2954    0.678741  \n",
       "959             8078    0.649913  \n",
       "1001           17227    0.548557  \n",
       "514            12744    0.546610  \n",
       "615            11527    0.544721  \n",
       "614             5314    0.541024  \n",
       "338            42068    0.515784  \n",
       "206            72376    0.511316  \n",
       "232             1253    0.510774  \n",
       "831             9559    0.508108  \n",
       "226            48535    0.506707  \n",
       "1096           12127    0.504329  \n",
       "827             1703    0.502055  \n",
       "166           104984    0.496371  \n",
       "1095             491    0.494908  \n",
       "837              140    0.485714  \n",
       "890             1482    0.473009  \n",
       "306            13247    0.470899  \n",
       "560            14654    0.467108  \n",
       "645             1937    0.465152  \n",
       "607            65268    0.452197  \n",
       "388             5495    0.450045  \n",
       "848           163791    0.444041  \n",
       "940             3720    0.440591  \n",
       "865            83363    0.437244  "
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_errors_with_titles[:25]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And those with the lowest rate of errors. Note the number of non-English newspapers in this list – of course our measure of accuracy fails completely in newspapers that don't use the word 'the'!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>id</th>\n",
       "      <th>total_results</th>\n",
       "      <th>total_articles</th>\n",
       "      <th>proportion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1175</th>\n",
       "      <td>The Chinese Advertiser (Ballarat, Vic. : 1856)</td>\n",
       "      <td>706</td>\n",
       "      <td>0</td>\n",
       "      <td>15</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1437</th>\n",
       "      <td>Stampa Italiana = The Italian Press (Perth, WA...</td>\n",
       "      <td>1380</td>\n",
       "      <td>0</td>\n",
       "      <td>2493</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>224</th>\n",
       "      <td>Sydney General Trade List, Mercantile Chronicl...</td>\n",
       "      <td>696</td>\n",
       "      <td>0</td>\n",
       "      <td>22</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1476</th>\n",
       "      <td>The Derby News (WA : 1887)</td>\n",
       "      <td>1617</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Canberra Illustrated: A Quarterly Magazine (AC...</td>\n",
       "      <td>165</td>\n",
       "      <td>0</td>\n",
       "      <td>57</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>533</th>\n",
       "      <td>Moonta Herald and Northern Territory Gazette (...</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>56</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Auburn and District News (NSW : 1929)</td>\n",
       "      <td>1320</td>\n",
       "      <td>0</td>\n",
       "      <td>25</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>762</th>\n",
       "      <td>The Port Adelaide Post Shipping Gazette, Farme...</td>\n",
       "      <td>719</td>\n",
       "      <td>0</td>\n",
       "      <td>18</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>212</th>\n",
       "      <td>Society (Sydney, NSW : 1887)</td>\n",
       "      <td>1042</td>\n",
       "      <td>0</td>\n",
       "      <td>21</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151</th>\n",
       "      <td>L'Italo-Australiano = The Italo-Australian (Su...</td>\n",
       "      <td>1596</td>\n",
       "      <td>0</td>\n",
       "      <td>197</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>978</th>\n",
       "      <td>Frankston Standard (Frankston, Vic. : 1949)</td>\n",
       "      <td>233</td>\n",
       "      <td>0</td>\n",
       "      <td>1997</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1305</th>\n",
       "      <td>Chung Wah News (Perth, WA : 1981 - 1987)</td>\n",
       "      <td>1383</td>\n",
       "      <td>0</td>\n",
       "      <td>860</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>Blayney West Macquarie (NSW : 1949)</td>\n",
       "      <td>802</td>\n",
       "      <td>0</td>\n",
       "      <td>110</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1125</th>\n",
       "      <td>Seamen's Strike Bulletin (Melbourne, Vic. : 1919)</td>\n",
       "      <td>1043</td>\n",
       "      <td>0</td>\n",
       "      <td>14</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1294</th>\n",
       "      <td>Bullfinch Miner and Yilgarn Advocate (WA : 1910)</td>\n",
       "      <td>1460</td>\n",
       "      <td>0</td>\n",
       "      <td>27</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>741</th>\n",
       "      <td>The Citizen (Port Adelaide, SA : 1938-1940)</td>\n",
       "      <td>1305</td>\n",
       "      <td>0</td>\n",
       "      <td>1284</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1388</th>\n",
       "      <td>Mediterranean Voice (Perth, WA : 1971 - 1972)</td>\n",
       "      <td>1390</td>\n",
       "      <td>0</td>\n",
       "      <td>431</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1565</th>\n",
       "      <td>The Southern Cross (Perth, WA : 1893)</td>\n",
       "      <td>1660</td>\n",
       "      <td>0</td>\n",
       "      <td>59</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1188</th>\n",
       "      <td>The Elsternwick Leader and Caulfield and Balac...</td>\n",
       "      <td>200</td>\n",
       "      <td>0</td>\n",
       "      <td>47</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>66</th>\n",
       "      <td>Citizen Soldier (Sydney, NSW : 1942)</td>\n",
       "      <td>996</td>\n",
       "      <td>0</td>\n",
       "      <td>60</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>343</th>\n",
       "      <td>The Hospital Saturday News (Katoomba, NSW : 1930)</td>\n",
       "      <td>915</td>\n",
       "      <td>0</td>\n",
       "      <td>54</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1557</th>\n",
       "      <td>The Possum (Fremantle, WA : 1890)</td>\n",
       "      <td>1201</td>\n",
       "      <td>0</td>\n",
       "      <td>105</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>67</th>\n",
       "      <td>Clarence and Richmond Examiner (Grafton, NSW :...</td>\n",
       "      <td>104</td>\n",
       "      <td>0</td>\n",
       "      <td>111</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>961</th>\n",
       "      <td>Elsternwick Leader and East Brighton, ... (Vic...</td>\n",
       "      <td>201</td>\n",
       "      <td>0</td>\n",
       "      <td>17</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1378</th>\n",
       "      <td>La Rondine (Perth, WA : 1969 - 1994)</td>\n",
       "      <td>1388</td>\n",
       "      <td>0</td>\n",
       "      <td>1383</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                  title    id  total_results  \\\n",
       "1175     The Chinese Advertiser (Ballarat, Vic. : 1856)   706              0   \n",
       "1437  Stampa Italiana = The Italian Press (Perth, WA...  1380              0   \n",
       "224   Sydney General Trade List, Mercantile Chronicl...   696              0   \n",
       "1476                         The Derby News (WA : 1887)  1617              0   \n",
       "1     Canberra Illustrated: A Quarterly Magazine (AC...   165              0   \n",
       "533   Moonta Herald and Northern Territory Gazette (...   118              0   \n",
       "31                Auburn and District News (NSW : 1929)  1320              0   \n",
       "762   The Port Adelaide Post Shipping Gazette, Farme...   719              0   \n",
       "212                        Society (Sydney, NSW : 1887)  1042              0   \n",
       "151   L'Italo-Australiano = The Italo-Australian (Su...  1596              0   \n",
       "978         Frankston Standard (Frankston, Vic. : 1949)   233              0   \n",
       "1305           Chung Wah News (Perth, WA : 1981 - 1987)  1383              0   \n",
       "46                  Blayney West Macquarie (NSW : 1949)   802              0   \n",
       "1125  Seamen's Strike Bulletin (Melbourne, Vic. : 1919)  1043              0   \n",
       "1294   Bullfinch Miner and Yilgarn Advocate (WA : 1910)  1460              0   \n",
       "741         The Citizen (Port Adelaide, SA : 1938-1940)  1305              0   \n",
       "1388      Mediterranean Voice (Perth, WA : 1971 - 1972)  1390              0   \n",
       "1565              The Southern Cross (Perth, WA : 1893)  1660              0   \n",
       "1188  The Elsternwick Leader and Caulfield and Balac...   200              0   \n",
       "66                 Citizen Soldier (Sydney, NSW : 1942)   996              0   \n",
       "343   The Hospital Saturday News (Katoomba, NSW : 1930)   915              0   \n",
       "1557                  The Possum (Fremantle, WA : 1890)  1201              0   \n",
       "67    Clarence and Richmond Examiner (Grafton, NSW :...   104              0   \n",
       "961   Elsternwick Leader and East Brighton, ... (Vic...   201              0   \n",
       "1378               La Rondine (Perth, WA : 1969 - 1994)  1388              0   \n",
       "\n",
       "      total_articles  proportion  \n",
       "1175              15         0.0  \n",
       "1437            2493         0.0  \n",
       "224               22         0.0  \n",
       "1476               9         0.0  \n",
       "1                 57         0.0  \n",
       "533               56         0.0  \n",
       "31                25         0.0  \n",
       "762               18         0.0  \n",
       "212               21         0.0  \n",
       "151              197         0.0  \n",
       "978             1997         0.0  \n",
       "1305             860         0.0  \n",
       "46               110         0.0  \n",
       "1125              14         0.0  \n",
       "1294              27         0.0  \n",
       "741             1284         0.0  \n",
       "1388             431         0.0  \n",
       "1565              59         0.0  \n",
       "1188              47         0.0  \n",
       "66                60         0.0  \n",
       "343               54         0.0  \n",
       "1557             105         0.0  \n",
       "67               111         0.0  \n",
       "961               17         0.0  \n",
       "1378            1383         0.0  "
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_errors_with_titles[-25:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's merge the error data with the correction data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "corrections_errors_merged_df = pd.merge(df_newspapers_with_titles, df_errors_with_titles, how='left', on='id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title_x</th>\n",
       "      <th>id</th>\n",
       "      <th>total_results_x</th>\n",
       "      <th>total_articles_x</th>\n",
       "      <th>proportion_x</th>\n",
       "      <th>title_y</th>\n",
       "      <th>total_results_y</th>\n",
       "      <th>total_articles_y</th>\n",
       "      <th>proportion_y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Party (Sydney, NSW : 1942)</td>\n",
       "      <td>1000</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Party (Sydney, NSW : 1942)</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>The Australian Abo Call (National : 1938)</td>\n",
       "      <td>51</td>\n",
       "      <td>78</td>\n",
       "      <td>78</td>\n",
       "      <td>1.0</td>\n",
       "      <td>The Australian Abo Call (National : 1938)</td>\n",
       "      <td>0</td>\n",
       "      <td>78</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The Satirist and Sporting Chronicle (Sydney, N...</td>\n",
       "      <td>1028</td>\n",
       "      <td>286</td>\n",
       "      <td>286</td>\n",
       "      <td>1.0</td>\n",
       "      <td>The Satirist and Sporting Chronicle (Sydney, N...</td>\n",
       "      <td>0</td>\n",
       "      <td>286</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Justice (Narrabri, NSW : 1891)</td>\n",
       "      <td>885</td>\n",
       "      <td>45</td>\n",
       "      <td>45</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Justice (Narrabri, NSW : 1891)</td>\n",
       "      <td>1</td>\n",
       "      <td>45</td>\n",
       "      <td>0.022222</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>The Temora Telegraph and Mining Advocate (NSW ...</td>\n",
       "      <td>729</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>The Temora Telegraph and Mining Advocate (NSW ...</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             title_x    id  total_results_x  \\\n",
       "0                         Party (Sydney, NSW : 1942)  1000                6   \n",
       "1          The Australian Abo Call (National : 1938)    51               78   \n",
       "2  The Satirist and Sporting Chronicle (Sydney, N...  1028              286   \n",
       "3                     Justice (Narrabri, NSW : 1891)   885               45   \n",
       "4  The Temora Telegraph and Mining Advocate (NSW ...   729                3   \n",
       "\n",
       "   total_articles_x  proportion_x  \\\n",
       "0                 6           1.0   \n",
       "1                78           1.0   \n",
       "2               286           1.0   \n",
       "3                45           1.0   \n",
       "4                 3           1.0   \n",
       "\n",
       "                                             title_y  total_results_y  \\\n",
       "0                         Party (Sydney, NSW : 1942)                0   \n",
       "1          The Australian Abo Call (National : 1938)                0   \n",
       "2  The Satirist and Sporting Chronicle (Sydney, N...                0   \n",
       "3                     Justice (Narrabri, NSW : 1891)                1   \n",
       "4  The Temora Telegraph and Mining Advocate (NSW ...                0   \n",
       "\n",
       "   total_articles_y  proportion_y  \n",
       "0                 6      0.000000  \n",
       "1                78      0.000000  \n",
       "2               286      0.000000  \n",
       "3                45      0.022222  \n",
       "4                 3      0.000000  "
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corrections_errors_merged_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "corrections_errors_merged_df['proportion_uncorrected'] = corrections_errors_merged_df['proportion_x'].apply(lambda x: 1 - x)\n",
    "corrections_errors_merged_df.rename(columns={'title_x': 'title', 'proportion_x': 'proportion_corrected', 'proportion_y': 'proportion_with_errors'}, inplace=True)\n",
    "corrections_errors_merged_df.sort_values(by=['proportion_with_errors', 'proportion_uncorrected'], ascending=False, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, for what it's worth, here's a list of the neediest newspapers – those with high error rates and low correction rates! As I've said, this is a pretty dodgy method, but interesting nonetheless."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>proportion_with_errors</th>\n",
       "      <th>proportion_uncorrected</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1185</th>\n",
       "      <td>The Weekly Advance (Granville, NSW : 1892 - 1893)</td>\n",
       "      <td>0.678741</td>\n",
       "      <td>0.974272</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>599</th>\n",
       "      <td>Dunolly and Betbetshire Express and County of ...</td>\n",
       "      <td>0.649913</td>\n",
       "      <td>0.933028</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>381</th>\n",
       "      <td>Hamilton Spectator and Grange District Adverti...</td>\n",
       "      <td>0.548557</td>\n",
       "      <td>0.893655</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>440</th>\n",
       "      <td>Wagga Wagga Express and Murrumbidgee District ...</td>\n",
       "      <td>0.546610</td>\n",
       "      <td>0.907250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179</th>\n",
       "      <td>The North Australian, Ipswich and General Adve...</td>\n",
       "      <td>0.544721</td>\n",
       "      <td>0.767936</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>255</th>\n",
       "      <td>The North Australian (Brisbane, Qld. : 1863 - ...</td>\n",
       "      <td>0.541024</td>\n",
       "      <td>0.835341</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>The Hay Standard and Advertiser for Balranald,...</td>\n",
       "      <td>0.515784</td>\n",
       "      <td>0.962941</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>820</th>\n",
       "      <td>Robertson Advocate (NSW : 1894 - 1923)</td>\n",
       "      <td>0.511316</td>\n",
       "      <td>0.952553</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>588</th>\n",
       "      <td>Temora Herald and Mining Journal (NSW : 1882 -...</td>\n",
       "      <td>0.510774</td>\n",
       "      <td>0.931365</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>474</th>\n",
       "      <td>Tasmanian Morning Herald (Hobart, Tas. : 1865 ...</td>\n",
       "      <td>0.508108</td>\n",
       "      <td>0.911915</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>336</th>\n",
       "      <td>Sydney Mail (NSW : 1860 - 1871)</td>\n",
       "      <td>0.506707</td>\n",
       "      <td>0.879922</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>226</th>\n",
       "      <td>Port Phillip Gazette and Settler's Journal (Vi...</td>\n",
       "      <td>0.504329</td>\n",
       "      <td>0.821225</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>Morning Star and Commercial Advertiser (Hobart...</td>\n",
       "      <td>0.502055</td>\n",
       "      <td>0.724016</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>649</th>\n",
       "      <td>Molong Argus (NSW : 1896 - 1921)</td>\n",
       "      <td>0.496371</td>\n",
       "      <td>0.937495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>247</th>\n",
       "      <td>Port Phillip Gazette (Vic. : 1851)</td>\n",
       "      <td>0.494908</td>\n",
       "      <td>0.830957</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>180</th>\n",
       "      <td>Telegraph (Hobart Town, Tas. : 1867)</td>\n",
       "      <td>0.485714</td>\n",
       "      <td>0.771429</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>134</th>\n",
       "      <td>Trumpeter General (Hobart, Tas. : 1833 - 1834)</td>\n",
       "      <td>0.473009</td>\n",
       "      <td>0.695007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>335</th>\n",
       "      <td>The Cumberland Free Press (Parramatta, NSW : 1...</td>\n",
       "      <td>0.470899</td>\n",
       "      <td>0.879293</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>269</th>\n",
       "      <td>Logan Witness (Beenleigh, Qld. : 1878 - 1893)</td>\n",
       "      <td>0.467108</td>\n",
       "      <td>0.849188</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>125</th>\n",
       "      <td>Adelaide Chronicle and South Australian Litera...</td>\n",
       "      <td>0.465152</td>\n",
       "      <td>0.681982</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>266</th>\n",
       "      <td>The Darling Downs Gazette and General Advertis...</td>\n",
       "      <td>0.452197</td>\n",
       "      <td>0.846740</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1372</th>\n",
       "      <td>The News, Shoalhaven and Southern Coast Distri...</td>\n",
       "      <td>0.450045</td>\n",
       "      <td>0.985623</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>229</th>\n",
       "      <td>The Cornwall Chronicle (Launceston, Tas. : 183...</td>\n",
       "      <td>0.444041</td>\n",
       "      <td>0.823372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>922</th>\n",
       "      <td>Chronicle, South Yarra Gazette, Toorak Times a...</td>\n",
       "      <td>0.440591</td>\n",
       "      <td>0.958602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1362</th>\n",
       "      <td>The Mount Lyell Standard and Strahan Gazette (...</td>\n",
       "      <td>0.437244</td>\n",
       "      <td>0.985113</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                  title  \\\n",
       "1185  The Weekly Advance (Granville, NSW : 1892 - 1893)   \n",
       "599   Dunolly and Betbetshire Express and County of ...   \n",
       "381   Hamilton Spectator and Grange District Adverti...   \n",
       "440   Wagga Wagga Express and Murrumbidgee District ...   \n",
       "179   The North Australian, Ipswich and General Adve...   \n",
       "255   The North Australian (Brisbane, Qld. : 1863 - ...   \n",
       "999   The Hay Standard and Advertiser for Balranald,...   \n",
       "820              Robertson Advocate (NSW : 1894 - 1923)   \n",
       "588   Temora Herald and Mining Journal (NSW : 1882 -...   \n",
       "474   Tasmanian Morning Herald (Hobart, Tas. : 1865 ...   \n",
       "336                     Sydney Mail (NSW : 1860 - 1871)   \n",
       "226   Port Phillip Gazette and Settler's Journal (Vi...   \n",
       "146   Morning Star and Commercial Advertiser (Hobart...   \n",
       "649                    Molong Argus (NSW : 1896 - 1921)   \n",
       "247                  Port Phillip Gazette (Vic. : 1851)   \n",
       "180                Telegraph (Hobart Town, Tas. : 1867)   \n",
       "134      Trumpeter General (Hobart, Tas. : 1833 - 1834)   \n",
       "335   The Cumberland Free Press (Parramatta, NSW : 1...   \n",
       "269       Logan Witness (Beenleigh, Qld. : 1878 - 1893)   \n",
       "125   Adelaide Chronicle and South Australian Litera...   \n",
       "266   The Darling Downs Gazette and General Advertis...   \n",
       "1372  The News, Shoalhaven and Southern Coast Distri...   \n",
       "229   The Cornwall Chronicle (Launceston, Tas. : 183...   \n",
       "922   Chronicle, South Yarra Gazette, Toorak Times a...   \n",
       "1362  The Mount Lyell Standard and Strahan Gazette (...   \n",
       "\n",
       "      proportion_with_errors  proportion_uncorrected  \n",
       "1185                0.678741                0.974272  \n",
       "599                 0.649913                0.933028  \n",
       "381                 0.548557                0.893655  \n",
       "440                 0.546610                0.907250  \n",
       "179                 0.544721                0.767936  \n",
       "255                 0.541024                0.835341  \n",
       "999                 0.515784                0.962941  \n",
       "820                 0.511316                0.952553  \n",
       "588                 0.510774                0.931365  \n",
       "474                 0.508108                0.911915  \n",
       "336                 0.506707                0.879922  \n",
       "226                 0.504329                0.821225  \n",
       "146                 0.502055                0.724016  \n",
       "649                 0.496371                0.937495  \n",
       "247                 0.494908                0.830957  \n",
       "180                 0.485714                0.771429  \n",
       "134                 0.473009                0.695007  \n",
       "335                 0.470899                0.879293  \n",
       "269                 0.467108                0.849188  \n",
       "125                 0.465152                0.681982  \n",
       "266                 0.452197                0.846740  \n",
       "1372                0.450045                0.985623  \n",
       "229                 0.444041                0.823372  \n",
       "922                 0.440591                0.958602  \n",
       "1362                0.437244                0.985113  "
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corrections_errors_merged_df[['title', 'proportion_with_errors', 'proportion_uncorrected']][:25]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).  \n",
    "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}