{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Corrections of OCRd text in Trove's newspapers\n", "\n", "The full text of newspaper articles in Trove is extracted from page images using Optical Character Recognition (OCR). The accuracy of the OCR process is influenced by a range of factors including the font and the quality of the images. Many errors slip through. Volunteers have done a remarkable job in correcting these errors, but it's a huge task. This notebook explores the scale of OCR correction in Trove.\n", "\n", "There are two ways of getting data about OCR corrections using the Trove API. To get aggregate data you can include `has:corrections` in your query to limit the results to articles that have at least one OCR correction.\n", "\n", "To get information about the number of corrections made to the articles in your results, you can add the `reclevel=full` parameter to include the number of corrections and details of the most recent correction to the article record. For example, note the `correctionCount` and `lastCorrection` values in the record below:\n", "\n", "``` json\n", "{\n", " \"article\": {\n", " \"id\": \"41697877\",\n", " \"url\": \"/newspaper/41697877\",\n", " \"heading\": \"WRAGGE AND WEATHER CYCLES.\",\n", " \"category\": \"Article\",\n", " \"title\": {\n", " \"id\": \"101\",\n", " \"value\": \"Western Mail (Perth, WA : 1885 - 1954)\"\n", " },\n", " \"date\": \"1922-11-23\",\n", " \"page\": 4,\n", " \"pageSequence\": 4,\n", " \"troveUrl\": \"https://trove.nla.gov.au/ndp/del/article/41697877\",\n", " \"illustrated\": \"N\",\n", " \"wordCount\": 1054,\n", " \"correctionCount\": 1,\n", " \"listCount\": 0,\n", " \"tagCount\": 0,\n", " \"commentCount\": 0,\n", " \"lastCorrection\": {\n", " \"by\": \"*anon*\",\n", " \"lastupdated\": \"2016-09-12T07:08:57Z\"\n", " },\n", " \"identifier\": \"https://nla.gov.au/nla.news-article41697877\",\n", " \"trovePageUrl\": \"https://trove.nla.gov.au/ndp/del/page/3522839\",\n", " \"pdf\": \"https://trove.nla.gov.au/ndp/imageservice/nla.news-page3522839/print\"\n", " }\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting things up" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import os\n", "import ipywidgets as widgets\n", "from operator import itemgetter # used for sorting\n", "import pandas as pd # makes manipulating the data easier\n", "import altair as alt\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "from IPython.display import display, HTML, FileLink, clear_output\n", "import math\n", "from collections import OrderedDict\n", "import time\n", "\n", "# Make sure data directory exists\n", "os.makedirs('data', exist_ok=True)\n", "\n", "# Create a session that will automatically retry on server errors\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n", "s.mount('http://', HTTPAdapter(max_retries=retries))\n", "s.mount('https://', HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "api_key = 'YOUR API KEY'\n", "print('Your API key is: {}'.format(api_key))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Basic parameters for Trove API\n", "params = {\n", " 'facet': 'year', # Get the data aggregated by year.\n", " 'zone': 'newspaper',\n", " 'key': api_key,\n", " 'encoding': 'json',\n", " 'n': 0 # We don't need any records, just the facets!\n", "}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def get_results(params):\n", " '''\n", " Get JSON response data from the Trove API.\n", " Parameters:\n", " params\n", " Returns:\n", " JSON formatted response data from Trove API \n", " '''\n", " response = s.get('https://api.trove.nla.gov.au/v2/result', params=params, timeout=30)\n", " response.raise_for_status()\n", " # print(response.url) # This shows us the url that's sent to the API\n", " data = response.json()\n", " return data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How many newspaper articles have corrections?\n", "\n", "Let's find out what proportion of newspaper articles have at least one OCR correction.\n", "\n", "First we'll get to the total number of newspaper articles in Trove." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "232,202,146\n" ] } ], "source": [ "# Set the q parameter to a single space to get everything\n", "params['q'] = ' '\n", "\n", "# Get the data from the API\n", "data = get_results(params)\n", "\n", "# Extract the total number of results\n", "total = int(data['response']['zone'][0]['records']['total'])\n", "print('{:,}'.format(total))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll set the `q` parameter to `has:corrections` to limit the results to newspaper articles that have at least one correction." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12,743,782\n" ] } ], "source": [ "# Set the q parameter to 'has:corrections' to limit results to articles with corrections\n", "params['q'] = 'has:corrections'\n", "\n", "# Get the data from the API\n", "data = get_results(params)\n", "\n", "# Extract the total number of results\n", "corrected = int(data['response']['zone'][0]['records']['total'])\n", "print('{:,}'.format(corrected))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the proportion of articles with corrections." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5.49% of articles have at least one correction\n" ] } ], "source": [ "print('{:.2%} of articles have at least one correction'.format(corrected/total))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might be thinking that these figures don't seem to match the number of corrections by individuals displayed on the digitised newspapers home page. Remember that these figures show the **number of articles** that include corrections, while the individual scores show the **number of lines** corrected by each volunteer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of corrections by year" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def get_facets(data):\n", " '''\n", " Loop through facets in Trove API response, saving terms and counts.\n", " Parameters:\n", " data - JSON formatted response data from Trove API \n", " Returns:\n", " A list of dictionaries containing: 'term', 'total_results'\n", " '''\n", " facets = []\n", " try:\n", " # The facets are buried a fair way down in the results\n", " # Note that if you ask for more than one facet, you'll have use the facet['name'] param to find the one you want\n", " # In this case there's only one facet, so we can just grab the list of terms (which are in fact the results by year)\n", " for term in data['response']['zone'][0]['facets']['facet']['term']:\n", " \n", " # Get the year and the number of results, and convert them to integers, before adding to our results\n", " facets.append({'term': term['search'], 'total_results': int(term['count'])})\n", " \n", " # Sort facets by year\n", " facets.sort(key=itemgetter('term'))\n", " except TypeError:\n", " pass\n", " return facets\n", "\n", "def get_facet_data(params, start_decade=180, end_decade=201):\n", " '''\n", " Loop throught the decades from 'start_decade' to 'end_decade',\n", " getting the number of search results for each year from the year facet.\n", " Combine all the results into a single list.\n", " Parameters:\n", " params - parameters to send to the API\n", " start_decade\n", " end_decade\n", " Returns:\n", " A list of dictionaries containing 'year', 'total_results' for the complete \n", " period between the start and end decades.\n", " '''\n", " # Create a list to hold the facets data\n", " facet_data = []\n", " \n", " # Loop through the decades\n", " for decade in tqdm(range(start_decade, end_decade + 1)):\n", " \n", " #print(params)\n", " # Avoid confusion by copying the params before we change anything.\n", " search_params = params.copy()\n", " \n", " # Add decade value to params\n", " search_params['l-decade'] = decade\n", " \n", " # Get the data from the API\n", " data = get_results(search_params)\n", " \n", " # Get the facets from the data and add to facets_data\n", " facet_data += get_facets(data)\n", " \n", " # Reomve the progress bar (you can also set leave=False in tqdm, but that still leaves white space in Jupyter Lab)\n", " clear_output()\n", " return facet_data" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "facet_data = get_facet_data(params)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Convert our data to a dataframe called df\n", "df = pd.DataFrame(facet_data)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "" ], "text/plain": [ "alt.VConcatChart(...)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of articles with corrections\n", "chart1 = alt.Chart(df_merged).mark_line(point=True).encode(\n", " x=alt.X('term:Q', axis=alt.Axis(format='c', title='Year')),\n", " y=alt.Y('total_results:Q', axis=alt.Axis(format=',d', title='Number of articles with corrections')),\n", " tooltip=[alt.Tooltip('term:Q', title='Year'), alt.Tooltip('total_results:Q', title='Articles', format=',')]\n", " ).properties(width=700, height=250)\n", "\n", "# Proportion of articles with corrections\n", "chart2 = alt.Chart(df_merged).mark_line(point=True, color='red').encode(\n", " x=alt.X('term:Q', axis=alt.Axis(format='c', title='Year')),\n", " \n", " # This time we're showing the proportion (formatted as a percentage) on the Y axis\n", " y=alt.Y('proportion:Q', axis=alt.Axis(format='%', title='Proportion of articles with corrections')),\n", " tooltip=[alt.Tooltip('term:Q', title='Year'), alt.Tooltip('proportion:Q', title='Proportion', format='%')],\n", " \n", " # Make the charts different colors\n", " color=alt.value('orange')\n", " ).properties(width=700, height=250)\n", "\n", "# This is a shorthand way of stacking the charts on top of each other\n", "chart1 & chart2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is really interesting – it seems there's been a deliberate effort to get the earliest newspapers corrected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of corrections by category\n", "\n", "Let's see how the number of corrections varies across categories. This time we'll use the `category` facet instead of `year`." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "params['q'] = 'has:corrections'\n", "params['facet'] = 'category'" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "data = get_results(params)\n", "facets = []\n", "for term in data['response']['zone'][0]['facets']['facet']['term']:\n", " # Get the state and the number of results, and convert it to integers, before adding to our results\n", " facets.append({'term': term['search'], 'total_results': int(term['count'])})\n", "df_categories = pd.DataFrame(facets)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "" ], "text/plain": [ "alt.HConcatChart(...)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_chart1 = alt.Chart(df_categories_filtered).mark_bar().encode(\n", " x=alt.X('term:N', title='Category'),\n", " y=alt.Y('total_results:Q', title='Articles with corrections')\n", ")\n", "\n", "cat_chart2 = alt.Chart(df_categories_filtered).mark_bar().encode(\n", " x=alt.X('term:N', title='Category'),\n", " y=alt.Y('proportion:Q', axis=alt.Axis(format='%', title='Proportion of articles with corrections')),\n", " color=alt.value('orange')\n", ")\n", "\n", "cat_chart1 | cat_chart2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the rate of corrections is much higher in the 'Family Notices' category than any other. This probably reflects the work of family historians and others searching for, and correcting, articles containing particular names." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of corrections by newspaper\n", "\n", "How do rates of correction vary across newspapers? We can use the `title` facet to find out." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "params['q'] = 'has:corrections'\n", "params['facet'] = 'title'" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "data = get_results(params)\n", "facets = []\n", "for term in data['response']['zone'][0]['facets']['facet']['term']:\n", " # Get the state and the number of results, and convert it to integers, before adding to our results\n", " facets.append({'term': term['search'], 'total_results': int(term['count'])})\n", "df_newspapers = pd.DataFrame(facets)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "/Volumes/Workspace/mycode/glam-workbench/trove-newspapers/notebooks/titles_corrected.csv" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(FileLink('titles_corrected.csv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Neediest newspapers\n", "\n", "Let's see if we can combine some guesses about OCR error rates with the correction data to find the newspapers most in need of help.\n", "\n", "To make a guesstimate of error rates, we'll use the occurance of 'tbe' – ie a common OCR error for 'the'. I don't know how valid this is, but it's a place to start!" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "# Search for 'tbe' to get an indication of errors by newspaper\n", "params['q'] = 'text:\"tbe\"~0'\n", "params['facet'] = 'title'" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "data = get_results(params)\n", "facets = []\n", "for term in data['response']['zone'][0]['facets']['facet']['term']:\n", " # Get the state and the number of results, and convert it to integers, before adding to our results\n", " facets.append({'term': term['search'], 'total_results': int(term['count'])})\n", "df_errors = pd.DataFrame(facets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Merge the error data with the total articles per newspaper to calculate the proportion." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "df_errors_merged = merge_df_with_total(df_errors, df_newspapers_total, how='right')\n", "df_errors_merged.sort_values(by='proportion', ascending=False, inplace=True)\n", "df_errors_merged.rename(columns={'term': 'id'}, inplace=True)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
