{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Corrections of OCRd text in Trove's newspapers\n", "\n", "The full text of newspaper articles in Trove is extracted from page images using Optical Character Recognition (OCR). The accuracy of the OCR process is influenced by a range of factors including the font and the quality of the images. Many errors slip through. Volunteers have done a remarkable job in correcting these errors, but it's a huge task. This notebook explores the scale of OCR correction in Trove.\n", "\n", "There are two ways of getting data about OCR corrections using the Trove API. To get aggregate data you can include `has:corrections` in your query to limit the results to articles that have at least one OCR correction.\n", "\n", "To get information about the number of corrections made to the articles in your results, you can add the `reclevel=full` parameter to include the number of corrections and details of the most recent correction to the article record. For example, note the `correctionCount` and `lastCorrection` values in the record below:\n", "\n", "``` json\n", "{\n", " \"article\": {\n", " \"id\": \"41697877\",\n", " \"url\": \"/newspaper/41697877\",\n", " \"heading\": \"WRAGGE AND WEATHER CYCLES.\",\n", " \"category\": \"Article\",\n", " \"title\": {\n", " \"id\": \"101\",\n", " \"value\": \"Western Mail (Perth, WA : 1885 - 1954)\"\n", " },\n", " \"date\": \"1922-11-23\",\n", " \"page\": 4,\n", " \"pageSequence\": 4,\n", " \"troveUrl\": \"https://trove.nla.gov.au/ndp/del/article/41697877\",\n", " \"illustrated\": \"N\",\n", " \"wordCount\": 1054,\n", " \"correctionCount\": 1,\n", " \"listCount\": 0,\n", " \"tagCount\": 0,\n", " \"commentCount\": 0,\n", " \"lastCorrection\": {\n", " \"by\": \"*anon*\",\n", " \"lastupdated\": \"2016-09-12T07:08:57Z\"\n", " },\n", " \"identifier\": \"https://nla.gov.au/nla.news-article41697877\",\n", " \"trovePageUrl\": \"https://trove.nla.gov.au/ndp/del/page/3522839\",\n", " \"pdf\": \"https://trove.nla.gov.au/ndp/imageservice/nla.news-page3522839/print\"\n", " }\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting things up" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import os\n", "import ipywidgets as widgets\n", "from operator import itemgetter # used for sorting\n", "import pandas as pd # makes manipulating the data easier\n", "import altair as alt\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "from IPython.display import display, HTML, FileLink, clear_output\n", "import math\n", "from collections import OrderedDict\n", "import time\n", "\n", "# Make sure data directory exists\n", "os.makedirs('data', exist_ok=True)\n", "\n", "# Create a session that will automatically retry on server errors\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n", "s.mount('http://', HTTPAdapter(max_retries=retries))\n", "s.mount('https://', HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "api_key = 'YOUR API KEY'\n", "print('Your API key is: {}'.format(api_key))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Basic parameters for Trove API\n", "params = {\n", " 'facet': 'year', # Get the data aggregated by year.\n", " 'zone': 'newspaper',\n", " 'key': api_key,\n", " 'encoding': 'json',\n", " 'n': 0 # We don't need any records, just the facets!\n", "}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def get_results(params):\n", " '''\n", " Get JSON response data from the Trove API.\n", " Parameters:\n", " params\n", " Returns:\n", " JSON formatted response data from Trove API \n", " '''\n", " response = s.get('https://api.trove.nla.gov.au/v2/result', params=params, timeout=30)\n", " response.raise_for_status()\n", " # print(response.url) # This shows us the url that's sent to the API\n", " data = response.json()\n", " return data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How many newspaper articles have corrections?\n", "\n", "Let's find out what proportion of newspaper articles have at least one OCR correction.\n", "\n", "First we'll get to the total number of newspaper articles in Trove." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "232,202,146\n" ] } ], "source": [ "# Set the q parameter to a single space to get everything\n", "params['q'] = ' '\n", "\n", "# Get the data from the API\n", "data = get_results(params)\n", "\n", "# Extract the total number of results\n", "total = int(data['response']['zone'][0]['records']['total'])\n", "print('{:,}'.format(total))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll set the `q` parameter to `has:corrections` to limit the results to newspaper articles that have at least one correction." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12,743,782\n" ] } ], "source": [ "# Set the q parameter to 'has:corrections' to limit results to articles with corrections\n", "params['q'] = 'has:corrections'\n", "\n", "# Get the data from the API\n", "data = get_results(params)\n", "\n", "# Extract the total number of results\n", "corrected = int(data['response']['zone'][0]['records']['total'])\n", "print('{:,}'.format(corrected))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the proportion of articles with corrections." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5.49% of articles have at least one correction\n" ] } ], "source": [ "print('{:.2%} of articles have at least one correction'.format(corrected/total))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might be thinking that these figures don't seem to match the number of corrections by individuals displayed on the digitised newspapers home page. Remember that these figures show the **number of articles** that include corrections, while the individual scores show the **number of lines** corrected by each volunteer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of corrections by year" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def get_facets(data):\n", " '''\n", " Loop through facets in Trove API response, saving terms and counts.\n", " Parameters:\n", " data - JSON formatted response data from Trove API \n", " Returns:\n", " A list of dictionaries containing: 'term', 'total_results'\n", " '''\n", " facets = []\n", " try:\n", " # The facets are buried a fair way down in the results\n", " # Note that if you ask for more than one facet, you'll have use the facet['name'] param to find the one you want\n", " # In this case there's only one facet, so we can just grab the list of terms (which are in fact the results by year)\n", " for term in data['response']['zone'][0]['facets']['facet']['term']:\n", " \n", " # Get the year and the number of results, and convert them to integers, before adding to our results\n", " facets.append({'term': term['search'], 'total_results': int(term['count'])})\n", " \n", " # Sort facets by year\n", " facets.sort(key=itemgetter('term'))\n", " except TypeError:\n", " pass\n", " return facets\n", "\n", "def get_facet_data(params, start_decade=180, end_decade=201):\n", " '''\n", " Loop throught the decades from 'start_decade' to 'end_decade',\n", " getting the number of search results for each year from the year facet.\n", " Combine all the results into a single list.\n", " Parameters:\n", " params - parameters to send to the API\n", " start_decade\n", " end_decade\n", " Returns:\n", " A list of dictionaries containing 'year', 'total_results' for the complete \n", " period between the start and end decades.\n", " '''\n", " # Create a list to hold the facets data\n", " facet_data = []\n", " \n", " # Loop through the decades\n", " for decade in tqdm(range(start_decade, end_decade + 1)):\n", " \n", " #print(params)\n", " # Avoid confusion by copying the params before we change anything.\n", " search_params = params.copy()\n", " \n", " # Add decade value to params\n", " search_params['l-decade'] = decade\n", " \n", " # Get the data from the API\n", " data = get_results(search_params)\n", " \n", " # Get the facets from the data and add to facets_data\n", " facet_data += get_facets(data)\n", " \n", " # Reomve the progress bar (you can also set leave=False in tqdm, but that still leaves white space in Jupyter Lab)\n", " clear_output()\n", " return facet_data" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "facet_data = get_facet_data(params)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Convert our data to a dataframe called df\n", "df = pd.DataFrame(facet_data)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
termtotal_results
01803526
11804619
21805430
31806367
41807134
\n", "
" ], "text/plain": [ " term total_results\n", "0 1803 526\n", "1 1804 619\n", "2 1805 430\n", "3 1806 367\n", "4 1807 134" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So which year has the most corrections?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "term 1915\n", "total_results 256092\n", "Name: 112, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df['total_results'].idxmax()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The fact that there's more corrections in newspaper articles from 1915, might make you think that people have been more motivated to correct articles relating to WWI. But if you look at the [total number of articles per year](visualise-total-newspaper-articles-by-state-year.ipynb), you'll see that there's been more articles digitised from 1915! The raw number of corrections is probably not very useful, so let's look instead at the *proportion* of articles each year that have at least one correction.\n", "\n", "To do that we'll re-harvest the facet data, but this time with a blank, or empty search, to get the total number of articles available from each year." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Reset the 'q' parameter\n", "# Use a an empty search (a single space) to get ALL THE ARTICLES\n", "params['q'] = ' '\n", "\n", "# Get facet data for all articles\n", "all_facet_data = get_facet_data(params)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Convert the results to a dataframe\n", "df_total = pd.DataFrame(all_facet_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No we'll merge the number of articles by year with corrections with the total number of articles. Then we'll calculate the proportion with corrections." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def merge_df_with_total(df, df_total, how='left'):\n", " '''\n", " Merge dataframes containing search results with the total number of articles by year.\n", " This is a left join on the year column. The total number of articles will be added as a column to \n", " the existing results.\n", " Once merged, do some reorganisation and calculate the proportion of search results.\n", " Parameters:\n", " df - the search results in a dataframe\n", " df_total - total number of articles per year in a dataframe\n", " Returns:\n", " A dataframe with the following columns - 'year', 'total_results', 'total_articles', 'proportion' \n", " (plus any other columns that are in the search results dataframe).\n", " '''\n", " # Merge the two dataframes on year\n", " # Note that we're joining the two dataframes on the year column\n", " df_merged = pd.merge(df, df_total, how=how, on='term')\n", "\n", " # Rename the columns for convenience\n", " df_merged.rename({'total_results_y': 'total_articles'}, inplace=True, axis='columns')\n", " df_merged.rename({'total_results_x': 'total_results'}, inplace=True, axis='columns')\n", "\n", " # Set blank values to zero to avoid problems\n", " df_merged['total_results'] = df_merged['total_results'].fillna(0).astype(int)\n", "\n", " # Calculate proportion by dividing the search results by the total articles\n", " df_merged['proportion'] = df_merged['total_results'] / df_merged['total_articles']\n", " return df_merged" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
termtotal_resultstotal_articlesproportion
018035265261.0
118046196191.0
218054304301.0
318063673671.0
418071341341.0
\n", "
" ], "text/plain": [ " term total_results total_articles proportion\n", "0 1803 526 526 1.0\n", "1 1804 619 619 1.0\n", "2 1805 430 430 1.0\n", "3 1806 367 367 1.0\n", "4 1807 134 134 1.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Merge the search results with the total articles\n", "df_merged = merge_df_with_total(df, df_total)\n", "df_merged.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualise the results, showing both the number of articles with corrections each year, and the proportion of articles each year with corrections." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.VConcatChart(...)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of articles with corrections\n", "chart1 = alt.Chart(df_merged).mark_line(point=True).encode(\n", " x=alt.X('term:Q', axis=alt.Axis(format='c', title='Year')),\n", " y=alt.Y('total_results:Q', axis=alt.Axis(format=',d', title='Number of articles with corrections')),\n", " tooltip=[alt.Tooltip('term:Q', title='Year'), alt.Tooltip('total_results:Q', title='Articles', format=',')]\n", " ).properties(width=700, height=250)\n", "\n", "# Proportion of articles with corrections\n", "chart2 = alt.Chart(df_merged).mark_line(point=True, color='red').encode(\n", " x=alt.X('term:Q', axis=alt.Axis(format='c', title='Year')),\n", " \n", " # This time we're showing the proportion (formatted as a percentage) on the Y axis\n", " y=alt.Y('proportion:Q', axis=alt.Axis(format='%', title='Proportion of articles with corrections')),\n", " tooltip=[alt.Tooltip('term:Q', title='Year'), alt.Tooltip('proportion:Q', title='Proportion', format='%')],\n", " \n", " # Make the charts different colors\n", " color=alt.value('orange')\n", " ).properties(width=700, height=250)\n", "\n", "# This is a shorthand way of stacking the charts on top of each other\n", "chart1 & chart2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is really interesting – it seems there's been a deliberate effort to get the earliest newspapers corrected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of corrections by category\n", "\n", "Let's see how the number of corrections varies across categories. This time we'll use the `category` facet instead of `year`." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "params['q'] = 'has:corrections'\n", "params['facet'] = 'category'" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "data = get_results(params)\n", "facets = []\n", "for term in data['response']['zone'][0]['facets']['facet']['term']:\n", " # Get the state and the number of results, and convert it to integers, before adding to our results\n", " facets.append({'term': term['search'], 'total_results': int(term['count'])})\n", "df_categories = pd.DataFrame(facets)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
termtotal_results
0Article9707996
1Family Notices1324644
2Advertising1231444
3Detailed Lists, Results, Guides485544
4Literature9371
\n", "
" ], "text/plain": [ " term total_results\n", "0 Article 9707996\n", "1 Family Notices 1324644\n", "2 Advertising 1231444\n", "3 Detailed Lists, Results, Guides 485544\n", "4 Literature 9371" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_categories.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again, the raw numbers are probably not all that useful, so let's get the total number of articles in each category and calculate the proportion that have at least one correction." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Blank query\n", "params['q'] = ' '\n", "data = get_results(params)\n", "facets = []\n", "for term in data['response']['zone'][0]['facets']['facet']['term']:\n", " # Get the state and the number of results, and convert it to integers, before adding to our results\n", " facets.append({'term': term['search'], 'total_results': int(term['count'])})\n", "df_total_categories = pd.DataFrame(facets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll merge the two corrections by category data with the total articles per category and calculate the proportion." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
termtotal_resultstotal_articlesproportion
0Article97079961613583610.060164
1Family Notices132464419131430.692392
2Advertising1231444428828860.028716
3Detailed Lists, Results, Guides485544260497610.018639
4Literature9371325390.287993
5Obituaries662670040.946031
6Humour6313226930.278192
7News603574390.811265
8Law, Courts, And Crime524464450.813654
9Sport And Games450189820.501113
10Letters251191370.274817
11Arts And Culture157922410.704596
12Editorial148092740.159586
13Puzzles1403296500.047319
14Classified Advertisements And Notices112912910.874516
15Shipping Notices105611640.907216
16Official Appointments And Notices8158380.972554
17Weather74352230.142255
18Commerce And Business66610380.641618
19Reviews5948980.661470
20Display Advertisement2472820.875887
\n", "
" ], "text/plain": [ " term total_results total_articles \\\n", "0 Article 9707996 161358361 \n", "1 Family Notices 1324644 1913143 \n", "2 Advertising 1231444 42882886 \n", "3 Detailed Lists, Results, Guides 485544 26049761 \n", "4 Literature 9371 32539 \n", "5 Obituaries 6626 7004 \n", "6 Humour 6313 22693 \n", "7 News 6035 7439 \n", "8 Law, Courts, And Crime 5244 6445 \n", "9 Sport And Games 4501 8982 \n", "10 Letters 2511 9137 \n", "11 Arts And Culture 1579 2241 \n", "12 Editorial 1480 9274 \n", "13 Puzzles 1403 29650 \n", "14 Classified Advertisements And Notices 1129 1291 \n", "15 Shipping Notices 1056 1164 \n", "16 Official Appointments And Notices 815 838 \n", "17 Weather 743 5223 \n", "18 Commerce And Business 666 1038 \n", "19 Reviews 594 898 \n", "20 Display Advertisement 247 282 \n", "\n", " proportion \n", "0 0.060164 \n", "1 0.692392 \n", "2 0.028716 \n", "3 0.018639 \n", "4 0.287993 \n", "5 0.946031 \n", "6 0.278192 \n", "7 0.811265 \n", "8 0.813654 \n", "9 0.501113 \n", "10 0.274817 \n", "11 0.704596 \n", "12 0.159586 \n", "13 0.047319 \n", "14 0.874516 \n", "15 0.907216 \n", "16 0.972554 \n", "17 0.142255 \n", "18 0.641618 \n", "19 0.661470 \n", "20 0.875887 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_categories_merged = merge_df_with_total(df_categories, df_total_categories)\n", "df_categories_merged" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A lot of the categories have been added recently and don't contain a lot of articles. Some of these have a very high proportion of articles with corrections – 'Obituaries' for example. This suggests users are systematically categorising and correcting certain types of article.\n", "\n", "Let's focus on the main categories by filtering out those with less than 30,000 articles." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
termtotal_resultstotal_articlesproportion
0Article97079961613583610.060164
1Family Notices132464419131430.692392
2Advertising1231444428828860.028716
3Detailed Lists, Results, Guides485544260497610.018639
4Literature9371325390.287993
\n", "
" ], "text/plain": [ " term total_results total_articles proportion\n", "0 Article 9707996 161358361 0.060164\n", "1 Family Notices 1324644 1913143 0.692392\n", "2 Advertising 1231444 42882886 0.028716\n", "3 Detailed Lists, Results, Guides 485544 26049761 0.018639\n", "4 Literature 9371 32539 0.287993" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_categories_filtered = df_categories_merged.loc[df_categories_merged['total_articles'] > 30000]\n", "df_categories_filtered" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can visualise the results." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.HConcatChart(...)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_chart1 = alt.Chart(df_categories_filtered).mark_bar().encode(\n", " x=alt.X('term:N', title='Category'),\n", " y=alt.Y('total_results:Q', title='Articles with corrections')\n", ")\n", "\n", "cat_chart2 = alt.Chart(df_categories_filtered).mark_bar().encode(\n", " x=alt.X('term:N', title='Category'),\n", " y=alt.Y('proportion:Q', axis=alt.Axis(format='%', title='Proportion of articles with corrections')),\n", " color=alt.value('orange')\n", ")\n", "\n", "cat_chart1 | cat_chart2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the rate of corrections is much higher in the 'Family Notices' category than any other. This probably reflects the work of family historians and others searching for, and correcting, articles containing particular names." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of corrections by newspaper\n", "\n", "How do rates of correction vary across newspapers? We can use the `title` facet to find out." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "params['q'] = 'has:corrections'\n", "params['facet'] = 'title'" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "data = get_results(params)\n", "facets = []\n", "for term in data['response']['zone'][0]['facets']['facet']['term']:\n", " # Get the state and the number of results, and convert it to integers, before adding to our results\n", " facets.append({'term': term['search'], 'total_results': int(term['count'])})\n", "df_newspapers = pd.DataFrame(facets)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
termtotal_results
035801023
113757930
211347365
316335237
430304692
\n", "
" ], "text/plain": [ " term total_results\n", "0 35 801023\n", "1 13 757930\n", "2 11 347365\n", "3 16 335237\n", "4 30 304692" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_newspapers.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again we'll calculate the proportion of articles corrected for each newspaper by getting the total number of articles for each newspaper on Trove." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "params['q'] = ' '" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "data = get_results(params)\n", "facets = []\n", "for term in data['response']['zone'][0]['facets']['facet']['term']:\n", " # Get the state and the number of results, and convert it to integers, before adding to our results\n", " facets.append({'term': term['search'], 'total_results': int(term['count'])})\n", "df_newspapers_total = pd.DataFrame(facets)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "df_newspapers_merged = merge_df_with_total(df_newspapers, df_newspapers_total, how='right')" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "df_newspapers_merged.sort_values(by='proportion', ascending=False, inplace=True)\n", "df_newspapers_merged.rename(columns={'term': 'id'}, inplace=True)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtotal_resultstotal_articlesproportion
1628729331.0
161415421211.0
13385155615561.0
152210282862861.0
15402731931931.0
\n", "
" ], "text/plain": [ " id total_results total_articles proportion\n", "1628 729 3 3 1.0\n", "1614 154 21 21 1.0\n", "1338 5 1556 1556 1.0\n", "1522 1028 286 286 1.0\n", "1540 273 193 193 1.0" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_newspapers_merged.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `title` facet only gives us the `id` number for each newspaper, not its title. Let's get all the titles and then merge them with the facet data." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# Get all the newspaper titles\n", "title_params = {\n", " 'key': api_key,\n", " 'encoding': 'json',\n", "}\n", "\n", "title_data = s.get('https://api.trove.nla.gov.au/v2/newspaper/titles', params=params).json()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "titles = []\n", "for newspaper in title_data['response']['records']['newspaper']:\n", " titles.append({'title': newspaper['title'], 'id': int(newspaper['id'])})\n", "df_titles = pd.DataFrame(titles)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleid
0Canberra Community News (ACT : 1925 - 1927)166
1Canberra Illustrated: A Quarterly Magazine (AC...165
2Federal Capital Pioneer (Canberra, ACT : 1924 ...69
3Good Neighbour (ACT : 1950 - 1969)871
4Student Notes/Canberra University College Stud...665
\n", "
" ], "text/plain": [ " title id\n", "0 Canberra Community News (ACT : 1925 - 1927) 166\n", "1 Canberra Illustrated: A Quarterly Magazine (AC... 165\n", "2 Federal Capital Pioneer (Canberra, ACT : 1924 ... 69\n", "3 Good Neighbour (ACT : 1950 - 1969) 871\n", "4 Student Notes/Canberra University College Stud... 665" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_titles.head()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1666, 2)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_titles.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One problem with this list is that it also includes the titles of the Government Gazettes (this seems to be a bug in the API). Let's get the gazette titles and then subtract them from the complete list." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# Get gazette titles\n", "gazette_data = s.get('https://api.trove.nla.gov.au/v2/gazette/titles', params=params).json()\n", "gazettes = []\n", "for gaz in gazette_data['response']['records']['newspaper']:\n", " gazettes.append({'title': gaz['title'], 'id': int(gaz['id'])})\n", "df_gazettes = pd.DataFrame(gazettes)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(38, 2)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_gazettes.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Subtract the gazettes from the list of titles." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "df_titles_not_gazettes = df_titles[~df_titles['id'].isin(df_gazettes['id'])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can merge the newspaper titles with the facet data using the `id` to link the two datasets." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "df_newspapers_with_titles = pd.merge(df_titles_not_gazettes, df_newspapers_merged, how='left', on='id').fillna(0).sort_values(by='proportion', ascending=False)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# Convert the totals back to integers\n", "df_newspapers_with_titles[['total_results', 'total_articles']] = df_newspapers_with_titles[['total_results', 'total_articles']].astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can display the newspapers with the highest rates of correction. Remember, that a `proportion` of 1.00 means that every available article has at least one correction." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleidtotal_resultstotal_articlesproportion
191Party (Sydney, NSW : 1942)1000661.000000
20The Australian Abo Call (National : 1938)5178781.000000
416The Satirist and Sporting Chronicle (Sydney, N...10282862861.000000
146Justice (Narrabri, NSW : 1891)88545451.000000
463The Temora Telegraph and Mining Advocate (NSW ...729331.000000
467The True Sun and New South Wales Independent P...103820201.000000
533Moonta Herald and Northern Territory Gazette (...11856561.000000
735Suedaustralische Zeitung (Adelaide, SA : 1850 ...31447471.000000
816Hobart Town Gazette and Van Diemen's Land Adve...5155615561.000000
829Tasmanian and Port Dalrymple Advertiser (Launc...2731931931.000000
851The Derwent Star and Van Diemen's Land Intelli...104612121.000000
903Alexandra and Yea Standard, Thornton, Gobur an...15421211.000000
961Elsternwick Leader and East Brighton, ... (Vic...20117171.000000
280The Branxton Advocate: Greta and Rothbury Reco...68653531.000000
212Society (Sydney, NSW : 1887)104221211.000000
1442Swan River Guardian (WA : 1836 - 1838)11424374371.000000
887The Van Diemen's Land Gazette and General Adve...104738381.000000
857The Hobart Town Gazette and Southern Reporter ...4192219230.999480
2Federal Capital Pioneer (Canberra, ACT : 1924 ...695425450.994495
1211The Melbourne Advertiser (Vic. : 1838)9351201210.991736
721South Australian Gazette and Colonial Register...40105110650.986854
140Intelligence (Bowral, NSW : 1884)6241171190.983193
1625York Advocate (WA : 1915)11312362410.979253
558Logan and Albert Advocate (Qld. : 1893 - 1900)84282840.976190
383The Newcastle Argus and District Advertiser (N...51329300.966667
\n", "
" ], "text/plain": [ " title id total_results \\\n", "191 Party (Sydney, NSW : 1942) 1000 6 \n", "20 The Australian Abo Call (National : 1938) 51 78 \n", "416 The Satirist and Sporting Chronicle (Sydney, N... 1028 286 \n", "146 Justice (Narrabri, NSW : 1891) 885 45 \n", "463 The Temora Telegraph and Mining Advocate (NSW ... 729 3 \n", "467 The True Sun and New South Wales Independent P... 1038 20 \n", "533 Moonta Herald and Northern Territory Gazette (... 118 56 \n", "735 Suedaustralische Zeitung (Adelaide, SA : 1850 ... 314 47 \n", "816 Hobart Town Gazette and Van Diemen's Land Adve... 5 1556 \n", "829 Tasmanian and Port Dalrymple Advertiser (Launc... 273 193 \n", "851 The Derwent Star and Van Diemen's Land Intelli... 1046 12 \n", "903 Alexandra and Yea Standard, Thornton, Gobur an... 154 21 \n", "961 Elsternwick Leader and East Brighton, ... (Vic... 201 17 \n", "280 The Branxton Advocate: Greta and Rothbury Reco... 686 53 \n", "212 Society (Sydney, NSW : 1887) 1042 21 \n", "1442 Swan River Guardian (WA : 1836 - 1838) 1142 437 \n", "887 The Van Diemen's Land Gazette and General Adve... 1047 38 \n", "857 The Hobart Town Gazette and Southern Reporter ... 4 1922 \n", "2 Federal Capital Pioneer (Canberra, ACT : 1924 ... 69 542 \n", "1211 The Melbourne Advertiser (Vic. : 1838) 935 120 \n", "721 South Australian Gazette and Colonial Register... 40 1051 \n", "140 Intelligence (Bowral, NSW : 1884) 624 117 \n", "1625 York Advocate (WA : 1915) 1131 236 \n", "558 Logan and Albert Advocate (Qld. : 1893 - 1900) 842 82 \n", "383 The Newcastle Argus and District Advertiser (N... 513 29 \n", "\n", " total_articles proportion \n", "191 6 1.000000 \n", "20 78 1.000000 \n", "416 286 1.000000 \n", "146 45 1.000000 \n", "463 3 1.000000 \n", "467 20 1.000000 \n", "533 56 1.000000 \n", "735 47 1.000000 \n", "816 1556 1.000000 \n", "829 193 1.000000 \n", "851 12 1.000000 \n", "903 21 1.000000 \n", "961 17 1.000000 \n", "280 53 1.000000 \n", "212 21 1.000000 \n", "1442 437 1.000000 \n", "887 38 1.000000 \n", "857 1923 0.999480 \n", "2 545 0.994495 \n", "1211 121 0.991736 \n", "721 1065 0.986854 \n", "140 119 0.983193 \n", "1625 241 0.979253 \n", "558 84 0.976190 \n", "383 30 0.966667 " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_newspapers_with_titles[:25]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the other end, we can see the newspapers with the smallest rates of correction. Note that some newspapers have no corrections at all." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleidtotal_resultstotal_articlesproportion
1125Seamen's Strike Bulletin (Melbourne, Vic. : 1919)10430140.000000
1521The Miner's Right (Perth, WA : 1894)172904260.000000
59Campbelltown Ingleburn News (NSW : 1953 - 1954)1699062480.000000
1104Progress (North Fitzroy, Vic. : 1889 - 1890)157402540.000000
1439Sunday Figaro (Kalgoorlie, WA : 1904)166403620.000000
1476The Derby News (WA : 1887)1617090.000000
1527The Mount Margaret Mercury (WA : 1897)16410240.000000
495To Ethnico Vema = Greek National Tribune (Arnc...15927628610.000111
249The Australian Jewish Times (Sydney, NSW : 195...1694432683790.000160
509Vil'na Dumka = Free Thought (Sydney, NSW : 194...15932116070.000172
743The Coromandel Times (Blackwood, SA : 1970 - 1...1681299000.000202
787West Coast Recorder (Port Lincoln, SA : 1909 -...1702231044810.000220
1588The W.A. Sportsman (Kalgoorlie, WA : 1901 - 1902)1666141290.000242
447The Sydney Jewish News (Sydney, N.S.W : 1939 -...169318716860.000251
1201The Jewish Weekly News (Melbourne, Vic. : 1933...17073118650.000253
742The Coromandel (Blackwood, SA : 1945 - 1970)168016556910.000287
175Mu̇sų Pastogė = Our Haven (Sydney, NSW : 195...1594390600.000331
1418Northam Advertiser and Toodyay Times (WA : 1954)1652126190.000382
1483The Evening News (Boulder, WA : 1921 - 1922)1621483100.000481
1436Sporting Life : Dryblower's Journal (Kalgoorli...1663482420.000485
152L'Italo-Australiano = The Italo-Australian (Sy...1597361060.000491
84Cowra Guardian and Lachlan Agricultural Record...169720364530.000549
1377Kulin Advocate and Dudinin-Jitarning Harrismit...16326108560.000553
143Italian Bulletin of Commerce (Sydney, NSW : 19...1603117750.000563
705Port Lincoln, Tumby and West Coast Recorder (S...17007115970.000604
\n", "
" ], "text/plain": [ " title id total_results \\\n", "1125 Seamen's Strike Bulletin (Melbourne, Vic. : 1919) 1043 0 \n", "1521 The Miner's Right (Perth, WA : 1894) 1729 0 \n", "59 Campbelltown Ingleburn News (NSW : 1953 - 1954) 1699 0 \n", "1104 Progress (North Fitzroy, Vic. : 1889 - 1890) 1574 0 \n", "1439 Sunday Figaro (Kalgoorlie, WA : 1904) 1664 0 \n", "1476 The Derby News (WA : 1887) 1617 0 \n", "1527 The Mount Margaret Mercury (WA : 1897) 1641 0 \n", "495 To Ethnico Vema = Greek National Tribune (Arnc... 1592 7 \n", "249 The Australian Jewish Times (Sydney, NSW : 195... 1694 43 \n", "509 Vil'na Dumka = Free Thought (Sydney, NSW : 194... 1593 2 \n", "743 The Coromandel Times (Blackwood, SA : 1970 - 1... 1681 2 \n", "787 West Coast Recorder (Port Lincoln, SA : 1909 -... 1702 23 \n", "1588 The W.A. Sportsman (Kalgoorlie, WA : 1901 - 1902) 1666 1 \n", "447 The Sydney Jewish News (Sydney, N.S.W : 1939 -... 1693 18 \n", "1201 The Jewish Weekly News (Melbourne, Vic. : 1933... 1707 3 \n", "742 The Coromandel (Blackwood, SA : 1945 - 1970) 1680 16 \n", "175 Mu̇sų Pastogė = Our Haven (Sydney, NSW : 195... 1594 3 \n", "1418 Northam Advertiser and Toodyay Times (WA : 1954) 1652 1 \n", "1483 The Evening News (Boulder, WA : 1921 - 1922) 1621 4 \n", "1436 Sporting Life : Dryblower's Journal (Kalgoorli... 1663 4 \n", "152 L'Italo-Australiano = The Italo-Australian (Sy... 1597 3 \n", "84 Cowra Guardian and Lachlan Agricultural Record... 1697 20 \n", "1377 Kulin Advocate and Dudinin-Jitarning Harrismit... 1632 6 \n", "143 Italian Bulletin of Commerce (Sydney, NSW : 19... 1603 1 \n", "705 Port Lincoln, Tumby and West Coast Recorder (S... 1700 7 \n", "\n", " total_articles proportion \n", "1125 14 0.000000 \n", "1521 426 0.000000 \n", "59 6248 0.000000 \n", "1104 254 0.000000 \n", "1439 362 0.000000 \n", "1476 9 0.000000 \n", "1527 24 0.000000 \n", "495 62861 0.000111 \n", "249 268379 0.000160 \n", "509 11607 0.000172 \n", "743 9900 0.000202 \n", "787 104481 0.000220 \n", "1588 4129 0.000242 \n", "447 71686 0.000251 \n", "1201 11865 0.000253 \n", "742 55691 0.000287 \n", "175 9060 0.000331 \n", "1418 2619 0.000382 \n", "1483 8310 0.000481 \n", "1436 8242 0.000485 \n", "152 6106 0.000491 \n", "84 36453 0.000549 \n", "1377 10856 0.000553 \n", "143 1775 0.000563 \n", "705 11597 0.000604 " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_newspapers_with_titles.sort_values(by='proportion')[:25]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll save the full list of newspapers as a CSV file." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "df_newspapers_with_titles_csv = df_newspapers_with_titles.copy()\n", "df_newspapers_with_titles_csv.rename({'total_results': 'articles_with_corrections'}, axis=1, inplace=True)\n", "df_newspapers_with_titles_csv['percentage_with_corrections'] = df_newspapers_with_titles_csv['proportion'] * 100\n", "df_newspapers_with_titles_csv.sort_values(by=['percentage_with_corrections'], inplace=True)\n", "df_newspapers_with_titles_csv[['id', 'title', 'articles_with_corrections', 'total_articles', 'percentage_with_corrections']].to_csv('titles_corrected.csv', index=False)\n", "df_newspapers_with_titles_csv['title_url'] = df_newspapers_with_titles_csv['id'].apply(lambda x: f'http://nla.gov.au/nla.news-title{x}')\n", "df_newspapers_with_titles_csv.to_csv('titles_corrected.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "titles_corrected.csv
" ], "text/plain": [ "/Volumes/Workspace/mycode/glam-workbench/trove-newspapers/notebooks/titles_corrected.csv" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(FileLink('titles_corrected.csv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Neediest newspapers\n", "\n", "Let's see if we can combine some guesses about OCR error rates with the correction data to find the newspapers most in need of help.\n", "\n", "To make a guesstimate of error rates, we'll use the occurance of 'tbe' – ie a common OCR error for 'the'. I don't know how valid this is, but it's a place to start!" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "# Search for 'tbe' to get an indication of errors by newspaper\n", "params['q'] = 'text:\"tbe\"~0'\n", "params['facet'] = 'title'" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "data = get_results(params)\n", "facets = []\n", "for term in data['response']['zone'][0]['facets']['facet']['term']:\n", " # Get the state and the number of results, and convert it to integers, before adding to our results\n", " facets.append({'term': term['search'], 'total_results': int(term['count'])})\n", "df_errors = pd.DataFrame(facets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Merge the error data with the total articles per newspaper to calculate the proportion." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "df_errors_merged = merge_df_with_total(df_errors, df_newspapers_total, how='right')\n", "df_errors_merged.sort_values(by='proportion', ascending=False, inplace=True)\n", "df_errors_merged.rename(columns={'term': 'id'}, inplace=True)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtotal_resultstotal_articlesproportion
12351316200529540.678741
1034758525080780.649913
8129279450172270.548557
9023826966127440.546610
9272626279115270.544721
\n", "
" ], "text/plain": [ " id total_results total_articles proportion\n", "1235 1316 2005 2954 0.678741\n", "1034 758 5250 8078 0.649913\n", "812 927 9450 17227 0.548557\n", "902 382 6966 12744 0.546610\n", "927 262 6279 11527 0.544721" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_errors_merged.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add the title names." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "df_errors_with_titles = pd.merge(df_titles_not_gazettes, df_errors_merged, how='left', on='id').fillna(0).sort_values(by='proportion', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So this is a list of the newspapers with the highest rate of OCR error (by our rather dodgy measure)." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleidtotal_resultstotal_articlesproportion
482The Weekly Advance (Granville, NSW : 1892 - 1893)1316200529540.678741
959Dunolly and Betbetshire Express and County of ...758525080780.649913
1001Hamilton Spectator and Grange District Adverti...9279450172270.548557
514Wagga Wagga Express and Murrumbidgee District ...3826966127440.546610
615The North Australian, Ipswich and General Adve...2626279115270.544721
614The North Australian (Brisbane, Qld. : 1863 - ...264287553140.541024
338The Hay Standard and Advertiser for Balranald,...72521698420680.515784
206Robertson Advocate (NSW : 1894 - 1923)53037007723760.511316
232Temora Herald and Mining Journal (NSW : 1882 -...72864012530.510774
831Tasmanian Morning Herald (Hobart, Tas. : 1865 ...865485795590.508108
226Sydney Mail (NSW : 1860 - 1871)69724593485350.506707
1096Port Phillip Gazette and Settler's Journal (Vi...11386116121270.504329
827Morning Star and Commercial Advertiser (Hobart...124285517030.502055
166Molong Argus (NSW : 1896 - 1921)424521111049840.496371
1095Port Phillip Gazette (Vic. : 1851)11392434910.494908
837Telegraph (Hobart Town, Tas. : 1867)1250681400.485714
890Trumpeter General (Hobart, Tas. : 1833 - 1834)86970114820.473009
306The Cumberland Free Press (Parramatta, NSW : 1...7246238132470.470899
560Logan Witness (Beenleigh, Qld. : 1878 - 1893)8506845146540.467108
645Adelaide Chronicle and South Australian Litera...98690119370.465152
607The Darling Downs Gazette and General Advertis...25729514652680.452197
388The News, Shoalhaven and Southern Coast Distri...1588247354950.450045
848The Cornwall Chronicle (Launceston, Tas. : 183...170727301637910.444041
940Chronicle, South Yarra Gazette, Toorak Times a...847163937200.440591
865The Mount Lyell Standard and Strahan Gazette (...125136450833630.437244
\n", "
" ], "text/plain": [ " title id total_results \\\n", "482 The Weekly Advance (Granville, NSW : 1892 - 1893) 1316 2005 \n", "959 Dunolly and Betbetshire Express and County of ... 758 5250 \n", "1001 Hamilton Spectator and Grange District Adverti... 927 9450 \n", "514 Wagga Wagga Express and Murrumbidgee District ... 382 6966 \n", "615 The North Australian, Ipswich and General Adve... 262 6279 \n", "614 The North Australian (Brisbane, Qld. : 1863 - ... 264 2875 \n", "338 The Hay Standard and Advertiser for Balranald,... 725 21698 \n", "206 Robertson Advocate (NSW : 1894 - 1923) 530 37007 \n", "232 Temora Herald and Mining Journal (NSW : 1882 -... 728 640 \n", "831 Tasmanian Morning Herald (Hobart, Tas. : 1865 ... 865 4857 \n", "226 Sydney Mail (NSW : 1860 - 1871) 697 24593 \n", "1096 Port Phillip Gazette and Settler's Journal (Vi... 1138 6116 \n", "827 Morning Star and Commercial Advertiser (Hobart... 1242 855 \n", "166 Molong Argus (NSW : 1896 - 1921) 424 52111 \n", "1095 Port Phillip Gazette (Vic. : 1851) 1139 243 \n", "837 Telegraph (Hobart Town, Tas. : 1867) 1250 68 \n", "890 Trumpeter General (Hobart, Tas. : 1833 - 1834) 869 701 \n", "306 The Cumberland Free Press (Parramatta, NSW : 1... 724 6238 \n", "560 Logan Witness (Beenleigh, Qld. : 1878 - 1893) 850 6845 \n", "645 Adelaide Chronicle and South Australian Litera... 986 901 \n", "607 The Darling Downs Gazette and General Advertis... 257 29514 \n", "388 The News, Shoalhaven and Southern Coast Distri... 1588 2473 \n", "848 The Cornwall Chronicle (Launceston, Tas. : 183... 170 72730 \n", "940 Chronicle, South Yarra Gazette, Toorak Times a... 847 1639 \n", "865 The Mount Lyell Standard and Strahan Gazette (... 1251 36450 \n", "\n", " total_articles proportion \n", "482 2954 0.678741 \n", "959 8078 0.649913 \n", "1001 17227 0.548557 \n", "514 12744 0.546610 \n", "615 11527 0.544721 \n", "614 5314 0.541024 \n", "338 42068 0.515784 \n", "206 72376 0.511316 \n", "232 1253 0.510774 \n", "831 9559 0.508108 \n", "226 48535 0.506707 \n", "1096 12127 0.504329 \n", "827 1703 0.502055 \n", "166 104984 0.496371 \n", "1095 491 0.494908 \n", "837 140 0.485714 \n", "890 1482 0.473009 \n", "306 13247 0.470899 \n", "560 14654 0.467108 \n", "645 1937 0.465152 \n", "607 65268 0.452197 \n", "388 5495 0.450045 \n", "848 163791 0.444041 \n", "940 3720 0.440591 \n", "865 83363 0.437244 " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_errors_with_titles[:25]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And those with the lowest rate of errors. Note the number of non-English newspapers in this list – of course our measure of accuracy fails completely in newspapers that don't use the word 'the'!" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleidtotal_resultstotal_articlesproportion
1175The Chinese Advertiser (Ballarat, Vic. : 1856)7060150.0
1437Stampa Italiana = The Italian Press (Perth, WA...1380024930.0
224Sydney General Trade List, Mercantile Chronicl...6960220.0
1476The Derby News (WA : 1887)1617090.0
1Canberra Illustrated: A Quarterly Magazine (AC...1650570.0
533Moonta Herald and Northern Territory Gazette (...1180560.0
31Auburn and District News (NSW : 1929)13200250.0
762The Port Adelaide Post Shipping Gazette, Farme...7190180.0
212Society (Sydney, NSW : 1887)10420210.0
151L'Italo-Australiano = The Italo-Australian (Su...159601970.0
978Frankston Standard (Frankston, Vic. : 1949)233019970.0
1305Chung Wah News (Perth, WA : 1981 - 1987)138308600.0
46Blayney West Macquarie (NSW : 1949)80201100.0
1125Seamen's Strike Bulletin (Melbourne, Vic. : 1919)10430140.0
1294Bullfinch Miner and Yilgarn Advocate (WA : 1910)14600270.0
741The Citizen (Port Adelaide, SA : 1938-1940)1305012840.0
1388Mediterranean Voice (Perth, WA : 1971 - 1972)139004310.0
1565The Southern Cross (Perth, WA : 1893)16600590.0
1188The Elsternwick Leader and Caulfield and Balac...2000470.0
66Citizen Soldier (Sydney, NSW : 1942)9960600.0
343The Hospital Saturday News (Katoomba, NSW : 1930)9150540.0
1557The Possum (Fremantle, WA : 1890)120101050.0
67Clarence and Richmond Examiner (Grafton, NSW :...10401110.0
961Elsternwick Leader and East Brighton, ... (Vic...2010170.0
1378La Rondine (Perth, WA : 1969 - 1994)1388013830.0
\n", "
" ], "text/plain": [ " title id total_results \\\n", "1175 The Chinese Advertiser (Ballarat, Vic. : 1856) 706 0 \n", "1437 Stampa Italiana = The Italian Press (Perth, WA... 1380 0 \n", "224 Sydney General Trade List, Mercantile Chronicl... 696 0 \n", "1476 The Derby News (WA : 1887) 1617 0 \n", "1 Canberra Illustrated: A Quarterly Magazine (AC... 165 0 \n", "533 Moonta Herald and Northern Territory Gazette (... 118 0 \n", "31 Auburn and District News (NSW : 1929) 1320 0 \n", "762 The Port Adelaide Post Shipping Gazette, Farme... 719 0 \n", "212 Society (Sydney, NSW : 1887) 1042 0 \n", "151 L'Italo-Australiano = The Italo-Australian (Su... 1596 0 \n", "978 Frankston Standard (Frankston, Vic. : 1949) 233 0 \n", "1305 Chung Wah News (Perth, WA : 1981 - 1987) 1383 0 \n", "46 Blayney West Macquarie (NSW : 1949) 802 0 \n", "1125 Seamen's Strike Bulletin (Melbourne, Vic. : 1919) 1043 0 \n", "1294 Bullfinch Miner and Yilgarn Advocate (WA : 1910) 1460 0 \n", "741 The Citizen (Port Adelaide, SA : 1938-1940) 1305 0 \n", "1388 Mediterranean Voice (Perth, WA : 1971 - 1972) 1390 0 \n", "1565 The Southern Cross (Perth, WA : 1893) 1660 0 \n", "1188 The Elsternwick Leader and Caulfield and Balac... 200 0 \n", "66 Citizen Soldier (Sydney, NSW : 1942) 996 0 \n", "343 The Hospital Saturday News (Katoomba, NSW : 1930) 915 0 \n", "1557 The Possum (Fremantle, WA : 1890) 1201 0 \n", "67 Clarence and Richmond Examiner (Grafton, NSW :... 104 0 \n", "961 Elsternwick Leader and East Brighton, ... (Vic... 201 0 \n", "1378 La Rondine (Perth, WA : 1969 - 1994) 1388 0 \n", "\n", " total_articles proportion \n", "1175 15 0.0 \n", "1437 2493 0.0 \n", "224 22 0.0 \n", "1476 9 0.0 \n", "1 57 0.0 \n", "533 56 0.0 \n", "31 25 0.0 \n", "762 18 0.0 \n", "212 21 0.0 \n", "151 197 0.0 \n", "978 1997 0.0 \n", "1305 860 0.0 \n", "46 110 0.0 \n", "1125 14 0.0 \n", "1294 27 0.0 \n", "741 1284 0.0 \n", "1388 431 0.0 \n", "1565 59 0.0 \n", "1188 47 0.0 \n", "66 60 0.0 \n", "343 54 0.0 \n", "1557 105 0.0 \n", "67 111 0.0 \n", "961 17 0.0 \n", "1378 1383 0.0 " ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_errors_with_titles[-25:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's merge the error data with the correction data." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "corrections_errors_merged_df = pd.merge(df_newspapers_with_titles, df_errors_with_titles, how='left', on='id')" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
title_xidtotal_results_xtotal_articles_xproportion_xtitle_ytotal_results_ytotal_articles_yproportion_y
0Party (Sydney, NSW : 1942)1000661.0Party (Sydney, NSW : 1942)060.000000
1The Australian Abo Call (National : 1938)5178781.0The Australian Abo Call (National : 1938)0780.000000
2The Satirist and Sporting Chronicle (Sydney, N...10282862861.0The Satirist and Sporting Chronicle (Sydney, N...02860.000000
3Justice (Narrabri, NSW : 1891)88545451.0Justice (Narrabri, NSW : 1891)1450.022222
4The Temora Telegraph and Mining Advocate (NSW ...729331.0The Temora Telegraph and Mining Advocate (NSW ...030.000000
\n", "
" ], "text/plain": [ " title_x id total_results_x \\\n", "0 Party (Sydney, NSW : 1942) 1000 6 \n", "1 The Australian Abo Call (National : 1938) 51 78 \n", "2 The Satirist and Sporting Chronicle (Sydney, N... 1028 286 \n", "3 Justice (Narrabri, NSW : 1891) 885 45 \n", "4 The Temora Telegraph and Mining Advocate (NSW ... 729 3 \n", "\n", " total_articles_x proportion_x \\\n", "0 6 1.0 \n", "1 78 1.0 \n", "2 286 1.0 \n", "3 45 1.0 \n", "4 3 1.0 \n", "\n", " title_y total_results_y \\\n", "0 Party (Sydney, NSW : 1942) 0 \n", "1 The Australian Abo Call (National : 1938) 0 \n", "2 The Satirist and Sporting Chronicle (Sydney, N... 0 \n", "3 Justice (Narrabri, NSW : 1891) 1 \n", "4 The Temora Telegraph and Mining Advocate (NSW ... 0 \n", "\n", " total_articles_y proportion_y \n", "0 6 0.000000 \n", "1 78 0.000000 \n", "2 286 0.000000 \n", "3 45 0.022222 \n", "4 3 0.000000 " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corrections_errors_merged_df.head()" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "corrections_errors_merged_df['proportion_uncorrected'] = corrections_errors_merged_df['proportion_x'].apply(lambda x: 1 - x)\n", "corrections_errors_merged_df.rename(columns={'title_x': 'title', 'proportion_x': 'proportion_corrected', 'proportion_y': 'proportion_with_errors'}, inplace=True)\n", "corrections_errors_merged_df.sort_values(by=['proportion_with_errors', 'proportion_uncorrected'], ascending=False, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, for what it's worth, here's a list of the neediest newspapers – those with high error rates and low correction rates! As I've said, this is a pretty dodgy method, but interesting nonetheless." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleproportion_with_errorsproportion_uncorrected
1185The Weekly Advance (Granville, NSW : 1892 - 1893)0.6787410.974272
599Dunolly and Betbetshire Express and County of ...0.6499130.933028
381Hamilton Spectator and Grange District Adverti...0.5485570.893655
440Wagga Wagga Express and Murrumbidgee District ...0.5466100.907250
179The North Australian, Ipswich and General Adve...0.5447210.767936
255The North Australian (Brisbane, Qld. : 1863 - ...0.5410240.835341
999The Hay Standard and Advertiser for Balranald,...0.5157840.962941
820Robertson Advocate (NSW : 1894 - 1923)0.5113160.952553
588Temora Herald and Mining Journal (NSW : 1882 -...0.5107740.931365
474Tasmanian Morning Herald (Hobart, Tas. : 1865 ...0.5081080.911915
336Sydney Mail (NSW : 1860 - 1871)0.5067070.879922
226Port Phillip Gazette and Settler's Journal (Vi...0.5043290.821225
146Morning Star and Commercial Advertiser (Hobart...0.5020550.724016
649Molong Argus (NSW : 1896 - 1921)0.4963710.937495
247Port Phillip Gazette (Vic. : 1851)0.4949080.830957
180Telegraph (Hobart Town, Tas. : 1867)0.4857140.771429
134Trumpeter General (Hobart, Tas. : 1833 - 1834)0.4730090.695007
335The Cumberland Free Press (Parramatta, NSW : 1...0.4708990.879293
269Logan Witness (Beenleigh, Qld. : 1878 - 1893)0.4671080.849188
125Adelaide Chronicle and South Australian Litera...0.4651520.681982
266The Darling Downs Gazette and General Advertis...0.4521970.846740
1372The News, Shoalhaven and Southern Coast Distri...0.4500450.985623
229The Cornwall Chronicle (Launceston, Tas. : 183...0.4440410.823372
922Chronicle, South Yarra Gazette, Toorak Times a...0.4405910.958602
1362The Mount Lyell Standard and Strahan Gazette (...0.4372440.985113
\n", "
" ], "text/plain": [ " title \\\n", "1185 The Weekly Advance (Granville, NSW : 1892 - 1893) \n", "599 Dunolly and Betbetshire Express and County of ... \n", "381 Hamilton Spectator and Grange District Adverti... \n", "440 Wagga Wagga Express and Murrumbidgee District ... \n", "179 The North Australian, Ipswich and General Adve... \n", "255 The North Australian (Brisbane, Qld. : 1863 - ... \n", "999 The Hay Standard and Advertiser for Balranald,... \n", "820 Robertson Advocate (NSW : 1894 - 1923) \n", "588 Temora Herald and Mining Journal (NSW : 1882 -... \n", "474 Tasmanian Morning Herald (Hobart, Tas. : 1865 ... \n", "336 Sydney Mail (NSW : 1860 - 1871) \n", "226 Port Phillip Gazette and Settler's Journal (Vi... \n", "146 Morning Star and Commercial Advertiser (Hobart... \n", "649 Molong Argus (NSW : 1896 - 1921) \n", "247 Port Phillip Gazette (Vic. : 1851) \n", "180 Telegraph (Hobart Town, Tas. : 1867) \n", "134 Trumpeter General (Hobart, Tas. : 1833 - 1834) \n", "335 The Cumberland Free Press (Parramatta, NSW : 1... \n", "269 Logan Witness (Beenleigh, Qld. : 1878 - 1893) \n", "125 Adelaide Chronicle and South Australian Litera... \n", "266 The Darling Downs Gazette and General Advertis... \n", "1372 The News, Shoalhaven and Southern Coast Distri... \n", "229 The Cornwall Chronicle (Launceston, Tas. : 183... \n", "922 Chronicle, South Yarra Gazette, Toorak Times a... \n", "1362 The Mount Lyell Standard and Strahan Gazette (... \n", "\n", " proportion_with_errors proportion_uncorrected \n", "1185 0.678741 0.974272 \n", "599 0.649913 0.933028 \n", "381 0.548557 0.893655 \n", "440 0.546610 0.907250 \n", "179 0.544721 0.767936 \n", "255 0.541024 0.835341 \n", "999 0.515784 0.962941 \n", "820 0.511316 0.952553 \n", "588 0.510774 0.931365 \n", "474 0.508108 0.911915 \n", "336 0.506707 0.879922 \n", "226 0.504329 0.821225 \n", "146 0.502055 0.724016 \n", "649 0.496371 0.937495 \n", "247 0.494908 0.830957 \n", "180 0.485714 0.771429 \n", "134 0.473009 0.695007 \n", "335 0.470899 0.879293 \n", "269 0.467108 0.849188 \n", "125 0.465152 0.681982 \n", "266 0.452197 0.846740 \n", "1372 0.450045 0.985623 \n", "229 0.444041 0.823372 \n", "922 0.440591 0.958602 \n", "1362 0.437244 0.985113 " ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corrections_errors_merged_df[['title', 'proportion_with_errors', 'proportion_uncorrected']][:25]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n", "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }