{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Corrections of OCRd text in Trove's newspapers\n",
"\n",
"The full text of newspaper articles in Trove is extracted from page images using Optical Character Recognition (OCR). The accuracy of the OCR process is influenced by a range of factors including the font and the quality of the images. Many errors slip through. Volunteers have done a remarkable job in correcting these errors, but it's a huge task. This notebook explores the scale of OCR correction in Trove.\n",
"\n",
"There are two ways of getting data about OCR corrections using the Trove API. To get aggregate data you can include `has:corrections` in your query to limit the results to articles that have at least one OCR correction.\n",
"\n",
"To get information about the number of corrections made to the articles in your results, you can add the `reclevel=full` parameter to include the number of corrections and details of the most recent correction to the article record. For example, note the `correctionCount` and `lastCorrection` values in the record below:\n",
"\n",
"``` json\n",
"{\n",
" \"article\": {\n",
" \"id\": \"41697877\",\n",
" \"url\": \"/newspaper/41697877\",\n",
" \"heading\": \"WRAGGE AND WEATHER CYCLES.\",\n",
" \"category\": \"Article\",\n",
" \"title\": {\n",
" \"id\": \"101\",\n",
" \"value\": \"Western Mail (Perth, WA : 1885 - 1954)\"\n",
" },\n",
" \"date\": \"1922-11-23\",\n",
" \"page\": 4,\n",
" \"pageSequence\": 4,\n",
" \"troveUrl\": \"https://trove.nla.gov.au/ndp/del/article/41697877\",\n",
" \"illustrated\": \"N\",\n",
" \"wordCount\": 1054,\n",
" \"correctionCount\": 1,\n",
" \"listCount\": 0,\n",
" \"tagCount\": 0,\n",
" \"commentCount\": 0,\n",
" \"lastCorrection\": {\n",
" \"by\": \"*anon*\",\n",
" \"lastupdated\": \"2016-09-12T07:08:57Z\"\n",
" },\n",
" \"identifier\": \"https://nla.gov.au/nla.news-article41697877\",\n",
" \"trovePageUrl\": \"https://trove.nla.gov.au/ndp/del/page/3522839\",\n",
" \"pdf\": \"https://trove.nla.gov.au/ndp/imageservice/nla.news-page3522839/print\"\n",
" }\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Setting things up"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from operator import itemgetter # used for sorting\n",
"\n",
"import altair as alt\n",
"import pandas as pd # makes manipulating the data easier\n",
"import requests\n",
"from IPython.display import FileLink, clear_output, display\n",
"from requests.adapters import HTTPAdapter\n",
"from requests.packages.urllib3.util.retry import Retry\n",
"from tqdm.auto import tqdm\n",
"\n",
"# Make sure data directory exists\n",
"os.makedirs(\"data\", exist_ok=True)\n",
"\n",
"# Create a session that will automatically retry on server errors\n",
"s = requests.Session()\n",
"retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
"s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n",
"s.mount(\"https://\", HTTPAdapter(max_retries=retries))"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"# Load variables from the .env file if it exists\n",
"# Use %%capture to suppress messages\n",
"%load_ext dotenv\n",
"%dotenv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Insert your Trove API key\n",
"API_KEY = \"YOUR API KEY\"\n",
"\n",
"# Use api key value from environment variables if it is available\n",
"if os.getenv(\"TROVE_API_KEY\"):\n",
" API_KEY = os.getenv(\"TROVE_API_KEY\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Basic parameters for Trove API\n",
"params = {\n",
" \"facet\": \"year\", # Get the data aggregated by year.\n",
" \"zone\": \"newspaper\",\n",
" \"key\": API_KEY,\n",
" \"encoding\": \"json\",\n",
" \"n\": 0, # We don't need any records, just the facets!\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def get_results(params):\n",
" \"\"\"\n",
" Get JSON response data from the Trove API.\n",
" Parameters:\n",
" params\n",
" Returns:\n",
" JSON formatted response data from Trove API\n",
" \"\"\"\n",
" response = s.get(\n",
" \"https://api.trove.nla.gov.au/v2/result\", params=params, timeout=30\n",
" )\n",
" response.raise_for_status()\n",
" # print(response.url) # This shows us the url that's sent to the API\n",
" data = response.json()\n",
" return data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How many newspaper articles have corrections?\n",
"\n",
"Let's find out what proportion of newspaper articles have at least one OCR correction.\n",
"\n",
"First we'll get to the total number of newspaper articles in Trove."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"233,635,348\n"
]
}
],
"source": [
"# Set the q parameter to a single space to get everything\n",
"params[\"q\"] = \" \"\n",
"\n",
"# Get the data from the API\n",
"data = get_results(params)\n",
"\n",
"# Extract the total number of results\n",
"total = int(data[\"response\"][\"zone\"][0][\"records\"][\"total\"])\n",
"print(\"{:,}\".format(total))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we'll set the `q` parameter to `has:corrections` to limit the results to newspaper articles that have at least one correction."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"13,585,979\n"
]
}
],
"source": [
"# Set the q parameter to 'has:corrections' to limit results to articles with corrections\n",
"params[\"q\"] = \"has:corrections\"\n",
"\n",
"# Get the data from the API\n",
"data = get_results(params)\n",
"\n",
"# Extract the total number of results\n",
"corrected = int(data[\"response\"][\"zone\"][0][\"records\"][\"total\"])\n",
"print(\"{:,}\".format(corrected))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculate the proportion of articles with corrections."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5.82% of articles have at least one correction\n"
]
}
],
"source": [
"print(\"{:.2%} of articles have at least one correction\".format(corrected / total))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You might be thinking that these figures don't seem to match the number of corrections by individuals displayed on the digitised newspapers home page. Remember that these figures show the **number of articles** that include corrections, while the individual scores show the **number of lines** corrected by each volunteer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Number of corrections by year"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def get_facets(data):\n",
" \"\"\"\n",
" Loop through facets in Trove API response, saving terms and counts.\n",
" Parameters:\n",
" data - JSON formatted response data from Trove API\n",
" Returns:\n",
" A list of dictionaries containing: 'term', 'total_results'\n",
" \"\"\"\n",
" facets = []\n",
" try:\n",
" # The facets are buried a fair way down in the results\n",
" # Note that if you ask for more than one facet, you'll have use the facet['name'] param to find the one you want\n",
" # In this case there's only one facet, so we can just grab the list of terms (which are in fact the results by year)\n",
" for term in data[\"response\"][\"zone\"][0][\"facets\"][\"facet\"][\"term\"]:\n",
"\n",
" # Get the year and the number of results, and convert them to integers, before adding to our results\n",
" facets.append({\"term\": term[\"search\"], \"total_results\": int(term[\"count\"])})\n",
"\n",
" # Sort facets by year\n",
" facets.sort(key=itemgetter(\"term\"))\n",
" except TypeError:\n",
" pass\n",
" return facets\n",
"\n",
"\n",
"def get_facet_data(params, start_decade=180, end_decade=201):\n",
" \"\"\"\n",
" Loop throught the decades from 'start_decade' to 'end_decade',\n",
" getting the number of search results for each year from the year facet.\n",
" Combine all the results into a single list.\n",
" Parameters:\n",
" params - parameters to send to the API\n",
" start_decade\n",
" end_decade\n",
" Returns:\n",
" A list of dictionaries containing 'year', 'total_results' for the complete\n",
" period between the start and end decades.\n",
" \"\"\"\n",
" # Create a list to hold the facets data\n",
" facet_data = []\n",
"\n",
" # Loop through the decades\n",
" for decade in tqdm(range(start_decade, end_decade + 1)):\n",
"\n",
" # print(params)\n",
" # Avoid confusion by copying the params before we change anything.\n",
" search_params = params.copy()\n",
"\n",
" # Add decade value to params\n",
" search_params[\"l-decade\"] = decade\n",
"\n",
" # Get the data from the API\n",
" data = get_results(search_params)\n",
"\n",
" # Get the facets from the data and add to facets_data\n",
" facet_data += get_facets(data)\n",
"\n",
" # Reomve the progress bar (you can also set leave=False in tqdm, but that still leaves white space in Jupyter Lab)\n",
" clear_output()\n",
" return facet_data"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"facet_data = get_facet_data(params)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Convert our data to a dataframe called df\n",
"df = pd.DataFrame(facet_data)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" term | \n",
" total_results | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1803 | \n",
" 526 | \n",
"
\n",
" \n",
" 1 | \n",
" 1804 | \n",
" 619 | \n",
"
\n",
" \n",
" 2 | \n",
" 1805 | \n",
" 430 | \n",
"
\n",
" \n",
" 3 | \n",
" 1806 | \n",
" 367 | \n",
"
\n",
" \n",
" 4 | \n",
" 1807 | \n",
" 134 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" term total_results\n",
"0 1803 526\n",
"1 1804 619\n",
"2 1805 430\n",
"3 1806 367\n",
"4 1807 134"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So which year has the most corrections?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"term 1915\n",
"total_results 270277\n",
"Name: 112, dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[df[\"total_results\"].idxmax()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fact that there's more corrections in newspaper articles from 1915, might make you think that people have been more motivated to correct articles relating to WWI. But if you look at the [total number of articles per year](visualise-total-newspaper-articles-by-state-year.ipynb), you'll see that there's been more articles digitised from 1915! The raw number of corrections is probably not very useful, so let's look instead at the *proportion* of articles each year that have at least one correction.\n",
"\n",
"To do that we'll re-harvest the facet data, but this time with a blank, or empty search, to get the total number of articles available from each year."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# Reset the 'q' parameter\n",
"# Use a an empty search (a single space) to get ALL THE ARTICLES\n",
"params[\"q\"] = \" \"\n",
"\n",
"# Get facet data for all articles\n",
"all_facet_data = get_facet_data(params)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# Convert the results to a dataframe\n",
"df_total = pd.DataFrame(all_facet_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"No we'll merge the number of articles by year with corrections with the total number of articles. Then we'll calculate the proportion with corrections."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def merge_df_with_total(df, df_total, how=\"left\"):\n",
" \"\"\"\n",
" Merge dataframes containing search results with the total number of articles by year.\n",
" This is a left join on the year column. The total number of articles will be added as a column to\n",
" the existing results.\n",
" Once merged, do some reorganisation and calculate the proportion of search results.\n",
" Parameters:\n",
" df - the search results in a dataframe\n",
" df_total - total number of articles per year in a dataframe\n",
" Returns:\n",
" A dataframe with the following columns - 'year', 'total_results', 'total_articles', 'proportion'\n",
" (plus any other columns that are in the search results dataframe).\n",
" \"\"\"\n",
" # Merge the two dataframes on year\n",
" # Note that we're joining the two dataframes on the year column\n",
" df_merged = pd.merge(df, df_total, how=how, on=\"term\")\n",
"\n",
" # Rename the columns for convenience\n",
" df_merged.rename(\n",
" {\"total_results_y\": \"total_articles\"}, inplace=True, axis=\"columns\"\n",
" )\n",
" df_merged.rename({\"total_results_x\": \"total_results\"}, inplace=True, axis=\"columns\")\n",
"\n",
" # Set blank values to zero to avoid problems\n",
" df_merged[\"total_results\"] = df_merged[\"total_results\"].fillna(0).astype(int)\n",
"\n",
" # Calculate proportion by dividing the search results by the total articles\n",
" df_merged[\"proportion\"] = df_merged[\"total_results\"] / df_merged[\"total_articles\"]\n",
" return df_merged"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" term | \n",
" total_results | \n",
" total_articles | \n",
" proportion | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1803 | \n",
" 526 | \n",
" 526 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1804 | \n",
" 619 | \n",
" 619 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1805 | \n",
" 430 | \n",
" 430 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1806 | \n",
" 367 | \n",
" 367 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1807 | \n",
" 134 | \n",
" 134 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" term total_results total_articles proportion\n",
"0 1803 526 526 1.0\n",
"1 1804 619 619 1.0\n",
"2 1805 430 430 1.0\n",
"3 1806 367 367 1.0\n",
"4 1807 134 134 1.0"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Merge the search results with the total articles\n",
"df_merged = merge_df_with_total(df, df_total)\n",
"df_merged.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's visualise the results, showing both the number of articles with corrections each year, and the proportion of articles each year with corrections."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.VConcatChart(...)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Number of articles with corrections\n",
"chart1 = (\n",
" alt.Chart(df_merged)\n",
" .mark_line(point=True)\n",
" .encode(\n",
" x=alt.X(\"term:Q\", axis=alt.Axis(format=\"c\", title=\"Year\")),\n",
" y=alt.Y(\n",
" \"total_results:Q\",\n",
" axis=alt.Axis(format=\",d\", title=\"Number of articles with corrections\"),\n",
" ),\n",
" tooltip=[\n",
" alt.Tooltip(\"term:Q\", title=\"Year\"),\n",
" alt.Tooltip(\"total_results:Q\", title=\"Articles\", format=\",\"),\n",
" ],\n",
" )\n",
" .properties(width=700, height=250)\n",
")\n",
"\n",
"# Proportion of articles with corrections\n",
"chart2 = (\n",
" alt.Chart(df_merged)\n",
" .mark_line(point=True, color=\"red\")\n",
" .encode(\n",
" x=alt.X(\"term:Q\", axis=alt.Axis(format=\"c\", title=\"Year\")),\n",
" # This time we're showing the proportion (formatted as a percentage) on the Y axis\n",
" y=alt.Y(\n",
" \"proportion:Q\",\n",
" axis=alt.Axis(format=\"%\", title=\"Proportion of articles with corrections\"),\n",
" ),\n",
" tooltip=[\n",
" alt.Tooltip(\"term:Q\", title=\"Year\"),\n",
" alt.Tooltip(\"proportion:Q\", title=\"Proportion\", format=\"%\"),\n",
" ],\n",
" # Make the charts different colors\n",
" color=alt.value(\"orange\"),\n",
" )\n",
" .properties(width=700, height=250)\n",
")\n",
"\n",
"# This is a shorthand way of stacking the charts on top of each other\n",
"chart1 & chart2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is really interesting – it seems there's been a deliberate effort to get the earliest newspapers corrected."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Number of corrections by category\n",
"\n",
"Let's see how the number of corrections varies across categories. This time we'll use the `category` facet instead of `year`."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"params[\"q\"] = \"has:corrections\"\n",
"params[\"facet\"] = \"category\""
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"data = get_results(params)\n",
"facets = []\n",
"for term in data[\"response\"][\"zone\"][0][\"facets\"][\"facet\"][\"term\"]:\n",
" # Get the state and the number of results, and convert it to integers, before adding to our results\n",
" facets.append({\"term\": term[\"search\"], \"total_results\": int(term[\"count\"])})\n",
"df_categories = pd.DataFrame(facets)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" term | \n",
" total_results | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Article | \n",
" 10385409 | \n",
"
\n",
" \n",
" 1 | \n",
" Family Notices | \n",
" 1362744 | \n",
"
\n",
" \n",
" 2 | \n",
" Advertising | \n",
" 1319419 | \n",
"
\n",
" \n",
" 3 | \n",
" Detailed Lists, Results, Guides | \n",
" 526437 | \n",
"
\n",
" \n",
" 4 | \n",
" Literature | \n",
" 11335 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" term total_results\n",
"0 Article 10385409\n",
"1 Family Notices 1362744\n",
"2 Advertising 1319419\n",
"3 Detailed Lists, Results, Guides 526437\n",
"4 Literature 11335"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_categories.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once again, the raw numbers are probably not all that useful, so let's get the total number of articles in each category and calculate the proportion that have at least one correction."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Blank query\n",
"params[\"q\"] = \" \"\n",
"data = get_results(params)\n",
"facets = []\n",
"for term in data[\"response\"][\"zone\"][0][\"facets\"][\"facet\"][\"term\"]:\n",
" # Get the state and the number of results, and convert it to integers, before adding to our results\n",
" facets.append({\"term\": term[\"search\"], \"total_results\": int(term[\"count\"])})\n",
"df_total_categories = pd.DataFrame(facets)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll merge the two corrections by category data with the total articles per category and calculate the proportion."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" term | \n",
" total_results | \n",
" total_articles | \n",
" proportion | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Article | \n",
" 10385409 | \n",
" 162296409 | \n",
" 0.063990 | \n",
"
\n",
" \n",
" 1 | \n",
" Family Notices | \n",
" 1362744 | \n",
" 1926852 | \n",
" 0.707239 | \n",
"
\n",
" \n",
" 2 | \n",
" Advertising | \n",
" 1319419 | \n",
" 43238969 | \n",
" 0.030515 | \n",
"
\n",
" \n",
" 3 | \n",
" Detailed Lists, Results, Guides | \n",
" 526437 | \n",
" 26177949 | \n",
" 0.020110 | \n",
"
\n",
" \n",
" 4 | \n",
" Literature | \n",
" 11335 | \n",
" 34669 | \n",
" 0.326949 | \n",
"
\n",
" \n",
" 5 | \n",
" Obituaries | \n",
" 9493 | \n",
" 10459 | \n",
" 0.907639 | \n",
"
\n",
" \n",
" 6 | \n",
" Humour | \n",
" 8698 | \n",
" 26571 | \n",
" 0.327349 | \n",
"
\n",
" \n",
" 7 | \n",
" News | \n",
" 7983 | \n",
" 9928 | \n",
" 0.804089 | \n",
"
\n",
" \n",
" 8 | \n",
" Law, Courts, And Crime | \n",
" 6615 | \n",
" 9047 | \n",
" 0.731182 | \n",
"
\n",
" \n",
" 9 | \n",
" Sport And Games | \n",
" 6029 | \n",
" 13682 | \n",
" 0.440652 | \n",
"
\n",
" \n",
" 10 | \n",
" Letters | \n",
" 3372 | \n",
" 11674 | \n",
" 0.288847 | \n",
"
\n",
" \n",
" 11 | \n",
" Editorial | \n",
" 2228 | \n",
" 12908 | \n",
" 0.172606 | \n",
"
\n",
" \n",
" 12 | \n",
" Arts And Culture | \n",
" 2212 | \n",
" 3054 | \n",
" 0.724296 | \n",
"
\n",
" \n",
" 13 | \n",
" Puzzles | \n",
" 1616 | \n",
" 38716 | \n",
" 0.041740 | \n",
"
\n",
" \n",
" 14 | \n",
" Reviews | \n",
" 1366 | \n",
" 1877 | \n",
" 0.727757 | \n",
"
\n",
" \n",
" 15 | \n",
" Shipping Notices | \n",
" 1332 | \n",
" 1796 | \n",
" 0.741648 | \n",
"
\n",
" \n",
" 16 | \n",
" Classified Advertisements And Notices | \n",
" 1222 | \n",
" 1410 | \n",
" 0.866667 | \n",
"
\n",
" \n",
" 17 | \n",
" Official Appointments And Notices | \n",
" 1019 | \n",
" 1046 | \n",
" 0.974187 | \n",
"
\n",
" \n",
" 18 | \n",
" Weather | \n",
" 1011 | \n",
" 8625 | \n",
" 0.117217 | \n",
"
\n",
" \n",
" 19 | \n",
" Commerce And Business | \n",
" 1005 | \n",
" 2566 | \n",
" 0.391660 | \n",
"
\n",
" \n",
" 20 | \n",
" Display Advertisement | \n",
" 415 | \n",
" 456 | \n",
" 0.910088 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" term total_results total_articles \\\n",
"0 Article 10385409 162296409 \n",
"1 Family Notices 1362744 1926852 \n",
"2 Advertising 1319419 43238969 \n",
"3 Detailed Lists, Results, Guides 526437 26177949 \n",
"4 Literature 11335 34669 \n",
"5 Obituaries 9493 10459 \n",
"6 Humour 8698 26571 \n",
"7 News 7983 9928 \n",
"8 Law, Courts, And Crime 6615 9047 \n",
"9 Sport And Games 6029 13682 \n",
"10 Letters 3372 11674 \n",
"11 Editorial 2228 12908 \n",
"12 Arts And Culture 2212 3054 \n",
"13 Puzzles 1616 38716 \n",
"14 Reviews 1366 1877 \n",
"15 Shipping Notices 1332 1796 \n",
"16 Classified Advertisements And Notices 1222 1410 \n",
"17 Official Appointments And Notices 1019 1046 \n",
"18 Weather 1011 8625 \n",
"19 Commerce And Business 1005 2566 \n",
"20 Display Advertisement 415 456 \n",
"\n",
" proportion \n",
"0 0.063990 \n",
"1 0.707239 \n",
"2 0.030515 \n",
"3 0.020110 \n",
"4 0.326949 \n",
"5 0.907639 \n",
"6 0.327349 \n",
"7 0.804089 \n",
"8 0.731182 \n",
"9 0.440652 \n",
"10 0.288847 \n",
"11 0.172606 \n",
"12 0.724296 \n",
"13 0.041740 \n",
"14 0.727757 \n",
"15 0.741648 \n",
"16 0.866667 \n",
"17 0.974187 \n",
"18 0.117217 \n",
"19 0.391660 \n",
"20 0.910088 "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_categories_merged = merge_df_with_total(df_categories, df_total_categories)\n",
"df_categories_merged"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A lot of the categories have been added recently and don't contain a lot of articles. Some of these have a very high proportion of articles with corrections – 'Obituaries' for example. This suggests users are systematically categorising and correcting certain types of article.\n",
"\n",
"Let's focus on the main categories by filtering out those with less than 30,000 articles."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" term | \n",
" total_results | \n",
" total_articles | \n",
" proportion | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Article | \n",
" 10385409 | \n",
" 162296409 | \n",
" 0.063990 | \n",
"
\n",
" \n",
" 1 | \n",
" Family Notices | \n",
" 1362744 | \n",
" 1926852 | \n",
" 0.707239 | \n",
"
\n",
" \n",
" 2 | \n",
" Advertising | \n",
" 1319419 | \n",
" 43238969 | \n",
" 0.030515 | \n",
"
\n",
" \n",
" 3 | \n",
" Detailed Lists, Results, Guides | \n",
" 526437 | \n",
" 26177949 | \n",
" 0.020110 | \n",
"
\n",
" \n",
" 4 | \n",
" Literature | \n",
" 11335 | \n",
" 34669 | \n",
" 0.326949 | \n",
"
\n",
" \n",
" 13 | \n",
" Puzzles | \n",
" 1616 | \n",
" 38716 | \n",
" 0.041740 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" term total_results total_articles proportion\n",
"0 Article 10385409 162296409 0.063990\n",
"1 Family Notices 1362744 1926852 0.707239\n",
"2 Advertising 1319419 43238969 0.030515\n",
"3 Detailed Lists, Results, Guides 526437 26177949 0.020110\n",
"4 Literature 11335 34669 0.326949\n",
"13 Puzzles 1616 38716 0.041740"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_categories_filtered = df_categories_merged.loc[\n",
" df_categories_merged[\"total_articles\"] > 30000\n",
"]\n",
"df_categories_filtered"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now we can visualise the results."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.HConcatChart(...)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_chart1 = (\n",
" alt.Chart(df_categories_filtered)\n",
" .mark_bar()\n",
" .encode(\n",
" x=alt.X(\"term:N\", title=\"Category\"),\n",
" y=alt.Y(\"total_results:Q\", title=\"Articles with corrections\"),\n",
" )\n",
")\n",
"\n",
"cat_chart2 = (\n",
" alt.Chart(df_categories_filtered)\n",
" .mark_bar()\n",
" .encode(\n",
" x=alt.X(\"term:N\", title=\"Category\"),\n",
" y=alt.Y(\n",
" \"proportion:Q\",\n",
" axis=alt.Axis(format=\"%\", title=\"Proportion of articles with corrections\"),\n",
" ),\n",
" color=alt.value(\"orange\"),\n",
" )\n",
")\n",
"\n",
"cat_chart1 | cat_chart2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, the rate of corrections is much higher in the 'Family Notices' category than any other. This probably reflects the work of family historians and others searching for, and correcting, articles containing particular names."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Number of corrections by newspaper\n",
"\n",
"How do rates of correction vary across newspapers? We can use the `title` facet to find out."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"params[\"q\"] = \"has:corrections\"\n",
"params[\"facet\"] = \"title\""
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"data = get_results(params)\n",
"facets = []\n",
"for term in data[\"response\"][\"zone\"][0][\"facets\"][\"facet\"][\"term\"]:\n",
" # Get the state and the number of results, and convert it to integers, before adding to our results\n",
" facets.append({\"term\": term[\"search\"], \"total_results\": int(term[\"count\"])})\n",
"df_newspapers = pd.DataFrame(facets)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" term | \n",
" total_results | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 35 | \n",
" 827605 | \n",
"
\n",
" \n",
" 1 | \n",
" 13 | \n",
" 780708 | \n",
"
\n",
" \n",
" 2 | \n",
" 11 | \n",
" 373062 | \n",
"
\n",
" \n",
" 3 | \n",
" 16 | \n",
" 347336 | \n",
"
\n",
" \n",
" 4 | \n",
" 30 | \n",
" 317381 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" term total_results\n",
"0 35 827605\n",
"1 13 780708\n",
"2 11 373062\n",
"3 16 347336\n",
"4 30 317381"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_newspapers.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once again we'll calculate the proportion of articles corrected for each newspaper by getting the total number of articles for each newspaper on Trove."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"params[\"q\"] = \" \""
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"data = get_results(params)\n",
"facets = []\n",
"for term in data[\"response\"][\"zone\"][0][\"facets\"][\"facet\"][\"term\"]:\n",
" # Get the state and the number of results, and convert it to integers, before adding to our results\n",
" facets.append({\"term\": term[\"search\"], \"total_results\": int(term[\"count\"])})\n",
"df_newspapers_total = pd.DataFrame(facets)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"df_newspapers_merged = merge_df_with_total(\n",
" df_newspapers, df_newspapers_total, how=\"right\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"df_newspapers_merged.sort_values(by=\"proportion\", ascending=False, inplace=True)\n",
"df_newspapers_merged.rename(columns={\"term\": \"id\"}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" total_results | \n",
" total_articles | \n",
" proportion | \n",
"
\n",
" \n",
" \n",
" \n",
" 1662 | \n",
" 729 | \n",
" 3 | \n",
" 3 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1548 | \n",
" 1028 | \n",
" 286 | \n",
" 286 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1629 | \n",
" 1047 | \n",
" 38 | \n",
" 38 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1617 | \n",
" 686 | \n",
" 53 | \n",
" 53 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1614 | \n",
" 118 | \n",
" 56 | \n",
" 56 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id total_results total_articles proportion\n",
"1662 729 3 3 1.0\n",
"1548 1028 286 286 1.0\n",
"1629 1047 38 38 1.0\n",
"1617 686 53 53 1.0\n",
"1614 118 56 56 1.0"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_newspapers_merged.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `title` facet only gives us the `id` number for each newspaper, not its title. Let's get all the titles and then merge them with the facet data."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"# Get all the newspaper titles\n",
"title_params = {\n",
" \"key\": API_KEY,\n",
" \"encoding\": \"json\",\n",
"}\n",
"\n",
"title_data = s.get(\n",
" \"https://api.trove.nla.gov.au/v2/newspaper/titles\", params=params\n",
").json()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"titles = []\n",
"for newspaper in title_data[\"response\"][\"records\"][\"newspaper\"]:\n",
" titles.append({\"title\": newspaper[\"title\"], \"id\": int(newspaper[\"id\"])})\n",
"df_titles = pd.DataFrame(titles)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Canberra Community News (ACT : 1925 - 1927) | \n",
" 166 | \n",
"
\n",
" \n",
" 1 | \n",
" Canberra Illustrated: A Quarterly Magazine (AC... | \n",
" 165 | \n",
"
\n",
" \n",
" 2 | \n",
" Federal Capital Pioneer (Canberra, ACT : 1924 ... | \n",
" 69 | \n",
"
\n",
" \n",
" 3 | \n",
" Good Neighbour (ACT : 1950 - 1969) | \n",
" 871 | \n",
"
\n",
" \n",
" 4 | \n",
" Student Notes/Canberra University College Stud... | \n",
" 665 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title id\n",
"0 Canberra Community News (ACT : 1925 - 1927) 166\n",
"1 Canberra Illustrated: A Quarterly Magazine (AC... 165\n",
"2 Federal Capital Pioneer (Canberra, ACT : 1924 ... 69\n",
"3 Good Neighbour (ACT : 1950 - 1969) 871\n",
"4 Student Notes/Canberra University College Stud... 665"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_titles.head()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1698, 2)"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_titles.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One problem with this list is that it also includes the titles of the Government Gazettes (this seems to be a bug in the API). Let's get the gazette titles and then subtract them from the complete list."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"# Get gazette titles\n",
"gazette_data = s.get(\n",
" \"https://api.trove.nla.gov.au/v2/gazette/titles\", params=params\n",
").json()\n",
"gazettes = []\n",
"for gaz in gazette_data[\"response\"][\"records\"][\"newspaper\"]:\n",
" gazettes.append({\"title\": gaz[\"title\"], \"id\": int(gaz[\"id\"])})\n",
"df_gazettes = pd.DataFrame(gazettes)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(38, 2)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_gazettes.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Subtract the gazettes from the list of titles."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"df_titles_not_gazettes = df_titles[~df_titles[\"id\"].isin(df_gazettes[\"id\"])]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can merge the newspaper titles with the facet data using the `id` to link the two datasets."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"df_newspapers_with_titles = (\n",
" pd.merge(df_titles_not_gazettes, df_newspapers_merged, how=\"left\", on=\"id\")\n",
" .fillna(0)\n",
" .sort_values(by=\"proportion\", ascending=False)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# Convert the totals back to integers\n",
"df_newspapers_with_titles[\n",
" [\"total_results\", \"total_articles\"]\n",
"] = df_newspapers_with_titles[[\"total_results\", \"total_articles\"]].astype(int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can display the newspapers with the highest rates of correction. Remember, that a `proportion` of 1.00 means that every available article has at least one correction."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" id | \n",
" total_results | \n",
" total_articles | \n",
" proportion | \n",
"
\n",
" \n",
" \n",
" \n",
" 468 | \n",
" The Temora Telegraph and Mining Advocate (NSW ... | \n",
" 729 | \n",
" 3 | \n",
" 3 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 824 | \n",
" Hobart Town Gazette and Van Diemen's Land Adve... | \n",
" 5 | \n",
" 1556 | \n",
" 1556 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 421 | \n",
" The Satirist and Sporting Chronicle (Sydney, N... | \n",
" 1028 | \n",
" 286 | \n",
" 286 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 149 | \n",
" Justice (Narrabri, NSW : 1891) | \n",
" 885 | \n",
" 45 | \n",
" 45 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 921 | \n",
" Alexandra and Yea Standard, Thornton, Gobur an... | \n",
" 154 | \n",
" 21 | \n",
" 21 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 739 | \n",
" Suedaustralische Zeitung (Adelaide, SA : 1850 ... | \n",
" 314 | \n",
" 47 | \n",
" 47 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 472 | \n",
" The True Sun and New South Wales Independent P... | \n",
" 1038 | \n",
" 20 | \n",
" 20 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 284 | \n",
" The Branxton Advocate: Greta and Rothbury Reco... | \n",
" 686 | \n",
" 53 | \n",
" 53 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 24 | \n",
" The Australian Abo Call (National : 1938) | \n",
" 51 | \n",
" 78 | \n",
" 78 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 838 | \n",
" Tasmanian and Port Dalrymple Advertiser (Launc... | \n",
" 273 | \n",
" 193 | \n",
" 193 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 538 | \n",
" Moonta Herald and Northern Territory Gazette (... | \n",
" 118 | \n",
" 56 | \n",
" 56 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 215 | \n",
" Society (Sydney, NSW : 1887) | \n",
" 1042 | \n",
" 21 | \n",
" 21 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 195 | \n",
" Party (Sydney, NSW : 1942) | \n",
" 1000 | \n",
" 6 | \n",
" 6 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 979 | \n",
" Elsternwick Leader and East Brighton, ... (Vic... | \n",
" 201 | \n",
" 17 | \n",
" 17 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 1472 | \n",
" Swan River Guardian (WA : 1836 - 1838) | \n",
" 1142 | \n",
" 437 | \n",
" 437 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 862 | \n",
" The Derwent Star and Van Diemen's Land Intelli... | \n",
" 1046 | \n",
" 12 | \n",
" 12 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 563 | \n",
" Logan and Albert Advocate (Qld. : 1893 - 1900) | \n",
" 842 | \n",
" 84 | \n",
" 84 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 905 | \n",
" The Van Diemen's Land Gazette and General Adve... | \n",
" 1047 | \n",
" 38 | \n",
" 38 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 871 | \n",
" The Hobart Town Gazette and Southern Reporter ... | \n",
" 4 | \n",
" 1922 | \n",
" 1923 | \n",
" 0.999480 | \n",
"
\n",
" \n",
" 2 | \n",
" Federal Capital Pioneer (Canberra, ACT : 1924 ... | \n",
" 69 | \n",
" 542 | \n",
" 545 | \n",
" 0.994495 | \n",
"
\n",
" \n",
" 1231 | \n",
" The Melbourne Advertiser (Vic. : 1838) | \n",
" 935 | \n",
" 120 | \n",
" 121 | \n",
" 0.991736 | \n",
"
\n",
" \n",
" 728 | \n",
" South Australian Gazette and Colonial Register... | \n",
" 40 | \n",
" 1051 | \n",
" 1065 | \n",
" 0.986854 | \n",
"
\n",
" \n",
" 143 | \n",
" Intelligence (Bowral, NSW : 1884) | \n",
" 624 | \n",
" 117 | \n",
" 119 | \n",
" 0.983193 | \n",
"
\n",
" \n",
" 892 | \n",
" The People's Horn Boy (Hobart Town, Tas. : 1834) | \n",
" 1240 | \n",
" 99 | \n",
" 101 | \n",
" 0.980198 | \n",
"
\n",
" \n",
" 1657 | \n",
" York Advocate (WA : 1915) | \n",
" 1131 | \n",
" 236 | \n",
" 241 | \n",
" 0.979253 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title id total_results \\\n",
"468 The Temora Telegraph and Mining Advocate (NSW ... 729 3 \n",
"824 Hobart Town Gazette and Van Diemen's Land Adve... 5 1556 \n",
"421 The Satirist and Sporting Chronicle (Sydney, N... 1028 286 \n",
"149 Justice (Narrabri, NSW : 1891) 885 45 \n",
"921 Alexandra and Yea Standard, Thornton, Gobur an... 154 21 \n",
"739 Suedaustralische Zeitung (Adelaide, SA : 1850 ... 314 47 \n",
"472 The True Sun and New South Wales Independent P... 1038 20 \n",
"284 The Branxton Advocate: Greta and Rothbury Reco... 686 53 \n",
"24 The Australian Abo Call (National : 1938) 51 78 \n",
"838 Tasmanian and Port Dalrymple Advertiser (Launc... 273 193 \n",
"538 Moonta Herald and Northern Territory Gazette (... 118 56 \n",
"215 Society (Sydney, NSW : 1887) 1042 21 \n",
"195 Party (Sydney, NSW : 1942) 1000 6 \n",
"979 Elsternwick Leader and East Brighton, ... (Vic... 201 17 \n",
"1472 Swan River Guardian (WA : 1836 - 1838) 1142 437 \n",
"862 The Derwent Star and Van Diemen's Land Intelli... 1046 12 \n",
"563 Logan and Albert Advocate (Qld. : 1893 - 1900) 842 84 \n",
"905 The Van Diemen's Land Gazette and General Adve... 1047 38 \n",
"871 The Hobart Town Gazette and Southern Reporter ... 4 1922 \n",
"2 Federal Capital Pioneer (Canberra, ACT : 1924 ... 69 542 \n",
"1231 The Melbourne Advertiser (Vic. : 1838) 935 120 \n",
"728 South Australian Gazette and Colonial Register... 40 1051 \n",
"143 Intelligence (Bowral, NSW : 1884) 624 117 \n",
"892 The People's Horn Boy (Hobart Town, Tas. : 1834) 1240 99 \n",
"1657 York Advocate (WA : 1915) 1131 236 \n",
"\n",
" total_articles proportion \n",
"468 3 1.000000 \n",
"824 1556 1.000000 \n",
"421 286 1.000000 \n",
"149 45 1.000000 \n",
"921 21 1.000000 \n",
"739 47 1.000000 \n",
"472 20 1.000000 \n",
"284 53 1.000000 \n",
"24 78 1.000000 \n",
"838 193 1.000000 \n",
"538 56 1.000000 \n",
"215 21 1.000000 \n",
"195 6 1.000000 \n",
"979 17 1.000000 \n",
"1472 437 1.000000 \n",
"862 12 1.000000 \n",
"563 84 1.000000 \n",
"905 38 1.000000 \n",
"871 1923 0.999480 \n",
"2 545 0.994495 \n",
"1231 121 0.991736 \n",
"728 1065 0.986854 \n",
"143 119 0.983193 \n",
"892 101 0.980198 \n",
"1657 241 0.979253 "
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_newspapers_with_titles[:25]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At the other end, we can see the newspapers with the smallest rates of correction. Note that some newspapers have no corrections at all."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" id | \n",
" total_results | \n",
" total_articles | \n",
" proportion | \n",
"
\n",
" \n",
" \n",
" \n",
" 1247 | \n",
" The Morwell Advocate and Narracan, Boolara and... | \n",
" 1734 | \n",
" 0 | \n",
" 208 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 1559 | \n",
" The Mount Margaret Mercury (WA : 1897) | \n",
" 1641 | \n",
" 0 | \n",
" 24 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 1553 | \n",
" The Miner's Right (Perth, WA : 1894) | \n",
" 1729 | \n",
" 0 | \n",
" 426 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 1508 | \n",
" The Derby News (WA : 1887) | \n",
" 1617 | \n",
" 0 | \n",
" 9 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 864 | \n",
" The Herald of Tasmania (Hobart, Tas. : 1845) | \n",
" 1741 | \n",
" 0 | \n",
" 50 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 1246 | \n",
" The Morwell Advocate and Boolara and Mirboo Ch... | \n",
" 1733 | \n",
" 0 | \n",
" 33 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 874 | \n",
" The Hobart Town Herald and abstinence advocate... | \n",
" 1743 | \n",
" 0 | \n",
" 434 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 1144 | \n",
" Seamen's Strike Bulletin (Melbourne, Vic. : 1919) | \n",
" 1043 | \n",
" 0 | \n",
" 14 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 872 | \n",
" The Hobart Town Herald (Tas. : 1845) | \n",
" 1740 | \n",
" 0 | \n",
" 57 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 875 | \n",
" The Hobart Town Herald, or, Southern reporter ... | \n",
" 1742 | \n",
" 0 | \n",
" 103 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 837 | \n",
" Saturday Evening Express (Launceston, Tas. : 1... | \n",
" 1747 | \n",
" 7 | \n",
" 44312 | \n",
" 0.000158 | \n",
"
\n",
" \n",
" 514 | \n",
" Vil'na Dumka = Free Thought (Sydney, NSW : 194... | \n",
" 1593 | \n",
" 2 | \n",
" 11607 | \n",
" 0.000172 | \n",
"
\n",
" \n",
" 926 | \n",
" Australier Leben = Australian Life (Melbourne,... | \n",
" 1686 | \n",
" 1 | \n",
" 3816 | \n",
" 0.000262 | \n",
"
\n",
" \n",
" 500 | \n",
" To Ethnico Vema = Greek National Tribune (Arnc... | \n",
" 1592 | \n",
" 27 | \n",
" 62861 | \n",
" 0.000430 | \n",
"
\n",
" \n",
" 685 | \n",
" Kimba Dispatch (SA. : 1927 - 1941) | \n",
" 1731 | \n",
" 17 | \n",
" 35136 | \n",
" 0.000484 | \n",
"
\n",
" \n",
" 746 | \n",
" The Challenger (Port Lincoln, SA. : 1932 - 1934) | \n",
" 1732 | \n",
" 3 | \n",
" 5599 | \n",
" 0.000536 | \n",
"
\n",
" \n",
" 1273 | \n",
" The Western Port Times and Phillip Island and ... | \n",
" 1365 | \n",
" 7 | \n",
" 12684 | \n",
" 0.000552 | \n",
"
\n",
" \n",
" 155 | \n",
" L'Italo-Australiano = The Italo-Australian (Sy... | \n",
" 1597 | \n",
" 5 | \n",
" 6106 | \n",
" 0.000819 | \n",
"
\n",
" \n",
" 178 | \n",
" Mu̇sų Pastogė = Our Haven (Sydney, NSW : 195... | \n",
" 1594 | \n",
" 9 | \n",
" 9060 | \n",
" 0.000993 | \n",
"
\n",
" \n",
" 764 | \n",
" The Northern Districts Courier (North Adelaide... | \n",
" 1711 | \n",
" 1 | \n",
" 885 | \n",
" 0.001130 | \n",
"
\n",
" \n",
" 1250 | \n",
" The Narracan Shire Advocate and Yallourn Brown... | \n",
" 1735 | \n",
" 49 | \n",
" 42777 | \n",
" 0.001145 | \n",
"
\n",
" \n",
" 147 | \n",
" Italo-Australian (Sydney, NSW : 1927 - 1940) | \n",
" 1595 | \n",
" 56 | \n",
" 38986 | \n",
" 0.001436 | \n",
"
\n",
" \n",
" 1535 | \n",
" The Inland Watch (Leonora, WA : 1937 - 1943) | \n",
" 1630 | \n",
" 50 | \n",
" 28725 | \n",
" 0.001741 | \n",
"
\n",
" \n",
" 873 | \n",
" The Hobart Town Herald (Tas. : 1880) | \n",
" 1744 | \n",
" 1 | \n",
" 560 | \n",
" 0.001786 | \n",
"
\n",
" \n",
" 192 | \n",
" Oceania (Sydney, NSW : 1913 - 1915) | \n",
" 1598 | \n",
" 4 | \n",
" 2167 | \n",
" 0.001846 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title id total_results \\\n",
"1247 The Morwell Advocate and Narracan, Boolara and... 1734 0 \n",
"1559 The Mount Margaret Mercury (WA : 1897) 1641 0 \n",
"1553 The Miner's Right (Perth, WA : 1894) 1729 0 \n",
"1508 The Derby News (WA : 1887) 1617 0 \n",
"864 The Herald of Tasmania (Hobart, Tas. : 1845) 1741 0 \n",
"1246 The Morwell Advocate and Boolara and Mirboo Ch... 1733 0 \n",
"874 The Hobart Town Herald and abstinence advocate... 1743 0 \n",
"1144 Seamen's Strike Bulletin (Melbourne, Vic. : 1919) 1043 0 \n",
"872 The Hobart Town Herald (Tas. : 1845) 1740 0 \n",
"875 The Hobart Town Herald, or, Southern reporter ... 1742 0 \n",
"837 Saturday Evening Express (Launceston, Tas. : 1... 1747 7 \n",
"514 Vil'na Dumka = Free Thought (Sydney, NSW : 194... 1593 2 \n",
"926 Australier Leben = Australian Life (Melbourne,... 1686 1 \n",
"500 To Ethnico Vema = Greek National Tribune (Arnc... 1592 27 \n",
"685 Kimba Dispatch (SA. : 1927 - 1941) 1731 17 \n",
"746 The Challenger (Port Lincoln, SA. : 1932 - 1934) 1732 3 \n",
"1273 The Western Port Times and Phillip Island and ... 1365 7 \n",
"155 L'Italo-Australiano = The Italo-Australian (Sy... 1597 5 \n",
"178 Mu̇sų Pastogė = Our Haven (Sydney, NSW : 195... 1594 9 \n",
"764 The Northern Districts Courier (North Adelaide... 1711 1 \n",
"1250 The Narracan Shire Advocate and Yallourn Brown... 1735 49 \n",
"147 Italo-Australian (Sydney, NSW : 1927 - 1940) 1595 56 \n",
"1535 The Inland Watch (Leonora, WA : 1937 - 1943) 1630 50 \n",
"873 The Hobart Town Herald (Tas. : 1880) 1744 1 \n",
"192 Oceania (Sydney, NSW : 1913 - 1915) 1598 4 \n",
"\n",
" total_articles proportion \n",
"1247 208 0.000000 \n",
"1559 24 0.000000 \n",
"1553 426 0.000000 \n",
"1508 9 0.000000 \n",
"864 50 0.000000 \n",
"1246 33 0.000000 \n",
"874 434 0.000000 \n",
"1144 14 0.000000 \n",
"872 57 0.000000 \n",
"875 103 0.000000 \n",
"837 44312 0.000158 \n",
"514 11607 0.000172 \n",
"926 3816 0.000262 \n",
"500 62861 0.000430 \n",
"685 35136 0.000484 \n",
"746 5599 0.000536 \n",
"1273 12684 0.000552 \n",
"155 6106 0.000819 \n",
"178 9060 0.000993 \n",
"764 885 0.001130 \n",
"1250 42777 0.001145 \n",
"147 38986 0.001436 \n",
"1535 28725 0.001741 \n",
"873 560 0.001786 \n",
"192 2167 0.001846 "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_newspapers_with_titles.sort_values(by=\"proportion\")[:25]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll save the full list of newspapers as a CSV file, but first we'll fix up the column headings and add urls for each title."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"df_newspapers_with_titles_csv = df_newspapers_with_titles.copy()\n",
"df_newspapers_with_titles_csv.rename(\n",
" {\"total_results\": \"articles_with_corrections\"}, axis=1, inplace=True\n",
")\n",
"df_newspapers_with_titles_csv[\"percentage_with_corrections\"] = (\n",
" df_newspapers_with_titles_csv[\"proportion\"] * 100\n",
")\n",
"df_newspapers_with_titles_csv.sort_values(\n",
" by=[\"percentage_with_corrections\"], inplace=True\n",
")\n",
"df_newspapers_with_titles_csv[\n",
" [\n",
" \"id\",\n",
" \"title\",\n",
" \"articles_with_corrections\",\n",
" \"total_articles\",\n",
" \"percentage_with_corrections\",\n",
" ]\n",
"].to_csv(\"titles_corrected.csv\", index=False)\n",
"df_newspapers_with_titles_csv[\"title_url\"] = df_newspapers_with_titles_csv[\"id\"].apply(\n",
" lambda x: f\"http://nla.gov.au/nla.news-title{x}\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we'll save the data as a CSV file and display a link."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"tags": [
"nbval-skip"
]
},
"outputs": [
{
"data": {
"text/html": [
"titles_corrected.csv
"
],
"text/plain": [
"/home/tim/Workspace/mycode/glam-workbench/trove-newspapers/notebooks/titles_corrected.csv"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_newspapers_with_titles_csv.to_csv(\"titles_corrected.csv\", index=False)\n",
"display(FileLink(\"titles_corrected.csv\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Neediest newspapers\n",
"\n",
"Let's see if we can combine some guesses about OCR error rates with the correction data to find the newspapers most in need of help.\n",
"\n",
"To make a guesstimate of error rates, we'll use the occurance of 'tbe' – ie a common OCR error for 'the'. I don't know how valid this is, but it's a place to start!"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"# Search for 'tbe' to get an indication of errors by newspaper\n",
"params[\"q\"] = 'text:\"tbe\"~0'\n",
"params[\"facet\"] = \"title\""
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"data = get_results(params)\n",
"facets = []\n",
"for term in data[\"response\"][\"zone\"][0][\"facets\"][\"facet\"][\"term\"]:\n",
" # Get the state and the number of results, and convert it to integers, before adding to our results\n",
" facets.append({\"term\": term[\"search\"], \"total_results\": int(term[\"count\"])})\n",
"df_errors = pd.DataFrame(facets)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Merge the error data with the total articles per newspaper to calculate the proportion."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"df_errors_merged = merge_df_with_total(df_errors, df_newspapers_total, how=\"right\")\n",
"df_errors_merged.sort_values(by=\"proportion\", ascending=False, inplace=True)\n",
"df_errors_merged.rename(columns={\"term\": \"id\"}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" total_results | \n",
" total_articles | \n",
" proportion | \n",
"
\n",
" \n",
" \n",
" \n",
" 1253 | \n",
" 1316 | \n",
" 2004 | \n",
" 2954 | \n",
" 0.678402 | \n",
"
\n",
" \n",
" 1046 | \n",
" 758 | \n",
" 5245 | \n",
" 8078 | \n",
" 0.649294 | \n",
"
\n",
" \n",
" 821 | \n",
" 927 | \n",
" 9438 | \n",
" 17227 | \n",
" 0.547861 | \n",
"
\n",
" \n",
" 912 | \n",
" 382 | \n",
" 6956 | \n",
" 12744 | \n",
" 0.545825 | \n",
"
\n",
" \n",
" 938 | \n",
" 262 | \n",
" 6256 | \n",
" 11527 | \n",
" 0.542726 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id total_results total_articles proportion\n",
"1253 1316 2004 2954 0.678402\n",
"1046 758 5245 8078 0.649294\n",
"821 927 9438 17227 0.547861\n",
"912 382 6956 12744 0.545825\n",
"938 262 6256 11527 0.542726"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_errors_merged.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Add the title names."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"df_errors_with_titles = (\n",
" pd.merge(df_titles_not_gazettes, df_errors_merged, how=\"left\", on=\"id\")\n",
" .fillna(0)\n",
" .sort_values(by=\"proportion\", ascending=False)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So this is a list of the newspapers with the highest rate of OCR error (by our rather dodgy measure)."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" id | \n",
" total_results | \n",
" total_articles | \n",
" proportion | \n",
"
\n",
" \n",
" \n",
" \n",
" 487 | \n",
" The Weekly Advance (Granville, NSW : 1892 - 1893) | \n",
" 1316 | \n",
" 2004 | \n",
" 2954 | \n",
" 0.678402 | \n",
"
\n",
" \n",
" 977 | \n",
" Dunolly and Betbetshire Express and County of ... | \n",
" 758 | \n",
" 5245 | \n",
" 8078 | \n",
" 0.649294 | \n",
"
\n",
" \n",
" 1021 | \n",
" Hamilton Spectator and Grange District Adverti... | \n",
" 927 | \n",
" 9438 | \n",
" 17227 | \n",
" 0.547861 | \n",
"
\n",
" \n",
" 519 | \n",
" Wagga Wagga Express and Murrumbidgee District ... | \n",
" 382 | \n",
" 6956 | \n",
" 12744 | \n",
" 0.545825 | \n",
"
\n",
" \n",
" 619 | \n",
" The North Australian, Ipswich and General Adve... | \n",
" 262 | \n",
" 6256 | \n",
" 11527 | \n",
" 0.542726 | \n",
"
\n",
" \n",
" 618 | \n",
" The North Australian (Brisbane, Qld. : 1863 - ... | \n",
" 264 | \n",
" 2868 | \n",
" 5314 | \n",
" 0.539706 | \n",
"
\n",
" \n",
" 864 | \n",
" The Herald of Tasmania (Hobart, Tas. : 1845) | \n",
" 1741 | \n",
" 26 | \n",
" 50 | \n",
" 0.520000 | \n",
"
\n",
" \n",
" 342 | \n",
" The Hay Standard and Advertiser for Balranald,... | \n",
" 725 | \n",
" 21671 | \n",
" 42068 | \n",
" 0.515142 | \n",
"
\n",
" \n",
" 209 | \n",
" Robertson Advocate (NSW : 1894 - 1923) | \n",
" 530 | \n",
" 36946 | \n",
" 72383 | \n",
" 0.510424 | \n",
"
\n",
" \n",
" 235 | \n",
" Temora Herald and Mining Journal (NSW : 1882 -... | \n",
" 728 | \n",
" 639 | \n",
" 1253 | \n",
" 0.509976 | \n",
"
\n",
" \n",
" 229 | \n",
" Sydney Mail (NSW : 1860 - 1871) | \n",
" 697 | \n",
" 24539 | \n",
" 48535 | \n",
" 0.505594 | \n",
"
\n",
" \n",
" 840 | \n",
" Tasmanian Morning Herald (Hobart, Tas. : 1865 ... | \n",
" 865 | \n",
" 5178 | \n",
" 10290 | \n",
" 0.503207 | \n",
"
\n",
" \n",
" 835 | \n",
" Morning Star and Commercial Advertiser (Hobart... | \n",
" 1242 | \n",
" 850 | \n",
" 1703 | \n",
" 0.499119 | \n",
"
\n",
" \n",
" 169 | \n",
" Molong Argus (NSW : 1896 - 1921) | \n",
" 424 | \n",
" 52061 | \n",
" 104984 | \n",
" 0.495895 | \n",
"
\n",
" \n",
" 1114 | \n",
" Port Phillip Gazette (Vic. : 1851) | \n",
" 1139 | \n",
" 241 | \n",
" 491 | \n",
" 0.490835 | \n",
"
\n",
" \n",
" 1115 | \n",
" Port Phillip Gazette and Settler's Journal (Vi... | \n",
" 1138 | \n",
" 5947 | \n",
" 12127 | \n",
" 0.490393 | \n",
"
\n",
" \n",
" 846 | \n",
" Telegraph (Hobart Town, Tas. : 1867) | \n",
" 1250 | \n",
" 68 | \n",
" 140 | \n",
" 0.485714 | \n",
"
\n",
" \n",
" 872 | \n",
" The Hobart Town Herald (Tas. : 1845) | \n",
" 1740 | \n",
" 27 | \n",
" 57 | \n",
" 0.473684 | \n",
"
\n",
" \n",
" 310 | \n",
" The Cumberland Free Press (Parramatta, NSW : 1... | \n",
" 724 | \n",
" 6231 | \n",
" 13247 | \n",
" 0.470371 | \n",
"
\n",
" \n",
" 908 | \n",
" Trumpeter General (Hobart, Tas. : 1833 - 1834) | \n",
" 869 | \n",
" 693 | \n",
" 1482 | \n",
" 0.467611 | \n",
"
\n",
" \n",
" 649 | \n",
" Adelaide Chronicle and South Australian Litera... | \n",
" 986 | \n",
" 892 | \n",
" 1937 | \n",
" 0.460506 | \n",
"
\n",
" \n",
" 565 | \n",
" Logan Witness (Beenleigh, Qld. : 1878 - 1893) | \n",
" 850 | \n",
" 6681 | \n",
" 14654 | \n",
" 0.455916 | \n",
"
\n",
" \n",
" 392 | \n",
" The News, Shoalhaven and Southern Coast Distri... | \n",
" 1588 | \n",
" 2466 | \n",
" 5495 | \n",
" 0.448772 | \n",
"
\n",
" \n",
" 611 | \n",
" The Darling Downs Gazette and General Advertis... | \n",
" 257 | \n",
" 29213 | \n",
" 65268 | \n",
" 0.447585 | \n",
"
\n",
" \n",
" 859 | \n",
" The Cornwall Chronicle (Launceston, Tas. : 183... | \n",
" 170 | \n",
" 72532 | \n",
" 163791 | \n",
" 0.442833 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title id total_results \\\n",
"487 The Weekly Advance (Granville, NSW : 1892 - 1893) 1316 2004 \n",
"977 Dunolly and Betbetshire Express and County of ... 758 5245 \n",
"1021 Hamilton Spectator and Grange District Adverti... 927 9438 \n",
"519 Wagga Wagga Express and Murrumbidgee District ... 382 6956 \n",
"619 The North Australian, Ipswich and General Adve... 262 6256 \n",
"618 The North Australian (Brisbane, Qld. : 1863 - ... 264 2868 \n",
"864 The Herald of Tasmania (Hobart, Tas. : 1845) 1741 26 \n",
"342 The Hay Standard and Advertiser for Balranald,... 725 21671 \n",
"209 Robertson Advocate (NSW : 1894 - 1923) 530 36946 \n",
"235 Temora Herald and Mining Journal (NSW : 1882 -... 728 639 \n",
"229 Sydney Mail (NSW : 1860 - 1871) 697 24539 \n",
"840 Tasmanian Morning Herald (Hobart, Tas. : 1865 ... 865 5178 \n",
"835 Morning Star and Commercial Advertiser (Hobart... 1242 850 \n",
"169 Molong Argus (NSW : 1896 - 1921) 424 52061 \n",
"1114 Port Phillip Gazette (Vic. : 1851) 1139 241 \n",
"1115 Port Phillip Gazette and Settler's Journal (Vi... 1138 5947 \n",
"846 Telegraph (Hobart Town, Tas. : 1867) 1250 68 \n",
"872 The Hobart Town Herald (Tas. : 1845) 1740 27 \n",
"310 The Cumberland Free Press (Parramatta, NSW : 1... 724 6231 \n",
"908 Trumpeter General (Hobart, Tas. : 1833 - 1834) 869 693 \n",
"649 Adelaide Chronicle and South Australian Litera... 986 892 \n",
"565 Logan Witness (Beenleigh, Qld. : 1878 - 1893) 850 6681 \n",
"392 The News, Shoalhaven and Southern Coast Distri... 1588 2466 \n",
"611 The Darling Downs Gazette and General Advertis... 257 29213 \n",
"859 The Cornwall Chronicle (Launceston, Tas. : 183... 170 72532 \n",
"\n",
" total_articles proportion \n",
"487 2954 0.678402 \n",
"977 8078 0.649294 \n",
"1021 17227 0.547861 \n",
"519 12744 0.545825 \n",
"619 11527 0.542726 \n",
"618 5314 0.539706 \n",
"864 50 0.520000 \n",
"342 42068 0.515142 \n",
"209 72383 0.510424 \n",
"235 1253 0.509976 \n",
"229 48535 0.505594 \n",
"840 10290 0.503207 \n",
"835 1703 0.499119 \n",
"169 104984 0.495895 \n",
"1114 491 0.490835 \n",
"1115 12127 0.490393 \n",
"846 140 0.485714 \n",
"872 57 0.473684 \n",
"310 13247 0.470371 \n",
"908 1482 0.467611 \n",
"649 1937 0.460506 \n",
"565 14654 0.455916 \n",
"392 5495 0.448772 \n",
"611 65268 0.447585 \n",
"859 163791 0.442833 "
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_errors_with_titles[:25]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And those with the lowest rate of errors. Note the number of non-English newspapers in this list – of course our measure of accuracy fails completely in newspapers that don't use the word 'the'!"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" id | \n",
" total_results | \n",
" total_articles | \n",
" proportion | \n",
"
\n",
" \n",
" \n",
" \n",
" 669 | \n",
" Deutsche Zeitung für Sud-Australien = German ... | \n",
" 1577 | \n",
" 0 | \n",
" 14 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 774 | \n",
" The Progressive Times (Largs North, SA : 1949 ... | \n",
" 1307 | \n",
" 0 | \n",
" 1446 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1506 | \n",
" The Dawn Newsletter (Perth, WA : 1952 - 1954) | \n",
" 1773 | \n",
" 0 | \n",
" 40 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1359 | \n",
" Eco Italiano (Perth, WA : 1958 - 1959) | \n",
" 1387 | \n",
" 0 | \n",
" 1579 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1208 | \n",
" The Elsternwick Leader and Caulfield and Balac... | \n",
" 200 | \n",
" 0 | \n",
" 47 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1358 | \n",
" Echo : Polski Tygodnik Niezalezny (Perth, WA :... | \n",
" 1384 | \n",
" 0 | \n",
" 2601 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 34 | \n",
" Auburn and District News (NSW : 1929) | \n",
" 1320 | \n",
" 0 | \n",
" 25 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1354 | \n",
" Der Australische Spiegel = The Australian Mirr... | \n",
" 1385 | \n",
" 0 | \n",
" 1455 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 468 | \n",
" The Temora Telegraph and Mining Advocate (NSW ... | \n",
" 729 | \n",
" 0 | \n",
" 3 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 921 | \n",
" Alexandra and Yea Standard, Thornton, Gobur an... | \n",
" 154 | \n",
" 0 | \n",
" 21 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1348 | \n",
" Dampier Despatch (Broome, WA : 1904 - 1905) | \n",
" 1407 | \n",
" 0 | \n",
" 871 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 472 | \n",
" The True Sun and New South Wales Independent P... | \n",
" 1038 | \n",
" 0 | \n",
" 20 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 49 | \n",
" Blayney West Macquarie (NSW : 1949) | \n",
" 802 | \n",
" 0 | \n",
" 110 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 195 | \n",
" Party (Sydney, NSW : 1942) | \n",
" 1000 | \n",
" 0 | \n",
" 6 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1597 | \n",
" The Southern Cross (Perth, WA : 1893) | \n",
" 1660 | \n",
" 0 | \n",
" 59 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 905 | \n",
" The Van Diemen's Land Gazette and General Adve... | \n",
" 1047 | \n",
" 0 | \n",
" 38 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 193 | \n",
" Out Of Work (Sydney, NSW : 1922) | \n",
" 1008 | \n",
" 0 | \n",
" 32 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 862 | \n",
" The Derwent Star and Van Diemen's Land Intelli... | \n",
" 1046 | \n",
" 0 | \n",
" 12 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1195 | \n",
" The Chinese Advertiser (Ballarat, Vic. : 1856) | \n",
" 706 | \n",
" 0 | \n",
" 15 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Canberra Illustrated: A Quarterly Magazine (AC... | \n",
" 165 | \n",
" 0 | \n",
" 57 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1505 | \n",
" The Dawn News-Sheet (Perth, WA : 1950 - 1952) | \n",
" 1772 | \n",
" 0 | \n",
" 29 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1589 | \n",
" The Possum (Fremantle, WA : 1890) | \n",
" 1201 | \n",
" 0 | \n",
" 105 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 69 | \n",
" Citizen Soldier (Sydney, NSW : 1942) | \n",
" 996 | \n",
" 0 | \n",
" 60 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 70 | \n",
" Clarence and Richmond Examiner (Grafton, NSW :... | \n",
" 104 | \n",
" 0 | \n",
" 111 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 196 | \n",
" Party Builder (Sydney, NSW : 1942) | \n",
" 1001 | \n",
" 0 | \n",
" 160 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title id total_results \\\n",
"669 Deutsche Zeitung für Sud-Australien = German ... 1577 0 \n",
"774 The Progressive Times (Largs North, SA : 1949 ... 1307 0 \n",
"1506 The Dawn Newsletter (Perth, WA : 1952 - 1954) 1773 0 \n",
"1359 Eco Italiano (Perth, WA : 1958 - 1959) 1387 0 \n",
"1208 The Elsternwick Leader and Caulfield and Balac... 200 0 \n",
"1358 Echo : Polski Tygodnik Niezalezny (Perth, WA :... 1384 0 \n",
"34 Auburn and District News (NSW : 1929) 1320 0 \n",
"1354 Der Australische Spiegel = The Australian Mirr... 1385 0 \n",
"468 The Temora Telegraph and Mining Advocate (NSW ... 729 0 \n",
"921 Alexandra and Yea Standard, Thornton, Gobur an... 154 0 \n",
"1348 Dampier Despatch (Broome, WA : 1904 - 1905) 1407 0 \n",
"472 The True Sun and New South Wales Independent P... 1038 0 \n",
"49 Blayney West Macquarie (NSW : 1949) 802 0 \n",
"195 Party (Sydney, NSW : 1942) 1000 0 \n",
"1597 The Southern Cross (Perth, WA : 1893) 1660 0 \n",
"905 The Van Diemen's Land Gazette and General Adve... 1047 0 \n",
"193 Out Of Work (Sydney, NSW : 1922) 1008 0 \n",
"862 The Derwent Star and Van Diemen's Land Intelli... 1046 0 \n",
"1195 The Chinese Advertiser (Ballarat, Vic. : 1856) 706 0 \n",
"1 Canberra Illustrated: A Quarterly Magazine (AC... 165 0 \n",
"1505 The Dawn News-Sheet (Perth, WA : 1950 - 1952) 1772 0 \n",
"1589 The Possum (Fremantle, WA : 1890) 1201 0 \n",
"69 Citizen Soldier (Sydney, NSW : 1942) 996 0 \n",
"70 Clarence and Richmond Examiner (Grafton, NSW :... 104 0 \n",
"196 Party Builder (Sydney, NSW : 1942) 1001 0 \n",
"\n",
" total_articles proportion \n",
"669 14 0.0 \n",
"774 1446 0.0 \n",
"1506 40 0.0 \n",
"1359 1579 0.0 \n",
"1208 47 0.0 \n",
"1358 2601 0.0 \n",
"34 25 0.0 \n",
"1354 1455 0.0 \n",
"468 3 0.0 \n",
"921 21 0.0 \n",
"1348 871 0.0 \n",
"472 20 0.0 \n",
"49 110 0.0 \n",
"195 6 0.0 \n",
"1597 59 0.0 \n",
"905 38 0.0 \n",
"193 32 0.0 \n",
"862 12 0.0 \n",
"1195 15 0.0 \n",
"1 57 0.0 \n",
"1505 29 0.0 \n",
"1589 105 0.0 \n",
"69 60 0.0 \n",
"70 111 0.0 \n",
"196 160 0.0 "
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_errors_with_titles[-25:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's merge the error data with the correction data."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"corrections_errors_merged_df = pd.merge(\n",
" df_newspapers_with_titles, df_errors_with_titles, how=\"left\", on=\"id\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title_x | \n",
" id | \n",
" total_results_x | \n",
" total_articles_x | \n",
" proportion_x | \n",
" title_y | \n",
" total_results_y | \n",
" total_articles_y | \n",
" proportion_y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" The Temora Telegraph and Mining Advocate (NSW ... | \n",
" 729 | \n",
" 3 | \n",
" 3 | \n",
" 1.0 | \n",
" The Temora Telegraph and Mining Advocate (NSW ... | \n",
" 0 | \n",
" 3 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" Hobart Town Gazette and Van Diemen's Land Adve... | \n",
" 5 | \n",
" 1556 | \n",
" 1556 | \n",
" 1.0 | \n",
" Hobart Town Gazette and Van Diemen's Land Adve... | \n",
" 40 | \n",
" 1556 | \n",
" 0.025707 | \n",
"
\n",
" \n",
" 2 | \n",
" The Satirist and Sporting Chronicle (Sydney, N... | \n",
" 1028 | \n",
" 286 | \n",
" 286 | \n",
" 1.0 | \n",
" The Satirist and Sporting Chronicle (Sydney, N... | \n",
" 0 | \n",
" 286 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 3 | \n",
" Justice (Narrabri, NSW : 1891) | \n",
" 885 | \n",
" 45 | \n",
" 45 | \n",
" 1.0 | \n",
" Justice (Narrabri, NSW : 1891) | \n",
" 1 | \n",
" 45 | \n",
" 0.022222 | \n",
"
\n",
" \n",
" 4 | \n",
" Alexandra and Yea Standard, Thornton, Gobur an... | \n",
" 154 | \n",
" 21 | \n",
" 21 | \n",
" 1.0 | \n",
" Alexandra and Yea Standard, Thornton, Gobur an... | \n",
" 0 | \n",
" 21 | \n",
" 0.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title_x id total_results_x \\\n",
"0 The Temora Telegraph and Mining Advocate (NSW ... 729 3 \n",
"1 Hobart Town Gazette and Van Diemen's Land Adve... 5 1556 \n",
"2 The Satirist and Sporting Chronicle (Sydney, N... 1028 286 \n",
"3 Justice (Narrabri, NSW : 1891) 885 45 \n",
"4 Alexandra and Yea Standard, Thornton, Gobur an... 154 21 \n",
"\n",
" total_articles_x proportion_x \\\n",
"0 3 1.0 \n",
"1 1556 1.0 \n",
"2 286 1.0 \n",
"3 45 1.0 \n",
"4 21 1.0 \n",
"\n",
" title_y total_results_y \\\n",
"0 The Temora Telegraph and Mining Advocate (NSW ... 0 \n",
"1 Hobart Town Gazette and Van Diemen's Land Adve... 40 \n",
"2 The Satirist and Sporting Chronicle (Sydney, N... 0 \n",
"3 Justice (Narrabri, NSW : 1891) 1 \n",
"4 Alexandra and Yea Standard, Thornton, Gobur an... 0 \n",
"\n",
" total_articles_y proportion_y \n",
"0 3 0.000000 \n",
"1 1556 0.025707 \n",
"2 286 0.000000 \n",
"3 45 0.022222 \n",
"4 21 0.000000 "
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corrections_errors_merged_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"corrections_errors_merged_df[\"proportion_uncorrected\"] = corrections_errors_merged_df[\n",
" \"proportion_x\"\n",
"].apply(lambda x: 1 - x)\n",
"corrections_errors_merged_df.rename(\n",
" columns={\n",
" \"title_x\": \"title\",\n",
" \"proportion_x\": \"proportion_corrected\",\n",
" \"proportion_y\": \"proportion_with_errors\",\n",
" },\n",
" inplace=True,\n",
")\n",
"corrections_errors_merged_df.sort_values(\n",
" by=[\"proportion_with_errors\", \"proportion_uncorrected\"],\n",
" ascending=False,\n",
" inplace=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, for what it's worth, here's a list of the neediest newspapers – those with high error rates and low correction rates! As I've said, this is a pretty dodgy method, but interesting nonetheless."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" title | \n",
" proportion_with_errors | \n",
" proportion_uncorrected | \n",
"
\n",
" \n",
" \n",
" \n",
" 1193 | \n",
" The Weekly Advance (Granville, NSW : 1892 - 1893) | \n",
" 0.678402 | \n",
" 0.970887 | \n",
"
\n",
" \n",
" 630 | \n",
" Dunolly and Betbetshire Express and County of ... | \n",
" 0.649294 | \n",
" 0.929809 | \n",
"
\n",
" \n",
" 393 | \n",
" Hamilton Spectator and Grange District Adverti... | \n",
" 0.547861 | \n",
" 0.886631 | \n",
"
\n",
" \n",
" 419 | \n",
" Wagga Wagga Express and Murrumbidgee District ... | \n",
" 0.545825 | \n",
" 0.893597 | \n",
"
\n",
" \n",
" 189 | \n",
" The North Australian, Ipswich and General Adve... | \n",
" 0.542726 | \n",
" 0.758393 | \n",
"
\n",
" \n",
" 259 | \n",
" The North Australian (Brisbane, Qld. : 1863 - ... | \n",
" 0.539706 | \n",
" 0.827249 | \n",
"
\n",
" \n",
" 1653 | \n",
" The Herald of Tasmania (Hobart, Tas. : 1845) | \n",
" 0.520000 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 1030 | \n",
" The Hay Standard and Advertiser for Balranald,... | \n",
" 0.515142 | \n",
" 0.960397 | \n",
"
\n",
" \n",
" 857 | \n",
" Robertson Advocate (NSW : 1894 - 1923) | \n",
" 0.510424 | \n",
" 0.949864 | \n",
"
\n",
" \n",
" 615 | \n",
" Temora Herald and Mining Journal (NSW : 1882 -... | \n",
" 0.509976 | \n",
" 0.927374 | \n",
"
\n",
" \n",
" 349 | \n",
" Sydney Mail (NSW : 1860 - 1871) | \n",
" 0.505594 | \n",
" 0.871639 | \n",
"
\n",
" \n",
" 509 | \n",
" Tasmanian Morning Herald (Hobart, Tas. : 1865 ... | \n",
" 0.503207 | \n",
" 0.910690 | \n",
"
\n",
" \n",
" 144 | \n",
" Morning Star and Commercial Advertiser (Hobart... | \n",
" 0.499119 | \n",
" 0.707575 | \n",
"
\n",
" \n",
" 691 | \n",
" Molong Argus (NSW : 1896 - 1921) | \n",
" 0.495895 | \n",
" 0.935047 | \n",
"
\n",
" \n",
" 220 | \n",
" Port Phillip Gazette (Vic. : 1851) | \n",
" 0.490835 | \n",
" 0.798371 | \n",
"
\n",
" \n",
" 204 | \n",
" Port Phillip Gazette and Settler's Journal (Vi... | \n",
" 0.490393 | \n",
" 0.786757 | \n",
"
\n",
" \n",
" 197 | \n",
" Telegraph (Hobart Town, Tas. : 1867) | \n",
" 0.485714 | \n",
" 0.771429 | \n",
"
\n",
" \n",
" 1654 | \n",
" The Hobart Town Herald (Tas. : 1845) | \n",
" 0.473684 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 361 | \n",
" The Cumberland Free Press (Parramatta, NSW : 1... | \n",
" 0.470371 | \n",
" 0.876349 | \n",
"
\n",
" \n",
" 134 | \n",
" Trumpeter General (Hobart, Tas. : 1833 - 1834) | \n",
" 0.467611 | \n",
" 0.671390 | \n",
"
\n",
" \n",
" 127 | \n",
" Adelaide Chronicle and South Australian Litera... | \n",
" 0.460506 | \n",
" 0.665978 | \n",
"
\n",
" \n",
" 261 | \n",
" Logan Witness (Beenleigh, Qld. : 1878 - 1893) | \n",
" 0.455916 | \n",
" 0.828102 | \n",
"
\n",
" \n",
" 1139 | \n",
" The News, Shoalhaven and Southern Coast Distri... | \n",
" 0.448772 | \n",
" 0.967789 | \n",
"
\n",
" \n",
" 278 | \n",
" The Darling Downs Gazette and General Advertis... | \n",
" 0.447585 | \n",
" 0.836168 | \n",
"
\n",
" \n",
" 243 | \n",
" The Cornwall Chronicle (Launceston, Tas. : 183... | \n",
" 0.442833 | \n",
" 0.816089 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title \\\n",
"1193 The Weekly Advance (Granville, NSW : 1892 - 1893) \n",
"630 Dunolly and Betbetshire Express and County of ... \n",
"393 Hamilton Spectator and Grange District Adverti... \n",
"419 Wagga Wagga Express and Murrumbidgee District ... \n",
"189 The North Australian, Ipswich and General Adve... \n",
"259 The North Australian (Brisbane, Qld. : 1863 - ... \n",
"1653 The Herald of Tasmania (Hobart, Tas. : 1845) \n",
"1030 The Hay Standard and Advertiser for Balranald,... \n",
"857 Robertson Advocate (NSW : 1894 - 1923) \n",
"615 Temora Herald and Mining Journal (NSW : 1882 -... \n",
"349 Sydney Mail (NSW : 1860 - 1871) \n",
"509 Tasmanian Morning Herald (Hobart, Tas. : 1865 ... \n",
"144 Morning Star and Commercial Advertiser (Hobart... \n",
"691 Molong Argus (NSW : 1896 - 1921) \n",
"220 Port Phillip Gazette (Vic. : 1851) \n",
"204 Port Phillip Gazette and Settler's Journal (Vi... \n",
"197 Telegraph (Hobart Town, Tas. : 1867) \n",
"1654 The Hobart Town Herald (Tas. : 1845) \n",
"361 The Cumberland Free Press (Parramatta, NSW : 1... \n",
"134 Trumpeter General (Hobart, Tas. : 1833 - 1834) \n",
"127 Adelaide Chronicle and South Australian Litera... \n",
"261 Logan Witness (Beenleigh, Qld. : 1878 - 1893) \n",
"1139 The News, Shoalhaven and Southern Coast Distri... \n",
"278 The Darling Downs Gazette and General Advertis... \n",
"243 The Cornwall Chronicle (Launceston, Tas. : 183... \n",
"\n",
" proportion_with_errors proportion_uncorrected \n",
"1193 0.678402 0.970887 \n",
"630 0.649294 0.929809 \n",
"393 0.547861 0.886631 \n",
"419 0.545825 0.893597 \n",
"189 0.542726 0.758393 \n",
"259 0.539706 0.827249 \n",
"1653 0.520000 1.000000 \n",
"1030 0.515142 0.960397 \n",
"857 0.510424 0.949864 \n",
"615 0.509976 0.927374 \n",
"349 0.505594 0.871639 \n",
"509 0.503207 0.910690 \n",
"144 0.499119 0.707575 \n",
"691 0.495895 0.935047 \n",
"220 0.490835 0.798371 \n",
"204 0.490393 0.786757 \n",
"197 0.485714 0.771429 \n",
"1654 0.473684 1.000000 \n",
"361 0.470371 0.876349 \n",
"134 0.467611 0.671390 \n",
"127 0.460506 0.665978 \n",
"261 0.455916 0.828102 \n",
"1139 0.448772 0.967789 \n",
"278 0.447585 0.836168 \n",
"243 0.442833 0.816089 "
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corrections_errors_merged_df[\n",
" [\"title\", \"proportion_with_errors\", \"proportion_uncorrected\"]\n",
"][:25]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"\n",
"Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n",
"Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}