{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Download the OCRd text for ALL the digitised periodicals in Trove!\n", "\n", "Finding which periodical issues in Trove have OCRd text you can download is not as easy as it should be. The `fullTextInd` index doesn't seem to distinguish between digitised works (with OCR) and born-digital publications (like PDFs) without downloadable text. You can use `has:correctabletext` to find articles with OCR, but you can't get a full list of the periodicals the articles come from using the `title` facet. As [this notebook explains](Create-digitised-journals-list.ipynb), you can search for `nla.obj`, but this returns both digitised works and publications supplied through edeposit. In previous harvests of OCRd text I processed all of the titles returned by the `nla.obj` search, finding out whether there was any OCRd text by just requesting it and seeing what came back. But the number of non-digitised works on the list of periodicals in digital form has skyrocketed through the edeposit scheme and this approach is no longer practical. It just means you waste a lot of time asking for things that don't exist.\n", "\n", "For the latest harvest I took a different approach. I only processed periodicals in digital form that *weren't* identified as coming through edeposit. These are the publications with a `fulltext_url_type` value of either 'digitised' or 'other' in my [dataset of digital periodicals](digital-journals-20220831.csv). Is it possible that there's some downloadable text in edeposit works that's now missing from the harvest? Yep, but I think this is a much more sensible, straightforward, and reproduceable approach.\n", "\n", "That's not the only problem. As I noted when creating the list of periodicals in digital form, there are duplicates in the list, so they have to be removed. You then have to find information obout the issues available for each title. This is not provided by the Trove API, but there is an internal API used in the web interface that you can access – see [this notebook for details](Get-text-from-a-Trove-journal.ipynb). I also noticed that sometimes where there's a single issue of a title, it's presented as if each page is an issue. I think I've found a work around for that as well.\n", "\n", "All these doubts, inconsistencies and workarounds mean that I'm fairly certain I don't have *everything*. But I do think I have *most* of the OCRd text available from digitised periodicals, and I do have a methodology, documented in this notebook, that at least provides a starting point for further investigation. As I noted in my comments on the plan for a Trove Researcher Platform, it would be great if more metadata for digitised works, other than newspapers, was made available through the API.\n", "\n", "So this notebook puts together the [list of periodicals](digital-journals-20220831.csv) [created by this notebook](Create-digitised-journals-list.ipynb) with the [code in this notebook](Get-text-from-a-Trove-journal.ipynb), and downloads as much OCRd text from digitised periodicals in the `journals` zone as it can find. If you're going to try this, you'll need a lots of patience and lots of disk space. Needless to say, don't try this on a cloud service like Binder.\n", "\n", "Fortunately you don't have to do it yourself, as I've already run the harvest and made all the text files available. See below for details.\n", "\n", "I repeat, **you probably don't want to do this yourself**. The point of this notebook is really to document the methodology used to create the repository.\n", "\n", "If you really, really do want to do it yourself, you should first [generate an updated list of digitised periodicals](Create-digitised-journals-list.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Here's a harvest I prepared earlier...\n", "\n", "I last ran this harvest in August 2022. Here are the results:\n", "\n", "* 1,430 periodicals had OCRd text available for download\n", "* OCRd text was downloaded from 41,645 periodical issues \n", "* About 9gb of text was downloaded\n", "\n", "Note that, the previous harvest excluded periodicals with the format 'government publication', this time I've kept them in, even though there's overlap with the 'Book' zone. I'm planning a new database to bring together the book and periodical harvests anyway. Also the harvested collection last time included some details of periodicals without OCRd text, but they've all now been excluded. So the full collection on CloudStor *only* includes periodicals with OCRd text. This makes it more compact and manageable. Finally, I think some duplicates slipped into the last harvest – I hope I've removed them this time.\n", "\n", "The list of digital journals with OCRd text is available both as [human-readable list](digital-journals-with-text.md) and a [CSV formatted spreadsheet](digital-journals-with-text-20220831.csv).\n", "\n", "The complete collection of text files for all the journals can be downloaded [from this repository on CloudStor](https://cloudstor.aarnet.edu.au/plus/s/QOmnqpGQCNCSC2h)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting things up" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Let's import the libraries we need.\n", "import json\n", "import os\n", "import re\n", "import time\n", "from datetime import datetime\n", "\n", "import arrow\n", "import pandas as pd\n", "import requests\n", "import requests_cache\n", "from bs4 import BeautifulSoup\n", "from IPython.display import HTML, FileLink, display\n", "from requests.adapters import HTTPAdapter\n", "from requests.exceptions import ConnectionError, Timeout\n", "from requests.packages.urllib3.util.retry import Retry\n", "from slugify import slugify\n", "from tqdm.auto import tqdm\n", "\n", "s = requests_cache.CachedSession()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "def get_work_metadata(obj_id):\n", " \"\"\"\n", " Extracts metadata embedded as a JSON string in a work's HTML page.\n", " See: https://glam-workbench.net/trove-books/metadata-for-digital-works/\n", " \"\"\"\n", " # Get the HTML page\n", " response = requests.get(f\"https://nla.gov.au/{obj_id}\")\n", " # Search for the JSON string using regex\n", " try:\n", " work_data = re.search(\n", " r\"var work = JSON\\.parse\\(JSON\\.stringify\\((\\{.*\\})\", response.text\n", " ).group(1)\n", " except AttributeError:\n", " # Just in case it's not there...\n", " work_data = \"{}\"\n", " print(\"No data found!\")\n", " # Return the JSON data\n", " return json.loads(work_data)\n", "\n", "\n", "def harvest_metadata(obj_id):\n", " \"\"\"\n", " This calls an internal API from a journal landing page to extract a list of available issues.\n", " \"\"\"\n", " start_url = \"https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c\"\n", " # The initial startIdx value\n", " start = 0\n", " # Number of results per page\n", " n = 20\n", " issues = []\n", " with tqdm(desc=\"Issues\", leave=False) as pbar:\n", " # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.\n", " while n == 20:\n", " # Get the browse page\n", " response = s.get(start_url.format(obj_id, start), timeout=60)\n", " # Beautifulsoup turns the HTML into an easily navigable structure\n", " soup = BeautifulSoup(response.text, \"lxml\")\n", " # Find all the divs containing issue details and loop through them\n", " details = soup.find_all(class_=\"l-item-info\")\n", " for detail in details:\n", " issue = {}\n", " title = detail.find(\"h3\")\n", " if title:\n", " issue[\"title\"] = title.text\n", " issue[\"id\"] = title.parent[\"href\"].strip(\"/\")\n", " else:\n", " issue[\"title\"] = \"No title\"\n", " issue[\"id\"] = detail.find(\"a\")[\"href\"].strip(\"/\")\n", " try:\n", " # Get the issue details\n", " issue[\"details\"] = detail.find(\n", " class_=\"obj-reference content\"\n", " ).string.strip()\n", " except (AttributeError, IndexError):\n", " issue[\"details\"] = \"issue\"\n", " # Get the number of pages\n", " try:\n", " issue[\"pages\"] = int(\n", " re.search(\n", " r\"^(\\d+)\",\n", " detail.find(\"a\", attrs={\"data-pid\": issue[\"id\"]}).text,\n", " flags=re.MULTILINE,\n", " ).group(1)\n", " )\n", " except AttributeError:\n", " # A number of 'periodicals' are actually single issues in which pages are split and treated as individual issues, they don't have a number of pages\n", " # Doens't seem to be an easy way of capturing these\n", " issue[\"pages\"] = 0\n", " issues.append(issue)\n", " # print(issue)\n", " if not response.from_cache:\n", " time.sleep(0.5)\n", " # Increment the startIdx\n", " start += n\n", " # Set n to the number of results on the current page\n", " n = len(details)\n", " pbar.update(n)\n", " return issues\n", "\n", "\n", "def issues_have_pages(issues):\n", " for issue in issues:\n", " if issue[\"pages\"] > 0:\n", " return True\n", " return False\n", "\n", "\n", "def download_issue(issue_id, last_page):\n", " url = f\"https://trove.nla.gov.au/{issue_id}/download?downloadOption=ocr&firstPage=0&lastPage={last_page}\"\n", " # print(url)\n", " # Get the file\n", " try:\n", " r = s.get(url, timeout=180)\n", " except (Timeout, ConnectionError) as err:\n", " print(f\"{type(err).__name__}: {url}\")\n", " else:\n", " # Check there was no error\n", " if r.status_code == requests.codes.ok:\n", " # Check that the file's not empty\n", " r.encoding = \"utf-8\"\n", " # Check that the file isn't HTML (some not found pages don't return 404s)\n", " # BS is too lax and will pass text files that happen to have html tags in them\n", " # if BeautifulSoup(r.text, \"html.parser\").find(\"html\") is None:\n", " if (\n", " len(r.text) > 0\n", " and not r.text.isspace()\n", " and not re.search(r\"\", r.text, re.IGNORECASE)\n", " ):\n", " return r\n", "\n", "\n", "def save_ocr(issues, obj_id, title=None, output_dir=\"journals\"):\n", " \"\"\"\n", " Download the OCRd text for each issue.\n", " \"\"\"\n", " processed_issues = []\n", " if not title:\n", " title = issues[0][\"title\"]\n", " output_path = os.path.join(output_dir, \"{}-{}\".format(slugify(title)[:50], obj_id))\n", " texts_path = os.path.join(output_path, \"texts\")\n", " os.makedirs(texts_path, exist_ok=True)\n", "\n", " if issues_have_pages(issues) is False:\n", " # Some things that look like collections of issues are actually just a single issue\n", " # Let's try and get text from the parent id\n", " work_data = get_work_metadata(obj_id)\n", " try:\n", " issue_pages = len(work_data[\"children\"][\"page\"])\n", " except KeyError:\n", " processed_issues = issues\n", " else:\n", " last_page = issue_pages - 1\n", " r = download_issue(obj_id, last_page)\n", " if r:\n", " file_name = f\"{slugify(title)[:50]}-{obj_id}.txt\"\n", " file_path = os.path.join(texts_path, file_name)\n", " with open(file_path, \"w\", encoding=\"utf-8\") as text_file:\n", " text_file.write(r.text)\n", " issue = {\n", " \"title\": title,\n", " \"id\": obj_id,\n", " \"details\": \"\",\n", " \"pages\": issue_pages,\n", " \"text_file\": file_name,\n", " }\n", " if not r.from_cache:\n", " time.sleep(1)\n", " processed_issues.append(issue)\n", " else:\n", " processed_issues = issues\n", " else:\n", " for issue in tqdm(issues, desc=\"Texts\", leave=False):\n", " # Default values\n", " issue[\"text_file\"] = \"\"\n", " if issue[\"pages\"] != 0:\n", " # print(book['title'])\n", " # The index value for the last page of an issue will be the total pages - 1\n", " last_page = issue[\"pages\"] - 1\n", " file_name = \"{}-{}-{}.txt\".format(\n", " slugify(issue[\"title\"])[:50],\n", " slugify(issue[\"details\"])[:50],\n", " issue[\"id\"],\n", " )\n", " file_path = os.path.join(texts_path, file_name)\n", " # Check to see if the file has already been harvested\n", " if os.path.exists(file_path) and os.path.getsize(file_path) > 0:\n", " # print('Already saved')\n", " issue[\"text_file\"] = file_name\n", " else:\n", " r = download_issue(issue[\"id\"], last_page)\n", " # In case the page number is wrong, try going back a few\n", " while r is None and last_page > max(0, issue[\"pages\"] - 5):\n", " last_page = last_page - 1\n", " r = download_issue(issue[\"id\"], last_page)\n", " if r:\n", " with open(file_path, \"w\", encoding=\"utf-8\") as text_file:\n", " text_file.write(r.text)\n", " issue[\"text_file\"] = file_name\n", " if not r.from_cache:\n", " time.sleep(1)\n", " processed_issues.append(issue)\n", " df = pd.DataFrame(processed_issues)\n", " # Remove empty directories\n", " \"\"\"\n", " try:\n", " os.rmdir(texts_path)\n", " os.rmdir(output_path)\n", " except OSError:\n", " #It's not empty, so add list of issues\n", " df.to_csv(os.path.join(output_path, '{}-issues.csv'.format(obj_id)), index=False)\n", " \"\"\"\n", " try:\n", " os.rmdir(texts_path)\n", " except OSError:\n", " pass\n", " df.to_csv(os.path.join(output_path, \"{}-issues.csv\".format(obj_id)), index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Process all the journals!\n", "\n", "As already mentioned, this takes a long time. It will also probably fail at various points and you'll have to run it again. If you do restart, the script will start at the beginning, but won't redownload any text files have already been harvested.\n", "\n", "Results for each journal are saved in a separate directory in the outpur directory (which defaults to `journals`). The name of the journal directory is created using the journal title and journal id. Inside this directory is a CSV formatted file containing details of all the available issues, and a `texts` sub-directory to contain the downloaded text files.\n", "\n", "The individual file names are created using the journal title, issue details, and issue identifier. So the resulting hierarchy might look something like this:\n", "\n", "```\n", "journals\n", " - angry-penguins-nla.obj-320790312\n", " - nla.obj-320790312-issues.csv\n", " - texts\n", " - angry-penguins-broadsheet-no-1-nla.obj-320791009.txt\n", "```\n", "\n", "The CSV list of issues includes the following fields:\n", "\n", "* `details` – string with issue details, might include dates, issue numbers etc.\n", "* `id` – issue identifier\n", "* `pages` – number of pages in this issue\n", "* `text_file` – file name of any downloaded OCRd text\n", "* `title` – journal title (if not supplied this will be extracted from issue browse list and so might differ from original journal title)\n", "\n", "Note that if the `text_file` field is empty, it means that no OCRd text could be extracted for that particular issue. Note also that if no OCRd text is available, no `texts` directory will be created." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "# You can provide a different output_dir if you want\n", "def process_titles(output_dir=\"journals\"):\n", " df = pd.read_csv(\"digital-journals-20220831.csv\")\n", " # df = pd.read_csv('government-publications-periodicals.csv')\n", " # Filter out edeposit titles, and drop any duplicates\n", " journals = (\n", " df.loc[df[\"fulltext_url_type\"] != \"edeposit\"]\n", " .sort_values(by=[\"trove_id\", \"fulltext_url_type\"])\n", " .drop_duplicates(subset=\"trove_id\", keep=\"first\")\n", " .to_dict(\"records\")\n", " )\n", " for journal in tqdm(journals, desc=\"Journals\"):\n", " issues = harvest_metadata(journal[\"trove_id\"])\n", " if issues:\n", " save_ocr(\n", " issues,\n", " journal[\"trove_id\"],\n", " title=journal[\"title\"],\n", " output_dir=output_dir,\n", " )\n", " else:\n", " print(journal[\"trove_id\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Start harvesting!!!!\n", "process_titles(\"/home/tim/Workspace/workingdata/Trove/journals/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gather data about the harvest\n", "\n", "Because the harvesting takes a long time and is prone to failure, it seemed wise to gather data at the end, rather than keeping a running total.\n", "\n", "The cells below create a list of journals that have OCRd text. The list has the following fields:\n", "\n", "* `fulltext_url` – the url of the landing page of the digital version of the journal\n", "* `title` – the title of the journal\n", "* `trove_id` – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital journal\n", "* `trove_url` – url of the journal's metadata record in Trove\n", "* `issues` – the number of available issues\n", "* `issues_with_text` – the number of issues that OCRd text could be downloaded from\n", "* `directory` – the directory in which the files from this journal have been saved (relative to the output directory)\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def collect_issue_data(output_path=\"journals\"):\n", " titles_with_text = []\n", " df = pd.read_csv(\"digital-journals-20220831.csv\", keep_default_na=False)\n", " # df = pd.read_csv('government-publications-periodicals-20210802.csv')\n", " journals = (\n", " df.loc[df[\"fulltext_url_type\"] != \"edeposit\"]\n", " .sort_values(by=[\"trove_id\", \"fulltext_url_type\"])\n", " .drop_duplicates(subset=\"trove_id\", keep=\"first\")\n", " .to_dict(\"records\")\n", " )\n", " for j in journals:\n", " j_dir = os.path.join(\n", " output_path, \"{}-{}\".format(slugify(j[\"title\"])[:50], j[\"trove_id\"])\n", " )\n", " if os.path.exists(j_dir):\n", " csv_file = os.path.join(j_dir, \"{}-issues.csv\".format(j[\"trove_id\"]))\n", " try:\n", " issues_df = pd.read_csv(csv_file, keep_default_na=False)\n", " except pd.errors.EmptyDataError:\n", " print(j_dir)\n", "\n", " try:\n", " num_text = issues_df.loc[issues_df[\"text_file\"] != \"\"].shape[0]\n", " except KeyError:\n", " pass\n", " else:\n", " if num_text > 0:\n", " j[\"issues_with_text\"] = num_text\n", " j[\"issues\"] = issues_df.shape[0]\n", " j[\"directory\"] = \"{}-{}\".format(\n", " slugify(j[\"title\"])[:50], j[\"trove_id\"]\n", " )\n", " titles_with_text.append(j)\n", " return titles_with_text" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Gather the data\n", "# titles_with_text = collect_issue_data()\n", "titles_with_text = collect_issue_data(\"/home/tim/Workspace/workingdata/Trove/journals/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert to a dataframe." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlecontributorissuedformatfulltext_urltrove_urltrove_idfulltext_url_typeissues_with_textissuesdirectory
0Science and technology statement (Canberra, A....Australia. Department of Industry, Technology ...1900-2022Periodical | Periodical/Journal, magazine, oth...https://nla.gov.au/nla.obj-1006986858https://trove.nla.gov.au/work/5555883nla.obj-1006986858digitised66science-and-technology-statement-canberra-a-c-...
1Research report / Australian National UniversityAustralian National University1990-2022Periodical | Periodical/Journal, magazine, otherhttps://nla.gov.au/nla.obj-1018350073https://trove.nla.gov.au/work/8712197nla.obj-1018350073digitised22research-report-australian-national-university...
2Airline activities of Ansett Transport Industr...Ansett Transport Industries1900-1973Periodical | Periodical/Journal, magazine, otherhttps://nla.gov.au/nla.obj-1036685302https://trove.nla.gov.au/work/34691176nla.obj-1036685302digitised11airline-activities-of-ansett-transport-industr...
3Anzac bulletin : issued to members of the Aust...Australia. High Commission (Great Britain)1916-1919Periodical | Periodical/Journal, magazine, otherhttp://nla.gov.au/nla.obj-1037567https://trove.nla.gov.au/work/12653430nla.obj-1037567digitised205205anzac-bulletin-issued-to-members-of-the-austra...
4The 23rd : the voice of the battalionAustralia. Army. Battalion, 23rd1917-1919Periodical | Periodical/Journal, magazine, otherhttp://nla.gov.au/nla.obj-10414986https://trove.nla.gov.au/work/31637350nla.obj-10414986digitised3030the-23rd-the-voice-of-the-battalion-nla.obj-10...
\n", "
" ], "text/plain": [ " title \\\n", "0 Science and technology statement (Canberra, A.... \n", "1 Research report / Australian National University \n", "2 Airline activities of Ansett Transport Industr... \n", "3 Anzac bulletin : issued to members of the Aust... \n", "4 The 23rd : the voice of the battalion \n", "\n", " contributor issued \\\n", "0 Australia. Department of Industry, Technology ... 1900-2022 \n", "1 Australian National University 1990-2022 \n", "2 Ansett Transport Industries 1900-1973 \n", "3 Australia. High Commission (Great Britain) 1916-1919 \n", "4 Australia. Army. Battalion, 23rd 1917-1919 \n", "\n", " format \\\n", "0 Periodical | Periodical/Journal, magazine, oth... \n", "1 Periodical | Periodical/Journal, magazine, other \n", "2 Periodical | Periodical/Journal, magazine, other \n", "3 Periodical | Periodical/Journal, magazine, other \n", "4 Periodical | Periodical/Journal, magazine, other \n", "\n", " fulltext_url \\\n", "0 https://nla.gov.au/nla.obj-1006986858 \n", "1 https://nla.gov.au/nla.obj-1018350073 \n", "2 https://nla.gov.au/nla.obj-1036685302 \n", "3 http://nla.gov.au/nla.obj-1037567 \n", "4 http://nla.gov.au/nla.obj-10414986 \n", "\n", " trove_url trove_id \\\n", "0 https://trove.nla.gov.au/work/5555883 nla.obj-1006986858 \n", "1 https://trove.nla.gov.au/work/8712197 nla.obj-1018350073 \n", "2 https://trove.nla.gov.au/work/34691176 nla.obj-1036685302 \n", "3 https://trove.nla.gov.au/work/12653430 nla.obj-1037567 \n", "4 https://trove.nla.gov.au/work/31637350 nla.obj-10414986 \n", "\n", " fulltext_url_type issues_with_text issues \\\n", "0 digitised 6 6 \n", "1 digitised 2 2 \n", "2 digitised 1 1 \n", "3 digitised 205 205 \n", "4 digitised 30 30 \n", "\n", " directory \n", "0 science-and-technology-statement-canberra-a-c-... \n", "1 research-report-australian-national-university... \n", "2 airline-activities-of-ansett-transport-industr... \n", "3 anzac-bulletin-issued-to-members-of-the-austra... \n", "4 the-23rd-the-voice-of-the-battalion-nla.obj-10... " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "try:\n", " df_text = pd.DataFrame(titles_with_text)\n", "except NameError:\n", " df_text = pd.read_csv(\"digital-journals-with-text-20220831.csv\")\n", "df_text.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save as a CSV file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a peek inside..." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1430, 11)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of journals with OCRd text\n", "df_text.shape" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "41904" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Total number of issues\n", "df_text[\"issues\"].sum()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "41645" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of issues with OCRd text\n", "df_text[\"issues_with_text\"].sum()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [ { "data": { "text/html": [ "digital-journals-with-text-20220831.csv" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "csv_file = f'digital-journals-with-text-{datetime.now().strftime(\"%Y%m%d\")}.csv'\n", "df_text.to_csv(csv_file, index=False)\n", "display(HTML(f'{csv_file}'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a markdown-formatted list" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_text.sort_values(by=[\"title\"], inplace=True)\n", "with open(\"digital-journals-with-text.md\", \"w\") as md_file:\n", " md_file.write(\"# Digitised journals from Trove with OCRd text\")\n", " md_file.write(\n", " f'\\n\\nThis harvest was completed on {arrow.now(\"Australia/Canberra\").format(\"D MMMM YYYY\")}.'\n", " )\n", " md_file.write(f\"\\n\\nNumber of journals harvested: {df_text.shape[0]:,}\")\n", " md_file.write(\n", " f'\\n\\nNumber of issues with OCRd text: {df_text[\"issues_with_text\"].sum():,}'\n", " )\n", " md_file.write(\"\\n\\n----\\n\\n\")\n", " for row in df_text.itertuples():\n", " md_file.write(f\"\\n### {row.title}\")\n", " if row.contributor:\n", " md_file.write(f\"\\n**{row.contributor}, {row.issued}**\")\n", " else:\n", " md_file.write(f\"\\n**{row.issued}**\")\n", " md_file.write(f\" \\n{row.format}\")\n", " md_file.write(\n", " f\"\\n\\n{row.issues_with_text} of {row.issues} issues have OCRd text available for download.\"\n", " )\n", " md_file.write(f\"\\n\\n* [Details on Trove]({row.trove_url})\\n\")\n", " md_file.write(f\"* [Browse issues on Trove]({row.fulltext_url})\\n\")\n", " md_file.write(\n", " f\"* [Download issue data as CSV from CloudStor](https://cloudstor.aarnet.edu.au/plus/s/QOmnqpGQCNCSC2h/download?path=%2F{row.directory}&files={row.trove_id}-issues.csv)\\n\"\n", " )\n", " md_file.write(\n", " f\"* [Download all OCRd text from CloudStor](https://cloudstor.aarnet.edu.au/plus/s/QOmnqpGQCNCSC2h/download?path=%2F{row.directory})\\n\"\n", " )\n", "\n", "display(FileLink(\"digital-journals-with-text.md\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).\n", "\n", "Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }