{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvest recently digitised files from RecordSearch\n", "\n", "This notebook scrapes data from the 'Newly scanned records' section of [RecordSearch](https://recordsearch.naa.gov.au/), creating a list of recently digitised files. I ran this code on 27 March 2021 to generate [a dataset](data/recently-digitised-20210327.csv) containing files that had been digitised in the previous month.\n", "\n", "The 'Newly scanned records' only display a month's worth of additions. However, I've modified the code below to create a 'git scraper' that uses GitHub actions to run the harvester every Sunday, saving a list of the files digitised in the previous week into a [public repository](https://github.com/wragge/naa-recently-digitised). Over time, this should build up a more complete record of the digitisation process.\n", "\n", "It took me a while to figure out how the pagination worked in the 'Newly scanned records' site. As you can see below, it's a matter of adding inputs to the main navigation form that mimic a click on the page navigation buttons. Screen scraping is such fun... 😬" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re\n", "import time\n", "from pathlib import Path\n", "\n", "import altair as alt\n", "import arrow\n", "import mechanicalsoup\n", "import pandas as pd\n", "from recordsearch_data_scraper.scrapers import RSSeries\n", "from tqdm.auto import tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def initialise_browser():\n", " \"\"\"\n", " This is necessary to get an active session in RS.\n", " \"\"\"\n", " browser = mechanicalsoup.StatefulBrowser()\n", " browser.open(\"https://recordsearch.naa.gov.au/scripts/Logon.asp?N=guest\")\n", " # As of Jan 2023 these lines don't seem necessary and cause a LinkNotFound error\n", " # browser.select_form('form[id=\"t\"]')\n", " # browser.submit_selected()\n", " return browser\n", "\n", "\n", "def get_date_digitised(result):\n", " \"\"\"\n", " Generate a formatted date from the date digitised string (eg 'Digitised 1 days ago').\n", " It does this by getting today's date then subtracting the interval.\n", " It's possible this might not always be accurate...\n", " \"\"\"\n", " # Get the string describing when the record was digitised\n", " when_digitised = result.find(\n", " \"div\", class_=\"card-footer card-footer-list\"\n", " ).span.string.strip()\n", "\n", " # Extract out the time interval and unit\n", " interval, unit = re.search(\n", " r\"^Digitised (\\d+) (minutes|hours|days) ago\", when_digitised\n", " ).groups()\n", "\n", " # Subtract interval from today's date\n", " if unit == \"minutes\":\n", " date_digitised = arrow.now(\"Australia/Sydney\").shift(minutes=-(int(interval)))\n", " elif unit == \"days\":\n", " date_digitised = arrow.now(\"Australia/Sydney\").shift(days=-(int(interval)))\n", " elif unit == \"hours\":\n", " date_digitised = arrow.now(\"Australia/Sydney\").shift(hours=-(int(interval)))\n", "\n", " # ISO format the result\n", " return date_digitised.format(\"YYYY-MM-DD\")\n", "\n", "\n", "def get_records_from_page(page, pbar):\n", " \"\"\"\n", " Scrapes item metadata from the list of results.\n", " \"\"\"\n", " records = []\n", "\n", " # Get the list of results\n", " results = page.find_all(\"li\", class_=\"soda_list\")\n", "\n", " # Loop through the results, extracting the metadata\n", " for result in results:\n", " record = {}\n", " record[\"title\"] = result.img[\"title\"]\n", " record[\"item_id\"] = (\n", " result.find(\"dt\", string=\"Item ID:\")\n", " .find_next_sibling(\"dd\")\n", " .a.string.strip()\n", " )\n", " record[\"series\"] = (\n", " result.find(\"dt\", string=\"Series:\").find_next_sibling(\"dd\").a.string.strip()\n", " )\n", " record[\"control_symbol\"] = (\n", " result.find(\"dt\", string=re.compile(\"Control symbol:\"))\n", " .find_next_sibling(\"dd\")\n", " .string.strip()\n", " )\n", " record[\"date_range\"] = re.sub(\n", " r\"\\s+\",\n", " \" \",\n", " result.find(\"dt\", string=re.compile(\"Date range:\"))\n", " .find_next_sibling(\"dd\")\n", " .string.strip(),\n", " )\n", " record[\"date_digitised\"] = get_date_digitised(result)\n", " records.append(record)\n", " pbar.update(len(records))\n", " return records\n", "\n", "\n", "def get_number_of_results(page):\n", " \"\"\"\n", " Get the start, end, and total number of results from the current page of results.\n", " \"\"\"\n", " result_summary = page.find(\n", " \"label\", id=\"ContentPlaceHolderSNR_lblTopPaging\"\n", " ).string.strip()\n", " start, end, total = re.search(r\"(\\d+) to (\\d+) of (\\d+)\", result_summary).groups()\n", " return (start, end, total)\n", "\n", "\n", "def harvest_recently_digitised():\n", " records = []\n", "\n", " # Get a browser with all RecordSearch's session stuff ready\n", " browser = initialise_browser()\n", "\n", " # Open the recently digitised page\n", " browser.open(\n", " \"https://recordsearch.naa.gov.au/SearchNRetrieve/Interface/ListingReports/NewlyScannedList.aspx\"\n", " )\n", "\n", " # CONFIGURE THE RESULTS FORM\n", " browser.select_form('form[id=\"formSNRMaster\"]')\n", "\n", " # Get 200 results per page\n", " browser[\"ctl00$ContentPlaceHolderSNR$ddlResultsPerPage\"] = \"200\"\n", "\n", " # Get results from the past month. Other options are 'w' (week) and 'f' (fortnight).\n", " browser[\"ctl00$ContentPlaceHolderSNR$ddlDateAdded\"] = \"m\"\n", "\n", " # Set display to list view\n", " # Setting these mimics a click on the List View button\n", " browser.form.set(\"ctl00$ContentPlaceHolderSNR$btn_viewList.x\", \"11\", force=True)\n", " browser.form.set(\"ctl00$ContentPlaceHolderSNR$btn_viewList.y\", \"9\", force=True)\n", " browser.submit_selected()\n", "\n", " # PROCESS RESULTS\n", " # Get the total number of results\n", " start, end, total = get_number_of_results(browser.page)\n", "\n", " with tqdm(total=int(total)) as pbar:\n", "\n", " # Process first page of results\n", " records += get_records_from_page(browser.page, pbar)\n", "\n", " # Loop through the rest of the results set\n", " while end != total:\n", " browser.select_form('form[id=\"formSNRMaster\"]')\n", "\n", " # Setting these and submitting the form retrieves th next page of results\n", " # Basically they mimic a click on the page navigation buttons\n", " browser.form.set(\n", " \"ctl00$ContentPlaceHolderSNR$listPagerTop$ctl00$ctl02.x\",\n", " \"10\",\n", " force=True,\n", " )\n", " browser.form.set(\n", " \"ctl00$ContentPlaceHolderSNR$listPagerTop$ctl00$ctl02.y\",\n", " \"10\",\n", " force=True,\n", " )\n", " browser.submit_selected()\n", "\n", " start, end, total = get_number_of_results(browser.page)\n", " records += get_records_from_page(browser.page, pbar)\n", " time.sleep(1)\n", " return records" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the harvest" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "records = harvest_recently_digitised()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert the results to a DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_records = pd.DataFrame(records)\n", "df_records.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find titles of series listed in the dataset\n", "\n", "The dataset only includes the series identifiers. To make it a bit more useful, we can retrieve the title of each series and add this to the dataset.\n", "\n", "First we extract a list of unique series identifiers from the dataset, then loop through it, grabbing the series details using my RecordSearch tools library." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "series_titles = []\n", "\n", "# Loop through the list of series ids\n", "for s in tqdm(list(df_records[\"series\"].unique())):\n", "\n", " # Get the summary details from each series\n", " # Note that this includes more information than the title which could be added into the dataset if you wanted (eg location)\n", " details = RSSeries(\n", " s, include_number_digitised=False, include_access_status=False\n", " ).data\n", "\n", " # Add the titles and ids to a new list\n", " try:\n", " series_titles.append({\"series\": s, \"series_title\": details[\"title\"]})\n", " except KeyError:\n", " print(details)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can convert the series titles into a dataframe and merge it with the records dataframe to create a new dataframe that includes the titles." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_series = pd.DataFrame(series_titles)\n", "\n", "# Merge the dataframes on the `series` column\n", "df = df_records.merge(df_series, on=\"series\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save the results to a CSV file" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df.to_csv(\n", " Path(\"data\", f'recently-digitised-{arrow.now().format(\"YYYYMMDD\")}.csv'),\n", " index=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Which series do the digitised files come from?\n", "\n", "Let's get a list of the series that appear most often in the dataset." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Reload previously harvested file if necessary\n", "df = pd.read_csv(\"data/recently-digitised-20210327.csv\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | series | \n", "series_title | \n", "count | \n", "
---|---|---|---|
0 | \n", "B884 | \n", "Citizen Military Forces Personnel Dossiers, 19... | \n", "20382 | \n", "
1 | \n", "A9301 | \n", "RAAF Personnel files of Non-Commissioned Offic... | \n", "1207 | \n", "
2 | \n", "B883 | \n", "Second Australian Imperial Force Personnel Dos... | \n", "515 | \n", "
3 | \n", "A10605 | \n", "Personnel Occurrence Reports | \n", "396 | \n", "
4 | \n", "A6135 | \n", "Photographic colour transparencies positives, ... | \n", "226 | \n", "
5 | \n", "D4881 | \n", "Alien registration cards, alphabetical series | \n", "66 | \n", "
6 | \n", "MP367/1 | \n", "General correspondence files | \n", "48 | \n", "
7 | \n", "A9300 | \n", "RAAF Officers Personnel files, 1921-1948 | \n", "47 | \n", "
8 | \n", "A12372 | \n", "RAAF Personnel files - All Ranks [Main corresp... | \n", "40 | \n", "
9 | \n", "BP5/2 | \n", "Drawings of inventions for letters patent, sin... | \n", "37 | \n", "
10 | \n", "B78 | \n", "Alien registration documents | \n", "36 | \n", "
11 | \n", "A2478 | \n", "Non-British European migrant selection documents | \n", "34 | \n", "
12 | \n", "MP84/1 | \n", "Correspondence files, multiple number series | \n", "30 | \n", "
13 | \n", "BP371/1 | \n", "Correspondence registration booklets and cards | \n", "26 | \n", "
14 | \n", "A705 | \n", "Correspondence files, multiple number (Melbour... | \n", "25 | \n", "
15 | \n", "A471 | \n", "Courts-Martial files [including war crimes tri... | \n", "23 | \n", "
16 | \n", "J3111 | \n", "Queensland post office history files, alphabet... | \n", "23 | \n", "
17 | \n", "J3109 | \n", "Historic photographic collection assembled by ... | \n", "19 | \n", "
18 | \n", "BP8/1 | \n", "Mail service (contract) files, either annual s... | \n", "19 | \n", "
19 | \n", "A13860 | \n", "Medical Documents - Army (Department of Defenc... | \n", "19 | \n", "
20 | \n", "SP908/1 | \n", "Application for Registration of Aliens (other ... | \n", "18 | \n", "
21 | \n", "A446 | \n", "Correspondence files, annual single number ser... | \n", "18 | \n", "
22 | \n", "J539 | \n", "Correspondence files, multiple number series. | \n", "16 | \n", "
23 | \n", "A1877 | \n", "British migrants - Selection documents for fre... | \n", "14 | \n", "
24 | \n", "J26 | \n", "Medical case files, single number series with ... | \n", "13 | \n", "
\n", " | series | \n", "series_title | \n", "count | \n", "
---|---|---|---|
17 | \n", "J3109 | \n", "Historic photographic collection assembled by ... | \n", "19 | \n", "
18 | \n", "BP8/1 | \n", "Mail service (contract) files, either annual s... | \n", "19 | \n", "
19 | \n", "A13860 | \n", "Medical Documents - Army (Department of Defenc... | \n", "19 | \n", "
20 | \n", "SP908/1 | \n", "Application for Registration of Aliens (other ... | \n", "18 | \n", "
21 | \n", "A446 | \n", "Correspondence files, annual single number ser... | \n", "18 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
369 | \n", "BP190/4 | \n", "'RT' series rifle range tenure correspondence ... | \n", "1 | \n", "
370 | \n", "BP242/1 | \n", "Correspondence files relating to national secu... | \n", "1 | \n", "
371 | \n", "BP25/1 | \n", "Alien registration papers, alphabetical series... | \n", "1 | \n", "
372 | \n", "BP460/3 | \n", "Main Trust files annual single number series | \n", "1 | \n", "
373 | \n", "C424 | \n", "General correspondence files, annual single nu... | \n", "1 | \n", "
357 rows × 3 columns
\n", "\n", " | series | \n", "series_title | \n", "count | \n", "
---|---|---|---|
164 | \n", "ST1233/1 | \n", "Investigation files, single number series with... | \n", "1 | \n", "
165 | \n", "K26 | \n", "Personal case files, single number series with... | \n", "1 | \n", "
166 | \n", "J992 | \n", "Mail Service files, North Queensland, single n... | \n", "1 | \n", "
167 | \n", "K60 | \n", "Personal case files, single number with 'M' an... | \n", "1 | \n", "
168 | \n", "K269 | \n", "Inward passenger manifests for ships and aircr... | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
369 | \n", "BP190/4 | \n", "'RT' series rifle range tenure correspondence ... | \n", "1 | \n", "
370 | \n", "BP242/1 | \n", "Correspondence files relating to national secu... | \n", "1 | \n", "
371 | \n", "BP25/1 | \n", "Alien registration papers, alphabetical series... | \n", "1 | \n", "
372 | \n", "BP460/3 | \n", "Main Trust files annual single number series | \n", "1 | \n", "
373 | \n", "C424 | \n", "General correspondence files, annual single nu... | \n", "1 | \n", "
210 rows × 3 columns
\n", "