{ "cells": [ { "cell_type": "markdown", "id": "ec606daa-7dfb-4c57-ab5f-25ef806303ce", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Harvest details of Commonwealth Parliamentary Papers digitised in Trove\n", "\n", "Trove includes thousands of digitised papers and reports presented to the Commonwealth Parliament. The *Trove Data Guide* [provides an overview of the Parliamentary Papers digitised in Trove](https://tdg.glam-workbench.net/other-digitised-resources/parliamentary-papers/overview.html).\n", "\n", "However, [finding all the Parliamentary Papers](https://tdg.glam-workbench.net/other-digitised-resources/parliamentary-papers/finding-pp.html) is not straightforward because of inconsistencies in the way they've been arranged and described. This notebook attempts to work around these problems and harvest as complete as possible data about Parliamentary Papers in Trove.\n", "\n", "The basic strategy is to harvest as many records as possible, and then merge any duplicates at the end. There are 4 main steps:\n", "\n", "- search for digitised Parliamentary Papers using the `/result` API endpoint and work through all the grouped versions in each work record, saving all that are relevant – this will expand and separate wrongly-grouped records so they can be individually harvested\n", "- enrich and expand the version records by extracting embedded metadata from the digitised item viewer – this gets the number of pages for publications, and extracts individual publication details from nested collections\n", "- check for missing parent publications – some records, such as sections extracted from a Parliamentary Paper, will have parent publications, this step makes sure we've got a record for each parent\n", "- merge duplicate and semi-duplicate records – de-duplicate based on unique fields (ie the link to the digitised item), and merge the values of other fields so that no metadata is lost\n", "\n", "It should be noted that this method takes a long time and is very inefficient. The main reason for this is that a search for Parliamentary Papers returns more than 250,000 records. Most of these records are sections (or 'articles') extracted from Parliamentary Papers and delivered through the **Magazines & Newsletters** category. However, there doesn't seem to be a reliable way of distinguishing between these 'articles' and complete publications based on the API metadata alone. The 'articles' are identified in the code below using the embedded metadata extracted from the digitised file viewer, and excluded at the merge stage.\n", "\n", "If you you don't want to harvest all the metadata yourself, see [Digitised Parliamentary Papers in Trove](https://glam-workbench.net/trove-government/trove-parliamentary-papers-data/) for a pre-harvested dataset." ] }, { "cell_type": "markdown", "id": "dc02b73a-fe02-4f1e-8186-70ee2bdb0646", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": null, "id": "a5e1169b-d40d-4f26-82a2-f31e0a983e08", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import json\n", "import os\n", "import re\n", "import time\n", "from functools import reduce\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "import requests\n", "import requests_cache\n", "from bs4 import BeautifulSoup\n", "from dotenv import load_dotenv\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "load_dotenv()" ] }, { "cell_type": "code", "execution_count": null, "id": "75246ced-ae39-4b0b-bf99-0688a1a7a1ba", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "s = requests_cache.CachedSession()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": null, "id": "752c4e71-5740-4787-acc6-d267ff86655d", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Insert your Trove API key between the quotes\n", "API_KEY = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " API_KEY = os.getenv(\"TROVE_API_KEY\")" ] }, { "cell_type": "markdown", "id": "ce270068-c11d-45dc-b3fe-123e52b443e5", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Save search results and extract versions\n", "\n", "First we search for Parliamentary Papers using the basic query `\"nla.obj\" series:\"Parliamentary paper (Australia. Parliament)\"`. Instead of just saving each work record, we check each version grouped within the work to see if it points to a digitised resource – if it does, we add it to the dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "a34d0a48-709f-49fd-a681-6452ec5406cd", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def get_total_results(params, headers):\n", " \"\"\"\n", " Get the total number of results for a search.\n", " \"\"\"\n", " these_params = params.copy()\n", " these_params[\"n\"] = 0\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/result\", params=these_params, headers=headers\n", " )\n", " data = response.json()\n", " return int(data[\"category\"][0][\"records\"][\"total\"])\n", "\n", "\n", "def get_value(record, field, keys=[\"value\"]):\n", " \"\"\"\n", " Get the values of a field.\n", " Some fields are lists of dicts, if so use the `key` to get the value.\n", " \"\"\"\n", " value = record.get(field, [])\n", " if value and isinstance(value[0], dict):\n", " for key in keys:\n", " try:\n", " return [re.sub(r\"\\s+\", \" \", v[key]) for v in value]\n", " except KeyError:\n", " pass\n", " else:\n", " return value\n", "\n", "\n", "def merge_values(record, fields, keys=[\"value\"]):\n", " \"\"\"\n", " Merges values from multiple fields, removing any duplicates.\n", " \"\"\"\n", " values = []\n", " for field in fields:\n", " values += get_value(record, field, keys)\n", " # Remove duplicates and None value\n", " return list(set([v for v in values if v is not None]))\n", "\n", "\n", "def flatten_values(record, field, key=\"type\"):\n", " \"\"\"\n", " If a field has a value and type, return the values as strings with this format: 'type: value'\n", " \"\"\"\n", " flattened = []\n", " values = record.get(field, [])\n", " for value in values:\n", " if key in value:\n", " flattened.append(f\"{value[key]}: {value['value']}\")\n", " else:\n", " flattened.append(value[\"value\"])\n", " return flattened\n", "\n", "\n", "def flatten_identifiers(record):\n", " \"\"\"\n", " Get a list of control numbers from the identifier field and flatten the values.\n", " \"\"\"\n", " ids = {\n", " \"identifier\": [\n", " v\n", " for v in record.get(\"identifier\", [])\n", " if \"type\" in v and v[\"type\"] == \"control number\"\n", " ]\n", " }\n", " return flatten_values(ids, \"identifier\", \"source\")\n", "\n", "\n", "def get_fulltext_url(links):\n", " \"\"\"\n", " Loop through the identifiers to find a link to the full text version of the book.\n", " \"\"\"\n", " urls = []\n", " for link in links:\n", " if (\n", " \"linktype\" in link\n", " and link[\"linktype\"] == \"fulltext\"\n", " and \"nla.obj\" in link[\"value\"]\n", " ):\n", " url = re.sub(r\"^http\\b\", \"https\", link[\"value\"])\n", " link_text = link.get(\"linktext\", \"\")\n", " urls.append({\"url\": url, \"link_text\": link_text})\n", " return urls\n", "\n", "\n", "def get_catalogue_url(links):\n", " \"\"\"\n", " Loop through the identifiers to find a link to the NLA catalogue.\n", " \"\"\"\n", " for link in links:\n", " if (\n", " \"linktype\" in link\n", " and link[\"linktype\"] == \"notonline\"\n", " and \"nla.cat\" in link[\"value\"]\n", " ):\n", " return link[\"value\"]\n", " return \"\"\n", "\n", "\n", "def has_fulltext_link(links):\n", " \"\"\"\n", " Check if a list of identifiers includes a fulltext url pointing to an NLA resource.\n", " \"\"\"\n", " for link in links:\n", " if (\n", " \"linktype\" in link\n", " and link[\"linktype\"] == \"fulltext\"\n", " and \"nla.obj\" in link[\"value\"]\n", " ):\n", " return True\n", "\n", "\n", "def get_digitised_versions(work):\n", " \"\"\"\n", " Get the versions from the given work that have a fulltext url pointing to an NLA resource\n", " in the `identifier` field.\n", " \"\"\"\n", " versions = []\n", " for version in work[\"version\"]:\n", " if \"identifier\" in version and has_fulltext_link(version[\"identifier\"]):\n", " versions.append(version)\n", " return versions\n", "\n", "\n", "def harvest_works(params, output=\"pp-metadata.ndjson\", max=None):\n", " \"\"\"\n", " Harvest metadata relating to digitised works.\n", " \"\"\"\n", " harvested = 0\n", " default_params = {\n", " \"category\": \"all\",\n", " \"bulkHarvest\": \"true\",\n", " \"n\": 100,\n", " \"encoding\": \"json\",\n", " \"include\": [\"links\", \"workversions\"],\n", " }\n", " params.update(default_params)\n", " headers = {\"X-API-KEY\": API_KEY}\n", " total = max if max else get_total_results(params, headers)\n", " start = \"*\"\n", " with Path(output).open(\"w\") as ndjson_file:\n", " with tqdm(total=total) as pbar:\n", " while start:\n", " params[\"s\"] = start\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/result\",\n", " params=params,\n", " headers=headers,\n", " )\n", " data = response.json()\n", " items = data[\"category\"][0][\"records\"][\"item\"]\n", " for item in items:\n", " for category, record in item.items():\n", " # See if there's a link to the full text version.\n", " if category == \"work\" and \"identifier\" in record:\n", " versions = get_digitised_versions(record)\n", " for version in versions:\n", " for sub_version in version[\"record\"]:\n", " metadata = sub_version[\"metadata\"][\"dc\"]\n", " # Sometimes fulltext identifiers are only available on the\n", " # version rather than the sub version. So we'll look in the\n", " # sub version first, and if they're not there use the url from\n", " # the version.\n", " # Sometimes there are multiple fulltext urls associated with a version:\n", " # eg a collection page and a publication. If so add records for both urls.\n", " # They could end up pointing to the same digitised publication, but\n", " # we can sort that out later. Aim here is to try and not miss any possible\n", " # routes to digitised publications!\n", " urls = get_fulltext_url(metadata[\"identifier\"])\n", " if len(urls) == 0:\n", " urls = get_fulltext_url(version[\"identifier\"])\n", " for url in urls:\n", " work = {\n", " # This is not the full set of available fields,\n", " # adjust as necessary.\n", " \"title\": get_value(metadata, \"title\"),\n", " \"work_url\": record.get(\"troveUrl\"),\n", " \"work_type\": record.get(\"type\", []),\n", " \"contributor\": merge_values(\n", " metadata,\n", " [\"creator\", \"contributor\"],\n", " [\"value\", \"name\"],\n", " ),\n", " \"publisher\": get_value(\n", " metadata, \"publisher\"\n", " ),\n", " \"date\": merge_values(\n", " metadata, [\"date\", \"issued\"]\n", " ),\n", " # Using merge here because I've noticed some duplicate values\n", " \"type\": merge_values(metadata, [\"type\"]),\n", " \"format\": get_value(metadata, \"format\"),\n", " \"rights\": merge_values(\n", " metadata, [\"rights\", \"licenseRef\"]\n", " ),\n", " \"language\": get_value(metadata, \"language\"),\n", " \"extent\": get_value(metadata, \"extent\"),\n", " \"subject\": merge_values(\n", " metadata, [\"subject\"]\n", " ),\n", " # Flattened type/value\n", " \"is_part_of\": flatten_values(\n", " metadata, \"isPartOf\"\n", " ),\n", " # Only get control numbers and flatten\n", " \"identifier\": flatten_identifiers(metadata),\n", " \"fulltext_url\": url[\"url\"],\n", " \"fulltext_url_text\": url[\"link_text\"],\n", " \"catalogue_url\": get_catalogue_url(\n", " metadata[\"identifier\"]\n", " ),\n", " # Could also add in data from bibliographicCitation\n", " # Although the types used in citations seem to vary by work and format.\n", " }\n", " ndjson_file.write(f\"{json.dumps(work)}\\n\")\n", " # The nextStart parameter is used to get the next page of results.\n", " # If there's no nextStart then it means we're on the last page of results.\n", " harvested += len(items)\n", " if max and harvested >= max:\n", " start = None\n", " else:\n", " try:\n", " start = data[\"category\"][0][\"records\"][\"nextStart\"]\n", " except KeyError:\n", " start = None\n", " pbar.update(len(items))" ] }, { "cell_type": "code", "execution_count": null, "id": "de9c8605-d052-490e-8dcc-942a9dc62b05", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "params = {\n", " \"q\": '\"nla.obj\" series:\"Parliamentary paper (Australia. Parliament)\"',\n", " \"l-availability\": \"y\",\n", "}\n", "\n", "harvest_works(params)" ] }, { "cell_type": "markdown", "id": "0f89a0d6-beb5-4611-98b8-79065970e5e7", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "How many records have we harvested so far? Note that the number harvested is greater than the number of search results. This is because we've unpacked versions that had been grouped into works and saved them as separate records." ] }, { "cell_type": "code", "execution_count": null, "id": "a9c6abb1-b3be-41d5-884a-a807546b60b0", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "count = 0\n", "with Path(\"pp-metadata.ndjson\").open() as ndjson:\n", " for line in ndjson:\n", " count += 1\n", "count" ] }, { "cell_type": "markdown", "id": "33a50f59-4311-4ebc-a624-338e43368afb", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Enrich and expand records using metadata from the digitised file viewer\n", "\n", "The digitised file viewer usually embeds some additional metadata, including the publication's MARC record from the NLA catalogue! If the file viewer link points to a publication, rather than a collection, the metadata will include details of individual pages. This code saves the number of pages in a publication and adds some extra metadata. If the file viewer link points to a collection, this code will unpack the individual publications from the collection and add them to the dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "e9e24d17-f873-4332-91db-49d2e0caa540", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def get_work_data(url):\n", " \"\"\"\n", " Extract work data in a JSON string from the work's HTML page.\n", " \"\"\"\n", " try:\n", " response = s.get(url)\n", " except requests.exceptions.InvalidURL:\n", " response = s.get(url.replace(\"\\\\\\\\\", \"//\"))\n", " try:\n", " work_data = re.search(\n", " r\"var work = JSON\\.parse\\(JSON\\.stringify\\((\\{.*\\})\", response.text\n", " ).group(1)\n", " except AttributeError:\n", " work_data = \"{}\"\n", " if not response.from_cache:\n", " time.sleep(0.2)\n", " return json.loads(work_data)\n", "\n", "\n", "def get_pages(work):\n", " \"\"\"\n", " Get the number of pages from the work metadata.\n", " \"\"\"\n", " try:\n", " pages = len(work[\"children\"][\"page\"])\n", " except KeyError:\n", " pages = 0\n", " return pages\n", "\n", "\n", "def get_page_ids(work):\n", " \"\"\"\n", " Get a list of page identifiers from the work metadata.\n", " \"\"\"\n", " try:\n", " page_ids = [p[\"pid\"] for p in work[\"children\"][\"page\"]]\n", " except KeyError:\n", " page_ids = []\n", " return page_ids\n", "\n", "\n", "def get_volumes(parent_id):\n", " \"\"\"\n", " Get the ids of volumes that are children of the current record.\n", " \"\"\"\n", " start_url = \"https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c\"\n", " # The initial startIdx value\n", " start = 0\n", " # Number of results per page\n", " n = 20\n", " parts = []\n", " # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.\n", " while n == 20:\n", " # Get the browse page\n", " response = s.get(start_url.format(parent_id, start))\n", " # Beautifulsoup turns the HTML into an easily navigable structure\n", " soup = BeautifulSoup(response.text, \"lxml\")\n", " # Find all the divs containing issue details and loop through them\n", " details = soup.find_all(class_=\"l-item-info\")\n", " for detail in details:\n", " title = detail.find(\"h3\")\n", " if title:\n", " issue_id = title.parent[\"href\"].strip(\"/\")\n", " else:\n", " issue_id = detail.find(\"a\")[\"href\"].strip(\"/\")\n", " # Get the issue id\n", " parts.append(issue_id)\n", " if not response.from_cache:\n", " time.sleep(0.2)\n", " # Increment the startIdx\n", " start += n\n", " # Set n to the number of results on the current page\n", " n = len(details)\n", " return parts\n", "\n", "\n", "def add_metadata(work, metadata, pages, include_page_ids=False):\n", " \"\"\"\n", " Add embedded metadata to existing record.\n", " New values will be appended to existing list.\n", " \"\"\"\n", " fields = [\n", " {\"to\": \"title\", \"from\": \"title\"},\n", " {\"to\": \"contributor\", \"from\": \"creator\"},\n", " {\"to\": \"publisher\", \"from\": \"publisherName\"},\n", " {\"to\": \"format\", \"from\": \"form\"},\n", " {\"to\": \"rights\", \"from\": \"copyrightPolicy\"},\n", " {\"to\": \"extent\", \"from\": \"extent\"},\n", " {\"to\": \"identifier\", \"from\": \"holdingNumber\"},\n", " ]\n", " for field in fields:\n", " value_from = metadata.get(field[\"from\"])\n", " if value_from:\n", " try:\n", " if value_from not in work[field[\"to\"]]:\n", " work[field[\"to\"]].append(metadata.get(field[\"from\"]))\n", " except KeyError:\n", " work[field[\"to\"]] = [metadata.get(field[\"from\"])]\n", " except AttributeError:\n", " if value_from != work[field[\"to\"]]:\n", " work[field[\"to\"]] = [work[field[\"to\"]], metadata.get(field[\"from\"])]\n", " work[\"alternative_title\"] = \" \".join(\n", " [\n", " metadata.get(\"subUnitType\", \"\"),\n", " metadata.get(\"subUnitNo\", \"\"),\n", " ]\n", " ).strip()\n", " if date := re.search(r\"\\b(\\d{4})$\", metadata.get(\"issueDate\", \"\")):\n", " work[\"date\"] = date.group(1)\n", " work[\"pages\"] = pages\n", " if include_page_ids:\n", " work[\"page_ids\"] = get_page_ids(metadata)\n", " return work\n", "\n", "\n", "def enrich_records(\n", " input=\"pp-metadata.ndjson\",\n", " output=\"pp-metadata-pages.ndjson\",\n", " include_page_ids=False,\n", "):\n", " \"\"\"\n", " Add the number of pages to the metadata for each work.\n", " Add volumes from multi volume books.\n", " \"\"\"\n", " total = sum(1 for _ in open(input))\n", " with Path(input).open(\"r\") as ndjson_in:\n", " with Path(output).open(\"w\") as ndjson_out:\n", " for line in tqdm(ndjson_in, total=total):\n", " work = json.loads(line)\n", " # print(book['fulltext_url'])\n", " metadata = get_work_data(work[\"fulltext_url\"])\n", " # Some ids are for sections (articles) rather than the complete publications\n", " # ignore these as we should already have the complete publication.\n", " trove_id = re.search(r\"(nla\\.obj\\-\\d+)\", work[\"fulltext_url\"]).group(1)\n", " if trove_id == metadata.get(\"pid\"):\n", " form = metadata.get(\"form\")\n", " pages = get_pages(metadata)\n", " work = add_metadata(work.copy(), metadata, pages, include_page_ids)\n", " parent = metadata.get(\"parent\", {})\n", " if ppid := parent.get(\"pid\"):\n", " work[\"parent\"] = ppid\n", " work[\"parent_url\"] = \"https://nla.gov.au/\" + ppid\n", " work[\"children\"] = \"\"\n", " # If there's no pages its probably a collection,\n", " # so we have to get the ids of each individual publication in the collection and process them\n", " if pages == 0 and form in [\"Multi Volume Book\", \"Journal\"]:\n", " # Get child volumes\n", " volumes = get_volumes(trove_id)\n", " # For each volume get details and add as a new book entry\n", " for volume_id in volumes:\n", " volume = {\n", " # Use values from parent\n", " # If there are additional values in embedded metadata,\n", " # they'll be added by add_metadata() below\n", " \"format\": work[\"format\"],\n", " \"subject\": work[\"subject\"],\n", " \"language\": work[\"language\"],\n", " \"is_part_of\": work[\"is_part_of\"],\n", " \"identifier\": work[\"identifier\"],\n", " # Add link up to parent\n", " \"parent\": trove_id,\n", " \"parent_url\": work[\"work_url\"],\n", " # Because this is a collection child it has no work url.\n", " # If there's an individual record for this publication\n", " # it'll be separately harvested and merged later.\n", " \"work_url\": \"\",\n", " \"fulltext_url\": \"https://nla.gov.au/{}\".format(\n", " volume_id\n", " ),\n", " }\n", " metadata = get_work_data(volume[\"fulltext_url\"])\n", " pages = get_pages(metadata)\n", " volume = add_metadata(\n", " volume, metadata, pages, include_page_ids\n", " )\n", " # print(volume)\n", " ndjson_out.write(f\"{json.dumps(volume)}\\n\")\n", " # Add links from container to volumes\n", " work[\"children\"] = \"|\".join(volumes)\n", " else:\n", " work[\"parent\"] = metadata.get(\"pid\", \"\")\n", " work[\"parent_url\"] = \"https://nla.gov.au/\" + metadata.get(\"pid\", \"\")\n", " # print(book)\n", " ndjson_out.write(f\"{json.dumps(work)}\\n\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d63eba95-6a9d-40bf-9ed0-463d95608598", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "enrich_records()" ] }, { "cell_type": "code", "execution_count": null, "id": "294474c7-9ee1-40f0-8214-d7bbe7026e61", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "count = 0\n", "with Path(\"pp-metadata-pages.ndjson\").open() as ndjson:\n", " for line in ndjson:\n", " count += 1\n", "count" ] }, { "cell_type": "markdown", "id": "74bf637d-4f25-495f-89da-4d03d51c73c9", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Check for missing parent records\n", "\n", "As noted, this method harvests sections extracted from Parliamentary Papers as well as full publications. The parent publication of a section should have been identified in the previous processing step. Here we make sure that we have individual publication records for all of the parents (yes, sometimes they're missing)." ] }, { "cell_type": "code", "execution_count": null, "id": "c01ff3ce-2af9-445e-98a8-bc8f5b264c06", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def get_missing_metadata(input=\"pp-metadata-pages.ndjson\", include_page_ids=False):\n", " df = pd.read_json(input, lines=True, convert_dates=False)\n", " parent_ids = list(df[\"parent\"].unique())\n", " fulltext_urls = list(df[\"fulltext_url\"].unique())\n", " fulltext_ids = [f.split(\"/\")[-1] for f in fulltext_urls]\n", " # Find parent ids that we don't have as individual records\n", " missing_ids = [m for m in list(set(parent_ids) - set(fulltext_ids)) if m != \"\"]\n", " with Path(input).open(\"a\") as ndjson_out:\n", " for mid in tqdm(missing_ids):\n", " fulltext_url = f\"https://nla.gov.au/{mid}\"\n", " metadata = get_work_data(fulltext_url)\n", " work = {\n", " \"fulltext_url\": fulltext_url,\n", " }\n", " pages = get_pages(metadata)\n", " work = add_metadata(work, metadata, pages, include_page_ids)\n", " ndjson_out.write(f\"{json.dumps(work)}\\n\")" ] }, { "cell_type": "code", "execution_count": null, "id": "b55dabc1-cd2f-44f3-a36e-598851d346c0", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "get_missing_metadata()" ] }, { "cell_type": "code", "execution_count": null, "id": "e3e0e243-be0e-43d4-b91c-d28b35b70291", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "count = 0\n", "with Path(\"pp-metadata-pages.ndjson\").open() as ndjson:\n", " for line in ndjson:\n", " count += 1\n", "count" ] }, { "cell_type": "markdown", "id": "33785322-44f2-486e-881b-546aa015994f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Merge duplicate records\n", "\n", "Because of the way we've unpacked grouped versions, it's possible we might have created duplicate records. In any case, Trove itself includes near-duplicate records for many digitised resources ­– they often point to the same digitised resource, but with slightly different metadata. To make sure we get as many of the Parliamentary Papers as possible, we've left the duplicates in the dataset until now. In this step we exclude collections and 'articles' by leaving out records without pages, and then merge the rest. The merge process de-duplicates records based on fields that only have a single unique value: `fulltext_url`, `pages`, `alternative_title`. Other fields can contain multiple values, so these are merged and separated by a `|` character." ] }, { "cell_type": "code", "execution_count": null, "id": "f8bab27e-1f01-482b-a1d7-b01643d8d625", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def merge_column(columns):\n", " values = []\n", " for value in columns:\n", " if isinstance(value, list):\n", " values += [str(v) for v in value if v]\n", " elif value:\n", " values.append(str(value))\n", " return \"|\".join(sorted(set(values)))\n", "\n", "\n", "def merge_records(df):\n", " df[\"pages\"].fillna(0, inplace=True)\n", " df.fillna(\"\", inplace=True)\n", " df[\"pages\"] = df[\"pages\"].astype(\"Int64\")\n", "\n", " # Add base dataset with columns that will always have only one value\n", " # Only include records with pages (excludes sections of publications and collections)\n", " dfs = [\n", " df.loc[df[\"pages\"] > 0][\n", " [\"fulltext_url\", \"pages\", \"alternative_title\"]\n", " ].drop_duplicates()\n", " ]\n", "\n", " # Columns that potentially have multiple values which will be merged\n", " columns = [\n", " \"title\",\n", " \"work_url\",\n", " \"work_type\",\n", " \"contributor\",\n", " \"publisher\",\n", " \"date\",\n", " \"type\",\n", " \"format\",\n", " \"extent\",\n", " \"language\",\n", " \"subject\",\n", " \"is_part_of\",\n", " \"identifier\",\n", " \"rights\",\n", " \"fulltext_url_text\",\n", " \"catalogue_url\",\n", " \"parent\",\n", " \"parent_url\",\n", " \"children\",\n", " ]\n", "\n", " # Merge values from each column in turn, creating a new dataframe from each\n", " for column in columns:\n", " dfs.append(\n", " df.groupby([\"fulltext_url\"])[column].apply(merge_column).reset_index()\n", " )\n", "\n", " # Merge all the individual dataframes into one, linking on `fulltext_url` value\n", " df_merged = reduce(\n", " lambda left, right: pd.merge(left, right, on=[\"fulltext_url\"], how=\"left\"), dfs\n", " )\n", " return df_merged" ] }, { "cell_type": "code", "execution_count": null, "id": "6741849a-89c1-4f15-af19-7181695bae91", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df = pd.read_json(\"pp-metadata-pages.ndjson\", lines=True, convert_dates=False)\n", "\n", "df_merged = merge_records(df)" ] }, { "cell_type": "markdown", "id": "1f15b9b4-c766-43fb-88b3-488abace61eb", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "How many records are there now?" ] }, { "cell_type": "code", "execution_count": null, "id": "77d45d5a-2a2d-4c71-bb8b-24372ea07601", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_merged.shape[0]" ] }, { "cell_type": "markdown", "id": "4847df77-bec8-48eb-a1cf-db7db44c628a", "metadata": {}, "source": [ "Add a column that provides a link to download the OCRd text of the book." ] }, { "cell_type": "code", "execution_count": null, "id": "32ea10b3-7c64-4860-8e19-13540c45666f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "def add_download_link(row):\n", " trove_id = re.search(r\"(nla\\.obj\\-\\d+)\", row[\"fulltext_url\"]).group(1)\n", " last_page = row[\"pages\"] - 1\n", " return f\"https://trove.nla.gov.au/{trove_id}/download?downloadOption=ocr&firstPage=0&lastPage={last_page}\"\n", "\n", "\n", "df_merged[\"text_download_url\"] = df_merged.apply(add_download_link, axis=1)" ] }, { "cell_type": "markdown", "id": "b8f37ad2-921f-43d2-ba8d-2951e337a38c", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Save the final dataset as CSV and Parquet files." ] }, { "cell_type": "code", "execution_count": null, "id": "b26c72d5-8dbd-4028-a175-f7120b7e5499", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "dataset_columns = [\n", " \"title\",\n", " \"alternative_title\",\n", " \"contributor\",\n", " \"publisher\",\n", " \"date\",\n", " \"type\",\n", " \"format\",\n", " \"extent\",\n", " \"language\",\n", " \"subject\",\n", " \"is_part_of\",\n", " \"identifier\",\n", " \"rights\",\n", " \"pages\",\n", " \"fulltext_url\",\n", " \"fulltext_url_text\",\n", " \"text_download_url\",\n", " \"catalogue_url\",\n", " \"work_url\",\n", " \"work_type\",\n", " \"parent\",\n", " \"parent_url\",\n", " \"children\",\n", "]\n", "\n", "df_merged[dataset_columns].to_csv(\"trove-parliamentary-papers.csv\", index=False)\n", "df_merged[dataset_columns].to_parquet(\"trove-parliamentary-papers.parquet\", index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "71e181fc-ce15-4965-9a09-49bee5c71cac", "metadata": { "editable": true, "jupyter": { "source_hidden": true }, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# TESTING ONLY -- PLEASE IGNORE THIS CELL\n", "if os.getenv(\"GW_STATUS\") == \"dev\":\n", " params = {\n", " \"q\": '\"nla.obj\" series:\"Parliamentary paper (Australia. Parliament)\"',\n", " \"l-availability\": \"y\",\n", " }\n", "\n", " harvest_works(params, output=\"test.ndjson\", max=100)\n", "\n", " enrich_records(input=\"test.ndjson\", output=\"test-pages.ndjson\")\n", "\n", " get_missing_metadata(input=\"test-pages.ndjson\")\n", "\n", " df = pd.read_json(\"test-pages.ndjson\", lines=True, convert_dates=False)\n", "\n", " df_merged = merge_records(df)\n", "\n", " assert not df_merged.empty\n", "\n", " Path(\"test.ndjson\").unlink()\n", " Path(\"test-pages.ndjson\").unlink()" ] }, { "cell_type": "markdown", "id": "728c9e18-c237-49e3-b63a-5dc9e83e781a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "rocrate": { "action": [ { "description": "This dataset contains metadata describing Commonwealth Parliamentary Papers that have been digitised and are made available through Trove.", "isPartOf": "https://github.com/GLAM-Workbench/trove-parliamentary-papers-data/", "mainEntityOfPage": "https://glam-workbench.net/trove-government/trove-parliamentary-papers-data/", "name": "trove-parliamentary-papers-data", "result": [ { "license": "https://creativecommons.org/publicdomain/zero/1.0/", "url": "https://github.com/GLAM-Workbench/trove-parliamentary-papers-data/raw/main/trove-parliamentary-papers.csv" } ], "workExample": [ { "name": "Explore using Datasette", "url": "https://glam-workbench.net/datasette-lite/?csv=https://raw.githubusercontent.com/GLAM-Workbench/trove-parliamentary-papers-data/main/trove-parliamentary-papers.csv&fts=title,alternative_title,contributor&drop=work_type,fulltext_url_text,parent,parent_url,children" }, { "name": "Visualised in the Trove Data Guide", "url": "https://tdg.glam-workbench.net/other-digitised-resources/parliamentary-papers/overview.html" } ] } ], "author": [ { "mainEntityOfPage": "https://timsherratt.au", "name": "Sherratt, Tim", "orcid": "https://orcid.org/0000-0001-7956-4498" } ], "description": "Trove includes thousands of digitised papers and reports presented to the Commonwealth Parliament. However, finding all the Parliamentary Papers is not straightforward because of inconsistencies in the way they've been arranged and described. This notebook attempts to work around these problems and harvest as complete as possible data about Parliamentary Papers in Trove.", "mainEntityOfPage": "https://glam-workbench.net/trove-government/harvest-parliamentary-papers/", "name": "Harvest details of Commonwealth Parliamentary Papers digitised in Trove" } }, "nbformat": 4, "nbformat_minor": 5 }