{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Create a list of Trove's digital periodicals\n", "\n", "This notebook creates a list of digitised periodicals in Trove by searching for the digital identifier string `nla.obj` and limiting the results to periodicals. Before the Trove API introduced the `/magazine/titles` endpoint, this was the only way to generate such a list. This method produces slightly different results to the new API endpoint, and it might be useful to compare the two to see what each method misses. [Get details of periodicals from the /magazine/titles API endpoint](periodicals-from-api.ipynb) and [Enrich the list of periodicals from the Trove API](periodicals-enrich-for-datasette.ipynb) demonstrate how to compile a list of periodicals from the `/magazine/titles` endpoint.\n", "\n", "The harvesting strategy used in this notebook is similar to that described in the Trove Data Guides' [HOW TO: Harvest data relating to digitised resources](https://tdg.glam-workbench.net/other-digitised-resources/how-to/harvest-digitised-resources.html). Because of variations in the way digitised resources are described and organised, it seems best to harvest all available version records individually, and then merge duplicates at a later step.\n", "\n", "The full search query used is `\"nla.obj\" NOT series:\"Parliamentary paper (Australia. Parliament)\" NOT nuc:\"ANL:NED\"`. This attempts to exclude Parliamentary Papers and periodicals submitted through the National edeposit scheme." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's import the libraries we need.\n", "import json\n", "import os\n", "import re\n", "from datetime import timedelta\n", "from functools import reduce\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "import requests_cache\n", "from dotenv import load_dotenv\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "s = requests_cache.CachedSession(expire_after=timedelta(days=30))\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "\n", "load_dotenv()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Add your Trove API key\n", "\n", "You can get a Trove API key by [following these instructions](https://trove.nla.gov.au/about/create-something/using-api)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Insert your Trove API key\n", "API_KEY = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " API_KEY = os.getenv(\"TROVE_API_KEY\")" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Define some functions to do the work" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def get_total_results(params, headers):\n", " \"\"\"\n", " Get the total number of results for a search.\n", " \"\"\"\n", " these_params = params.copy()\n", " these_params[\"n\"] = 0\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/result\", params=these_params, headers=headers\n", " )\n", " data = response.json()\n", " return int(data[\"category\"][0][\"records\"][\"total\"])\n", "\n", "\n", "def get_value(record, field, keys=[\"value\"]):\n", " \"\"\"\n", " Get the values of a field.\n", " Some fields are lists of dicts, if so use the `key` to get the value.\n", " \"\"\"\n", " value = record.get(field, [])\n", " if value and isinstance(value[0], dict):\n", " for key in keys:\n", " try:\n", " return [re.sub(r\"\\s+\", \" \", v[key]) for v in value]\n", " except KeyError:\n", " pass\n", " else:\n", " return value\n", "\n", "\n", "def merge_values(record, fields, keys=[\"value\"]):\n", " \"\"\"\n", " Merges values from multiple fields, removing any duplicates.\n", " \"\"\"\n", " values = []\n", " for field in fields:\n", " values += get_value(record, field, keys)\n", " # Remove duplicates and None value\n", " return list(set([v for v in values if v is not None]))\n", "\n", "\n", "def flatten_values(record, field, key=\"type\"):\n", " \"\"\"\n", " If a field has a value and type, return the values as strings with this format: 'type: value'\n", " \"\"\"\n", " flattened = []\n", " values = record.get(field, [])\n", " for value in values:\n", " if key in value:\n", " flattened.append(f\"{value[key]}: {value['value']}\")\n", " else:\n", " flattened.append(value[\"value\"])\n", " return flattened\n", "\n", "\n", "def flatten_identifiers(record):\n", " \"\"\"\n", " Get a list of control numbers from the identifier field and flatten the values.\n", " \"\"\"\n", " ids = {\n", " \"identifier\": [\n", " v\n", " for v in record.get(\"identifier\", [])\n", " if \"type\" in v and v[\"type\"] == \"control number\"\n", " ]\n", " }\n", " return flatten_values(ids, \"identifier\", \"source\")\n", "\n", "\n", "def get_fulltext_url(links):\n", " \"\"\"\n", " Loop through the identifiers to find a link to the full text version of the book.\n", " \"\"\"\n", " urls = []\n", " for link in links:\n", " if (\n", " \"linktype\" in link\n", " and link[\"linktype\"] == \"fulltext\"\n", " and \"nla.obj\" in link[\"value\"]\n", " ):\n", " url = re.sub(r\"^http\\b\", \"https\", link[\"value\"])\n", " link_text = link.get(\"linktext\", \"\")\n", " urls.append({\"url\": url, \"link_text\": link_text})\n", " return urls\n", "\n", "\n", "def get_catalogue_url(links):\n", " \"\"\"\n", " Loop through the identifiers to find a link to the NLA catalogue.\n", " \"\"\"\n", " for link in links:\n", " if (\n", " \"linktype\" in link\n", " and link[\"linktype\"] == \"notonline\"\n", " and \"nla.cat\" in link[\"value\"]\n", " ):\n", " return link[\"value\"]\n", " return \"\"\n", "\n", "\n", "def has_fulltext_link(links):\n", " \"\"\"\n", " Check if a list of identifiers includes a fulltext url pointing to an NLA resource.\n", " \"\"\"\n", " for link in links:\n", " if (\n", " \"linktype\" in link\n", " and link[\"linktype\"] == \"fulltext\"\n", " and \"nla.obj\" in link[\"value\"]\n", " ):\n", " return True\n", "\n", "\n", "def has_holding(holdings, nucs):\n", " \"\"\"\n", " Check if a list of holdings includes one of the supplied nucs.\n", " \"\"\"\n", " for holding in holdings:\n", " if holding.get(\"nuc\") in nucs:\n", " return True\n", "\n", "\n", "def get_digitised_versions(work):\n", " \"\"\"\n", " Get the versions from the given work that have a fulltext url pointing to an NLA resource\n", " in the `identifier` field.\n", " \"\"\"\n", " versions = []\n", " for version in work[\"version\"]:\n", " if \"identifier\" in version and has_fulltext_link(version[\"identifier\"]):\n", " versions.append(version)\n", " return versions\n", "\n", "\n", "def get_nuc_versions(work, nucs=[\"ANL\", \"ANL:DL\"]):\n", " \"\"\"\n", " Get the versions from the given work that are held by the NLA.\n", " \"\"\"\n", " versions = []\n", " for version in work[\"version\"]:\n", " if \"holding\" in version and has_holding(version[\"holding\"], [\"ANL\", \"ANL:DL\"]):\n", " versions.append(version)\n", " return versions\n", "\n", "\n", "def harvest_works(\n", " params,\n", " filter_by=\"url\",\n", " nucs=[\"ANL\", \"ANL:DL\"],\n", " output_file=\"harvested-metadata.ndjson\",\n", "):\n", " \"\"\"\n", " Harvest metadata relating to digitised works.\n", " The filter_by parameter selects records for inclusion in the dataset, options:\n", " * url -- only include versions that have an NLA fulltext url\n", " * nuc -- only include versions that have an NLA nuc (ANL or ANL:DL)\n", " \"\"\"\n", " default_params = {\n", " \"category\": \"all\",\n", " \"bulkHarvest\": \"true\",\n", " \"n\": 100,\n", " \"encoding\": \"json\",\n", " \"include\": [\"links\", \"workversions\", \"holdings\"],\n", " }\n", " params.update(default_params)\n", " headers = {\"X-API-KEY\": API_KEY}\n", " total = get_total_results(params, headers)\n", " start = \"*\"\n", " with Path(output_file).open(\"w\") as ndjson_file:\n", " with tqdm(total=total) as pbar:\n", " while start:\n", " params[\"s\"] = start\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/result\",\n", " params=params,\n", " headers=headers,\n", " )\n", " data = response.json()\n", " items = data[\"category\"][0][\"records\"][\"item\"]\n", " for item in items:\n", " for category, record in item.items():\n", " if category == \"work\":\n", " if filter_by == \"nuc\":\n", " versions = get_nuc_versions(record, nucs)\n", " else:\n", " versions = get_digitised_versions(record)\n", " # Sometimes there are fulltext links on work but not versions\n", " if len(versions) == 0 and has_fulltext_link(\n", " record[\"identifier\"]\n", " ):\n", " versions = record[\"version\"]\n", " for version in versions:\n", " for sub_version in version[\"record\"]:\n", " metadata = sub_version[\"metadata\"][\"dc\"]\n", " # Sometimes fulltext identifiers are only available on the\n", " # version rather than the sub version. So we'll look in the\n", " # sub version first, and if they're not there use the url from\n", " # the version.\n", " # Sometimes there are multiple fulltext urls associated with a version:\n", " # eg a collection page and a publication. If so add records for both urls.\n", " # They could end up pointing to the same digitised publication, but\n", " # we can sort that out later. Aim here is to try and not miss any possible\n", " # routes to digitised publications!\n", " urls = get_fulltext_url(\n", " metadata.get(\"identifier\", [])\n", " )\n", " if len(urls) == 0:\n", " urls = get_fulltext_url(\n", " version.get(\"identifier\", [])\n", " )\n", " # Sometimes there are fulltext links on work but not versions\n", " if len(urls) == 0:\n", " urls = get_fulltext_url(\n", " record.get(\"identifier\", [])\n", " )\n", " if len(urls) == 0 and filter_by == \"nuc\":\n", " urls = [{\"url\": \"\", \"link_text\": \"\"}]\n", " for url in urls:\n", " work = {\n", " # This is not the full set of available fields,\n", " # adjust as necessary.\n", " \"title\": get_value(metadata, \"title\"),\n", " \"work_url\": record.get(\"troveUrl\"),\n", " \"work_type\": record.get(\"type\", []),\n", " \"contributor\": merge_values(\n", " metadata,\n", " [\"creator\", \"contributor\"],\n", " [\"value\", \"name\"],\n", " ),\n", " \"publisher\": get_value(\n", " metadata, \"publisher\"\n", " ),\n", " \"date\": merge_values(\n", " metadata, [\"date\", \"issued\"]\n", " ),\n", " # Using merge here because I've noticed some duplicate values\n", " \"type\": merge_values(metadata, [\"type\"]),\n", " \"format\": get_value(metadata, \"format\"),\n", " \"rights\": merge_values(\n", " metadata, [\"rights\", \"licenseRef\"]\n", " ),\n", " \"language\": get_value(metadata, \"language\"),\n", " \"extent\": get_value(metadata, \"extent\"),\n", " \"subject\": merge_values(\n", " metadata, [\"subject\"]\n", " ),\n", " \"spatial\": get_value(metadata, \"spatial\"),\n", " # Flattened type/value\n", " \"is_part_of\": flatten_values(\n", " metadata, \"isPartOf\"\n", " ),\n", " # Only get control numbers and flatten\n", " \"identifier\": flatten_identifiers(metadata),\n", " \"fulltext_url\": url[\"url\"],\n", " \"fulltext_url_text\": url[\"link_text\"],\n", " \"catalogue_url\": get_catalogue_url(\n", " metadata[\"identifier\"]\n", " ),\n", " # Could also add in data from bibliographicCitation\n", " # Although the types used in citations seem to vary by work and format.\n", " }\n", " ndjson_file.write(f\"{json.dumps(work)}\\n\")\n", " # The nextStart parameter is used to get the next page of results.\n", " # If there's no nextStart then it means we're on the last page of results.\n", " try:\n", " start = data[\"category\"][0][\"records\"][\"nextStart\"]\n", " except KeyError:\n", " start = None\n", " pbar.update(len(items))" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Run the harvest" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "params = {\n", " \"q\": '\"nla.obj\" NOT series:\"Parliamentary paper (Australia. Parliament)\" NOT nuc:\"ANL:NED\"',\n", " \"l-format\": \"Periodical\", # Journals only\n", " \"l-availability\": \"y\",\n", "}" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "afedf879ef5e454d8dcfb54d21f3855a", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/1078 [00:00