{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Harvest details of periodicals submitted to Trove through the National edeposit scheme (NED)\n", "\n", "This notebook harvests details of periodicals submitted to Trove through the National edeposit scheme (NED). It creates two datasets, one containing details of the periodical titles, and the other listing all the available issues. \n", "\n", "There are two main harvesting steps. The first is to search for periodicals using the API's `/result` endpoint using the following parameters:\n", "\n", "- `q` set to `\"nla.obj\" nuc:\"ANL:NED\"`\n", "- `format` facet to `Periodical`\n", "- and `l-availability` set to `y`\n", "\n", "The work records returned by this search are unpacked, and individual versions saved to make sure we get everything. Once this is complete, any duplicate records are merged.\n", "\n", "The second step harvests details of issues by extracting a list of issues for each title from the collection viewer. It then supplements the issue metadata by extracting information for each issue from the journal viewer." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Let's import the libraries we need.\n", "import json\n", "import os\n", "import re\n", "import time\n", "from datetime import timedelta\n", "from functools import reduce\n", "from pathlib import Path\n", "\n", "import arrow\n", "import pandas as pd\n", "import requests\n", "import requests_cache\n", "from bs4 import BeautifulSoup\n", "from dotenv import load_dotenv\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from sqlite_utils import Database\n", "from tqdm.auto import tqdm\n", "\n", "r = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "r.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "r.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "\n", "s = requests_cache.CachedSession(expire_after=timedelta(days=30))\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "\n", "load_dotenv()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Add your Trove API key\n", "\n", "You can get a Trove API key by [following these instructions](https://trove.nla.gov.au/about/create-something/using-api)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Insert your Trove API key\n", "API_KEY = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " API_KEY = os.getenv(\"TROVE_API_KEY\")" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Define some functions to do the work" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def get_total_results(params, headers):\n", " \"\"\"\n", " Get the total number of results for a search.\n", " \"\"\"\n", " these_params = params.copy()\n", " these_params[\"n\"] = 0\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v3/result\", params=these_params, headers=headers\n", " )\n", " data = response.json()\n", " return int(data[\"category\"][0][\"records\"][\"total\"])\n", "\n", "\n", "def get_value(record, field, keys=[\"value\"]):\n", " \"\"\"\n", " Get the values of a field.\n", " Some fields are lists of dicts, if so use the `key` to get the value.\n", " \"\"\"\n", " value = record.get(field, [])\n", " if value and isinstance(value[0], dict):\n", " for key in keys:\n", " try:\n", " return [re.sub(r\"\\s+\", \" \", v[key]) for v in value]\n", " except KeyError:\n", " pass\n", " else:\n", " return value\n", "\n", "\n", "def merge_values(record, fields, keys=[\"value\"]):\n", " \"\"\"\n", " Merges values from multiple fields, removing any duplicates.\n", " \"\"\"\n", " values = []\n", " for field in fields:\n", " values += get_value(record, field, keys)\n", " # Remove duplicates and None value\n", " return list(set([v for v in values if v is not None]))\n", "\n", "\n", "def flatten_values(record, field, key=\"type\"):\n", " \"\"\"\n", " If a field has a value and type, return the values as strings with this format: 'type: value'\n", " \"\"\"\n", " flattened = []\n", " values = record.get(field, [])\n", " for value in values:\n", " if key in value:\n", " flattened.append(f\"{value[key]}: {value['value']}\")\n", " else:\n", " flattened.append(value[\"value\"])\n", " return flattened\n", "\n", "\n", "def flatten_identifiers(record):\n", " \"\"\"\n", " Get a list of control numbers from the identifier field and flatten the values.\n", " \"\"\"\n", " ids = {\n", " \"identifier\": [\n", " v\n", " for v in record.get(\"identifier\", [])\n", " if \"type\" in v and v[\"type\"] == \"control number\"\n", " ]\n", " }\n", " return flatten_values(ids, \"identifier\", \"source\")\n", "\n", "\n", "def get_fulltext_url(links):\n", " \"\"\"\n", " Loop through the identifiers to find a link to the full text version of the book.\n", " \"\"\"\n", " urls = []\n", " for link in links:\n", " if (\n", " \"linktype\" in link\n", " and link[\"linktype\"] == \"fulltext\"\n", " and \"nla.obj\" in link[\"value\"]\n", " and \"edeposit\" in link.get(\"linktext\", \"\")\n", " ):\n", " url = re.sub(r\"^http\\b\", \"https\", link[\"value\"])\n", " url = re.sub(r\"^https://www\\.\", \"https://\", url)\n", " link_text = link.get(\"linktext\", \"\")\n", " urls.append({\"url\": url, \"link_text\": link_text})\n", " return urls\n", "\n", "\n", "def get_catalogue_url(links):\n", " \"\"\"\n", " Loop through the identifiers to find a link to the NLA catalogue.\n", " \"\"\"\n", " for link in links:\n", " if (\n", " \"linktype\" in link\n", " and link[\"linktype\"] == \"notonline\"\n", " and \"nla.cat\" in link[\"value\"]\n", " ):\n", " return link[\"value\"]\n", " return \"\"\n", "\n", "\n", "def has_fulltext_link(links):\n", " \"\"\"\n", " Check if a list of identifiers includes a fulltext url pointing to an NLA resource.\n", " \"\"\"\n", " for link in links:\n", " if (\n", " \"linktype\" in link\n", " and link[\"linktype\"] == \"fulltext\"\n", " and \"nla.obj\" in link[\"value\"]\n", " and \"edeposit\" in link.get(\"linktext\", \"\")\n", " ):\n", " return True\n", "\n", "\n", "def has_holding(holdings, nucs):\n", " \"\"\"\n", " Check if a list of holdings includes one of the supplied nucs.\n", " \"\"\"\n", " for holding in holdings:\n", " if holding.get(\"nuc\") in nucs:\n", " return True\n", "\n", "\n", "def get_digitised_versions(work):\n", " \"\"\"\n", " Get the versions from the given work that have a fulltext url pointing to an NLA resource\n", " in the `identifier` field.\n", " \"\"\"\n", " versions = []\n", " for version in work[\"version\"]:\n", " if \"identifier\" in version and has_fulltext_link(version[\"identifier\"]):\n", " versions.append(version)\n", " return versions\n", "\n", "\n", "def get_nuc_versions(work, nucs=[\"ANL\", \"ANL:DL\"]):\n", " \"\"\"\n", " Get the versions from the given work that are held by the NLA.\n", " \"\"\"\n", " versions = []\n", " for version in work[\"version\"]:\n", " if \"holding\" in version and has_holding(version[\"holding\"], [\"ANL\", \"ANL:DL\"]):\n", " versions.append(version)\n", " return versions\n", "\n", "\n", "def harvest_works(\n", " params,\n", " filter_by=\"url\",\n", " nucs=[\"ANL\", \"ANL:DL\"],\n", " output_file=\"harvested-metadata.ndjson\",\n", "):\n", " \"\"\"\n", " Harvest metadata relating to digitised works.\n", " The filter_by parameter selects records for inclusion in the dataset, options:\n", " * url -- only include versions that have an NLA fulltext url\n", " * nuc -- only include versions that have an NLA nuc (ANL or ANL:DL)\n", " \"\"\"\n", " default_params = {\n", " \"category\": \"all\",\n", " \"bulkHarvest\": \"true\",\n", " \"n\": 100,\n", " \"encoding\": \"json\",\n", " \"include\": [\"links\", \"workversions\", \"holdings\"],\n", " }\n", " params.update(default_params)\n", " headers = {\"X-API-KEY\": API_KEY}\n", " total = get_total_results(params, headers)\n", " start = \"*\"\n", " with Path(output_file).open(\"w\") as ndjson_file:\n", " with tqdm(total=total) as pbar:\n", " while start:\n", " params[\"s\"] = start\n", " response = r.get(\n", " \"https://api.trove.nla.gov.au/v3/result\",\n", " params=params,\n", " headers=headers,\n", " )\n", " data = response.json()\n", " items = data[\"category\"][0][\"records\"][\"item\"]\n", " for item in items:\n", " for category, record in item.items():\n", " if category == \"work\":\n", " if filter_by == \"nuc\":\n", " versions = get_nuc_versions(record, nucs)\n", " else:\n", " versions = get_digitised_versions(record)\n", " # Sometimes there are fulltext links on work but not versions\n", " if len(versions) == 0 and has_fulltext_link(\n", " record[\"identifier\"]\n", " ):\n", " versions = record[\"version\"]\n", " for version in versions:\n", " for sub_version in version[\"record\"]:\n", " metadata = sub_version[\"metadata\"][\"dc\"]\n", " # Sometimes fulltext identifiers are only available on the\n", " # version rather than the sub version. So we'll look in the\n", " # sub version first, and if they're not there use the url from\n", " # the version.\n", " # Sometimes there are multiple fulltext urls associated with a version:\n", " # eg a collection page and a publication. If so add records for both urls.\n", " # They could end up pointing to the same digitised publication, but\n", " # we can sort that out later. Aim here is to try and not miss any possible\n", " # routes to digitised publications!\n", " urls = get_fulltext_url(\n", " metadata.get(\"identifier\", [])\n", " )\n", " if len(urls) == 0:\n", " urls = get_fulltext_url(\n", " version.get(\"identifier\", [])\n", " )\n", " # Sometimes there are fulltext links on work but not versions\n", " if len(urls) == 0:\n", " urls = get_fulltext_url(\n", " record.get(\"identifier\", [])\n", " )\n", " if len(urls) == 0 and filter_by == \"nuc\":\n", " urls = [{\"url\": \"\", \"link_text\": \"\"}]\n", " for url in urls:\n", " work = {\n", " # This is not the full set of available fields,\n", " # adjust as necessary.\n", " \"title\": get_value(metadata, \"title\"),\n", " \"work_url\": record.get(\"troveUrl\"),\n", " \"work_type\": record.get(\"type\", []),\n", " \"contributor\": merge_values(\n", " metadata,\n", " [\"creator\", \"contributor\"],\n", " [\"value\", \"name\"],\n", " ),\n", " \"publisher\": get_value(\n", " metadata, \"publisher\"\n", " ),\n", " \"date\": merge_values(\n", " metadata, [\"date\", \"issued\"]\n", " ),\n", " # Using merge here because I've noticed some duplicate values\n", " \"type\": merge_values(metadata, [\"type\"]),\n", " \"format\": get_value(metadata, \"format\"),\n", " \"rights\": merge_values(\n", " metadata, [\"rights\", \"licenseRef\"]\n", " ),\n", " \"language\": get_value(metadata, \"language\"),\n", " \"extent\": get_value(metadata, \"extent\"),\n", " \"subject\": merge_values(\n", " metadata, [\"subject\"]\n", " ),\n", " \"spatial\": get_value(metadata, \"spatial\"),\n", " # Flattened type/value\n", " \"is_part_of\": flatten_values(\n", " metadata, \"isPartOf\"\n", " ),\n", " # Only get control numbers and flatten\n", " \"identifier\": flatten_identifiers(metadata),\n", " \"fulltext_url\": url[\"url\"],\n", " \"fulltext_url_text\": url[\"link_text\"],\n", " \"catalogue_url\": get_catalogue_url(\n", " metadata[\"identifier\"]\n", " ),\n", " # Could also add in data from bibliographicCitation\n", " # Although the types used in citations seem to vary by work and format.\n", " }\n", " ndjson_file.write(f\"{json.dumps(work)}\\n\")\n", " # The nextStart parameter is used to get the next page of results.\n", " # If there's no nextStart then it means we're on the last page of results.\n", " try:\n", " start = data[\"category\"][0][\"records\"][\"nextStart\"]\n", " except KeyError:\n", " start = None\n", " pbar.update(len(items))" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Harvest periodical titles\n", "\n", "The first step is to search for NED periodical titles and harvest all the version records." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "params = {\n", " \"q\": '\"nla.obj\" nuc:\"ANL:NED\"',\n", " \"l-format\": \"Periodical\", # Journals only\n", " # \"l-availability\": \"y\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "harvest_works(params, output_file=\"ned-periodicals.ndjson\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check that they're not really issues\n", "\n", "I've found that there are some work records that point to an individual issue of a periodical, rather than to a collection of issues. These might be periodicals that only have a single issue, or they might be anomalies. In this step we'll do some checking to try and separate titles and issues." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# get the current list of ids for comparison\n", "# loop through titles\n", "# get page type\n", "# if page type is pdf, check if id == parent_id\n", "# if it's an issue with a parent, check that the parent is in the set of titles\n", "# if not try to get some details of the parent and add to title dataset\n", "\n", "\n", "def get_metadata(id):\n", " \"\"\"\n", " Extract work data in a JSON string from the work's HTML page.\n", " \"\"\"\n", " if not id.startswith(\"http\"):\n", " id = \"https://nla.gov.au/\" + id\n", " response = s.get(id)\n", " try:\n", " work_data = re.search(\n", " r\"var work = JSON\\.parse\\(JSON\\.stringify\\((\\{.*\\})\", response.text\n", " ).group(1)\n", " except AttributeError:\n", " work_data = \"{}\"\n", " if not response.from_cache:\n", " time.sleep(0.2)\n", " return json.loads(work_data)\n", "\n", "\n", "def get_iso_date(date):\n", " if date:\n", " iso_date = arrow.get(date, \"ddd, D MMM YYYY\").format(\"YYYY\")\n", " else:\n", " iso_date = None\n", " return iso_date\n", "\n", "\n", "def create_title_from_metadata(id):\n", " if not id.startswith(\"http\"):\n", " id = \"https://nla.gov.au/\" + id\n", " metadata = get_metadata(id)\n", " title = {\n", " \"title\": metadata.get(\"title\", \"\"),\n", " \"contributor\": [metadata.get(\"creator\", \"\")],\n", " \"publisher\": metadata.get(\"publisherName\", \"\"),\n", " \"date\": [get_iso_date(metadata.get(\"issueDate\", None))],\n", " \"extent\": metadata.get(\"extent\", \"\"),\n", " \"rights\": metadata.get(\"copyrightPolicy\", \"\"),\n", " \"identifier\": metadata.get(\"standardIds\", []),\n", " \"fulltext_url\": id,\n", " \"type\": [],\n", " \"format\": [],\n", " \"language\": [],\n", " \"subject\": [],\n", " \"spatial\": [],\n", " \"is_part_of\": [],\n", " \"work_url\": \"\",\n", " \"work_type\": \"\",\n", " \"fulltext_url_text\": \"\",\n", " \"catalogue_url\": \"\",\n", " }\n", " return title\n", "\n", "\n", "def get_page_type(url):\n", " response = s.get(url)\n", " soup = BeautifulSoup(response.text)\n", " page_type = soup.find(\"meta\", attrs={\"data-screen-id\": True})[\"data-screen-id\"]\n", " return page_type\n", "\n", "\n", "def check_titles(\n", " input=\"ned-periodicals.ndjson\", output=\"ned-periodicals-checked.ndjson\"\n", "):\n", " df = pd.read_json(input, lines=True)\n", " df[\"id\"] = df[\"fulltext_url\"].apply(lambda x: x.strip(\"/\").split(\"/\")[-1])\n", " # df.fillna(\"\", inplace=True)\n", " with Path(output).open(\"w\") as ndjson_file:\n", " for title in tqdm(df.to_dict(orient=\"records\"), total=df.shape[0]):\n", " url = title[\"fulltext_url\"]\n", " page_type = get_page_type(url)\n", " # Keep title landing pages\n", " if page_type in [\"Preview Landing Page\", \"Onsite Landing Page\"]:\n", " # keep this in titles\n", " ndjson_file.write(f\"{json.dumps(title)}\\n\")\n", " # Drop not found pages\n", " elif page_type != \"Page Not Found\":\n", " metadata = get_metadata(url)\n", " parent_id = metadata[\"topLevelCollection\"]\n", " pid = metadata[\"pid\"]\n", " # This page has a parent, so it's not a title\n", " if parent_id != pid:\n", " # It's parent isn't in the current dataset\n", " if df.loc[df[\"id\"] == parent_id].empty:\n", " # print(\"parent not found\")\n", " # add a record for the parent\n", " new_title = create_title_from_metadata(parent_id)\n", " ndjson_file.write(f\"{json.dumps(new_title)}\\n\")\n", " # add details of parent to titles\n", " # else:\n", " # print(\"parent found\")\n", "\n", " else:\n", " if page_type == \"Ebook Page\":\n", "\n", " ndjson_file.write(f\"{json.dumps(title)}\\n\")\n", " # keep this in titles\n", " # need to do another check when getting issues\n", " elif page_type == \"Picture Viewer Page\":\n", " pass\n", " # print(url, \"picture\")\n", " # ignore\n", " else:\n", " print(url, \"not found\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "check_titles()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Remove duplicates\n", "\n", "Because we've unpacked the work records and saved individual versions, there are likely to be some duplicates. Here we'll merge the duplicates records." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def merge_column(columns):\n", " values = []\n", " for value in columns:\n", " if isinstance(value, list):\n", " values += [str(v) for v in value if v]\n", " elif value:\n", " values.append(str(value))\n", " return \" | \".join(sorted(set(values)))\n", "\n", "\n", "def merge_records(df):\n", " # df[\"pages\"].fillna(0, inplace=True)\n", " # df.fillna(\"\", inplace=True)\n", " # df[\"pages\"] = df[\"pages\"].astype(\"Int64\")\n", "\n", " # Add base dataset with columns that will always have only one value\n", " dfs = [df[[\"fulltext_url\"]].drop_duplicates()]\n", "\n", " # Columns that potentially have multiple values which will be merged\n", " columns = [\n", " \"title\",\n", " \"work_url\",\n", " \"work_type\",\n", " \"contributor\",\n", " \"publisher\",\n", " \"date\",\n", " \"type\",\n", " \"format\",\n", " \"extent\",\n", " \"language\",\n", " \"subject\",\n", " \"spatial\",\n", " \"is_part_of\",\n", " \"identifier\",\n", " \"rights\",\n", " \"fulltext_url_text\",\n", " \"catalogue_url\",\n", " ]\n", "\n", " # Merge values from each column in turn, creating a new dataframe from each\n", " for column in columns:\n", " dfs.append(\n", " df.groupby([\"fulltext_url\"])[column].apply(merge_column).reset_index()\n", " )\n", "\n", " # Merge all the individual dataframes into one, linking on `text_file` value\n", " df_merged = reduce(\n", " lambda left, right: pd.merge(left, right, on=[\"fulltext_url\"], how=\"left\"), dfs\n", " )\n", " return df_merged" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Load the harvested data." ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "df = pd.read_json(\"ned-periodicals-checked.ndjson\", lines=True)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "How many records are there?" ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "(9474, 19)" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Now we'll merge the duplicates." ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "df_merged = merge_records(df)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "How many records are there now?" ] }, { "cell_type": "code", "execution_count": 149, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "8572" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many journals are there?\n", "df_merged.shape[0]" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Do some reorganisation of the dataset and save it as a CSV file." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def save_ned_titles(df, output=\"ned-periodicals.csv\"):\n", " df[\"id\"] = df[\"fulltext_url\"].apply(lambda x: x.strip(\"/\").split(\"/\")[-1])\n", " df_titles = df[\n", " [\n", " \"id\",\n", " \"title\",\n", " \"contributor\",\n", " \"publisher\",\n", " \"date\",\n", " \"fulltext_url\",\n", " \"work_url\",\n", " \"work_type\",\n", " \"type\",\n", " \"format\",\n", " \"extent\",\n", " \"language\",\n", " \"subject\",\n", " \"spatial\",\n", " \"is_part_of\",\n", " \"identifier\",\n", " \"rights\",\n", " \"catalogue_url\",\n", " ]\n", " ]\n", "\n", " df_titles.to_csv(output, index=False)\n", " return df_titles" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_titles = save_ned_titles(df_merged)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Get details of issues" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "def get_issues(parent_id):\n", " \"\"\"\n", " Get the ids of issues that are children of the current record.\n", " \"\"\"\n", " start_url = \"https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c\"\n", " # The initial startIdx value\n", " start = 0\n", " # Number of results per page\n", " n = 20\n", " parts = []\n", " # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.\n", " while n == 20:\n", " # Get the browse page\n", " response = s.get(start_url.format(parent_id, start))\n", " # Beautifulsoup turns the HTML into an easily navigable structure\n", " soup = BeautifulSoup(response.text, \"lxml\")\n", " # Find all the divs containing issue details and loop through them\n", " details = soup.find_all(class_=\"l-item-info\")\n", " for detail in details:\n", " title = detail.find(\"h3\")\n", " if title:\n", " issue_id = title.parent[\"href\"].strip(\"/\")\n", " else:\n", " issue_id = detail.find(\"a\")[\"href\"].strip(\"/\")\n", " # Get the issue id\n", " parts.append(issue_id)\n", " if not response.from_cache:\n", " time.sleep(0.2)\n", " # Increment the startIdx\n", " start += n\n", " # Set n to the number of results on the current page\n", " n = len(details)\n", " return parts\n", "\n", "\n", "def harvest_all_issues(input=\"ned-periodicals.csv\", output=\"ned-issues.ndjson\"):\n", " df = pd.read_csv(input)\n", " with Path(output).open(\"w\") as ndjson_file:\n", " for title in tqdm(df.itertuples(), total=df.shape[0]):\n", " # title_id = title.fulltext_url.strip(\"/\").split(\"/\")[-1]\n", " title_id = title.id\n", " page_type = get_page_type(title.fulltext_url)\n", " if page_type == \"Ebook Page\":\n", " issues = [title.fulltext_url]\n", " else:\n", " issues = get_issues(title_id)\n", " for issue_id in issues:\n", " metadata = get_metadata(issue_id)\n", " try:\n", " issue = {\n", " \"id\": metadata[\"pid\"],\n", " \"title_id\": title_id,\n", " \"title\": metadata[\"title\"],\n", " \"description\": metadata.get(\"subUnitNo\", \"\"),\n", " \"date\": get_iso_date(metadata.get(\"issueDate\", None)),\n", " \"url\": f\"https://nla.gov.au/{metadata['pid']}\",\n", " \"ebook_type\": metadata.get(\"ebookType\", \"\"),\n", " \"access_conditions\": metadata.get(\"accessConditions\", \"\"),\n", " \"copyright_policy\": metadata.get(\"copyrightPolicy\", \"\"),\n", " }\n", " except KeyError:\n", " print(title_id)\n", " else:\n", " ndjson_file.write(f\"{json.dumps(issue)}\\n\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "harvest_all_issues()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explore the data" ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "df_issues = pd.read_json(\n", " \"ned-issues.ndjson\", convert_dates=False, dtype={\"date\": \"Int64\"}, lines=True\n", ")" ] }, { "cell_type": "code", "execution_count": 140, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_issues.to_csv(\"ned-periodical-issues.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "(179510, 9)" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_issues.shape" ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "df_totals = (\n", " df_issues.loc[df_issues[\"access_conditions\"] == \"Unrestricted\"]\n", " .groupby([\"title_id\", \"title\"])\n", " .size()\n", " .to_frame()\n", " .reset_index()\n", ")" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>title_id</th>\n", " <th>title</th>\n", " <th>0</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1737</th>\n", " <td>nla.obj-1916881555</td>\n", " <td>Western Australian government gazette.</td>\n", " <td>2021</td>\n", " </tr>\n", " <tr>\n", " <th>2598</th>\n", " <td>nla.obj-2692666983</td>\n", " <td>APSjobs-vacancies daily ... daily gazette.</td>\n", " <td>1255</td>\n", " </tr>\n", " <tr>\n", " <th>4424</th>\n", " <td>nla.obj-2940864261</td>\n", " <td>The Australian Jewish News.</td>\n", " <td>1067</td>\n", " </tr>\n", " <tr>\n", " <th>4448</th>\n", " <td>nla.obj-2945379691</td>\n", " <td>Tweed link</td>\n", " <td>880</td>\n", " </tr>\n", " <tr>\n", " <th>2201</th>\n", " <td>nla.obj-2541626239</td>\n", " <td>Weekly notice</td>\n", " <td>798</td>\n", " </tr>\n", " <tr>\n", " <th>34</th>\n", " <td>nla.obj-1252109725</td>\n", " <td>Queensland Health services bulletin</td>\n", " <td>745</td>\n", " </tr>\n", " <tr>\n", " <th>4423</th>\n", " <td>nla.obj-2940863963</td>\n", " <td>The Australian Jewish News.</td>\n", " <td>726</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>nla.obj-1247944368</td>\n", " <td>Hyden Karlgarin Householder News.</td>\n", " <td>680</td>\n", " </tr>\n", " <tr>\n", " <th>752</th>\n", " <td>nla.obj-1775015332</td>\n", " <td>E-record : your news from across the Archdioce...</td>\n", " <td>679</td>\n", " </tr>\n", " <tr>\n", " <th>7761</th>\n", " <td>nla.obj-638303044</td>\n", " <td>Class ruling</td>\n", " <td>648</td>\n", " </tr>\n", " <tr>\n", " <th>2191</th>\n", " <td>nla.obj-2536144595</td>\n", " <td>Plantagenet news.</td>\n", " <td>594</td>\n", " </tr>\n", " <tr>\n", " <th>3383</th>\n", " <td>nla.obj-2815835489</td>\n", " <td>The Apollo Bay news.</td>\n", " <td>560</td>\n", " </tr>\n", " <tr>\n", " <th>5642</th>\n", " <td>nla.obj-3125539859</td>\n", " <td>The Peninsula community access news.</td>\n", " <td>528</td>\n", " </tr>\n", " <tr>\n", " <th>3939</th>\n", " <td>nla.obj-2859788676</td>\n", " <td>Council news : weekly information from us to you</td>\n", " <td>520</td>\n", " </tr>\n", " <tr>\n", " <th>184</th>\n", " <td>nla.obj-1252305285</td>\n", " <td>Clermont rag : Community newspaper.</td>\n", " <td>514</td>\n", " </tr>\n", " <tr>\n", " <th>1710</th>\n", " <td>nla.obj-1908935587</td>\n", " <td>Assessment reports and exam papers</td>\n", " <td>512</td>\n", " </tr>\n", " <tr>\n", " <th>42</th>\n", " <td>nla.obj-1252119874</td>\n", " <td>Rot-Ayr-Ian [electronic resource] : the offici...</td>\n", " <td>467</td>\n", " </tr>\n", " <tr>\n", " <th>140</th>\n", " <td>nla.obj-1252246096</td>\n", " <td>Palm Island Voice.</td>\n", " <td>454</td>\n", " </tr>\n", " <tr>\n", " <th>4886</th>\n", " <td>nla.obj-2994765231</td>\n", " <td>Townsville Orchid Society Inc. bulletin.</td>\n", " <td>452</td>\n", " </tr>\n", " <tr>\n", " <th>4459</th>\n", " <td>nla.obj-2949797877</td>\n", " <td>Short list</td>\n", " <td>431</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " title_id title \\\n", "1737 nla.obj-1916881555 Western Australian government gazette. \n", "2598 nla.obj-2692666983 APSjobs-vacancies daily ... daily gazette. \n", "4424 nla.obj-2940864261 The Australian Jewish News. \n", "4448 nla.obj-2945379691 Tweed link \n", "2201 nla.obj-2541626239 Weekly notice \n", "34 nla.obj-1252109725 Queensland Health services bulletin \n", "4423 nla.obj-2940863963 The Australian Jewish News. \n", "16 nla.obj-1247944368 Hyden Karlgarin Householder News. \n", "752 nla.obj-1775015332 E-record : your news from across the Archdioce... \n", "7761 nla.obj-638303044 Class ruling \n", "2191 nla.obj-2536144595 Plantagenet news. \n", "3383 nla.obj-2815835489 The Apollo Bay news. \n", "5642 nla.obj-3125539859 The Peninsula community access news. \n", "3939 nla.obj-2859788676 Council news : weekly information from us to you \n", "184 nla.obj-1252305285 Clermont rag : Community newspaper. \n", "1710 nla.obj-1908935587 Assessment reports and exam papers \n", "42 nla.obj-1252119874 Rot-Ayr-Ian [electronic resource] : the offici... \n", "140 nla.obj-1252246096 Palm Island Voice. \n", "4886 nla.obj-2994765231 Townsville Orchid Society Inc. bulletin. \n", "4459 nla.obj-2949797877 Short list \n", "\n", " 0 \n", "1737 2021 \n", "2598 1255 \n", "4424 1067 \n", "4448 880 \n", "2201 798 \n", "34 745 \n", "4423 726 \n", "16 680 \n", "752 679 \n", "7761 648 \n", "2191 594 \n", "3383 560 \n", "5642 528 \n", "3939 520 \n", "184 514 \n", "1710 512 \n", "42 467 \n", "140 454 \n", "4886 452 \n", "4459 431 " ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_totals.sort_values(0, ascending=False)[:20]" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "access_conditions\n", "Unrestricted 155783\n", "View Only 15118\n", "Onsite Only 8609\n", "Name: count, dtype: int64" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_issues[\"access_conditions\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 145, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "ebook_type\n", "application/pdf 178553\n", " 838\n", "application/epub+zip 119\n", "Name: count, dtype: int64" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_issues[\"ebook_type\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create an SQLite database\n", "\n", "I'm creating an SQLite database that can be used with Datasette to make it easier to explore the NED periodicals. The code creates a database with two linked tables, `titles` and `issues`. You can [view the result here](https://glam-workbench.net/datasette-lite/?url=https://github.com/GLAM-Workbench/trove-ned-periodicals-data/blob/main/ned-periodicals.db&install=datasette-json-html&install=datasette-template-sql&metadata=https://github.com/GLAM-Workbench/trove-ned-periodicals-data/blob/main/metadata.json)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "def add_download_link(row):\n", " url = \"\"\n", " if row[\"access_conditions\"] == \"Unrestricted\":\n", " url = f\"https://nla.gov.au/{row['id']}/download?downloadOption=eBook&firstPage=-1&lastPage=-1\"\n", " return url\n", "\n", "\n", "df_issues[\"download_link\"] = df_issues.apply(add_download_link, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "db = Database(\"ned-periodicals.db\", recreate=True)\n", "df_titles.insert(\n", " 0,\n", " \"thumbnail\",\n", " df_titles[\"fulltext_url\"].apply(\n", " lambda x: f'{{\"img_src\": \"{x + \"-t\"}\"}}' if not pd.isnull(x) else \"\"\n", " ),\n", ")\n", "db[\"titles\"].insert_all(df_titles.to_dict(orient=\"records\"), pk=\"id\")\n", "db[\"titles\"].enable_fts([\"title\", \"contributor\", \"publisher\", \"subject\"])\n", "\n", "\n", "df_issues.insert(\n", " 0,\n", " \"thumbnail\",\n", " df_issues[\"url\"].apply(\n", " lambda x: f'{{\"img_src\": \"{x + \"-t\"}\"}}' if not pd.isnull(x) else \"\"\n", " ),\n", ")\n", "df_issues = df_issues.drop(\"title\", axis=1)\n", "db[\"issues\"].insert_all(df_issues.to_dict(orient=\"records\"), pk=\"id\")\n", "db[\"issues\"].add_foreign_key(\"title_id\", \"titles\", \"id\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# IGNORE THIS CELL -- FOR TESTING ONLY\n", "if os.getenv(\"GW_STATUS\") == \"dev\":\n", " df_test = pd.read_json(\"ned-periodicals.ndjson\", lines=True)[:20]\n", " df_merged_test = merge_records(df_test)\n", " df_titles_test = save_ned_titles(df_merged_test, \"ned-periodicals-test.csv\")\n", " harvest_all_issues(\n", " input=\"ned-periodicals-test.csv\", output=\"ned-periodicals-issues-test.ndjson\"\n", " )\n", "\n", " Path(\"ned-periodicals-test.csv\").unlink()\n", " Path(\"ned-periodicals-issues-test.ndjson\").unlink()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "rocrate": { "action": [ { "description": "This dataset contains details of periodical titles and issues submitted to the Trove through the NLA's National edeposit scheme. It includes CSV-formatted lists of titles and issues, and an SQLite database created for use with Datasette-Lite.", "isPartOf": "https://github.com/GLAM-Workbench/trove-ned-periodicals-data", "mainEntityOfPage": "https://glam-workbench.net/trove-journals/trove-ned-periodicals-data/", "result": [ { "url": "https://github.com/GLAM-Workbench/trove-ned-periodicals-data/raw/main/ned-periodicals.csv" }, { "url": "https://github.com/GLAM-Workbench/trove-ned-periodicals-data/raw/main/ned-periodical-issues.csv" }, { "description": "This SQLite database contains data relating to digitised periodical titles and issues from Trove. It was created for use with Datasette-Lite. There is a foreign key link between the issues and the titles, making it easy to find the issues from any title. Some extra columns have been added to include thumbnails and download links.", "url": "https://github.com/GLAM-Workbench/trove-ned-periodicals-data/raw/main/ned-periodicals.db" } ], "workExample": [ { "name": "Explore in Datasette", "url": "https://glam-workbench.net/datasette-lite/?url=https://github.com/GLAM-Workbench/trove-ned-periodicals-data/blob/main/ned-periodicals.db&install=datasette-json-html&install=datasette-template-sql&metadata=https://github.com/GLAM-Workbench/trove-ned-periodicals-data/blob/main/metadata.json" } ] } ], "author": [ { "mainEntityOfPage": "https://timsherratt.au", "name": "Sherratt, Tim", "orcid": "https://orcid.org/0000-0001-7956-4498" } ], "category": "Harvesting metadata", "description": "This notebook harvests details of periodicals submitted to Trove through the National edeposit scheme (NED). It creates two datasets, one containing details of the periodical titles, and the other listing all the available issues.\n\nThere are two main harvesting steps. The first is to search for periodicals using the API's /result endpoint using the following parameters:\n\n* `q` set to `\"nla.obj\" nuc:\"ANL:NED\"`\n* `format` facet to `Periodical`\n* and `l-availability` set to `y`\n\nThe work records returned by this search are unpacked, and individual versions saved to make sure we get everything. Once this is complete, any duplicate records are merged.\n\nThe second step harvests details of issues by extracting a list of issues for each title from the collection viewer. It then supplements the issue metadata by extracting information for each issue from the journal viewer.", "mainEntityOfPage": "https://glam-workbench.net/trove-journals/harvest-ned-periodicals/", "name": "Harvest details of periodicals submitted to Trove through the National edeposit scheme (NED)", "position": 4 } }, "nbformat": 4, "nbformat_minor": 4 }