{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "# Create a list of Trove's digital periodicals\n",
    "\n",
    "This notebook creates a list of digitised periodicals in Trove by searching for the digital identifier string `nla.obj` and limiting the results to periodicals. Before the Trove API introduced the `/magazine/titles` endpoint, this was the only way to generate such a list. This method produces slightly different results to the new API endpoint, and it might be useful to compare the two to see what each method misses. [Get details of periodicals from the /magazine/titles API endpoint](periodicals-from-api.ipynb) and [Enrich the list of periodicals from the Trove API](periodicals-enrich-for-datasette.ipynb) demonstrate how to compile a list of periodicals from the `/magazine/titles` endpoint.\n",
    "\n",
    "The harvesting strategy used in this notebook is similar to that described in the Trove Data Guides' [HOW TO: Harvest data relating to digitised resources](https://tdg.glam-workbench.net/other-digitised-resources/how-to/harvest-digitised-resources.html). Because of variations in the way digitised resources are described and organised, it seems best to harvest all available version records individually, and then merge duplicates at a later step.\n",
    "\n",
    "The full search query used is `\"nla.obj\" NOT series:\"Parliamentary paper (Australia. Parliament)\" NOT nuc:\"ANL:NED\"`. This attempts to exclude Parliamentary Papers and periodicals submitted through the National edeposit scheme."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's import the libraries we need.\n",
    "import json\n",
    "import os\n",
    "import re\n",
    "from datetime import timedelta\n",
    "from functools import reduce\n",
    "from pathlib import Path\n",
    "\n",
    "import pandas as pd\n",
    "import requests_cache\n",
    "from dotenv import load_dotenv\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from tqdm.auto import tqdm\n",
    "\n",
    "s = requests_cache.CachedSession(expire_after=timedelta(days=30))\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n",
    "\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Add your Trove API key\n",
    "\n",
    "You can get a Trove API key by [following these instructions](https://trove.nla.gov.au/about/create-something/using-api)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Insert your Trove API key\n",
    "API_KEY = \"YOUR API KEY\"\n",
    "\n",
    "# Use api key value from environment variables if it is available\n",
    "if os.getenv(\"TROVE_API_KEY\"):\n",
    "    API_KEY = os.getenv(\"TROVE_API_KEY\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Define some functions to do the work"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def get_total_results(params, headers):\n",
    "    \"\"\"\n",
    "    Get the total number of results for a search.\n",
    "    \"\"\"\n",
    "    these_params = params.copy()\n",
    "    these_params[\"n\"] = 0\n",
    "    response = s.get(\n",
    "        \"https://api.trove.nla.gov.au/v3/result\", params=these_params, headers=headers\n",
    "    )\n",
    "    data = response.json()\n",
    "    return int(data[\"category\"][0][\"records\"][\"total\"])\n",
    "\n",
    "\n",
    "def get_value(record, field, keys=[\"value\"]):\n",
    "    \"\"\"\n",
    "    Get the values of a field.\n",
    "    Some fields are lists of dicts, if so use the `key` to get the value.\n",
    "    \"\"\"\n",
    "    value = record.get(field, [])\n",
    "    if value and isinstance(value[0], dict):\n",
    "        for key in keys:\n",
    "            try:\n",
    "                return [re.sub(r\"\\s+\", \" \", v[key]) for v in value]\n",
    "            except KeyError:\n",
    "                pass\n",
    "    else:\n",
    "        return value\n",
    "\n",
    "\n",
    "def merge_values(record, fields, keys=[\"value\"]):\n",
    "    \"\"\"\n",
    "    Merges values from multiple fields, removing any duplicates.\n",
    "    \"\"\"\n",
    "    values = []\n",
    "    for field in fields:\n",
    "        values += get_value(record, field, keys)\n",
    "    # Remove duplicates and None value\n",
    "    return list(set([v for v in values if v is not None]))\n",
    "\n",
    "\n",
    "def flatten_values(record, field, key=\"type\"):\n",
    "    \"\"\"\n",
    "    If a field has a value and type, return the values as strings with this format: 'type: value'\n",
    "    \"\"\"\n",
    "    flattened = []\n",
    "    values = record.get(field, [])\n",
    "    for value in values:\n",
    "        if key in value:\n",
    "            flattened.append(f\"{value[key]}: {value['value']}\")\n",
    "        else:\n",
    "            flattened.append(value[\"value\"])\n",
    "    return flattened\n",
    "\n",
    "\n",
    "def flatten_identifiers(record):\n",
    "    \"\"\"\n",
    "    Get a list of control numbers from the identifier field and flatten the values.\n",
    "    \"\"\"\n",
    "    ids = {\n",
    "        \"identifier\": [\n",
    "            v\n",
    "            for v in record.get(\"identifier\", [])\n",
    "            if \"type\" in v and v[\"type\"] == \"control number\"\n",
    "        ]\n",
    "    }\n",
    "    return flatten_values(ids, \"identifier\", \"source\")\n",
    "\n",
    "\n",
    "def get_fulltext_url(links):\n",
    "    \"\"\"\n",
    "    Loop through the identifiers to find a link to the full text version of the book.\n",
    "    \"\"\"\n",
    "    urls = []\n",
    "    for link in links:\n",
    "        if (\n",
    "            \"linktype\" in link\n",
    "            and link[\"linktype\"] == \"fulltext\"\n",
    "            and \"nla.obj\" in link[\"value\"]\n",
    "        ):\n",
    "            url = re.sub(r\"^http\\b\", \"https\", link[\"value\"])\n",
    "            link_text = link.get(\"linktext\", \"\")\n",
    "            urls.append({\"url\": url, \"link_text\": link_text})\n",
    "    return urls\n",
    "\n",
    "\n",
    "def get_catalogue_url(links):\n",
    "    \"\"\"\n",
    "    Loop through the identifiers to find a link to the NLA catalogue.\n",
    "    \"\"\"\n",
    "    for link in links:\n",
    "        if (\n",
    "            \"linktype\" in link\n",
    "            and link[\"linktype\"] == \"notonline\"\n",
    "            and \"nla.cat\" in link[\"value\"]\n",
    "        ):\n",
    "            return link[\"value\"]\n",
    "    return \"\"\n",
    "\n",
    "\n",
    "def has_fulltext_link(links):\n",
    "    \"\"\"\n",
    "    Check if a list of identifiers includes a fulltext url pointing to an NLA resource.\n",
    "    \"\"\"\n",
    "    for link in links:\n",
    "        if (\n",
    "            \"linktype\" in link\n",
    "            and link[\"linktype\"] == \"fulltext\"\n",
    "            and \"nla.obj\" in link[\"value\"]\n",
    "        ):\n",
    "            return True\n",
    "\n",
    "\n",
    "def has_holding(holdings, nucs):\n",
    "    \"\"\"\n",
    "    Check if a list of holdings includes one of the supplied nucs.\n",
    "    \"\"\"\n",
    "    for holding in holdings:\n",
    "        if holding.get(\"nuc\") in nucs:\n",
    "            return True\n",
    "\n",
    "\n",
    "def get_digitised_versions(work):\n",
    "    \"\"\"\n",
    "    Get the versions from the given work that have a fulltext url pointing to an NLA resource\n",
    "    in the `identifier` field.\n",
    "    \"\"\"\n",
    "    versions = []\n",
    "    for version in work[\"version\"]:\n",
    "        if \"identifier\" in version and has_fulltext_link(version[\"identifier\"]):\n",
    "            versions.append(version)\n",
    "    return versions\n",
    "\n",
    "\n",
    "def get_nuc_versions(work, nucs=[\"ANL\", \"ANL:DL\"]):\n",
    "    \"\"\"\n",
    "    Get the versions from the given work that are held by the NLA.\n",
    "    \"\"\"\n",
    "    versions = []\n",
    "    for version in work[\"version\"]:\n",
    "        if \"holding\" in version and has_holding(version[\"holding\"], [\"ANL\", \"ANL:DL\"]):\n",
    "            versions.append(version)\n",
    "    return versions\n",
    "\n",
    "\n",
    "def harvest_works(\n",
    "    params,\n",
    "    filter_by=\"url\",\n",
    "    nucs=[\"ANL\", \"ANL:DL\"],\n",
    "    output_file=\"harvested-metadata.ndjson\",\n",
    "):\n",
    "    \"\"\"\n",
    "    Harvest metadata relating to digitised works.\n",
    "    The filter_by parameter selects records for inclusion in the dataset, options:\n",
    "        * url -- only include versions that have an NLA fulltext url\n",
    "        * nuc -- only include versions that have an NLA nuc (ANL or ANL:DL)\n",
    "    \"\"\"\n",
    "    default_params = {\n",
    "        \"category\": \"all\",\n",
    "        \"bulkHarvest\": \"true\",\n",
    "        \"n\": 100,\n",
    "        \"encoding\": \"json\",\n",
    "        \"include\": [\"links\", \"workversions\", \"holdings\"],\n",
    "    }\n",
    "    params.update(default_params)\n",
    "    headers = {\"X-API-KEY\": API_KEY}\n",
    "    total = get_total_results(params, headers)\n",
    "    start = \"*\"\n",
    "    with Path(output_file).open(\"w\") as ndjson_file:\n",
    "        with tqdm(total=total) as pbar:\n",
    "            while start:\n",
    "                params[\"s\"] = start\n",
    "                response = s.get(\n",
    "                    \"https://api.trove.nla.gov.au/v3/result\",\n",
    "                    params=params,\n",
    "                    headers=headers,\n",
    "                )\n",
    "                data = response.json()\n",
    "                items = data[\"category\"][0][\"records\"][\"item\"]\n",
    "                for item in items:\n",
    "                    for category, record in item.items():\n",
    "                        if category == \"work\":\n",
    "                            if filter_by == \"nuc\":\n",
    "                                versions = get_nuc_versions(record, nucs)\n",
    "                            else:\n",
    "                                versions = get_digitised_versions(record)\n",
    "                                # Sometimes there are fulltext links on work but not versions\n",
    "                                if len(versions) == 0 and has_fulltext_link(\n",
    "                                    record[\"identifier\"]\n",
    "                                ):\n",
    "                                    versions = record[\"version\"]\n",
    "                            for version in versions:\n",
    "                                for sub_version in version[\"record\"]:\n",
    "                                    metadata = sub_version[\"metadata\"][\"dc\"]\n",
    "                                    # Sometimes fulltext identifiers are only available on the\n",
    "                                    # version rather than the sub version. So we'll look in the\n",
    "                                    # sub version first, and if they're not there use the url from\n",
    "                                    # the version.\n",
    "                                    # Sometimes there are multiple fulltext urls associated with a version:\n",
    "                                    # eg a collection page and a publication. If so add records for both urls.\n",
    "                                    # They could end up pointing to the same digitised publication, but\n",
    "                                    # we can sort that out later. Aim here is to try and not miss any possible\n",
    "                                    # routes to digitised publications!\n",
    "                                    urls = get_fulltext_url(\n",
    "                                        metadata.get(\"identifier\", [])\n",
    "                                    )\n",
    "                                    if len(urls) == 0:\n",
    "                                        urls = get_fulltext_url(\n",
    "                                            version.get(\"identifier\", [])\n",
    "                                        )\n",
    "                                    # Sometimes there are fulltext links on work but not versions\n",
    "                                    if len(urls) == 0:\n",
    "                                        urls = get_fulltext_url(\n",
    "                                            record.get(\"identifier\", [])\n",
    "                                        )\n",
    "                                    if len(urls) == 0 and filter_by == \"nuc\":\n",
    "                                        urls = [{\"url\": \"\", \"link_text\": \"\"}]\n",
    "                                    for url in urls:\n",
    "                                        work = {\n",
    "                                            # This is not the full set of available fields,\n",
    "                                            # adjust as necessary.\n",
    "                                            \"title\": get_value(metadata, \"title\"),\n",
    "                                            \"work_url\": record.get(\"troveUrl\"),\n",
    "                                            \"work_type\": record.get(\"type\", []),\n",
    "                                            \"contributor\": merge_values(\n",
    "                                                metadata,\n",
    "                                                [\"creator\", \"contributor\"],\n",
    "                                                [\"value\", \"name\"],\n",
    "                                            ),\n",
    "                                            \"publisher\": get_value(\n",
    "                                                metadata, \"publisher\"\n",
    "                                            ),\n",
    "                                            \"date\": merge_values(\n",
    "                                                metadata, [\"date\", \"issued\"]\n",
    "                                            ),\n",
    "                                            # Using merge here because I've noticed some duplicate values\n",
    "                                            \"type\": merge_values(metadata, [\"type\"]),\n",
    "                                            \"format\": get_value(metadata, \"format\"),\n",
    "                                            \"rights\": merge_values(\n",
    "                                                metadata, [\"rights\", \"licenseRef\"]\n",
    "                                            ),\n",
    "                                            \"language\": get_value(metadata, \"language\"),\n",
    "                                            \"extent\": get_value(metadata, \"extent\"),\n",
    "                                            \"subject\": merge_values(\n",
    "                                                metadata, [\"subject\"]\n",
    "                                            ),\n",
    "                                            \"spatial\": get_value(metadata, \"spatial\"),\n",
    "                                            # Flattened type/value\n",
    "                                            \"is_part_of\": flatten_values(\n",
    "                                                metadata, \"isPartOf\"\n",
    "                                            ),\n",
    "                                            # Only get control numbers and flatten\n",
    "                                            \"identifier\": flatten_identifiers(metadata),\n",
    "                                            \"fulltext_url\": url[\"url\"],\n",
    "                                            \"fulltext_url_text\": url[\"link_text\"],\n",
    "                                            \"catalogue_url\": get_catalogue_url(\n",
    "                                                metadata[\"identifier\"]\n",
    "                                            ),\n",
    "                                            # Could also add in data from bibliographicCitation\n",
    "                                            # Although the types used in citations seem to vary by work and format.\n",
    "                                        }\n",
    "                                        ndjson_file.write(f\"{json.dumps(work)}\\n\")\n",
    "                # The nextStart parameter is used to get the next page of results.\n",
    "                # If there's no nextStart then it means we're on the last page of results.\n",
    "                try:\n",
    "                    start = data[\"category\"][0][\"records\"][\"nextStart\"]\n",
    "                except KeyError:\n",
    "                    start = None\n",
    "                pbar.update(len(items))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Run the harvest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "params = {\n",
    "    \"q\": '\"nla.obj\" NOT series:\"Parliamentary paper (Australia. Parliament)\" NOT nuc:\"ANL:NED\"',\n",
    "    \"l-format\": \"Periodical\",  # Journals only\n",
    "    \"l-availability\": \"y\",\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "afedf879ef5e454d8dcfb54d21f3855a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1078 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "harvest_works(params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "df = pd.read_json(\"harvested-metadata.ndjson\", lines=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1274, 18)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Remove duplicates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def merge_column(columns):\n",
    "    values = []\n",
    "    for value in columns:\n",
    "        if isinstance(value, list):\n",
    "            values += [str(v) for v in value if v]\n",
    "        elif value:\n",
    "            values.append(str(value))\n",
    "    return \" | \".join(sorted(set(values)))\n",
    "\n",
    "\n",
    "def merge_records(df):\n",
    "    # df[\"pages\"].fillna(0, inplace=True)\n",
    "    # df.fillna(\"\", inplace=True)\n",
    "    # df[\"pages\"] = df[\"pages\"].astype(\"Int64\")\n",
    "\n",
    "    # Add base dataset with columns that will always have only one value\n",
    "    dfs = [df[[\"fulltext_url\"]].drop_duplicates()]\n",
    "\n",
    "    # Columns that potentially have multiple values which will be merged\n",
    "    columns = [\n",
    "        \"title\",\n",
    "        \"work_url\",\n",
    "        \"work_type\",\n",
    "        \"contributor\",\n",
    "        \"publisher\",\n",
    "        \"date\",\n",
    "        \"type\",\n",
    "        \"format\",\n",
    "        \"extent\",\n",
    "        \"language\",\n",
    "        \"subject\",\n",
    "        \"spatial\",\n",
    "        \"is_part_of\",\n",
    "        \"identifier\",\n",
    "        \"rights\",\n",
    "        \"fulltext_url_text\",\n",
    "        \"catalogue_url\",\n",
    "    ]\n",
    "\n",
    "    # Merge values from each column in turn, creating a new dataframe from each\n",
    "    for column in columns:\n",
    "        dfs.append(\n",
    "            df.groupby([\"fulltext_url\"])[column].apply(merge_column).reset_index()\n",
    "        )\n",
    "\n",
    "    # Merge all the individual dataframes into one, linking on `text_file` value\n",
    "    df_merged = reduce(\n",
    "        lambda left, right: pd.merge(left, right, on=[\"fulltext_url\"], how=\"left\"), dfs\n",
    "    )\n",
    "    return df_merged"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "df_merged = merge_records(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "929"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How many journals are there?\n",
    "df_merged.shape[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "df_merged.to_csv(\"periodical-works.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "rocrate": {
   "action": [
    {
     "description": "This dataset contains version records describing digitised periodicals found by searching for the digital identifier string `nla.obj` and limiting the results to periodicals. Duplicate records were merged.",
     "isPartOf": "https://github.com/GLAM-Workbench/trove-periodical-works-data",
     "mainEntityOfPage": "https://glam-workbench.net/trove-journals/csv-digital-journals/",
     "result": [
      {
       "url": "https://github.com/GLAM-Workbench/trove-periodical-works-data/raw/main/periodical-works.csv"
      }
     ]
    }
   ],
   "author": [
    {
     "mainEntityOfPage": "https://timsherratt.au",
     "name": "Sherratt, Tim",
     "orcid": "https://orcid.org/0000-0001-7956-4498"
    }
   ],
   "category": "Harvesting metadata",
   "description": "This notebook creates a list of digitised periodicals in Trove by searching for the digital identifier string `nla.obj` and limiting the results to periodicals. Before the Trove API introduced the `/magazine/titles` endpoint, this was the only way to generate such a list. This method produces slightly different results to the new API endpoint, and it might be useful to compare the two to see what each method misses.",
   "mainEntityOfPage": "https://glam-workbench.net/trove-journals/create-list-digitised-journals/",
   "name": "Create a list of Trove's digital periodicals",
   "position": 3
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}