{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Convert a Trove list into a CSV file\n",
    "\n",
    "This notebook converts [Trove lists](https://trove.nla.gov.au/list/result?q=) into CSV files (spreadsheets). Separate CSV files are created for newspaper articles and works from Trove's other zones. You can also save the OCRd text, a PDF, and an image of each newspaper article."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>\n",
    "\n",
    "<p>\n",
    "    Some tips:\n",
    "    <ul>\n",
    "        <li>Code cells have boxes around them.</li>\n",
    "        <li>To run a code cell either click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>\n",
    "        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>\n",
    "        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>\n",
    "        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>\n",
    "    </ul>\n",
    "</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set things up\n",
    "\n",
    "Run the cell below to load the necessary libraries and set up some directories to store the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "import shutil\n",
    "import time\n",
    "from pathlib import Path\n",
    "\n",
    "import pandas as pd\n",
    "import requests\n",
    "from dotenv import load_dotenv\n",
    "from IPython.display import HTML\n",
    "from PIL import UnidentifiedImageError\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.exceptions import HTTPError\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from tqdm.auto import tqdm\n",
    "from trove_newspaper_images.articles import download_images\n",
    "\n",
    "s = requests.Session()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[500, 502, 503, 504])\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n",
    "\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add your values to these two cells\n",
    "\n",
    "This is the only section that you'll need to edit. Paste your API key and list id in the cells below as indicated.\n",
    "\n",
    "If necessary, follow the instructions in the Trove Help to [obtain your own Trove API Key](http://help.nla.gov.au/trove/building-with-trove/api).\n",
    "\n",
    "The list id is the number in the url of your Trove list. So [the list](https://trove.nla.gov.au/list/83774) with this url `https://trove.nla.gov.au/list/83774` has an id of `83774`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Insert your Trove API key between the quotes\n",
    "API_KEY = \"YOUR API KEY\"\n",
    "\n",
    "# Use api key value from environment variables if it is available\n",
    "if os.getenv(\"TROVE_API_KEY\"):\n",
    "    API_KEY = os.getenv(\"TROVE_API_KEY\")\n",
    "\n",
    "headers = {\"X-API-KEY\": API_KEY}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Paste your list id below, and set your preferences for saving newspaper articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Paste your list id between the quotes, and then run the cell\n",
    "list_id = \"83777\"\n",
    "\n",
    "# If you don't want to save all the OCRd text, change True to False below\n",
    "save_texts = True\n",
    "\n",
    "# Change this to True if you want to save PDFs of newspaper articles\n",
    "save_pdfs = False\n",
    "\n",
    "# Change this to False if you don't want to save images of newspaper articles\n",
    "save_images = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define some functions\n",
    "\n",
    "Run the cell below to set up all the functions we'll need for the conversion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_url(identifiers, linktype):\n",
    "    \"\"\"\n",
    "    Loop through the identifiers to find the requested url.\n",
    "    \"\"\"\n",
    "    url = \"\"\n",
    "    for identifier in identifiers:\n",
    "        if identifier[\"linktype\"] == linktype:\n",
    "            url = identifier[\"value\"]\n",
    "            break\n",
    "    return url\n",
    "\n",
    "\n",
    "def save_as_csv(list_dir, data, data_type):\n",
    "    df = pd.DataFrame(data)\n",
    "    df.to_csv(\"{}/{}-{}.csv\".format(list_dir, list_id, data_type), index=False)\n",
    "\n",
    "\n",
    "def make_filename(article):\n",
    "    \"\"\"\n",
    "    Create a filename for a text file or PDF.\n",
    "    For easy sorting/aggregation the filename has the format:\n",
    "        PUBLICATIONDATE-NEWSPAPERID-ARTICLEID\n",
    "    \"\"\"\n",
    "    date = article[\"date\"]\n",
    "    date = date.replace(\"-\", \"\")\n",
    "    newspaper_id = article[\"newspaper_id\"]\n",
    "    article_id = article[\"id\"]\n",
    "    return \"{}-{}-{}\".format(date, newspaper_id, article_id)\n",
    "\n",
    "\n",
    "def get_list(list_id):\n",
    "    list_url = f\"https://api.trove.nla.gov.au/v3/list/{list_id}?encoding=json&reclevel=full&include=listItems\"\n",
    "    response = s.get(list_url, headers=headers)\n",
    "    return response.json()\n",
    "\n",
    "\n",
    "def get_article(id):\n",
    "    article_api_url = f\"https://api.trove.nla.gov.au/v3/newspaper/{id}?encoding=json&reclevel=full&include=articletext\"\n",
    "    response = s.get(article_api_url, headers=headers)\n",
    "    return response.json()\n",
    "\n",
    "\n",
    "def make_dirs(list_id):\n",
    "    list_dir = Path(\"data\", \"converted-lists\", list_id)\n",
    "    list_dir.mkdir(parents=True, exist_ok=True)\n",
    "    Path(list_dir, \"text\").mkdir(exist_ok=True)\n",
    "    Path(list_dir, \"image\").mkdir(exist_ok=True)\n",
    "    Path(list_dir, \"pdf\").mkdir(exist_ok=True)\n",
    "    return list_dir\n",
    "\n",
    "\n",
    "def ping_pdf(ping_url):\n",
    "    \"\"\"\n",
    "    Check to see if a PDF is ready for download.\n",
    "    If a 200 status code is received, return True.\n",
    "    \"\"\"\n",
    "    ready = False\n",
    "    # req = Request(ping_url)\n",
    "    try:\n",
    "        # urlopen(req)\n",
    "        response = s.get(ping_url, timeout=30)\n",
    "        response.raise_for_status()\n",
    "    except HTTPError:\n",
    "        if response.status_code == 423:\n",
    "            ready = False\n",
    "        else:\n",
    "            raise\n",
    "    else:\n",
    "        ready = True\n",
    "    return ready\n",
    "\n",
    "\n",
    "def get_pdf_url(article_id, zoom=3):\n",
    "    \"\"\"\n",
    "    Download the PDF version of an article.\n",
    "    These can take a while to generate, so we need to ping the server to see if it's ready before we download.\n",
    "    \"\"\"\n",
    "    pdf_url = None\n",
    "    # Ask for the PDF to be created\n",
    "    prep_url = f\"https://trove.nla.gov.au/newspaper/rendition/nla.news-article{article_id}/level/{zoom}/prep\"\n",
    "    response = s.get(prep_url)\n",
    "    # Get the hash\n",
    "    prep_id = response.text\n",
    "    # Url to check if the PDF is ready\n",
    "    ping_url = f\"https://trove.nla.gov.au/newspaper/rendition/nla.news-article{article_id}.{zoom}.ping?followup={prep_id}\"\n",
    "    tries = 0\n",
    "    ready = False\n",
    "    time.sleep(2)  # Give some time to generate pdf\n",
    "    # Are you ready yet?\n",
    "    while ready is False and tries < 5:\n",
    "        ready = ping_pdf(ping_url)\n",
    "        if not ready:\n",
    "            tries += 1\n",
    "            time.sleep(2)\n",
    "    # Download if ready\n",
    "    if ready:\n",
    "        pdf_url = f\"https://trove.nla.gov.au/newspaper/rendition/nla.news-article{article_id}.{zoom}.pdf?followup={prep_id}\"\n",
    "    return pdf_url\n",
    "\n",
    "\n",
    "def harvest_list(list_id, save_text=True, save_pdfs=False, save_images=False):\n",
    "    list_dir = make_dirs(list_id)\n",
    "    data = get_list(list_id)\n",
    "    works = []\n",
    "    articles = []\n",
    "    for item in tqdm(data[\"listItem\"]):\n",
    "        for zone, record in item.items():\n",
    "            if zone == \"work\":\n",
    "                work = {\n",
    "                    \"id\": record.get(\"id\", \"\"),\n",
    "                    \"title\": record.get(\"title\", \"\"),\n",
    "                    \"type\": \"|\".join(record.get(\"type\", [])),\n",
    "                    \"issued\": record.get(\"issued\", \"\"),\n",
    "                    \"contributor\": \"|\".join(record.get(\"contributor\", [])),\n",
    "                    \"trove_url\": record.get(\"troveUrl\", \"\"),\n",
    "                    \"fulltext_url\": get_url(record.get(\"identifier\", \"\"), \"fulltext\"),\n",
    "                    \"thumbnail_url\": get_url(record.get(\"identifier\", \"\"), \"thumbnail\"),\n",
    "                }\n",
    "                works.append(work)\n",
    "            elif zone == \"article\":\n",
    "                article = {\n",
    "                    \"id\": record.get(\"id\"),\n",
    "                    \"title\": record.get(\"heading\", \"\"),\n",
    "                    \"category\": record.get(\"category\", \"\"),\n",
    "                    \"date\": record.get(\"date\", \"\"),\n",
    "                    \"newspaper_id\": record.get(\"title\", {}).get(\"id\"),\n",
    "                    \"newspaper_title\": record.get(\"title\", {}).get(\"title\"),\n",
    "                    \"page\": record.get(\"page\", \"\"),\n",
    "                    \"page_sequence\": record.get(\"pageSequence\", \"\"),\n",
    "                    \"trove_url\": f'http://nla.gov.au/nla.news-article{record.get(\"id\")}',\n",
    "                }\n",
    "                full_details = get_article(record.get(\"id\"))\n",
    "                article[\"words\"] = full_details.get(\"wordCount\", \"\")\n",
    "                article[\"illustrated\"] = full_details.get(\"illustrated\", \"\")\n",
    "                article[\"corrections\"] = full_details.get(\"correctionCount\", \"\")\n",
    "                if \"trovePageUrl\" in full_details:\n",
    "                    page_id = re.search(\n",
    "                        r\"page(\\d+)\", full_details[\"trovePageUrl\"]\n",
    "                    ).group(1)\n",
    "                    article[\"page_url\"] = (\n",
    "                        f\"http://trove.nla.gov.au/newspaper/page/{page_id}\"\n",
    "                    )\n",
    "                else:\n",
    "                    article[\"page_url\"] = \"\"\n",
    "                filename = make_filename(article)\n",
    "                if save_texts:\n",
    "                    text = full_details.get(\"articleText\")\n",
    "                    text_file = Path(list_dir, \"text\", f\"{filename}.txt\")\n",
    "                    if text:\n",
    "                        text = re.sub(r\"<[^<]+?>\", \"\", text)\n",
    "                        text = re.sub(r\"\\s\\s+\", \" \", text)\n",
    "                        text_file = Path(list_dir, \"text\", f\"{filename}.txt\")\n",
    "                        with open(text_file, \"wb\") as text_output:\n",
    "                            text_output.write(text.encode(\"utf-8\"))\n",
    "                if save_pdfs:\n",
    "                    pdf_url = get_pdf_url(record[\"id\"])\n",
    "                    if pdf_url:\n",
    "                        pdf_file = Path(list_dir, \"pdf\", f\"{filename}.pdf\")\n",
    "                        response = s.get(pdf_url, stream=True)\n",
    "                        with open(pdf_file, \"wb\") as pf:\n",
    "                            for chunk in response.iter_content(chunk_size=128):\n",
    "                                pf.write(chunk)\n",
    "                if save_images:\n",
    "                    images = []\n",
    "                    tries = 0\n",
    "                    # Trove has had some issues loading newspaper images lately\n",
    "                    # This is an attempted workaround\n",
    "                    while not images and tries < 2:\n",
    "                        try:\n",
    "                            images = download_images(\n",
    "                                article[\"id\"], Path(list_dir, \"image\"), masked=True\n",
    "                            )\n",
    "                        except UnidentifiedImageError:\n",
    "                            time.sleep(5)\n",
    "                            tries += 1\n",
    "\n",
    "                articles.append(article)\n",
    "    if articles:\n",
    "        save_as_csv(list_dir, articles, \"articles\")\n",
    "    if works:\n",
    "        save_as_csv(list_dir, works, \"works\")\n",
    "    return works, articles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's do it!\n",
    "\n",
    "Run the cell below to start the conversion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "works, articles = harvest_list(list_id, save_texts, save_pdfs, save_images)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## View the results\n",
    "\n",
    "You can browse the harvested files in the `data/converted-lists/[your list id]` directory.\n",
    "\n",
    "Run the cells below for a preview of the CSV files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Preview newspaper articles CSV\n",
    "df_articles = pd.DataFrame(articles)\n",
    "df_articles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Preview works CSV\n",
    "df_works = pd.DataFrame(works)\n",
    "df_works"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the results\n",
    "\n",
    "Run the cell below to zip up all the harvested files and create a download link."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list_dir = Path(\"data\", \"converted-lists\", list_id)\n",
    "shutil.make_archive(list_dir, \"zip\", list_dir)\n",
    "HTML(f'<a download=\"{list_id}.zip\" href=\"{list_dir}.zip\">Download your harvest</a>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "rocrate": {
   "author": [
    {
     "mainEntityOfPage": "https://timsherratt.au",
     "name": "Sherratt, Tim",
     "orcid": "https://orcid.org/0000-0001-7956-4498"
    }
   ],
   "category": "Lists",
   "description": "This notebook converts Trove lists into CSV files (spreadsheets). Separate CSV files are created for newspaper articles and works from Trove's other zones. You can also save the OCRd text, a PDF, and an image of each newspaper article.",
   "mainEntityOfPage": "https://glam-workbench.net/trove-lists/convert-a-trove-list-into-a-csv-file/",
   "name": "Convert a Trove list into a CSV file",
   "position": 0
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}