{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Harvesting the text of digitised books (and ephemera)\n",
    "\n",
    "This notebook harvests metadata and OCRd text from digitised books in Trove. There's three main steps:\n",
    "\n",
    "* Harvest metadata of digitised books using the Trove API\n",
    "* Extract the number of pages for each book from the Trove web interface (the number of pages is necessary to download the OCRd text)\n",
    "* Download the OCRd text for each book\n",
    "\n",
    "It's not easy to identify all the digitised books with OCRd text in Trove. I'm starting with a [search in the book zone](https://trove.nla.gov.au/search/category/books?keyword=%22nla.obj%22&l-availability=y&l-format=Book) for books that include the phrase `\"nla.obj\"` and are available online. This currently returns 65,050 results in the web interface. In amongst these are a range of government papers and reports, as well as publications where access to the digital copy is restricted. I think these are mostly recent books submitted in digital form under legal deposit. Things are made even more confusing by the fact that the contents of the 'Books & libraries' category in the web interface is not the same as the API's books zone. Anyway, I've used the new `fullTextInd` index to try and filter out works without any OCRd text. This reduces the total results to 40,751 results.\n",
    "\n",
    "But some of those 40,751 results are actually parent records that contain multiple volumes or parts. When I find the number of pages in each book, I'm also checking to see if the record is a 'Multi volume book' and has child works. If it does, I add the child works to the list of books. After this stage there are 42,174 works. However, not all of these 42,174 records have OCRd text. Parent records of multi volume works, and ebook formats like PDFs or MOBI, don't have individual pages, and therefore don't have any text to download. If we exclude works without pages, there are 31,402 works that might have some OCRd text to download.\n",
    "\n",
    "After downloading all the OCRd text files (ignoring any that were empty) I ended up with a grand total of **26,762 files**.\n",
    "\n",
    "If you compare the number of downloaded files to the number in the CSV file that are identified as having OCRd text you'll notice a difference – 26,762 compared to 29,652. After a bit more poking around I realised that there are some duplicates in the list of works. This seems to be because more than one Trove metadata record can point to the same digitised work. For example, both [this record](https://trove.nla.gov.au/work/192090169) and [this record](https://trove.nla.gov.au/work/31771096) point to [this digitised work](http://nla.gov.au/nla.obj-1874683). As they're not exact duplicates, I've left them in the results.\n",
    "\n",
    "Looking through the downloaded text files, it's clear that we're getting ephemera (particularly pamphlets and posters) as well as books. There doesn't seem to be an obvious way to filter these out up front, but of course you could filter later by the number of pages. \n",
    "\n",
    "Here's the metadata I've harvested in CSV format:\n",
    "\n",
    "* [CSV formatted file with details of digitised books](trove_digitised_books_with_ocr.csv)\n",
    "\n",
    "This file includes the following columns:\n",
    "\n",
    "* `title` – title of the work\n",
    "* `url` – link to the metadata record in Trove\n",
    "* `contributors` – pipe-separated names of contributors\n",
    "* `date` – publication date\n",
    "* `format` – the type of work, eg 'Book' or 'Government publication', can have multiple values (pipe-separated)\n",
    "* `fulltext_url` – link to the digital version\n",
    "* `trove_id` – unique identifier of the digital version\n",
    "* `language` – main language of the work\n",
    "* `rights` – copyright status\n",
    "* `pages` – number of pages\n",
    "* `form` – work format, generally one of 'Book', 'Multi volume book', or 'Digital publication'\n",
    "* `volume` – volume/part number\n",
    "* `children` – pipe-separated ids of any child works\n",
    "* `parent` – id of parent work (if any)\n",
    "* `text_downloaded` – file name of the downloaded OCR text\n",
    "* `text_file` – True/False is there any OCRd text\n",
    "\n",
    "Browse and download text files from Cloudstor:\n",
    "\n",
    "* **[26,762 text files](https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL) (about 3.6gb in total) downloaded from the books zone in August 2021.** \n",
    "\n",
    "The full list of books in digital format is also available as a [**searchable database running on Glitch**](https://trove-digital-books.glitch.me/data/trove-digital-books). It includes links to download OCRd text from CloudStor. You can use this database to filter the titles and create your own list of books. Search results can be downloaded as in CSV or JSON format.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting things up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "from tqdm.auto import tqdm\n",
    "from IPython.display import display, FileLink\n",
    "import pandas as pd\n",
    "import json\n",
    "import re\n",
    "import time\n",
    "import os\n",
    "import arrow\n",
    "from copy import deepcopy\n",
    "from bs4 import BeautifulSoup\n",
    "from slugify import slugify\n",
    "import requests_cache"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = requests_cache.CachedSession()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n",
    "s.mount('https://', HTTPAdapter(max_retries=retries))\n",
    "s.mount('http://', HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add your Trove API key below\n",
    "api_key = 'YOUR API KEY'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = {\n",
    "    'key': api_key,\n",
    "    'zone': 'book',\n",
    "    'q': '\"nla.obj\" fullTextInd:y', # API v 2.1 added the full text indicator\n",
    "    'bulkHarvest': 'true',\n",
    "    'n': 100,\n",
    "    'encoding': 'json',\n",
    "    'l-availability': 'y',\n",
    "    'l-format': 'Book',\n",
    "    'include': 'links,workversions'\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harvest metadata using the API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_total_results():\n",
    "    '''\n",
    "    Get the total number of results for a search.\n",
    "    '''\n",
    "    these_params = params.copy()\n",
    "    these_params['n'] = 0\n",
    "    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)\n",
    "    data = response.json()\n",
    "    return int(data['response']['zone'][0]['records']['total'])\n",
    "\n",
    "\n",
    "def get_fulltext_url(links):\n",
    "    '''\n",
    "    Loop through the identifiers to find a link to the full text version of the book.\n",
    "    '''\n",
    "    url = None\n",
    "    for link in links:\n",
    "        if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:\n",
    "            url = link['value']\n",
    "            break\n",
    "    return url\n",
    "\n",
    "def get_version_record(record):\n",
    "    for version in record.get('version'):\n",
    "        for record in version['record']:\n",
    "            try:\n",
    "                if record['metadataSource'].get('value') == 'ANL:DL':\n",
    "                    return record\n",
    "            except (AttributeError, TypeError, KeyError):\n",
    "                pass\n",
    "                \n",
    "def join_list(record, key):\n",
    "    # A field may have a single value or an array.\n",
    "    # If it's an array, join the values into a string.\n",
    "    string_list = ''\n",
    "    if record:\n",
    "        value = record.get(key, [])\n",
    "        if not isinstance(value, list):\n",
    "            value = [value]\n",
    "        string_list = '|'.join(value)\n",
    "    return string_list\n",
    "\n",
    "\n",
    "def harvest_books():\n",
    "    '''\n",
    "    Harvest metadata relating to digitised books.\n",
    "    '''\n",
    "    books = []\n",
    "    total = get_total_results()\n",
    "    start = '*'\n",
    "    these_params = params.copy()\n",
    "    with tqdm(total=total) as pbar:\n",
    "        while start:\n",
    "            these_params['s'] = start\n",
    "            response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)\n",
    "            data = response.json()\n",
    "            # The nextStart parameter is used to get the next page of results.\n",
    "            # If there's no nextStart then it means we're on the last page of results.\n",
    "            try:\n",
    "                start = data['response']['zone'][0]['records']['nextStart']\n",
    "            except KeyError:\n",
    "                start = None\n",
    "            for record in data['response']['zone'][0]['records']['work']:\n",
    "                # See if there's a link to the full text version.\n",
    "                if 'identifier' in record:\n",
    "                    fulltext_url = get_fulltext_url(record['identifier'])\n",
    "                    # I'm making the assumption that if this is a booky book (not a map or music etc),\n",
    "                    # then 'Book' will appear first in the list of types.\n",
    "                    # This might not be a valid assumption.\n",
    "                    # try:\n",
    "                    #    format_type = record.get('type')[0]\n",
    "                    # except (IndexError, TypeError):\n",
    "                    #    format_type = None\n",
    "                    # Save the record if there's a full text link and it's a booky book.\n",
    "                    if fulltext_url:\n",
    "                        trove_id = re.search(r'(nla\\.obj\\-\\d+)', fulltext_url).group(1)\n",
    "                        # Get the basic metadata.\n",
    "                        book = {\n",
    "                            'title': record.get('title'),\n",
    "                            'url': record.get('troveUrl'),\n",
    "                            'contributors': join_list(record, 'contributor'),\n",
    "                            'date': record.get('issued'),\n",
    "                            'format': join_list(record, 'type'),\n",
    "                            'fulltext_url': fulltext_url,\n",
    "                            'trove_id': trove_id\n",
    "                        }\n",
    "                        # Add some extra info if avaliable\n",
    "                        version = get_version_record(record)\n",
    "                        book['language'] = join_list(version, 'language')\n",
    "                        book['rights'] = join_list(version, 'rights')\n",
    "                        books.append(book)\n",
    "                        # print(book)\n",
    "            if not response.from_cache:\n",
    "                time.sleep(0.2)\n",
    "            pbar.update(100)\n",
    "    return books"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Do the harvest!\n",
    "books = harvest_books()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "40751"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(books)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get the number of pages in each book\n",
    "\n",
    "In order to download the OCRd text we need to know the number of pages in a work. This information is not available via the API, so we have to scrape it from the work's HTML page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_work_data(url):\n",
    "    '''\n",
    "    Extract work data in a JSON string from the work's HTML page.\n",
    "    '''\n",
    "    response = s.get(url)\n",
    "    try:\n",
    "        work_data = re.search(r'var work = JSON\\.parse\\(JSON\\.stringify\\((\\{.*\\})', response.text).group(1)\n",
    "    except AttributeError:\n",
    "        work_data = '{}'\n",
    "    if not response.from_cache:\n",
    "        time.sleep(0.2)\n",
    "    return json.loads(work_data)\n",
    "\n",
    "\n",
    "def get_pages(work):\n",
    "    '''\n",
    "    Get the number of pages from the work data.\n",
    "    '''\n",
    "    try:\n",
    "        pages = len(work['children']['page'])\n",
    "    except KeyError:\n",
    "        pages = 0\n",
    "    return pages\n",
    "\n",
    "\n",
    "def get_volumes(parent_id):\n",
    "    '''\n",
    "    Get the ids of volumes that are children of the current record.\n",
    "    '''\n",
    "    start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'\n",
    "    # The initial startIdx value\n",
    "    start = 0\n",
    "    # Number of results per page\n",
    "    n = 20\n",
    "    parts = []\n",
    "    # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.\n",
    "    while n == 20:\n",
    "        # Get the browse page\n",
    "        response = s.get(start_url.format(parent_id, start))\n",
    "        # Beautifulsoup turns the HTML into an easily navigable structure\n",
    "        soup = BeautifulSoup(response.text, 'lxml')\n",
    "        # Find all the divs containing issue details and loop through them\n",
    "        details = soup.find_all(class_='l-item-info')\n",
    "        for detail in details:\n",
    "            title = detail.find('h3')\n",
    "            if title:\n",
    "                issue_id = title.parent['href'].strip('/')\n",
    "            else:\n",
    "                issue_id = detail.find('a')['href'].strip('/')\n",
    "            # Get the issue id\n",
    "            parts.append(issue_id)\n",
    "        if not response.from_cache:\n",
    "            time.sleep(0.2)\n",
    "        # Increment the startIdx\n",
    "        start += n\n",
    "        # Set n to the number of results on the current page\n",
    "        n = len(details)\n",
    "    return parts\n",
    "\n",
    "\n",
    "def add_pages(books):\n",
    "    '''\n",
    "    Add the number of pages to the metadata for each book.\n",
    "    Add volumes from multi volume books.\n",
    "    '''\n",
    "    books_with_pages = []\n",
    "    for book in tqdm(books):\n",
    "        # print(book['fulltext_url'])\n",
    "        work = get_work_data(book['fulltext_url'])\n",
    "        form = work.get('form')\n",
    "        pages = get_pages(work)\n",
    "        book['pages'] = pages\n",
    "        book['form'] = form\n",
    "        book['volume'] = ''\n",
    "        book['parent'] = ''\n",
    "        book['children'] = ''\n",
    "        # Multi volume books are containers with child volumes\n",
    "        # so we have to get the ids of each individual volume and process them\n",
    "        if pages == 0 and form == 'Multi Volume Book':\n",
    "            # Get child volumes\n",
    "            volumes = get_volumes(book['trove_id'])\n",
    "            # For each volume get details and add as a new book entry\n",
    "            for index, volume_id in enumerate(volumes):\n",
    "                volume = book.copy()\n",
    "                # Add link up to the container\n",
    "                volume['parent'] = book['trove_id']\n",
    "                volume['fulltext_url'] = 'http://nla.gov.au/{}'.format(volume_id)\n",
    "                volume['trove_id'] = volume_id\n",
    "                work = get_work_data(volume['fulltext_url'])\n",
    "                form = work.get('form')\n",
    "                pages = get_pages(work)\n",
    "                volume['form'] = form\n",
    "                volume['pages'] = pages\n",
    "                volume['volume'] = str(index + 1)\n",
    "                # print(volume)\n",
    "                books_with_pages.append(volume)\n",
    "            # Add links from container to volumes\n",
    "            book['children'] = '|'.join(volumes)\n",
    "        # print(book)\n",
    "        books_with_pages.append(book)\n",
    "    return books_with_pages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add number of pages to the book metadata\n",
    "books_with_pages = add_pages(deepcopy(books))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert and save results\n",
    "\n",
    "Getting the page numbers takes quite a while, so it's a good idea to save the results to a CSV file before proceeding. That way, you won't have to repeat the process if something goes wrong and you lose the data that's sitting in memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(books_with_pages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "      <th>contributors</th>\n",
       "      <th>date</th>\n",
       "      <th>format</th>\n",
       "      <th>fulltext_url</th>\n",
       "      <th>trove_id</th>\n",
       "      <th>language</th>\n",
       "      <th>rights</th>\n",
       "      <th>pages</th>\n",
       "      <th>form</th>\n",
       "      <th>volume</th>\n",
       "      <th>parent</th>\n",
       "      <th>children</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Goliath Joe, fisherman / by Charles Thackeray ...</td>\n",
       "      <td>https://trove.nla.gov.au/work/10013347</td>\n",
       "      <td>Thackeray, Charles</td>\n",
       "      <td>1900-1919</td>\n",
       "      <td>Book|Book/Illustrated</td>\n",
       "      <td>https://nla.gov.au/nla.obj-2831231419</td>\n",
       "      <td>nla.obj-2831231419</td>\n",
       "      <td>English</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>130</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Grammar of the Narrinyeri tribe of Australian ...</td>\n",
       "      <td>https://trove.nla.gov.au/work/10029401</td>\n",
       "      <td>Taplin, George</td>\n",
       "      <td>1878-1880</td>\n",
       "      <td>Book|Government publication</td>\n",
       "      <td>http://nla.gov.au/nla.obj-688657424</td>\n",
       "      <td>nla.obj-688657424</td>\n",
       "      <td>English</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>24</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The works of the Rev. Sydney Smith</td>\n",
       "      <td>https://trove.nla.gov.au/work/1004403</td>\n",
       "      <td>Smith, Sydney, 1771-1845</td>\n",
       "      <td>1839-1900</td>\n",
       "      <td>Book|Book/Illustrated|Microform</td>\n",
       "      <td>https://nla.gov.au/nla.obj-630176596</td>\n",
       "      <td>nla.obj-630176596</td>\n",
       "      <td>English</td>\n",
       "      <td>No known copyright restrictions|http://rightss...</td>\n",
       "      <td>65</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Nellie Doran : a story of Australian home and ...</td>\n",
       "      <td>https://trove.nla.gov.au/work/10049667</td>\n",
       "      <td>Miriam Agatha</td>\n",
       "      <td>1914-1923</td>\n",
       "      <td>Book</td>\n",
       "      <td>http://nla.gov.au/nla.obj-24357566</td>\n",
       "      <td>nla.obj-24357566</td>\n",
       "      <td>English</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>246</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...</td>\n",
       "      <td>https://trove.nla.gov.au/work/10053234</td>\n",
       "      <td>Germany. Heer. Heereswaffenamt</td>\n",
       "      <td>1942</td>\n",
       "      <td>Book|Book/Illustrated|Government publication</td>\n",
       "      <td>https://nla.gov.au/nla.obj-51530748</td>\n",
       "      <td>nla.obj-51530748</td>\n",
       "      <td>German</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>80</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title  \\\n",
       "0  Goliath Joe, fisherman / by Charles Thackeray ...   \n",
       "1  Grammar of the Narrinyeri tribe of Australian ...   \n",
       "2                 The works of the Rev. Sydney Smith   \n",
       "3  Nellie Doran : a story of Australian home and ...   \n",
       "4  Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...   \n",
       "\n",
       "                                      url                    contributors  \\\n",
       "0  https://trove.nla.gov.au/work/10013347              Thackeray, Charles   \n",
       "1  https://trove.nla.gov.au/work/10029401                  Taplin, George   \n",
       "2   https://trove.nla.gov.au/work/1004403        Smith, Sydney, 1771-1845   \n",
       "3  https://trove.nla.gov.au/work/10049667                   Miriam Agatha   \n",
       "4  https://trove.nla.gov.au/work/10053234  Germany. Heer. Heereswaffenamt   \n",
       "\n",
       "        date                                        format  \\\n",
       "0  1900-1919                         Book|Book/Illustrated   \n",
       "1  1878-1880                   Book|Government publication   \n",
       "2  1839-1900               Book|Book/Illustrated|Microform   \n",
       "3  1914-1923                                          Book   \n",
       "4       1942  Book|Book/Illustrated|Government publication   \n",
       "\n",
       "                            fulltext_url            trove_id language  \\\n",
       "0  https://nla.gov.au/nla.obj-2831231419  nla.obj-2831231419  English   \n",
       "1    http://nla.gov.au/nla.obj-688657424   nla.obj-688657424  English   \n",
       "2   https://nla.gov.au/nla.obj-630176596   nla.obj-630176596  English   \n",
       "3     http://nla.gov.au/nla.obj-24357566    nla.obj-24357566  English   \n",
       "4    https://nla.gov.au/nla.obj-51530748    nla.obj-51530748   German   \n",
       "\n",
       "                                              rights  pages  form volume  \\\n",
       "0  Out of Copyright|http://rightsstatements.org/v...    130  Book          \n",
       "1  Out of Copyright|http://rightsstatements.org/v...     24  Book          \n",
       "2  No known copyright restrictions|http://rightss...     65  Book          \n",
       "3  Out of Copyright|http://rightsstatements.org/v...    246  Book          \n",
       "4  Out of Copyright|http://rightsstatements.org/v...     80  Book          \n",
       "\n",
       "  parent children  \n",
       "0                  \n",
       "1                  \n",
       "2                  \n",
       "3                  \n",
       "4                  "
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(42174, 14)"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How many records?\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(31402, 14)"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How many have pages?\n",
    "df.loc[df['pages'] != 0].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Book                   29069\n",
       "Digital Publication     9808\n",
       "Multi Volume Book       2348\n",
       "Picture                  523\n",
       "Journal                  357\n",
       "Manuscript                36\n",
       "Other - General           14\n",
       "Map                        2\n",
       "Other - Australian         1\n",
       "Name: form, dtype: int64"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How many of each format?\n",
    "df['form'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "English                                       25674\n",
       "                                              14284\n",
       "Chinese                                        1219\n",
       "French                                          210\n",
       "Undetermined                                    193\n",
       "German                                           92\n",
       "Japanese                                         63\n",
       "Dutch                                            56\n",
       "Australian languages                             55\n",
       "Austronesian (Other)                             55\n",
       "Italian                                          31\n",
       "Latin                                            31\n",
       "Spanish                                          22\n",
       "Maori                                            20\n",
       "Swedish                                          19\n",
       "Portuguese                                       16\n",
       "Korean                                           15\n",
       "Tahitian                                         13\n",
       "Indonesian                                       12\n",
       "Danish                                           11\n",
       "Multiple languages                                8\n",
       "Tongan                                            7\n",
       "Greek, Modern (1453- )                            7\n",
       "Finnish                                           7\n",
       "Russian                                           6\n",
       "Norwegian                                         5\n",
       "Czech                                             4\n",
       "Samoan                                            4\n",
       "Thai                                              4\n",
       "Polish                                            3\n",
       "Fijian                                            2\n",
       "Miscellaneous languages                           2\n",
       "Papiamento                                        2\n",
       "Malay                                             2\n",
       "Welsh                                             2\n",
       "Papuan (Other)                                    2\n",
       "No linguistic content                             2\n",
       "Tagalog                                           1\n",
       "Niger-Kordofanian (Other)                         1\n",
       "Sanskrit                                          1\n",
       "Javanese                                          1\n",
       "pol                                               1\n",
       "Philippine (Other)                                1\n",
       "Scottish Gaelic                                   1\n",
       "Vietnamese                                        1\n",
       "Yiddish                                           1\n",
       "Hawaiian                                          1\n",
       "Creoles and Pidgins, English-based (Other)        1\n",
       "Irish                                             1\n",
       "Gã                                                1\n",
       "Nauru                                             1\n",
       "Name: language, dtype: int64"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Breakdown by language\n",
    "df['language'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href='trove_digitised_books.csv' target='_blank'>trove_digitised_books.csv</a><br>"
      ],
      "text/plain": [
       "/Volumes/Workspace/mycode/glam-workbench/trove-books/notebooks/trove_digitised_books.csv"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Save as CSV\n",
    "df.to_csv('trove_digitised_books.csv', index=False)\n",
    "display(FileLink('trove_digitised_books.csv'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the OCRd texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run this cell if you need to reload the books data from the CSV\n",
    "df = pd.read_csv('trove_digitised_books.csv', keep_default_na=False)\n",
    "books_with_pages = df.to_dict('records')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "def save_ocr(books, output_dir='text'):\n",
    "    '''\n",
    "    Download the OCRd text for each book.\n",
    "    '''\n",
    "    os.makedirs(output_dir, exist_ok=True)\n",
    "    for book in tqdm(books):\n",
    "        # Default values\n",
    "        book['text_downloaded'] = False\n",
    "        book['text_file'] = ''\n",
    "        if book['pages'] != 0:       \n",
    "            # print(book['title'])\n",
    "            # The index value for the last page of an issue will be the total pages - 1\n",
    "            last_page = book['pages'] - 1\n",
    "            file_name = '{}-{}.txt'.format(slugify(str(book['title'])[:50]), book['trove_id'])\n",
    "            file_path = os.path.join(output_dir, file_name)\n",
    "            # Check to see if the file has already been harvested\n",
    "            if os.path.exists(file_path) and os.path.getsize(file_path) > 0:\n",
    "                # print('Already saved')\n",
    "                book['text_file'] = file_name\n",
    "                book['text_downloaded'] = True\n",
    "            else:\n",
    "                url = 'https://trove.nla.gov.au/{}/download?downloadOption=ocr&firstPage=0&lastPage={}'.format(book['trove_id'], last_page)\n",
    "                # print(url)\n",
    "                # Get the file\n",
    "                r = s.get(url)\n",
    "                # Check there was no error\n",
    "                if r.status_code == requests.codes.ok:\n",
    "                    # Check that the file's not empty\n",
    "                    r.encoding = 'utf-8'\n",
    "                    if len(r.text) > 0 and not r.text.isspace():\n",
    "                        # Check that the file isn't HTML (some not found pages don't return 404s)\n",
    "                        if BeautifulSoup(r.text, 'html.parser').find('html') is None:\n",
    "                            # If everything's ok, save the file\n",
    "                            with open(file_path, 'w', encoding='utf-8') as text_file:\n",
    "                                text_file.write(r.text)\n",
    "                            # print('Saved')\n",
    "                            book['text_file'] = file_name\n",
    "                            book['text_downloaded'] = True\n",
    "                if not r.from_cache:\n",
    "                    time.sleep(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_ocr(books_with_pages, '/Volumes/bigdata/mydata/Trove/books')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert and save updated results\n",
    "\n",
    "The new books list includes the file name of the downloaded text file (if there is one),\n",
    "and a boolean field indicating if the text has been downloaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert this to df\n",
    "df_downloaded = pd.DataFrame(books_with_pages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "      <th>contributors</th>\n",
       "      <th>date</th>\n",
       "      <th>format</th>\n",
       "      <th>fulltext_url</th>\n",
       "      <th>trove_id</th>\n",
       "      <th>language</th>\n",
       "      <th>rights</th>\n",
       "      <th>pages</th>\n",
       "      <th>form</th>\n",
       "      <th>volume</th>\n",
       "      <th>parent</th>\n",
       "      <th>children</th>\n",
       "      <th>text_downloaded</th>\n",
       "      <th>text_file</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Goliath Joe, fisherman / by Charles Thackeray ...</td>\n",
       "      <td>https://trove.nla.gov.au/work/10013347</td>\n",
       "      <td>Thackeray, Charles</td>\n",
       "      <td>1900-1919</td>\n",
       "      <td>Book|Book/Illustrated</td>\n",
       "      <td>https://nla.gov.au/nla.obj-2831231419</td>\n",
       "      <td>nla.obj-2831231419</td>\n",
       "      <td>English</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>130</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>goliath-joe-fisherman-by-charles-thackeray-wob...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Grammar of the Narrinyeri tribe of Australian ...</td>\n",
       "      <td>https://trove.nla.gov.au/work/10029401</td>\n",
       "      <td>Taplin, George</td>\n",
       "      <td>1878-1880</td>\n",
       "      <td>Book|Government publication</td>\n",
       "      <td>http://nla.gov.au/nla.obj-688657424</td>\n",
       "      <td>nla.obj-688657424</td>\n",
       "      <td>English</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>24</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>grammar-of-the-narrinyeri-tribe-of-australian-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The works of the Rev. Sydney Smith</td>\n",
       "      <td>https://trove.nla.gov.au/work/1004403</td>\n",
       "      <td>Smith, Sydney, 1771-1845</td>\n",
       "      <td>1839-1900</td>\n",
       "      <td>Book|Book/Illustrated|Microform</td>\n",
       "      <td>https://nla.gov.au/nla.obj-630176596</td>\n",
       "      <td>nla.obj-630176596</td>\n",
       "      <td>English</td>\n",
       "      <td>No known copyright restrictions|http://rightss...</td>\n",
       "      <td>65</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>the-works-of-the-rev-sydney-smith-nla.obj-6301...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Nellie Doran : a story of Australian home and ...</td>\n",
       "      <td>https://trove.nla.gov.au/work/10049667</td>\n",
       "      <td>Miriam Agatha</td>\n",
       "      <td>1914-1923</td>\n",
       "      <td>Book</td>\n",
       "      <td>http://nla.gov.au/nla.obj-24357566</td>\n",
       "      <td>nla.obj-24357566</td>\n",
       "      <td>English</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>246</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>nellie-doran-a-story-of-australian-home-and-sc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...</td>\n",
       "      <td>https://trove.nla.gov.au/work/10053234</td>\n",
       "      <td>Germany. Heer. Heereswaffenamt</td>\n",
       "      <td>1942</td>\n",
       "      <td>Book|Book/Illustrated|Government publication</td>\n",
       "      <td>https://nla.gov.au/nla.obj-51530748</td>\n",
       "      <td>nla.obj-51530748</td>\n",
       "      <td>German</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>80</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>lastkraftwagen-3-t-ford-baumuster-v-3000-s-ger...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title  \\\n",
       "0  Goliath Joe, fisherman / by Charles Thackeray ...   \n",
       "1  Grammar of the Narrinyeri tribe of Australian ...   \n",
       "2                 The works of the Rev. Sydney Smith   \n",
       "3  Nellie Doran : a story of Australian home and ...   \n",
       "4  Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...   \n",
       "\n",
       "                                      url                    contributors  \\\n",
       "0  https://trove.nla.gov.au/work/10013347              Thackeray, Charles   \n",
       "1  https://trove.nla.gov.au/work/10029401                  Taplin, George   \n",
       "2   https://trove.nla.gov.au/work/1004403        Smith, Sydney, 1771-1845   \n",
       "3  https://trove.nla.gov.au/work/10049667                   Miriam Agatha   \n",
       "4  https://trove.nla.gov.au/work/10053234  Germany. Heer. Heereswaffenamt   \n",
       "\n",
       "        date                                        format  \\\n",
       "0  1900-1919                         Book|Book/Illustrated   \n",
       "1  1878-1880                   Book|Government publication   \n",
       "2  1839-1900               Book|Book/Illustrated|Microform   \n",
       "3  1914-1923                                          Book   \n",
       "4       1942  Book|Book/Illustrated|Government publication   \n",
       "\n",
       "                            fulltext_url            trove_id language  \\\n",
       "0  https://nla.gov.au/nla.obj-2831231419  nla.obj-2831231419  English   \n",
       "1    http://nla.gov.au/nla.obj-688657424   nla.obj-688657424  English   \n",
       "2   https://nla.gov.au/nla.obj-630176596   nla.obj-630176596  English   \n",
       "3     http://nla.gov.au/nla.obj-24357566    nla.obj-24357566  English   \n",
       "4    https://nla.gov.au/nla.obj-51530748    nla.obj-51530748   German   \n",
       "\n",
       "                                              rights  pages  form volume  \\\n",
       "0  Out of Copyright|http://rightsstatements.org/v...    130  Book          \n",
       "1  Out of Copyright|http://rightsstatements.org/v...     24  Book          \n",
       "2  No known copyright restrictions|http://rightss...     65  Book          \n",
       "3  Out of Copyright|http://rightsstatements.org/v...    246  Book          \n",
       "4  Out of Copyright|http://rightsstatements.org/v...     80  Book          \n",
       "\n",
       "  parent children  text_downloaded  \\\n",
       "0                             True   \n",
       "1                             True   \n",
       "2                             True   \n",
       "3                             True   \n",
       "4                             True   \n",
       "\n",
       "                                           text_file  \n",
       "0  goliath-joe-fisherman-by-charles-thackeray-wob...  \n",
       "1  grammar-of-the-narrinyeri-tribe-of-australian-...  \n",
       "2  the-works-of-the-rev-sydney-smith-nla.obj-6301...  \n",
       "3  nellie-doran-a-story-of-australian-home-and-sc...  \n",
       "4  lastkraftwagen-3-t-ford-baumuster-v-3000-s-ger...  "
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_downloaded.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(29652, 16)"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How many have been downloaded?\n",
    "df_downloaded.loc[df_downloaded['text_downloaded'] == True].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why is the number above different to the number of files actually downloaded? Let's have a look for duplicates.\n",
    "\n",
    "As you can see below, some digitised works are linked to from multiple metadata records. Hence there are duplicates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "      <th>contributors</th>\n",
       "      <th>date</th>\n",
       "      <th>format</th>\n",
       "      <th>fulltext_url</th>\n",
       "      <th>trove_id</th>\n",
       "      <th>language</th>\n",
       "      <th>rights</th>\n",
       "      <th>pages</th>\n",
       "      <th>form</th>\n",
       "      <th>volume</th>\n",
       "      <th>parent</th>\n",
       "      <th>children</th>\n",
       "      <th>text_downloaded</th>\n",
       "      <th>text_file</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>25788</th>\n",
       "      <td>Three weeks in Southland : being the account o...</td>\n",
       "      <td>https://trove.nla.gov.au/work/237350529</td>\n",
       "      <td>Reid, Stuart, active 1884-1885</td>\n",
       "      <td>1885</td>\n",
       "      <td>Book</td>\n",
       "      <td>https://nla.gov.au/nla.obj-101207695</td>\n",
       "      <td>nla.obj-101207695</td>\n",
       "      <td>English</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>66</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>three-weeks-in-southland-being-the-account-of-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7469</th>\n",
       "      <td>Three weeks in Southland : being the account o...</td>\n",
       "      <td>https://trove.nla.gov.au/work/19178390</td>\n",
       "      <td>Reid, Stuart, active 1884-1885</td>\n",
       "      <td>1885</td>\n",
       "      <td>Book</td>\n",
       "      <td>http://nla.gov.au/nla.obj-101207695</td>\n",
       "      <td>nla.obj-101207695</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>66</td>\n",
       "      <td>Book</td>\n",
       "      <td>2</td>\n",
       "      <td>nla.obj-477008239</td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>three-weeks-in-southland-being-the-account-of-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25790</th>\n",
       "      <td>A recent visit to several of the Polynesian is...</td>\n",
       "      <td>https://trove.nla.gov.au/work/237350531</td>\n",
       "      <td>Bennett, George, active 1830-1831</td>\n",
       "      <td>1831</td>\n",
       "      <td>Book</td>\n",
       "      <td>https://nla.gov.au/nla.obj-101212925</td>\n",
       "      <td>nla.obj-101212925</td>\n",
       "      <td>English</td>\n",
       "      <td>No known copyright restrictions|http://rightss...</td>\n",
       "      <td>8</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>a-recent-visit-to-several-of-the-polynesian-is...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7771</th>\n",
       "      <td>A recent visit to several of the Polynesian is...</td>\n",
       "      <td>https://trove.nla.gov.au/work/19241288</td>\n",
       "      <td>Bennett, George, active 1830-1831</td>\n",
       "      <td>1831-1832</td>\n",
       "      <td>Book/Illustrated|Book</td>\n",
       "      <td>http://nla.gov.au/nla.obj-101212925</td>\n",
       "      <td>nla.obj-101212925</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>8</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>a-recent-visit-to-several-of-the-polynesian-is...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25807</th>\n",
       "      <td>How Capt. Cook died : new light from an old book</td>\n",
       "      <td>https://trove.nla.gov.au/work/237350548</td>\n",
       "      <td></td>\n",
       "      <td>1908</td>\n",
       "      <td>Book</td>\n",
       "      <td>https://nla.gov.au/nla.obj-101227721</td>\n",
       "      <td>nla.obj-101227721</td>\n",
       "      <td>English</td>\n",
       "      <td>No known copyright restrictions|http://rightss...</td>\n",
       "      <td>10</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>how-capt-cook-died-new-light-from-an-old-book-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37508</th>\n",
       "      <td>A Wonderful Illawarra waterfall : a rare beaut...</td>\n",
       "      <td>https://trove.nla.gov.au/work/24063846</td>\n",
       "      <td></td>\n",
       "      <td>1895</td>\n",
       "      <td>Book</td>\n",
       "      <td>http://nla.gov.au/nla.obj-99671695</td>\n",
       "      <td>nla.obj-99671695</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>a-wonderful-illawarra-waterfall-a-rare-beauty-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25811</th>\n",
       "      <td>The Results of the census of 1871 : supplement...</td>\n",
       "      <td>https://trove.nla.gov.au/work/237350552</td>\n",
       "      <td></td>\n",
       "      <td>1873</td>\n",
       "      <td>Book</td>\n",
       "      <td>https://nla.gov.au/nla.obj-99716940</td>\n",
       "      <td>nla.obj-99716940</td>\n",
       "      <td>English</td>\n",
       "      <td>No known copyright restrictions|http://rightss...</td>\n",
       "      <td>2</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>the-results-of-the-census-of-1871-supplement-t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4099</th>\n",
       "      <td>The Results of the census of 1871 : supplement...</td>\n",
       "      <td>https://trove.nla.gov.au/work/17856108</td>\n",
       "      <td></td>\n",
       "      <td>1873</td>\n",
       "      <td>Book</td>\n",
       "      <td>http://nla.gov.au/nla.obj-99716940</td>\n",
       "      <td>nla.obj-99716940</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>2</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>the-results-of-the-census-of-1871-supplement-t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25795</th>\n",
       "      <td>Regular packets for Australia : emigration to ...</td>\n",
       "      <td>https://trove.nla.gov.au/work/237350536</td>\n",
       "      <td></td>\n",
       "      <td>1850</td>\n",
       "      <td>Book</td>\n",
       "      <td>https://nla.gov.au/nla.obj-99727992</td>\n",
       "      <td>nla.obj-99727992</td>\n",
       "      <td>English</td>\n",
       "      <td>No known copyright restrictions|http://rightss...</td>\n",
       "      <td>1</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>regular-packets-for-australia-emigration-to-po...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>909</th>\n",
       "      <td>Regular packets for Australia : emigration to ...</td>\n",
       "      <td>https://trove.nla.gov.au/work/12328620</td>\n",
       "      <td></td>\n",
       "      <td>1850</td>\n",
       "      <td>Book</td>\n",
       "      <td>http://nla.gov.au/nla.obj-99727992</td>\n",
       "      <td>nla.obj-99727992</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>1</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>True</td>\n",
       "      <td>regular-packets-for-australia-emigration-to-po...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6234 rows × 16 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                   title  \\\n",
       "25788  Three weeks in Southland : being the account o...   \n",
       "7469   Three weeks in Southland : being the account o...   \n",
       "25790  A recent visit to several of the Polynesian is...   \n",
       "7771   A recent visit to several of the Polynesian is...   \n",
       "25807   How Capt. Cook died : new light from an old book   \n",
       "...                                                  ...   \n",
       "37508  A Wonderful Illawarra waterfall : a rare beaut...   \n",
       "25811  The Results of the census of 1871 : supplement...   \n",
       "4099   The Results of the census of 1871 : supplement...   \n",
       "25795  Regular packets for Australia : emigration to ...   \n",
       "909    Regular packets for Australia : emigration to ...   \n",
       "\n",
       "                                           url  \\\n",
       "25788  https://trove.nla.gov.au/work/237350529   \n",
       "7469    https://trove.nla.gov.au/work/19178390   \n",
       "25790  https://trove.nla.gov.au/work/237350531   \n",
       "7771    https://trove.nla.gov.au/work/19241288   \n",
       "25807  https://trove.nla.gov.au/work/237350548   \n",
       "...                                        ...   \n",
       "37508   https://trove.nla.gov.au/work/24063846   \n",
       "25811  https://trove.nla.gov.au/work/237350552   \n",
       "4099    https://trove.nla.gov.au/work/17856108   \n",
       "25795  https://trove.nla.gov.au/work/237350536   \n",
       "909     https://trove.nla.gov.au/work/12328620   \n",
       "\n",
       "                            contributors       date                 format  \\\n",
       "25788     Reid, Stuart, active 1884-1885       1885                   Book   \n",
       "7469      Reid, Stuart, active 1884-1885       1885                   Book   \n",
       "25790  Bennett, George, active 1830-1831       1831                   Book   \n",
       "7771   Bennett, George, active 1830-1831  1831-1832  Book/Illustrated|Book   \n",
       "25807                                          1908                   Book   \n",
       "...                                  ...        ...                    ...   \n",
       "37508                                          1895                   Book   \n",
       "25811                                          1873                   Book   \n",
       "4099                                           1873                   Book   \n",
       "25795                                          1850                   Book   \n",
       "909                                            1850                   Book   \n",
       "\n",
       "                               fulltext_url           trove_id language  \\\n",
       "25788  https://nla.gov.au/nla.obj-101207695  nla.obj-101207695  English   \n",
       "7469    http://nla.gov.au/nla.obj-101207695  nla.obj-101207695            \n",
       "25790  https://nla.gov.au/nla.obj-101212925  nla.obj-101212925  English   \n",
       "7771    http://nla.gov.au/nla.obj-101212925  nla.obj-101212925            \n",
       "25807  https://nla.gov.au/nla.obj-101227721  nla.obj-101227721  English   \n",
       "...                                     ...                ...      ...   \n",
       "37508    http://nla.gov.au/nla.obj-99671695   nla.obj-99671695            \n",
       "25811   https://nla.gov.au/nla.obj-99716940   nla.obj-99716940  English   \n",
       "4099     http://nla.gov.au/nla.obj-99716940   nla.obj-99716940            \n",
       "25795   https://nla.gov.au/nla.obj-99727992   nla.obj-99727992  English   \n",
       "909      http://nla.gov.au/nla.obj-99727992   nla.obj-99727992            \n",
       "\n",
       "                                                  rights  pages  form volume  \\\n",
       "25788  Out of Copyright|http://rightsstatements.org/v...     66  Book          \n",
       "7469                                                         66  Book      2   \n",
       "25790  No known copyright restrictions|http://rightss...      8  Book          \n",
       "7771                                                          8  Book          \n",
       "25807  No known copyright restrictions|http://rightss...     10  Book          \n",
       "...                                                  ...    ...   ...    ...   \n",
       "37508                                                         1  Book          \n",
       "25811  No known copyright restrictions|http://rightss...      2  Book          \n",
       "4099                                                          2  Book          \n",
       "25795  No known copyright restrictions|http://rightss...      1  Book          \n",
       "909                                                           1  Book          \n",
       "\n",
       "                  parent children  text_downloaded  \\\n",
       "25788                                         True   \n",
       "7469   nla.obj-477008239                      True   \n",
       "25790                                         True   \n",
       "7771                                          True   \n",
       "25807                                         True   \n",
       "...                  ...      ...              ...   \n",
       "37508                                         True   \n",
       "25811                                         True   \n",
       "4099                                          True   \n",
       "25795                                         True   \n",
       "909                                           True   \n",
       "\n",
       "                                               text_file  \n",
       "25788  three-weeks-in-southland-being-the-account-of-...  \n",
       "7469   three-weeks-in-southland-being-the-account-of-...  \n",
       "25790  a-recent-visit-to-several-of-the-polynesian-is...  \n",
       "7771   a-recent-visit-to-several-of-the-polynesian-is...  \n",
       "25807  how-capt-cook-died-new-light-from-an-old-book-...  \n",
       "...                                                  ...  \n",
       "37508  a-wonderful-illawarra-waterfall-a-rare-beauty-...  \n",
       "25811  the-results-of-the-census-of-1871-supplement-t...  \n",
       "4099   the-results-of-the-census-of-1871-supplement-t...  \n",
       "25795  regular-packets-for-australia-emigration-to-po...  \n",
       "909    regular-packets-for-australia-emigration-to-po...  \n",
       "\n",
       "[6234 rows x 16 columns]"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_downloaded.loc[df_downloaded.duplicated('trove_id', keep=False) == True].sort_values('trove_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href='trove_digitised_books_with_ocr.csv' target='_blank'>trove_digitised_books_with_ocr.csv</a><br>"
      ],
      "text/plain": [
       "/Volumes/Workspace/mycode/glam-workbench/trove-books/notebooks/trove_digitised_books_with_ocr.csv"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Save as CSV\n",
    "df_downloaded.to_csv('trove_digitised_books_with_ocr.csv', index=False)\n",
    "display(FileLink('trove_digitised_books_with_ocr.csv'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a searchable database using Datasette\n",
    "\n",
    "To make it easy to explore the list of books, let's load the CSV file into Datasette. First we'll drop some columns, do some reordering, and add links to the downloaded text files stored on CloudStor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_datasette = df_downloaded.copy()\n",
    "\n",
    "# Add link to Cloudstor\n",
    "df_datasette['cloudstor_url'] = df_datasette.loc[df_datasette['text_downloaded'] == True]['text_file'].apply(lambda x: f'https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL/download?path={x}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove some columns that aren't going to be useful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_datasette = df_datasette[['title', 'contributors', 'date', 'format', 'language', 'rights', 'pages', 'url', 'fulltext_url', 'cloudstor_url', 'form', 'volume', 'parent', 'children']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rename columns for clarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>contributors</th>\n",
       "      <th>date</th>\n",
       "      <th>format</th>\n",
       "      <th>language</th>\n",
       "      <th>copyright</th>\n",
       "      <th>pages</th>\n",
       "      <th>view_details_url</th>\n",
       "      <th>view_book_url</th>\n",
       "      <th>download_text_url</th>\n",
       "      <th>form</th>\n",
       "      <th>volume</th>\n",
       "      <th>parent</th>\n",
       "      <th>children</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Goliath Joe, fisherman / by Charles Thackeray ...</td>\n",
       "      <td>Thackeray, Charles</td>\n",
       "      <td>1900-1919</td>\n",
       "      <td>Book|Book/Illustrated</td>\n",
       "      <td>English</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>130</td>\n",
       "      <td>https://trove.nla.gov.au/work/10013347</td>\n",
       "      <td>https://nla.gov.au/nla.obj-2831231419</td>\n",
       "      <td>https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Grammar of the Narrinyeri tribe of Australian ...</td>\n",
       "      <td>Taplin, George</td>\n",
       "      <td>1878-1880</td>\n",
       "      <td>Book|Government publication</td>\n",
       "      <td>English</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>24</td>\n",
       "      <td>https://trove.nla.gov.au/work/10029401</td>\n",
       "      <td>http://nla.gov.au/nla.obj-688657424</td>\n",
       "      <td>https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The works of the Rev. Sydney Smith</td>\n",
       "      <td>Smith, Sydney, 1771-1845</td>\n",
       "      <td>1839-1900</td>\n",
       "      <td>Book|Book/Illustrated|Microform</td>\n",
       "      <td>English</td>\n",
       "      <td>No known copyright restrictions|http://rightss...</td>\n",
       "      <td>65</td>\n",
       "      <td>https://trove.nla.gov.au/work/1004403</td>\n",
       "      <td>https://nla.gov.au/nla.obj-630176596</td>\n",
       "      <td>https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Nellie Doran : a story of Australian home and ...</td>\n",
       "      <td>Miriam Agatha</td>\n",
       "      <td>1914-1923</td>\n",
       "      <td>Book</td>\n",
       "      <td>English</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>246</td>\n",
       "      <td>https://trove.nla.gov.au/work/10049667</td>\n",
       "      <td>http://nla.gov.au/nla.obj-24357566</td>\n",
       "      <td>https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...</td>\n",
       "      <td>Germany. Heer. Heereswaffenamt</td>\n",
       "      <td>1942</td>\n",
       "      <td>Book|Book/Illustrated|Government publication</td>\n",
       "      <td>German</td>\n",
       "      <td>Out of Copyright|http://rightsstatements.org/v...</td>\n",
       "      <td>80</td>\n",
       "      <td>https://trove.nla.gov.au/work/10053234</td>\n",
       "      <td>https://nla.gov.au/nla.obj-51530748</td>\n",
       "      <td>https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...</td>\n",
       "      <td>Book</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title  \\\n",
       "0  Goliath Joe, fisherman / by Charles Thackeray ...   \n",
       "1  Grammar of the Narrinyeri tribe of Australian ...   \n",
       "2                 The works of the Rev. Sydney Smith   \n",
       "3  Nellie Doran : a story of Australian home and ...   \n",
       "4  Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...   \n",
       "\n",
       "                     contributors       date  \\\n",
       "0              Thackeray, Charles  1900-1919   \n",
       "1                  Taplin, George  1878-1880   \n",
       "2        Smith, Sydney, 1771-1845  1839-1900   \n",
       "3                   Miriam Agatha  1914-1923   \n",
       "4  Germany. Heer. Heereswaffenamt       1942   \n",
       "\n",
       "                                         format language  \\\n",
       "0                         Book|Book/Illustrated  English   \n",
       "1                   Book|Government publication  English   \n",
       "2               Book|Book/Illustrated|Microform  English   \n",
       "3                                          Book  English   \n",
       "4  Book|Book/Illustrated|Government publication   German   \n",
       "\n",
       "                                           copyright  pages  \\\n",
       "0  Out of Copyright|http://rightsstatements.org/v...    130   \n",
       "1  Out of Copyright|http://rightsstatements.org/v...     24   \n",
       "2  No known copyright restrictions|http://rightss...     65   \n",
       "3  Out of Copyright|http://rightsstatements.org/v...    246   \n",
       "4  Out of Copyright|http://rightsstatements.org/v...     80   \n",
       "\n",
       "                         view_details_url  \\\n",
       "0  https://trove.nla.gov.au/work/10013347   \n",
       "1  https://trove.nla.gov.au/work/10029401   \n",
       "2   https://trove.nla.gov.au/work/1004403   \n",
       "3  https://trove.nla.gov.au/work/10049667   \n",
       "4  https://trove.nla.gov.au/work/10053234   \n",
       "\n",
       "                           view_book_url  \\\n",
       "0  https://nla.gov.au/nla.obj-2831231419   \n",
       "1    http://nla.gov.au/nla.obj-688657424   \n",
       "2   https://nla.gov.au/nla.obj-630176596   \n",
       "3     http://nla.gov.au/nla.obj-24357566   \n",
       "4    https://nla.gov.au/nla.obj-51530748   \n",
       "\n",
       "                                   download_text_url  form volume parent  \\\n",
       "0  https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...  Book                 \n",
       "1  https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...  Book                 \n",
       "2  https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...  Book                 \n",
       "3  https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...  Book                 \n",
       "4  https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...  Book                 \n",
       "\n",
       "  children  \n",
       "0           \n",
       "1           \n",
       "2           \n",
       "3           \n",
       "4           "
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_datasette.columns = ['title', 'contributors', 'date', 'format', 'language', 'copyright', 'pages', 'view_details_url', 'view_book_url', 'download_text_url', 'form', 'volume', 'parent', 'children']\n",
    "df_datasette.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_datasette.to_csv('trove-digital-books-datasette.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[This post](https://101dhhacks.net/share-searchable-csvs/) describes how you can load your CSV files into Datasette using Glitch. Here's the result – [a searchable database of Trove books available in digital form](https://trove-digital-books.glitch.me/data/trove-digital-books)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some leftover bits used for renaming the text files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rename files to include truncated title of book\n",
    "for row in df.itertuples():\n",
    "    try:\n",
    "        os.rename(os.path.join('text', '{}.txt'.format(row.book_id)), os.path.join('text', '{}-{}.txt'.format(slugify(row.title[:50]), row.book_id)))\n",
    "    except FileNotFoundError:\n",
    "        pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert all filenames back to just nla.obj- form\n",
    "for filename in [f for f in os.listdir('text') if f[-4:] == '.txt']:\n",
    "    try:\n",
    "        objname = re.search(r'.*(nla\\.obj.*)', filename).group(1)\n",
    "    except AttributeError:\n",
    "        print(filename)\n",
    "    os.rename(os.path.join('text', filename), os.path.join('text', objname))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io/).\n",
    "\n",
    "Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}