{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvesting the text of digitised books (and ephemera)\n", "\n", "This notebook harvests metadata and OCRd text from digitised books in Trove. There's three main steps:\n", "\n", "* Harvest metadata of digitised books using the Trove API\n", "* Extract the number of pages for each book from the Trove web interface (the number of pages is necessary to download the OCRd text)\n", "* Download the OCRd text for each book\n", "\n", "It's not easy to identify all the digitised books with OCRd text in Trove. I'm starting with a [search in the book zone](https://trove.nla.gov.au/search/category/books?keyword=%22nla.obj%22&l-availability=y&l-format=Book) for books that include the phrase `\"nla.obj\"` and are available online. This currently returns 65,050 results in the web interface. In amongst these are a range of government papers and reports, as well as publications where access to the digital copy is restricted. I think these are mostly recent books submitted in digital form under legal deposit. Things are made even more confusing by the fact that the contents of the 'Books & libraries' category in the web interface is not the same as the API's books zone. Anyway, I've used the new `fullTextInd` index to try and filter out works without any OCRd text. This reduces the total results to 40,751 results.\n", "\n", "But some of those 40,751 results are actually parent records that contain multiple volumes or parts. When I find the number of pages in each book, I'm also checking to see if the record is a 'Multi volume book' and has child works. If it does, I add the child works to the list of books. After this stage there are 42,174 works. However, not all of these 42,174 records have OCRd text. Parent records of multi volume works, and ebook formats like PDFs or MOBI, don't have individual pages, and therefore don't have any text to download. If we exclude works without pages, there are 31,402 works that might have some OCRd text to download.\n", "\n", "After downloading all the OCRd text files (ignoring any that were empty) I ended up with a grand total of **26,762 files**.\n", "\n", "If you compare the number of downloaded files to the number in the CSV file that are identified as having OCRd text you'll notice a difference – 26,762 compared to 29,652. After a bit more poking around I realised that there are some duplicates in the list of works. This seems to be because more than one Trove metadata record can point to the same digitised work. For example, both [this record](https://trove.nla.gov.au/work/192090169) and [this record](https://trove.nla.gov.au/work/31771096) point to [this digitised work](http://nla.gov.au/nla.obj-1874683). As they're not exact duplicates, I've left them in the results.\n", "\n", "Looking through the downloaded text files, it's clear that we're getting ephemera (particularly pamphlets and posters) as well as books. There doesn't seem to be an obvious way to filter these out up front, but of course you could filter later by the number of pages. \n", "\n", "Here's the metadata I've harvested in CSV format:\n", "\n", "* [CSV formatted file with details of digitised books](trove_digitised_books_with_ocr.csv)\n", "\n", "This file includes the following columns:\n", "\n", "* `title` – title of the work\n", "* `url` – link to the metadata record in Trove\n", "* `contributors` – pipe-separated names of contributors\n", "* `date` – publication date\n", "* `format` – the type of work, eg 'Book' or 'Government publication', can have multiple values (pipe-separated)\n", "* `fulltext_url` – link to the digital version\n", "* `trove_id` – unique identifier of the digital version\n", "* `language` – main language of the work\n", "* `rights` – copyright status\n", "* `pages` – number of pages\n", "* `form` – work format, generally one of 'Book', 'Multi volume book', or 'Digital publication'\n", "* `volume` – volume/part number\n", "* `children` – pipe-separated ids of any child works\n", "* `parent` – id of parent work (if any)\n", "* `text_downloaded` – file name of the downloaded OCR text\n", "* `text_file` – True/False is there any OCRd text\n", "\n", "Browse and download text files from Cloudstor:\n", "\n", "* **[26,762 text files](https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL) (about 3.6gb in total) downloaded from the books zone in August 2021.** \n", "\n", "The full list of books in digital format is also available as a [**searchable database running on Glitch**](https://trove-digital-books.glitch.me/data/trove-digital-books). It includes links to download OCRd text from CloudStor. You can use this database to filter the titles and create your own list of books. Search results can be downloaded as in CSV or JSON format.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting things up" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "from IPython.display import display, FileLink\n", "import pandas as pd\n", "import json\n", "import re\n", "import time\n", "import os\n", "import arrow\n", "from copy import deepcopy\n", "from bs4 import BeautifulSoup\n", "from slugify import slugify\n", "import requests_cache" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "s = requests_cache.CachedSession()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n", "s.mount('https://', HTTPAdapter(max_retries=retries))\n", "s.mount('http://', HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "# Add your Trove API key below\n", "api_key = 'YOUR API KEY'" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "params = {\n", " 'key': api_key,\n", " 'zone': 'book',\n", " 'q': '\"nla.obj\" fullTextInd:y', # API v 2.1 added the full text indicator\n", " 'bulkHarvest': 'true',\n", " 'n': 100,\n", " 'encoding': 'json',\n", " 'l-availability': 'y',\n", " 'l-format': 'Book',\n", " 'include': 'links,workversions'\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Harvest metadata using the API" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "def get_total_results():\n", " '''\n", " Get the total number of results for a search.\n", " '''\n", " these_params = params.copy()\n", " these_params['n'] = 0\n", " response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)\n", " data = response.json()\n", " return int(data['response']['zone'][0]['records']['total'])\n", "\n", "\n", "def get_fulltext_url(links):\n", " '''\n", " Loop through the identifiers to find a link to the full text version of the book.\n", " '''\n", " url = None\n", " for link in links:\n", " if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:\n", " url = link['value']\n", " break\n", " return url\n", "\n", "def get_version_record(record):\n", " for version in record.get('version'):\n", " for record in version['record']:\n", " try:\n", " if record['metadataSource'].get('value') == 'ANL:DL':\n", " return record\n", " except (AttributeError, TypeError, KeyError):\n", " pass\n", " \n", "def join_list(record, key):\n", " # A field may have a single value or an array.\n", " # If it's an array, join the values into a string.\n", " string_list = ''\n", " if record:\n", " value = record.get(key, [])\n", " if not isinstance(value, list):\n", " value = [value]\n", " string_list = '|'.join(value)\n", " return string_list\n", "\n", "\n", "def harvest_books():\n", " '''\n", " Harvest metadata relating to digitised books.\n", " '''\n", " books = []\n", " total = get_total_results()\n", " start = '*'\n", " these_params = params.copy()\n", " with tqdm(total=total) as pbar:\n", " while start:\n", " these_params['s'] = start\n", " response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)\n", " data = response.json()\n", " # The nextStart parameter is used to get the next page of results.\n", " # If there's no nextStart then it means we're on the last page of results.\n", " try:\n", " start = data['response']['zone'][0]['records']['nextStart']\n", " except KeyError:\n", " start = None\n", " for record in data['response']['zone'][0]['records']['work']:\n", " # See if there's a link to the full text version.\n", " if 'identifier' in record:\n", " fulltext_url = get_fulltext_url(record['identifier'])\n", " # I'm making the assumption that if this is a booky book (not a map or music etc),\n", " # then 'Book' will appear first in the list of types.\n", " # This might not be a valid assumption.\n", " # try:\n", " # format_type = record.get('type')[0]\n", " # except (IndexError, TypeError):\n", " # format_type = None\n", " # Save the record if there's a full text link and it's a booky book.\n", " if fulltext_url:\n", " trove_id = re.search(r'(nla\\.obj\\-\\d+)', fulltext_url).group(1)\n", " # Get the basic metadata.\n", " book = {\n", " 'title': record.get('title'),\n", " 'url': record.get('troveUrl'),\n", " 'contributors': join_list(record, 'contributor'),\n", " 'date': record.get('issued'),\n", " 'format': join_list(record, 'type'),\n", " 'fulltext_url': fulltext_url,\n", " 'trove_id': trove_id\n", " }\n", " # Add some extra info if avaliable\n", " version = get_version_record(record)\n", " book['language'] = join_list(version, 'language')\n", " book['rights'] = join_list(version, 'rights')\n", " books.append(book)\n", " # print(book)\n", " if not response.from_cache:\n", " time.sleep(0.2)\n", " pbar.update(100)\n", " return books" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Do the harvest!\n", "books = harvest_books()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "40751" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(books)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the number of pages in each book\n", "\n", "In order to download the OCRd text we need to know the number of pages in a work. This information is not available via the API, so we have to scrape it from the work's HTML page." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "def get_work_data(url):\n", " '''\n", " Extract work data in a JSON string from the work's HTML page.\n", " '''\n", " response = s.get(url)\n", " try:\n", " work_data = re.search(r'var work = JSON\\.parse\\(JSON\\.stringify\\((\\{.*\\})', response.text).group(1)\n", " except AttributeError:\n", " work_data = '{}'\n", " if not response.from_cache:\n", " time.sleep(0.2)\n", " return json.loads(work_data)\n", "\n", "\n", "def get_pages(work):\n", " '''\n", " Get the number of pages from the work data.\n", " '''\n", " try:\n", " pages = len(work['children']['page'])\n", " except KeyError:\n", " pages = 0\n", " return pages\n", "\n", "\n", "def get_volumes(parent_id):\n", " '''\n", " Get the ids of volumes that are children of the current record.\n", " '''\n", " start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'\n", " # The initial startIdx value\n", " start = 0\n", " # Number of results per page\n", " n = 20\n", " parts = []\n", " # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.\n", " while n == 20:\n", " # Get the browse page\n", " response = s.get(start_url.format(parent_id, start))\n", " # Beautifulsoup turns the HTML into an easily navigable structure\n", " soup = BeautifulSoup(response.text, 'lxml')\n", " # Find all the divs containing issue details and loop through them\n", " details = soup.find_all(class_='l-item-info')\n", " for detail in details:\n", " title = detail.find('h3')\n", " if title:\n", " issue_id = title.parent['href'].strip('/')\n", " else:\n", " issue_id = detail.find('a')['href'].strip('/')\n", " # Get the issue id\n", " parts.append(issue_id)\n", " if not response.from_cache:\n", " time.sleep(0.2)\n", " # Increment the startIdx\n", " start += n\n", " # Set n to the number of results on the current page\n", " n = len(details)\n", " return parts\n", "\n", "\n", "def add_pages(books):\n", " '''\n", " Add the number of pages to the metadata for each book.\n", " Add volumes from multi volume books.\n", " '''\n", " books_with_pages = []\n", " for book in tqdm(books):\n", " # print(book['fulltext_url'])\n", " work = get_work_data(book['fulltext_url'])\n", " form = work.get('form')\n", " pages = get_pages(work)\n", " book['pages'] = pages\n", " book['form'] = form\n", " book['volume'] = ''\n", " book['parent'] = ''\n", " book['children'] = ''\n", " # Multi volume books are containers with child volumes\n", " # so we have to get the ids of each individual volume and process them\n", " if pages == 0 and form == 'Multi Volume Book':\n", " # Get child volumes\n", " volumes = get_volumes(book['trove_id'])\n", " # For each volume get details and add as a new book entry\n", " for index, volume_id in enumerate(volumes):\n", " volume = book.copy()\n", " # Add link up to the container\n", " volume['parent'] = book['trove_id']\n", " volume['fulltext_url'] = 'http://nla.gov.au/{}'.format(volume_id)\n", " volume['trove_id'] = volume_id\n", " work = get_work_data(volume['fulltext_url'])\n", " form = work.get('form')\n", " pages = get_pages(work)\n", " volume['form'] = form\n", " volume['pages'] = pages\n", " volume['volume'] = str(index + 1)\n", " # print(volume)\n", " books_with_pages.append(volume)\n", " # Add links from container to volumes\n", " book['children'] = '|'.join(volumes)\n", " # print(book)\n", " books_with_pages.append(book)\n", " return books_with_pages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Add number of pages to the book metadata\n", "books_with_pages = add_pages(deepcopy(books))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert and save results\n", "\n", "Getting the page numbers takes quite a while, so it's a good idea to save the results to a CSV file before proceeding. That way, you won't have to repeat the process if something goes wrong and you lose the data that's sitting in memory." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(books_with_pages)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleurlcontributorsdateformatfulltext_urltrove_idlanguagerightspagesformvolumeparentchildren
0Goliath Joe, fisherman / by Charles Thackeray ...https://trove.nla.gov.au/work/10013347Thackeray, Charles1900-1919Book|Book/Illustratedhttps://nla.gov.au/nla.obj-2831231419nla.obj-2831231419EnglishOut of Copyright|http://rightsstatements.org/v...130Book
1Grammar of the Narrinyeri tribe of Australian ...https://trove.nla.gov.au/work/10029401Taplin, George1878-1880Book|Government publicationhttp://nla.gov.au/nla.obj-688657424nla.obj-688657424EnglishOut of Copyright|http://rightsstatements.org/v...24Book
2The works of the Rev. Sydney Smithhttps://trove.nla.gov.au/work/1004403Smith, Sydney, 1771-18451839-1900Book|Book/Illustrated|Microformhttps://nla.gov.au/nla.obj-630176596nla.obj-630176596EnglishNo known copyright restrictions|http://rightss...65Book
3Nellie Doran : a story of Australian home and ...https://trove.nla.gov.au/work/10049667Miriam Agatha1914-1923Bookhttp://nla.gov.au/nla.obj-24357566nla.obj-24357566EnglishOut of Copyright|http://rightsstatements.org/v...246Book
4Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...https://trove.nla.gov.au/work/10053234Germany. Heer. Heereswaffenamt1942Book|Book/Illustrated|Government publicationhttps://nla.gov.au/nla.obj-51530748nla.obj-51530748GermanOut of Copyright|http://rightsstatements.org/v...80Book
\n", "
" ], "text/plain": [ " title \\\n", "0 Goliath Joe, fisherman / by Charles Thackeray ... \n", "1 Grammar of the Narrinyeri tribe of Australian ... \n", "2 The works of the Rev. Sydney Smith \n", "3 Nellie Doran : a story of Australian home and ... \n", "4 Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... \n", "\n", " url contributors \\\n", "0 https://trove.nla.gov.au/work/10013347 Thackeray, Charles \n", "1 https://trove.nla.gov.au/work/10029401 Taplin, George \n", "2 https://trove.nla.gov.au/work/1004403 Smith, Sydney, 1771-1845 \n", "3 https://trove.nla.gov.au/work/10049667 Miriam Agatha \n", "4 https://trove.nla.gov.au/work/10053234 Germany. Heer. Heereswaffenamt \n", "\n", " date format \\\n", "0 1900-1919 Book|Book/Illustrated \n", "1 1878-1880 Book|Government publication \n", "2 1839-1900 Book|Book/Illustrated|Microform \n", "3 1914-1923 Book \n", "4 1942 Book|Book/Illustrated|Government publication \n", "\n", " fulltext_url trove_id language \\\n", "0 https://nla.gov.au/nla.obj-2831231419 nla.obj-2831231419 English \n", "1 http://nla.gov.au/nla.obj-688657424 nla.obj-688657424 English \n", "2 https://nla.gov.au/nla.obj-630176596 nla.obj-630176596 English \n", "3 http://nla.gov.au/nla.obj-24357566 nla.obj-24357566 English \n", "4 https://nla.gov.au/nla.obj-51530748 nla.obj-51530748 German \n", "\n", " rights pages form volume \\\n", "0 Out of Copyright|http://rightsstatements.org/v... 130 Book \n", "1 Out of Copyright|http://rightsstatements.org/v... 24 Book \n", "2 No known copyright restrictions|http://rightss... 65 Book \n", "3 Out of Copyright|http://rightsstatements.org/v... 246 Book \n", "4 Out of Copyright|http://rightsstatements.org/v... 80 Book \n", "\n", " parent children \n", "0 \n", "1 \n", "2 \n", "3 \n", "4 " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(42174, 14)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many records?\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(31402, 14)" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many have pages?\n", "df.loc[df['pages'] != 0].shape" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Book 29069\n", "Digital Publication 9808\n", "Multi Volume Book 2348\n", "Picture 523\n", "Journal 357\n", "Manuscript 36\n", "Other - General 14\n", "Map 2\n", "Other - Australian 1\n", "Name: form, dtype: int64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many of each format?\n", "df['form'].value_counts()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "English 25674\n", " 14284\n", "Chinese 1219\n", "French 210\n", "Undetermined 193\n", "German 92\n", "Japanese 63\n", "Dutch 56\n", "Australian languages 55\n", "Austronesian (Other) 55\n", "Italian 31\n", "Latin 31\n", "Spanish 22\n", "Maori 20\n", "Swedish 19\n", "Portuguese 16\n", "Korean 15\n", "Tahitian 13\n", "Indonesian 12\n", "Danish 11\n", "Multiple languages 8\n", "Tongan 7\n", "Greek, Modern (1453- ) 7\n", "Finnish 7\n", "Russian 6\n", "Norwegian 5\n", "Czech 4\n", "Samoan 4\n", "Thai 4\n", "Polish 3\n", "Fijian 2\n", "Miscellaneous languages 2\n", "Papiamento 2\n", "Malay 2\n", "Welsh 2\n", "Papuan (Other) 2\n", "No linguistic content 2\n", "Tagalog 1\n", "Niger-Kordofanian (Other) 1\n", "Sanskrit 1\n", "Javanese 1\n", "pol 1\n", "Philippine (Other) 1\n", "Scottish Gaelic 1\n", "Vietnamese 1\n", "Yiddish 1\n", "Hawaiian 1\n", "Creoles and Pidgins, English-based (Other) 1\n", "Irish 1\n", "Gã 1\n", "Nauru 1\n", "Name: language, dtype: int64" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Breakdown by language\n", "df['language'].value_counts()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "trove_digitised_books.csv
" ], "text/plain": [ "/Volumes/Workspace/mycode/glam-workbench/trove-books/notebooks/trove_digitised_books.csv" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Save as CSV\n", "df.to_csv('trove_digitised_books.csv', index=False)\n", "display(FileLink('trove_digitised_books.csv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download the OCRd texts" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# Run this cell if you need to reload the books data from the CSV\n", "df = pd.read_csv('trove_digitised_books.csv', keep_default_na=False)\n", "books_with_pages = df.to_dict('records')" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "def save_ocr(books, output_dir='text'):\n", " '''\n", " Download the OCRd text for each book.\n", " '''\n", " os.makedirs(output_dir, exist_ok=True)\n", " for book in tqdm(books):\n", " # Default values\n", " book['text_downloaded'] = False\n", " book['text_file'] = ''\n", " if book['pages'] != 0: \n", " # print(book['title'])\n", " # The index value for the last page of an issue will be the total pages - 1\n", " last_page = book['pages'] - 1\n", " file_name = '{}-{}.txt'.format(slugify(str(book['title'])[:50]), book['trove_id'])\n", " file_path = os.path.join(output_dir, file_name)\n", " # Check to see if the file has already been harvested\n", " if os.path.exists(file_path) and os.path.getsize(file_path) > 0:\n", " # print('Already saved')\n", " book['text_file'] = file_name\n", " book['text_downloaded'] = True\n", " else:\n", " url = 'https://trove.nla.gov.au/{}/download?downloadOption=ocr&firstPage=0&lastPage={}'.format(book['trove_id'], last_page)\n", " # print(url)\n", " # Get the file\n", " r = s.get(url)\n", " # Check there was no error\n", " if r.status_code == requests.codes.ok:\n", " # Check that the file's not empty\n", " r.encoding = 'utf-8'\n", " if len(r.text) > 0 and not r.text.isspace():\n", " # Check that the file isn't HTML (some not found pages don't return 404s)\n", " if BeautifulSoup(r.text, 'html.parser').find('html') is None:\n", " # If everything's ok, save the file\n", " with open(file_path, 'w', encoding='utf-8') as text_file:\n", " text_file.write(r.text)\n", " # print('Saved')\n", " book['text_file'] = file_name\n", " book['text_downloaded'] = True\n", " if not r.from_cache:\n", " time.sleep(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "save_ocr(books_with_pages, '/Volumes/bigdata/mydata/Trove/books')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert and save updated results\n", "\n", "The new books list includes the file name of the downloaded text file (if there is one),\n", "and a boolean field indicating if the text has been downloaded." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "# Convert this to df\n", "df_downloaded = pd.DataFrame(books_with_pages)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleurlcontributorsdateformatfulltext_urltrove_idlanguagerightspagesformvolumeparentchildrentext_downloadedtext_file
0Goliath Joe, fisherman / by Charles Thackeray ...https://trove.nla.gov.au/work/10013347Thackeray, Charles1900-1919Book|Book/Illustratedhttps://nla.gov.au/nla.obj-2831231419nla.obj-2831231419EnglishOut of Copyright|http://rightsstatements.org/v...130BookTruegoliath-joe-fisherman-by-charles-thackeray-wob...
1Grammar of the Narrinyeri tribe of Australian ...https://trove.nla.gov.au/work/10029401Taplin, George1878-1880Book|Government publicationhttp://nla.gov.au/nla.obj-688657424nla.obj-688657424EnglishOut of Copyright|http://rightsstatements.org/v...24BookTruegrammar-of-the-narrinyeri-tribe-of-australian-...
2The works of the Rev. Sydney Smithhttps://trove.nla.gov.au/work/1004403Smith, Sydney, 1771-18451839-1900Book|Book/Illustrated|Microformhttps://nla.gov.au/nla.obj-630176596nla.obj-630176596EnglishNo known copyright restrictions|http://rightss...65BookTruethe-works-of-the-rev-sydney-smith-nla.obj-6301...
3Nellie Doran : a story of Australian home and ...https://trove.nla.gov.au/work/10049667Miriam Agatha1914-1923Bookhttp://nla.gov.au/nla.obj-24357566nla.obj-24357566EnglishOut of Copyright|http://rightsstatements.org/v...246BookTruenellie-doran-a-story-of-australian-home-and-sc...
4Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...https://trove.nla.gov.au/work/10053234Germany. Heer. Heereswaffenamt1942Book|Book/Illustrated|Government publicationhttps://nla.gov.au/nla.obj-51530748nla.obj-51530748GermanOut of Copyright|http://rightsstatements.org/v...80BookTruelastkraftwagen-3-t-ford-baumuster-v-3000-s-ger...
\n", "
" ], "text/plain": [ " title \\\n", "0 Goliath Joe, fisherman / by Charles Thackeray ... \n", "1 Grammar of the Narrinyeri tribe of Australian ... \n", "2 The works of the Rev. Sydney Smith \n", "3 Nellie Doran : a story of Australian home and ... \n", "4 Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... \n", "\n", " url contributors \\\n", "0 https://trove.nla.gov.au/work/10013347 Thackeray, Charles \n", "1 https://trove.nla.gov.au/work/10029401 Taplin, George \n", "2 https://trove.nla.gov.au/work/1004403 Smith, Sydney, 1771-1845 \n", "3 https://trove.nla.gov.au/work/10049667 Miriam Agatha \n", "4 https://trove.nla.gov.au/work/10053234 Germany. Heer. Heereswaffenamt \n", "\n", " date format \\\n", "0 1900-1919 Book|Book/Illustrated \n", "1 1878-1880 Book|Government publication \n", "2 1839-1900 Book|Book/Illustrated|Microform \n", "3 1914-1923 Book \n", "4 1942 Book|Book/Illustrated|Government publication \n", "\n", " fulltext_url trove_id language \\\n", "0 https://nla.gov.au/nla.obj-2831231419 nla.obj-2831231419 English \n", "1 http://nla.gov.au/nla.obj-688657424 nla.obj-688657424 English \n", "2 https://nla.gov.au/nla.obj-630176596 nla.obj-630176596 English \n", "3 http://nla.gov.au/nla.obj-24357566 nla.obj-24357566 English \n", "4 https://nla.gov.au/nla.obj-51530748 nla.obj-51530748 German \n", "\n", " rights pages form volume \\\n", "0 Out of Copyright|http://rightsstatements.org/v... 130 Book \n", "1 Out of Copyright|http://rightsstatements.org/v... 24 Book \n", "2 No known copyright restrictions|http://rightss... 65 Book \n", "3 Out of Copyright|http://rightsstatements.org/v... 246 Book \n", "4 Out of Copyright|http://rightsstatements.org/v... 80 Book \n", "\n", " parent children text_downloaded \\\n", "0 True \n", "1 True \n", "2 True \n", "3 True \n", "4 True \n", "\n", " text_file \n", "0 goliath-joe-fisherman-by-charles-thackeray-wob... \n", "1 grammar-of-the-narrinyeri-tribe-of-australian-... \n", "2 the-works-of-the-rev-sydney-smith-nla.obj-6301... \n", "3 nellie-doran-a-story-of-australian-home-and-sc... \n", "4 lastkraftwagen-3-t-ford-baumuster-v-3000-s-ger... " ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_downloaded.head()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(29652, 16)" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many have been downloaded?\n", "df_downloaded.loc[df_downloaded['text_downloaded'] == True].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why is the number above different to the number of files actually downloaded? Let's have a look for duplicates.\n", "\n", "As you can see below, some digitised works are linked to from multiple metadata records. Hence there are duplicates." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleurlcontributorsdateformatfulltext_urltrove_idlanguagerightspagesformvolumeparentchildrentext_downloadedtext_file
25788Three weeks in Southland : being the account o...https://trove.nla.gov.au/work/237350529Reid, Stuart, active 1884-18851885Bookhttps://nla.gov.au/nla.obj-101207695nla.obj-101207695EnglishOut of Copyright|http://rightsstatements.org/v...66BookTruethree-weeks-in-southland-being-the-account-of-...
7469Three weeks in Southland : being the account o...https://trove.nla.gov.au/work/19178390Reid, Stuart, active 1884-18851885Bookhttp://nla.gov.au/nla.obj-101207695nla.obj-10120769566Book2nla.obj-477008239Truethree-weeks-in-southland-being-the-account-of-...
25790A recent visit to several of the Polynesian is...https://trove.nla.gov.au/work/237350531Bennett, George, active 1830-18311831Bookhttps://nla.gov.au/nla.obj-101212925nla.obj-101212925EnglishNo known copyright restrictions|http://rightss...8BookTruea-recent-visit-to-several-of-the-polynesian-is...
7771A recent visit to several of the Polynesian is...https://trove.nla.gov.au/work/19241288Bennett, George, active 1830-18311831-1832Book/Illustrated|Bookhttp://nla.gov.au/nla.obj-101212925nla.obj-1012129258BookTruea-recent-visit-to-several-of-the-polynesian-is...
25807How Capt. Cook died : new light from an old bookhttps://trove.nla.gov.au/work/2373505481908Bookhttps://nla.gov.au/nla.obj-101227721nla.obj-101227721EnglishNo known copyright restrictions|http://rightss...10BookTruehow-capt-cook-died-new-light-from-an-old-book-...
...................................................
37508A Wonderful Illawarra waterfall : a rare beaut...https://trove.nla.gov.au/work/240638461895Bookhttp://nla.gov.au/nla.obj-99671695nla.obj-996716951BookTruea-wonderful-illawarra-waterfall-a-rare-beauty-...
25811The Results of the census of 1871 : supplement...https://trove.nla.gov.au/work/2373505521873Bookhttps://nla.gov.au/nla.obj-99716940nla.obj-99716940EnglishNo known copyright restrictions|http://rightss...2BookTruethe-results-of-the-census-of-1871-supplement-t...
4099The Results of the census of 1871 : supplement...https://trove.nla.gov.au/work/178561081873Bookhttp://nla.gov.au/nla.obj-99716940nla.obj-997169402BookTruethe-results-of-the-census-of-1871-supplement-t...
25795Regular packets for Australia : emigration to ...https://trove.nla.gov.au/work/2373505361850Bookhttps://nla.gov.au/nla.obj-99727992nla.obj-99727992EnglishNo known copyright restrictions|http://rightss...1BookTrueregular-packets-for-australia-emigration-to-po...
909Regular packets for Australia : emigration to ...https://trove.nla.gov.au/work/123286201850Bookhttp://nla.gov.au/nla.obj-99727992nla.obj-997279921BookTrueregular-packets-for-australia-emigration-to-po...
\n", "

6234 rows × 16 columns

\n", "
" ], "text/plain": [ " title \\\n", "25788 Three weeks in Southland : being the account o... \n", "7469 Three weeks in Southland : being the account o... \n", "25790 A recent visit to several of the Polynesian is... \n", "7771 A recent visit to several of the Polynesian is... \n", "25807 How Capt. Cook died : new light from an old book \n", "... ... \n", "37508 A Wonderful Illawarra waterfall : a rare beaut... \n", "25811 The Results of the census of 1871 : supplement... \n", "4099 The Results of the census of 1871 : supplement... \n", "25795 Regular packets for Australia : emigration to ... \n", "909 Regular packets for Australia : emigration to ... \n", "\n", " url \\\n", "25788 https://trove.nla.gov.au/work/237350529 \n", "7469 https://trove.nla.gov.au/work/19178390 \n", "25790 https://trove.nla.gov.au/work/237350531 \n", "7771 https://trove.nla.gov.au/work/19241288 \n", "25807 https://trove.nla.gov.au/work/237350548 \n", "... ... \n", "37508 https://trove.nla.gov.au/work/24063846 \n", "25811 https://trove.nla.gov.au/work/237350552 \n", "4099 https://trove.nla.gov.au/work/17856108 \n", "25795 https://trove.nla.gov.au/work/237350536 \n", "909 https://trove.nla.gov.au/work/12328620 \n", "\n", " contributors date format \\\n", "25788 Reid, Stuart, active 1884-1885 1885 Book \n", "7469 Reid, Stuart, active 1884-1885 1885 Book \n", "25790 Bennett, George, active 1830-1831 1831 Book \n", "7771 Bennett, George, active 1830-1831 1831-1832 Book/Illustrated|Book \n", "25807 1908 Book \n", "... ... ... ... \n", "37508 1895 Book \n", "25811 1873 Book \n", "4099 1873 Book \n", "25795 1850 Book \n", "909 1850 Book \n", "\n", " fulltext_url trove_id language \\\n", "25788 https://nla.gov.au/nla.obj-101207695 nla.obj-101207695 English \n", "7469 http://nla.gov.au/nla.obj-101207695 nla.obj-101207695 \n", "25790 https://nla.gov.au/nla.obj-101212925 nla.obj-101212925 English \n", "7771 http://nla.gov.au/nla.obj-101212925 nla.obj-101212925 \n", "25807 https://nla.gov.au/nla.obj-101227721 nla.obj-101227721 English \n", "... ... ... ... \n", "37508 http://nla.gov.au/nla.obj-99671695 nla.obj-99671695 \n", "25811 https://nla.gov.au/nla.obj-99716940 nla.obj-99716940 English \n", "4099 http://nla.gov.au/nla.obj-99716940 nla.obj-99716940 \n", "25795 https://nla.gov.au/nla.obj-99727992 nla.obj-99727992 English \n", "909 http://nla.gov.au/nla.obj-99727992 nla.obj-99727992 \n", "\n", " rights pages form volume \\\n", "25788 Out of Copyright|http://rightsstatements.org/v... 66 Book \n", "7469 66 Book 2 \n", "25790 No known copyright restrictions|http://rightss... 8 Book \n", "7771 8 Book \n", "25807 No known copyright restrictions|http://rightss... 10 Book \n", "... ... ... ... ... \n", "37508 1 Book \n", "25811 No known copyright restrictions|http://rightss... 2 Book \n", "4099 2 Book \n", "25795 No known copyright restrictions|http://rightss... 1 Book \n", "909 1 Book \n", "\n", " parent children text_downloaded \\\n", "25788 True \n", "7469 nla.obj-477008239 True \n", "25790 True \n", "7771 True \n", "25807 True \n", "... ... ... ... \n", "37508 True \n", "25811 True \n", "4099 True \n", "25795 True \n", "909 True \n", "\n", " text_file \n", "25788 three-weeks-in-southland-being-the-account-of-... \n", "7469 three-weeks-in-southland-being-the-account-of-... \n", "25790 a-recent-visit-to-several-of-the-polynesian-is... \n", "7771 a-recent-visit-to-several-of-the-polynesian-is... \n", "25807 how-capt-cook-died-new-light-from-an-old-book-... \n", "... ... \n", "37508 a-wonderful-illawarra-waterfall-a-rare-beauty-... \n", "25811 the-results-of-the-census-of-1871-supplement-t... \n", "4099 the-results-of-the-census-of-1871-supplement-t... \n", "25795 regular-packets-for-australia-emigration-to-po... \n", "909 regular-packets-for-australia-emigration-to-po... \n", "\n", "[6234 rows x 16 columns]" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_downloaded.loc[df_downloaded.duplicated('trove_id', keep=False) == True].sort_values('trove_id')" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "trove_digitised_books_with_ocr.csv
" ], "text/plain": [ "/Volumes/Workspace/mycode/glam-workbench/trove-books/notebooks/trove_digitised_books_with_ocr.csv" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Save as CSV\n", "df_downloaded.to_csv('trove_digitised_books_with_ocr.csv', index=False)\n", "display(FileLink('trove_digitised_books_with_ocr.csv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a searchable database using Datasette\n", "\n", "To make it easy to explore the list of books, let's load the CSV file into Datasette. First we'll drop some columns, do some reordering, and add links to the downloaded text files stored on CloudStor." ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "df_datasette = df_downloaded.copy()\n", "\n", "# Add link to Cloudstor\n", "df_datasette['cloudstor_url'] = df_datasette.loc[df_datasette['text_downloaded'] == True]['text_file'].apply(lambda x: f'https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL/download?path={x}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove some columns that aren't going to be useful." ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "df_datasette = df_datasette[['title', 'contributors', 'date', 'format', 'language', 'rights', 'pages', 'url', 'fulltext_url', 'cloudstor_url', 'form', 'volume', 'parent', 'children']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rename columns for clarity." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlecontributorsdateformatlanguagecopyrightpagesview_details_urlview_book_urldownload_text_urlformvolumeparentchildren
0Goliath Joe, fisherman / by Charles Thackeray ...Thackeray, Charles1900-1919Book|Book/IllustratedEnglishOut of Copyright|http://rightsstatements.org/v...130https://trove.nla.gov.au/work/10013347https://nla.gov.au/nla.obj-2831231419https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...Book
1Grammar of the Narrinyeri tribe of Australian ...Taplin, George1878-1880Book|Government publicationEnglishOut of Copyright|http://rightsstatements.org/v...24https://trove.nla.gov.au/work/10029401http://nla.gov.au/nla.obj-688657424https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...Book
2The works of the Rev. Sydney SmithSmith, Sydney, 1771-18451839-1900Book|Book/Illustrated|MicroformEnglishNo known copyright restrictions|http://rightss...65https://trove.nla.gov.au/work/1004403https://nla.gov.au/nla.obj-630176596https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...Book
3Nellie Doran : a story of Australian home and ...Miriam Agatha1914-1923BookEnglishOut of Copyright|http://rightsstatements.org/v...246https://trove.nla.gov.au/work/10049667http://nla.gov.au/nla.obj-24357566https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...Book
4Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...Germany. Heer. Heereswaffenamt1942Book|Book/Illustrated|Government publicationGermanOut of Copyright|http://rightsstatements.org/v...80https://trove.nla.gov.au/work/10053234https://nla.gov.au/nla.obj-51530748https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...Book
\n", "
" ], "text/plain": [ " title \\\n", "0 Goliath Joe, fisherman / by Charles Thackeray ... \n", "1 Grammar of the Narrinyeri tribe of Australian ... \n", "2 The works of the Rev. Sydney Smith \n", "3 Nellie Doran : a story of Australian home and ... \n", "4 Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... \n", "\n", " contributors date \\\n", "0 Thackeray, Charles 1900-1919 \n", "1 Taplin, George 1878-1880 \n", "2 Smith, Sydney, 1771-1845 1839-1900 \n", "3 Miriam Agatha 1914-1923 \n", "4 Germany. Heer. Heereswaffenamt 1942 \n", "\n", " format language \\\n", "0 Book|Book/Illustrated English \n", "1 Book|Government publication English \n", "2 Book|Book/Illustrated|Microform English \n", "3 Book English \n", "4 Book|Book/Illustrated|Government publication German \n", "\n", " copyright pages \\\n", "0 Out of Copyright|http://rightsstatements.org/v... 130 \n", "1 Out of Copyright|http://rightsstatements.org/v... 24 \n", "2 No known copyright restrictions|http://rightss... 65 \n", "3 Out of Copyright|http://rightsstatements.org/v... 246 \n", "4 Out of Copyright|http://rightsstatements.org/v... 80 \n", "\n", " view_details_url \\\n", "0 https://trove.nla.gov.au/work/10013347 \n", "1 https://trove.nla.gov.au/work/10029401 \n", "2 https://trove.nla.gov.au/work/1004403 \n", "3 https://trove.nla.gov.au/work/10049667 \n", "4 https://trove.nla.gov.au/work/10053234 \n", "\n", " view_book_url \\\n", "0 https://nla.gov.au/nla.obj-2831231419 \n", "1 http://nla.gov.au/nla.obj-688657424 \n", "2 https://nla.gov.au/nla.obj-630176596 \n", "3 http://nla.gov.au/nla.obj-24357566 \n", "4 https://nla.gov.au/nla.obj-51530748 \n", "\n", " download_text_url form volume parent \\\n", "0 https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... Book \n", "1 https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... Book \n", "2 https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... Book \n", "3 https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... Book \n", "4 https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... Book \n", "\n", " children \n", "0 \n", "1 \n", "2 \n", "3 \n", "4 " ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_datasette.columns = ['title', 'contributors', 'date', 'format', 'language', 'copyright', 'pages', 'view_details_url', 'view_book_url', 'download_text_url', 'form', 'volume', 'parent', 'children']\n", "df_datasette.head()" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "df_datasette.to_csv('trove-digital-books-datasette.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[This post](https://101dhhacks.net/share-searchable-csvs/) describes how you can load your CSV files into Datasette using Glitch. Here's the result – [a searchable database of Trove books available in digital form](https://trove-digital-books.glitch.me/data/trove-digital-books)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some leftover bits used for renaming the text files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Rename files to include truncated title of book\n", "for row in df.itertuples():\n", " try:\n", " os.rename(os.path.join('text', '{}.txt'.format(row.book_id)), os.path.join('text', '{}-{}.txt'.format(slugify(row.title[:50]), row.book_id)))\n", " except FileNotFoundError:\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert all filenames back to just nla.obj- form\n", "for filename in [f for f in os.listdir('text') if f[-4:] == '.txt']:\n", " try:\n", " objname = re.search(r'.*(nla\\.obj.*)', filename).group(1)\n", " except AttributeError:\n", " print(filename)\n", " os.rename(os.path.join('text', filename), os.path.join('text', objname))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io/).\n", "\n", "Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }