{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting the text of Trove books from the Internet Archive\n", "\n", "Previously I've [harvested the text of books digitised](Harvesting-digitised-books.ipynb) by the National Library of Australia and made available through Trove. But it occured to me it might be possible to get the full text of other books in Trove by making use of the links to the [Open Library](https://openlibrary.org/). \n", "\n", "There are lots of links to the Open Library in Trove. A [search for `\"http://openlibrary.org/\"`](https://trove.nla.gov.au/book/result?q=%22http%3A%2F%2Fopenlibrary.org%2F%22) in the books zone currently return almost a million results. Many of the linked Open Library records themselves point to digital copies in the [Internet Archive](https://archive.org/). However, this is less useful than it seems as many of the digital copies have access restrictions. Nonetheless, at least *some* of the books in Trove will have freely accessible versions in the Internet Archive.\n", "\n", "At first I thought finding them might require three steps – get Open Library identifier from Trove, query the Open Library API to get the Internet Archive identifier, then download the text from the Internet Archive. But then I realised that you can query the Internet Archive API with an Open Library identifier, so that cut out a step. This is the basic method:\n", "\n", "* Search the Trove API for Australian books that include `\"http://openlibrary.org/\"`\n", "* Extract metadata, including the Open Library identifier, from these records\n", "* Work through the results, retrieving item metadata from the Internet Archive using the Open Library identifier\n", "* If the item has a freely available text version of the book, download it and save the metadata\n", "\n", "To talk to the Internet Archive API I made use of the [internetarchive Python package](https://github.com/jjjake/internetarchive). Before you can use this, you need to have an account at the Internet Archive, and then [run `ia configure` on the command line](https://archive.org/services/docs/api/internetarchive/quickstart.html#configuring). This will prompt you for you login details and save them in a config file.\n", "\n", "The results:\n", "\n", "* ['Australian' books in Trove with an Open Library identifier](books_with_olids.csv) (CSV)\n", "* ['Australian' books in Trove with full text available from the Internet Archives](trove-books-in-ia.csv) (CSV)\n", "\n", "The list of books with full text includes the follwing fields:\n", "\n", "* `creators` – pipe-separated list of creators\n", "* `date` – publication date\n", "* `ia_formats` – pipe-separated list of file formats available from the Internet Archive (these can be downloaded from the IA)\n", "* `ia_id` – Internet Archive identifier\n", "* `ia_url` – link to more information in the Internet Archive\n", "* `ol_id` – Open Library identifier\n", "* `publisher` – publisher\n", "* `text_filename` – name of the downloaded text file\n", "* `title` – title of the book\n", "* `trove_url` – link to more information in Trove\n", "* `version_id` – Trove version identifier\n", "* `work_id` – Trove work identifier\n", "\n", "I ended up downloading 1,513 text files. However, despite the fact that I used the Australian content filter in Trove, it's clear that some of them have nothing to do with Australia. Nonetheless, there are many interesting books amongst the results, and it's an interesting example of how you can make use of cross-links between resources.\n", "\n", "You can [download the harvested text files](https://cloudstor.aarnet.edu.au/plus/s/3h3GHfS3tQTDLaX) from CloudStor.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set things up" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm import tqdm_notebook\n", "import pandas as pd\n", "import time\n", "import os\n", "import urllib\n", "# Remember to run ia configure at the command line first\n", "import internetarchive as ia" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "api_key = '[YOUR TROVE API KEY GOES HERE]'\n", "\n", "# Note that we're excluding periodicals even though there seems to be quite a few in the IA.\n", "# I thought it would be best to stick to books for now & do the journals later.\n", "# I'm using the 'Australian content' filter to try & limit to books published in or about Australia.\n", "# The filter is not always accurate as you can see in some of the results...\n", "params = {\n", " 'q': '\"http://openlibrary.org/\" NOT format:Periodical',\n", " 'zone': 'book',\n", " 'l-australian': 'y', # Australian content --> yes please\n", " 'include': 'workVersions', # We want all versions to make sure we find the OL record\n", " 'key': api_key,\n", " 'encoding': 'json',\n", " 'bulkHarvest': 'true',\n", " 'n': 100\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions to do the work" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n", "s.mount('https://', HTTPAdapter(max_retries=retries))\n", "s.mount('http://', HTTPAdapter(max_retries=retries))\n", "\n", "def get_total_results():\n", " '''\n", " Get the total number of results for a search.\n", " '''\n", " these_params = params.copy()\n", " these_params['n'] = 0\n", " response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)\n", " data = response.json()\n", " return int(data['response']['zone'][0]['records']['total'])\n", "\n", "def get_ol_id(record):\n", " '''\n", " Extract the Open Library identifier form a record.\n", " '''\n", " ol_id = None\n", " for link in record['identifier']:\n", " if link['type'] == 'control number' and link['value'][:2] == 'OL':\n", " ol_id = link['value']\n", " return ol_id\n", "\n", "def get_details(record):\n", " '''\n", " Get basic metadata from a record.\n", " '''\n", " if isinstance(record.get('creator'), list):\n", " creators = '|'.join(record.get('creator'))\n", " else:\n", " creators = record.get('creator')\n", " book = {\n", " 'title': record.get('title'),\n", " 'creators': creators,\n", " 'date': record.get('issued'),\n", " 'publisher': record.get('publisher')\n", " }\n", " return book\n", "\n", "def process_record(record, work_id, version_id):\n", " '''\n", " Check to see if a version record comes from the OpenLibrary.\n", " If it does, extract the OL identifier and prepare basic metadata.\n", " '''\n", " book = None\n", " source = record.get('metadataSource')\n", " if source and source == 'Open Library':\n", " ol_id = get_ol_id(record)\n", " if ol_id:\n", " book = get_details(record)\n", " book['ol_id'] = ol_id\n", " book['work_id'] = work_id\n", " book['version_id'] = version_id\n", " book['trove_url'] = 'https://trove.nla.gov.au/version/{}'.format(version_id)\n", " return book\n", " \n", "def harvest_books():\n", " '''\n", " Get records from Trove with Open Library links.\n", " Extract and save the OL identifier and basic book metadata.\n", " '''\n", " books = []\n", " total = get_total_results()\n", " start = '*'\n", " these_params = params.copy()\n", " with tqdm_notebook(total=total) as pbar:\n", " while start:\n", " these_params['s'] = start\n", " response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)\n", " data = response.json()\n", " # The nextStart parameter is used to get the next page of results.\n", " # If there's no nextStart then it means we're on the last page of results.\n", " try:\n", " start = data['response']['zone'][0]['records']['nextStart']\n", " except KeyError:\n", " start = None\n", " for work in data['response']['zone'][0]['records']['work']:\n", " # Sometimes there's a single version, other times a list\n", " # Make sure we process all of them to get different editions\n", " for version in work['version']:\n", " if isinstance(version['record'], list):\n", " for record in version['record']:\n", " book = process_record(record, work['id'], version['id'])\n", " if book:\n", " books.append(book)\n", " else:\n", " book = process_record(version['record'], work['id'], version['id'])\n", " if book:\n", " books.append(book)\n", " pbar.update(100)\n", " return books\n", "\n", "def get_ia_details(books):\n", " '''\n", " Process a list of books with OL identifiers.\n", " Retrieve metadata from Internet Archive.\n", " If there's a freely available text file, download it.\n", " Remember to run ia config at the command line first or else you'll get authentication errors!\n", " '''\n", " ia_books = []\n", " for book in tqdm_notebook(books):\n", " # Search for items with a specific OL identifier\n", " for item in ia.search_items('openlibrary_edition:{}'.format(book['ol_id'])).iter_as_items():\n", " formats = []\n", " # Check to see if there are digital files available\n", " if 'files' in item.item_metadata:\n", " # Loop through the files and save the format names\n", " for file in item.item_metadata['files']:\n", " if file.get('private') != 'true':\n", " formats.append(file['format'])\n", " # If there's a text version, grab the filename\n", " if file['format'] == 'DjVuTXT':\n", " text_file = file['name']\n", " # If there's a text version we'll download it\n", " if 'DjVuTXT' in formats:\n", " # Check to see if we've already got it\n", " if not os.path.exists(os.path.join('ia_texts', text_file)):\n", " try:\n", " dl = ia.download(item.identifier, formats='DjVuTXT', destdir='ia_texts', no_directory=True)\n", " except requests.exceptions.HTTPError as err:\n", " # Even though I tried to exclude 'private' files above, I still got some authentication errors\n", " if err.response.status_code == 403:\n", " dl = None\n", " else:\n", " raise\n", " else:\n", " dl = True\n", " # If we've successfully downloaded a text file, save the book details\n", " if dl is not None:\n", " ia_book = book.copy()\n", " ia_book['ia_formats'] = '|'.join(formats)\n", " ia_book['ia_id'] = item.identifier\n", " ia_book['text_filename'] = text_file\n", " ia_book['ia_url'] = 'https://archive.org/details/{}'.format(item.identifier)\n", " ia_books.append(ia_book)\n", " return ia_books" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Trove books with Open Library links" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "books = harvest_books()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
creatorsdateol_idpublishertitletrove_urlversion_idwork_id
0Steve Parker 1952-1993OL1404493MLondon Dorling KindersleyRocks and minerals written by Steve Parker.https://trove.nla.gov.au/version/25774155325774155310007961
1South Australia. Premier's Dept. Publicity and...1980OL24656253MAdelaide Publicity, Premier's Department, Sout...South Australiahttps://trove.nla.gov.au/version/16679588016679588010015487
2Thomas Chastain1981OL4092928MGarden City, N.Y DoubledayThe diamond exchange Thomas Chastainhttps://trove.nla.gov.au/version/490892704908927010020752
3Baker, Richard W.|East-West Center.1994OL1405863MWestport, Conn PraegerThe ANZUS states and their region regional pol...https://trove.nla.gov.au/version/18609062318609062310022258
4Arthur, Elizabeth 1953-1995OL1413756MNew York KnopfAntarctic navigation a novel Elizabeth Arthur.https://trove.nla.gov.au/version/17144454417144454410025636
\n", "
" ], "text/plain": [ " creators date ol_id \\\n", "0 Steve Parker 1952- 1993 OL1404493M \n", "1 South Australia. Premier's Dept. Publicity and... 1980 OL24656253M \n", "2 Thomas Chastain 1981 OL4092928M \n", "3 Baker, Richard W.|East-West Center. 1994 OL1405863M \n", "4 Arthur, Elizabeth 1953- 1995 OL1413756M \n", "\n", " publisher \\\n", "0 London Dorling Kindersley \n", "1 Adelaide Publicity, Premier's Department, Sout... \n", "2 Garden City, N.Y Doubleday \n", "3 Westport, Conn Praeger \n", "4 New York Knopf \n", "\n", " title \\\n", "0 Rocks and minerals written by Steve Parker. \n", "1 South Australia \n", "2 The diamond exchange Thomas Chastain \n", "3 The ANZUS states and their region regional pol... \n", "4 Antarctic navigation a novel Elizabeth Arthur. \n", "\n", " trove_url version_id work_id \n", "0 https://trove.nla.gov.au/version/257741553 257741553 10007961 \n", "1 https://trove.nla.gov.au/version/166795880 166795880 10015487 \n", "2 https://trove.nla.gov.au/version/49089270 49089270 10020752 \n", "3 https://trove.nla.gov.au/version/186090623 186090623 10022258 \n", "4 https://trove.nla.gov.au/version/171444544 171444544 10025636 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert to a dataframe\n", "df = pd.DataFrame(books)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8273, 8)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many results?\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Save to CSV\n", "df.to_csv('books_with_olids.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get metadata and download text files from the Internet Archive " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ia_books = get_ia_details(books)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
creatorsdateia_formatsia_idia_urlol_idpublishertext_filenametitletrove_urlversion_idwork_id
0George Jeffrey|Perkins, Arthur J. (Arthur Jame...1907Item Tile|DjVu|Animated GIF|Text PDF|Abbyy GZ|...cu31924003182643https://archive.org/details/cu31924003182643OL24169301MAdelaide Printed by Vardon & sons, ltd.cu31924003182643_djvu.txtA practical handbook on sheep and wool for the...https://trove.nla.gov.au/version/491290174912901710051865
1Rolf Boldrewood 1826-19151969DjVu|Animated GIF|Image Container PDF|Abbyy GZ...oldmelbournemem00boldgooghttps://archive.org/details/oldmelbournemem00b...OL5729660MMelbourne Heinemannoldmelbournemem00boldgoog_djvu.txtOld Melbourne memories [by] Rolf Boldrewood. I...https://trove.nla.gov.au/version/54421954421910070727
2Rolf Boldrewood 1826-19151896Item Tile|DjVu|Animated GIF|Text PDF|Abbyy GZ|...oldmelbournememo00boldhttps://archive.org/details/oldmelbournememo00...OL7114646MLondon, New York Macmillan and Co.oldmelbournememo00bold_djvu.txtOld Melbourne memories [by] Rolf Boldrewood.https://trove.nla.gov.au/version/1560640156064010070727
3Rolf Boldrewood 1826-19151884Item Tile|DjVu|Animated GIF|Image Container PD...oldmelbournemem01boldgooghttps://archive.org/details/oldmelbournemem01b...OL23448373MGeorge Robertsonoldmelbournemem01boldgoog_djvu.txtOld Melbourne Memorieshttps://trove.nla.gov.au/version/3575747357574710070727
4Athel D'Ombrain 1901-|Swan, Wendy.1981Item Tile|DjVu|Animated GIF|Text PDF|Abbyy GZ|...religionbusiness00stimhttps://archive.org/details/religionbusiness00...OL3518420MSydney, N.S.W Reedreligionbusiness00stim_djvu.txtHistoric buildings of Maitland District Maitla...https://trove.nla.gov.au/version/18360158718360158710077826
\n", "
" ], "text/plain": [ " creators date \\\n", "0 George Jeffrey|Perkins, Arthur J. (Arthur Jame... 1907 \n", "1 Rolf Boldrewood 1826-1915 1969 \n", "2 Rolf Boldrewood 1826-1915 1896 \n", "3 Rolf Boldrewood 1826-1915 1884 \n", "4 Athel D'Ombrain 1901-|Swan, Wendy. 1981 \n", "\n", " ia_formats \\\n", "0 Item Tile|DjVu|Animated GIF|Text PDF|Abbyy GZ|... \n", "1 DjVu|Animated GIF|Image Container PDF|Abbyy GZ... \n", "2 Item Tile|DjVu|Animated GIF|Text PDF|Abbyy GZ|... \n", "3 Item Tile|DjVu|Animated GIF|Image Container PD... \n", "4 Item Tile|DjVu|Animated GIF|Text PDF|Abbyy GZ|... \n", "\n", " ia_id \\\n", "0 cu31924003182643 \n", "1 oldmelbournemem00boldgoog \n", "2 oldmelbournememo00bold \n", "3 oldmelbournemem01boldgoog \n", "4 religionbusiness00stim \n", "\n", " ia_url ol_id \\\n", "0 https://archive.org/details/cu31924003182643 OL24169301M \n", "1 https://archive.org/details/oldmelbournemem00b... OL5729660M \n", "2 https://archive.org/details/oldmelbournememo00... OL7114646M \n", "3 https://archive.org/details/oldmelbournemem01b... OL23448373M \n", "4 https://archive.org/details/religionbusiness00... OL3518420M \n", "\n", " publisher \\\n", "0 Adelaide Printed by Vardon & sons, ltd. \n", "1 Melbourne Heinemann \n", "2 London, New York Macmillan and Co. \n", "3 George Robertson \n", "4 Sydney, N.S.W Reed \n", "\n", " text_filename \\\n", "0 cu31924003182643_djvu.txt \n", "1 oldmelbournemem00boldgoog_djvu.txt \n", "2 oldmelbournememo00bold_djvu.txt \n", "3 oldmelbournemem01boldgoog_djvu.txt \n", "4 religionbusiness00stim_djvu.txt \n", "\n", " title \\\n", "0 A practical handbook on sheep and wool for the... \n", "1 Old Melbourne memories [by] Rolf Boldrewood. I... \n", "2 Old Melbourne memories [by] Rolf Boldrewood. \n", "3 Old Melbourne Memories \n", "4 Historic buildings of Maitland District Maitla... \n", "\n", " trove_url version_id work_id \n", "0 https://trove.nla.gov.au/version/49129017 49129017 10051865 \n", "1 https://trove.nla.gov.au/version/544219 544219 10070727 \n", "2 https://trove.nla.gov.au/version/1560640 1560640 10070727 \n", "3 https://trove.nla.gov.au/version/3575747 3575747 10070727 \n", "4 https://trove.nla.gov.au/version/183601587 183601587 10077826 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert to dataframe\n", "df_ia = pd.DataFrame(ia_books)\n", "df_ia.head()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1511, 12)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the number of records\n", "df_ia.shape" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Save as a CSV\n", "df_ia.to_csv('trove-books-in-ia.csv', index=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }