{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring digitised maps in Trove\n",
"\n",
"If you've ever poked around in Trove's 'map' zone, you might have noticed the beautiful deep-zoomable images available for many of the NLA's digitised maps. Even better, in many cases the high-resolution TIFF versions of the digitised maps are available for download.\n",
"\n",
"I knew there were lots of great maps you could download from Trove, but how many? And how big were the files? I thought I'd try to quantify this a bit by harvesting and analysing the metadata.\n",
"\n",
"The size of the downloadable files (both in bytes and pixels) are [embedded within the landing pages](https://nbviewer.jupyter.org/github/GLAM-Workbench/trove-books/blob/master/Metadata-for-Trove-digitised-works.ipynb) for the digitised maps. So harvesting the metadata involves a number of steps:\n",
"\n",
"* Use the Trove API to search for maps that include the phrase \"nla.obj\" – this will filter the results to maps that have been digitised and are available through Trove\n",
"* Work through the results, checking to see if the record includes a link to a digital copy.\n",
"* If there is a digital copy, extract the embedded work data from the landing page.\n",
"* Sometimes the work data doesn't include the copyright status, if it doesn't then I scrape it from the page.\n",
"\n",
"Here's the [downloaded metadata as a CSV formatted file](single_maps.csv). You can also [browse the results](https://docs.google.com/spreadsheets/d/1yBPcCk9wIRovRacKbfrlyThWrzGXLF79Lr0GIQbaO9Y/edit?usp=sharing) using Google Sheets."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting things up"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DataTransformerRegistry.enable('json')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import requests\n",
"from tqdm import tqdm_notebook\n",
"from requests.adapters import HTTPAdapter\n",
"from requests.packages.urllib3.util.retry import Retry\n",
"from IPython.display import display, FileLink\n",
"import re\n",
"import json\n",
"import time\n",
"import pandas as pd\n",
"from bs4 import BeautifulSoup\n",
"import altair as alt\n",
"\n",
"s = requests.Session()\n",
"retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n",
"s.mount('https://', HTTPAdapter(max_retries=retries))\n",
"s.mount('http://', HTTPAdapter(max_retries=retries))\n",
"\n",
"alt.renderers.enable('notebook')\n",
"alt.data_transformers.enable('json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## You'll need a Trove API key to harvest the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"api_key = ''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define some functions to do the work"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def get_total_results(params):\n",
" '''\n",
" Get the total number of results for a search.\n",
" '''\n",
" these_params = params.copy()\n",
" these_params['n'] = 0\n",
" response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)\n",
" data = response.json()\n",
" return int(data['response']['zone'][0]['records']['total'])\n",
"\n",
"\n",
"def get_fulltext_url(links):\n",
" '''\n",
" Loop through the identifiers to find a link to the digital version of the journal.\n",
" '''\n",
" url = None\n",
" for link in links:\n",
" if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:\n",
" url = link['value']\n",
" break\n",
" return url\n",
"\n",
"def get_copyright_status(response):\n",
" '''\n",
" Scrape copyright information from a digital work page.\n",
" '''\n",
" soup = BeautifulSoup(response.text, 'lxml')\n",
" copyright_status = soup.find('div', id='tab-access').strong.string\n",
" return copyright_status\n",
"\n",
"def get_work_data(url):\n",
" '''\n",
" Extract work data in a JSON string from the work's HTML page.\n",
" '''\n",
" response = s.get(url)\n",
" try:\n",
" work_data = json.loads(re.search(r'var work = JSON\\.parse\\(JSON\\.stringify\\((\\{.*\\})', response.text).group(1))\n",
" except (AttributeError, TypeError):\n",
" work_data = '{}'\n",
" else:\n",
" # If there's no copyright info in the work data, then scrape it\n",
" if 'copyrightPolicy' not in work_data:\n",
" work_data['copyrightPolicy'] = get_copyright_status(response)\n",
" return work_data\n",
"\n",
"def format_bytes(size):\n",
" # 2**10 = 1024\n",
" power = 2**10\n",
" n = 0\n",
" power_labels = {0 : '', 1: 'K', 2: 'M', 3: 'G', 4: 'T'}\n",
" while size > power:\n",
" size /= power\n",
" n += 1\n",
" return size, power_labels[n]+'B'\n",
"\n",
"def get_map_data(work_data):\n",
" '''\n",
" Look for file size information in the embedded data\n",
" '''\n",
" map_data = {}\n",
" width = None\n",
" height = None\n",
" num_bytes = None\n",
" try:\n",
" # Make sure there's a downloadable version\n",
" if work_data.get('accessConditions') == 'Unrestricted' and 'copies' in work_data:\n",
" for copy in work_data['copies']:\n",
" # Get the pixel dimensions\n",
" if 'technicalmetadata' in copy:\n",
" width = copy['technicalmetadata'].get('width')\n",
" height = copy['technicalmetadata'].get('height')\n",
" # Get filesize in bytes\n",
" elif copy['copyrole'] in ['m', 'o', 'i', 'fd'] and copy['access'] == 'true':\n",
" num_bytes = copy.get('filesize')\n",
" if width and height and num_bytes:\n",
" size, unit = format_bytes(num_bytes)\n",
" # Convert bytes to something human friendly\n",
" map_data['filesize_string'] = '{:.2f}{}'.format(size, unit)\n",
" map_data['filesize'] = num_bytes\n",
" map_data['width'] = width\n",
" map_data['height'] = height\n",
" map_data['copyright_status'] = work_data.get('copyrightPolicy')\n",
" except AttributeError:\n",
" pass\n",
" return map_data\n",
" \n",
"\n",
"def get_maps():\n",
" '''\n",
" Harvest metadata about maps.\n",
" '''\n",
" url = 'http://api.trove.nla.gov.au/v2/result'\n",
" maps = []\n",
" params = {\n",
" 'q': '\"nla.obj-\"',\n",
" 'zone': 'map',\n",
" 'l-availability': 'y',\n",
" 'l-format': 'Map/Single map',\n",
" 'bulkHarvest': 'true', # Needed to maintain a consistent order across requests\n",
" 'key': api_key,\n",
" 'n': 100,\n",
" 'encoding': 'json'\n",
" }\n",
" start = '*'\n",
" total = get_total_results(params)\n",
" with tqdm_notebook(total=total) as pbar:\n",
" while start:\n",
" params['s'] = start\n",
" response = s.get(url, params=params)\n",
" data = response.json()\n",
" # If there's a startNext value then we get it to request the next page of results\n",
" try:\n",
" start = data['response']['zone'][0]['records']['nextStart']\n",
" except KeyError:\n",
" start = None\n",
" for work in tqdm_notebook(data['response']['zone'][0]['records']['work'], leave=False):\n",
" # Check to see if there's a link to a digital version\n",
" try:\n",
" fulltext_url = get_fulltext_url(work['identifier'])\n",
" except KeyError:\n",
" pass\n",
" else:\n",
" if fulltext_url:\n",
" work_data = get_work_data(fulltext_url)\n",
" map_data = get_map_data(work_data)\n",
" if 'filesize' in map_data:\n",
" trove_id = re.search(r'(nla\\.obj\\-\\d+)', fulltext_url).group(1)\n",
" try:\n",
" contributors = '|'.join(work.get('contributor'))\n",
" except TypeError:\n",
" contributors = work.get('contributor')\n",
" # Get basic metadata\n",
" # You could add more work data here\n",
" # Check the Trove API docs for work record structure\n",
" map_data['title'] = work['title']\n",
" map_data['fulltext_url'] = fulltext_url\n",
" map_data['trove_url'] = work.get('troveUrl')\n",
" map_data['trove_id'] = trove_id\n",
" map_data['date'] = work.get('issued')\n",
" map_data['creators'] = contributors\n",
" maps.append(map_data)\n",
" time.sleep(0.2)\n",
" time.sleep(0.2)\n",
" pbar.update(100)\n",
" return maps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download map data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"maps = get_maps()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Convert to dataframe and save to CSV"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert to dataframe\n",
"df = pd.DataFrame(maps)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"single_maps.csv
"
],
"text/plain": [
"/Users/tim/mycode/glam-workbench/trove-maps/notebooks/single_maps.csv"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Save to CSV\n",
"df.to_csv('single_maps.csv', index=False)\n",
"display(FileLink('single_maps.csv'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Let's explore the results"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# Reload data from CSV if necessary\n",
"df = pd.read_csv('single_maps.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many single maps have high-resolution downloads?"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"20,158 maps\n"
]
}
],
"source": [
"print('{:,} maps'.format(df.shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How much map data is available for download?"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7.07TB\n"
]
}
],
"source": [
"size, unit = format_bytes(df['filesize'].sum())\n",
"print('{:.2f}{}'.format(size, unit))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What's the copyright status of the maps?"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Out of Copyright 14967\n",
"In Copyright 3271\n",
"No known copyright restrictions 1506\n",
"Edition Out of Copyright 245\n",
"Copyright Undetermined 148\n",
"Edition In Copyright 12\n",
"Unknown 6\n",
"Perpetual 2\n",
"Copyright Uncertain 1\n",
"Name: copyright_status, dtype: int64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['copyright_status'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's show the copyright status as a chart..."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | copyright_status | \n", "creators | \n", "date | \n", "filesize | \n", "filesize_string | \n", "fulltext_url | \n", "height | \n", "title | \n", "trove_id | \n", "trove_url | \n", "width | \n", "mb | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
1218 | \n", "Out of Copyright | \n", "Imray, James F. (James Frederick), 1829?-1891 | \n", "1853-1863 | \n", "3388748804 | \n", "3.16GB | \n", "http://nla.gov.au/nla.obj-390032889 | \n", "24785 | \n", "Chart of the west, south and east coasts of Au... | \n", "nla.obj-390032889 | \n", "https://trove.nla.gov.au/work/13684619 | \n", "45575 | \n", "3231.762699 | \n", "
3017 | \n", "No known copyright restrictions | \n", "Geological Survey of India | \n", "1932 | \n", "3623879488 | \n", "3.38GB | \n", "http://nla.gov.au/nla.obj-591001246 | \n", "38023 | \n", "Map of the City of Rangoon and suburbs 1928-29... | \n", "nla.obj-591001246 | \n", "https://trove.nla.gov.au/work/182743876 | \n", "31769 | \n", "3456.000793 | \n", "
4578 | \n", "In Copyright | \n", "Indonesia. Direktorat Geologi | \n", "1970 | \n", "3279210576 | \n", "3.05GB | \n", "http://nla.gov.au/nla.obj-568387103 | \n", "41429 | \n", "Peta geologi teknik daerah Jakarta - Bogor : E... | \n", "nla.obj-568387103 | \n", "https://trove.nla.gov.au/work/20208553 | \n", "26384 | \n", "3127.298904 | \n", "
4830 | \n", "No known copyright restrictions | \n", "Taiwan | \n", "1942 | \n", "3264456500 | \n", "3.04GB | \n", "http://nla.gov.au/nla.obj-400826638 | \n", "25508 | \n", "Nyūginia-tō zenzu / Taiwan Sōtokufu Gaijibu... | \n", "nla.obj-400826638 | \n", "https://trove.nla.gov.au/work/205481810 | \n", "42659 | \n", "3113.228321 | \n", "
7237 | \n", "In Copyright | \n", "Indonesia. Direktorat Geologi | \n", "1963 | \n", "3311801600 | \n", "3.08GB | \n", "http://nla.gov.au/nla.obj-568387099 | \n", "20990 | \n", "Geological map of Djawa and Madura / compiled ... | \n", "nla.obj-568387099 | \n", "https://trove.nla.gov.au/work/218208895 | \n", "52593 | \n", "3158.380127 | \n", "
19858 | \n", "Out of Copyright | \n", "South Australia. Surveyor-General's Office | \n", "1885-1950 | \n", "3308608288 | \n", "3.08GB | \n", "http://nla.gov.au/nla.obj-230705067 | \n", "43121 | \n", "Plan shewing pastoral leases and claims in the... | \n", "nla.obj-230705067 | \n", "https://trove.nla.gov.au/work/8818311 | \n", "25576 | \n", "3155.334747 | \n", "