{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Download an image using the IIIF server and a Handle url\n", "\n", "[IIIF](https://iiif.io/) (the International Image Interoperability Framework) has defined a set of standards for publishing and using image collections. The State Library of Victoria makes many of its images available from an IIIF-compliant server. This means you can access and manipulate the images in standard ways set out by the [IIIF Image API](https://iiif.io/api/image/2.1/).\n", "\n", "Many images in the State Library's collection also have a permanent url, created using the Handle system. These are displayed in Trove. But there's no obvious way of getting an IIIF image from a Handle. Of course, if you're a human you can load the image page in your browser and click on the download button. But what if you want to build a processing pipeline, or create a dataset of images? Wouldn't it be good if you could just supply a Handle url and get back an image in whatever format or size you wanted? That's what this notebook does.\n", "\n", "One odd thing I discovered when developing this notebook is that if you try to download an image from the IIIF server using the `@id` in the image manifest, you get a 403 ('Forbidden') error. This seems to be because the server is expecting a session cookie in the request headers. In order to set this cookie, you have to first download the image manifest itself and submit the saved cookie with the image request (fortunately `requests.Session()` makes this easy)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!

\n", "\n", "

\n", " Some tips:\n", "

\n", "

\n", "\n", "

Is this thing on? If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to load a live version running on Binder.

\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting things up" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import re\n", "from pathlib import Path\n", "from IPython.display import display, HTML" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def get_pid(handle_url):\n", " '''\n", " Extract a pid (image identifier) from the image viewer url.\n", " '''\n", " # The handle url will get redirected to a system url that includes a pid\n", " response = requests.get(handle_url)\n", " \n", " # Get the pid from the redirected url\n", " match = re.search(r'entity=(IE\\d+)', response.url)\n", " if match:\n", " return match.group(1)\n", " \n", "def get_image_ids(manifest):\n", " '''\n", " Extract a list of image @ids from an IIIF manifest\n", " '''\n", " image_ids = []\n", " # There can be multiple images in a record\n", " # So we loop through the canvases to get each one.\n", " for canvas in manifest['sequences'][0]['canvases']:\n", " image_ids.append(canvas['images'][0]['resource']['service']['@id'])\n", " return image_ids\n", "\n", "def construct_image_url(image_id, image_type, max_width, max_height):\n", " '''\n", " Construct a url to download the image according to the IIIF standard.\n", " '''\n", " # Create a string with the size information -- either 'w,h', 'w,', ',h', or 'max'\n", " if max_width and max_height:\n", " size = f'!{max_width},{max_height}'\n", " elif max_width:\n", " size = f'{max_width},'\n", " elif max_height:\n", " size = f',{max_height}'\n", " else:\n", " size = 'max'\n", " # Construct the url\n", " return f'{image_id}/full/{size}/0/default.{image_type}'\n", " \n", "def download_image(handle_url, image_type='jpg', max_width=None, max_height=None):\n", " '''\n", " Downloads a derivative image from the IIIF server using its Handle, and saves\n", " it to the 'images' folder.\n", " Params:\n", " handle_url: the Handle link for this image (required)\n", " image_type: one of 'jpg', 'tif', 'png'\n", " max_width: maximum width in pixels\n", " max_height: maximum height in pixels\n", " '''\n", " pid = get_pid(handle_url)\n", " manifest_url = f'https://rosetta.slv.vic.gov.au/delivery/iiif/presentation/2.1/{pid}/manifest'\n", " \n", " # We need to use a session to save cookies\n", " s = requests.Session()\n", " \n", " # Get the IIIF manifest\n", " # Requesting the manifest also sets a cookie\n", " response = s.get(manifest_url)\n", " manifest = response.json()\n", " \n", " # Extract the image ids from the manifest\n", " image_ids = get_image_ids(manifest)\n", " \n", " # Make sure there's somewhere to save the images\n", " Path('images').mkdir(parents=True, exist_ok=True)\n", " \n", " # Loop through the image ids, downloading each image\n", " for index, image_id in enumerate(image_ids):\n", " \n", " # Construct a filename using the image pid and a numeric index\n", " filename = Path(f'images/slv-{pid}-{index}.{image_type}')\n", " \n", " # Construct an IIIF compliant url\n", " image_url = construct_image_url(image_id, image_type, max_width, max_height)\n", " \n", " # Download and save the image\n", " response = s.get(image_url)\n", " filename.write_bytes(response.content)\n", " \n", " # Display the image if possible\n", " if image_type in ['jpg', 'png']:\n", " display(HTML(f''))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading an image\n", "\n", "To download an image (or images) from a Handle url, just copy and paste the url into the cell below. By default, this will download the largest available version of the image in jpeg format. You can modify this behaviour by supplying any of the following parameters:\n", "\n", "* `image_type`: one of 'jpg', 'tif', 'png'\n", "* `max_width`: maximum width of the image in pixels\n", "* `max_height`: maximum height of the image in pixels\n", "\n", "For example to get a fullsize TIFF version:\n", "\n", "```\n", "download_image('http://handle.slv.vic.gov.au/10381/282282', image_type='tif')\n", "```\n", "\n", "To get a PNG file that's 200 pixels wide:\n", "\n", "```\n", "download_image('http://handle.slv.vic.gov.au/10381/282282', image_type='png', max_width=200)\n", "```\n", "\n", "You'll find the downloaded image(s) in the `images` directory." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Paste a Handle url between the quotes\n", "download_image('http://handle.slv.vic.gov.au/10381/25896')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course there's a lot more fun things you can do with IIIF – we'll explore that in [another notebook](more_fun_with_iiif.ipynb)..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io/).\n", "\n", "If you find this useful, please consider supporting my work on [Patreon](https://www.patreon.com/timsherratt)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }