{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Accessing Europeana IIIF APIs\n",
"[Europeana IIIF APIs](https://pro.europeana.eu/page/iiif), allows us to download, share, and reuse images and text of Europeana newspapers.\n",
"\n",
"This notebook introduces how to explore the repository, search, read a record, obtain the fulltext and create a CSV dataset.\n",
"\n",
"Europeana IIIF APIs requires an API key to access the endpoints. Please register at https://pro.europeana.eu/page/get-api to get a key.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up things"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import requests, csv\n",
"import json\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Glogal configuration\n",
"In this section, we can add our api_key, the text that we want to use to search and retrieve the elements, and the number of records to retrieve."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"api_key = 'add_your_api' #J6W44jvPV\n",
"query = 'paris'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Performing a search using the API\n",
"The API allows us to search on text and retrieve the hits highlighted, as traditional systems (e.g. Lucene and Solr)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = 'https://newspapers.eanadev.org/api/v2/search.json'\n",
"r = requests.get(url, params = {'query': query, 'profile': 'hits', 'wskey': api_key })\n",
"print(r.url)\n",
"response = r.text\n",
"#print(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Displaying the mentions in the transcribed text where the search keyword was found"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results = json.loads(response)\n",
"\n",
"for r in results['hits']:\n",
" print('id:' + r['scope'])\n",
" for s in r['selectors']:\n",
" \n",
" print(s.get('prefix', '') + s.get('exact', '') + s.get('suffix', ''))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a CSV file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"csv_out = csv.writer(open('eu_records.csv', 'w'), delimiter = ',', quotechar = '\"', quoting = csv.QUOTE_MINIMAL)\n",
"csv_out.writerow(['title', 'thumbnail', 'date', 'license', 'typem', 'language', 'fulltextUrl', 'manifestUrl', 'fulltext'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Retrieving the manifests\n",
"A manifest describes the information needed for a viewer to present a digital object to the user, such as the title and the sequence of views/images. We can also retrieve the manifest of each item. According to the Europeana documentation, the request follows the pattern https://iiif.europeana.eu/presentation/[RECORD_ID]/manifest\n",
"\n",
"The manifest includes the metadata, some of the attribues are multivalued.\n",
"\n",
"The full text is available at \n",
"https://www.europeana.eu/api/fulltext/9200303/BibliographicResource_3000059898023/472ef0641de5cce2ba8eb26d67110ed6#char=0,10o"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results = json.loads(response)\n",
"\n",
"for r in results['hits']:\n",
" \n",
" title = thumbnail = date = license = typem = language = fulltextUrl = manifestUrl = fulltext =''\n",
" \n",
" manifestUrl = 'https://iiif.europeana.eu/presentation/' + r['scope'] + '/manifest'\n",
" responseManifest = requests.get(manifestUrl, params = {'wskey': api_key })\n",
" print(responseManifest.url)\n",
" \n",
" # retrieving the metadata\n",
" m = json.loads(responseManifest.text)\n",
" \n",
" # retrieving metadata\n",
" title = m['label'][0]['@value']\n",
" thumbnail = m['thumbnail']['@id']\n",
" if 'navDate' in m:\n",
" date = m['navDate']\n",
" license = m['license']\n",
"\n",
" for i in m['metadata']:\n",
" if i['label'] == 'type':\n",
" typem = i['value'][0]['@value']\n",
" elif i['label'] == 'language':\n",
" language = i['value'][0]['@value']\n",
" else: pass\n",
" \n",
" ## getting the full text\n",
" annopageUrl = 'https://iiif.europeana.eu/presentation/' + r['scope'] + '/annopage/1'\n",
" responseAnnopage = requests.get(annopageUrl, params = {'wskey': api_key })\n",
" print(responseAnnopage.url)\n",
" \n",
" a = json.loads(responseAnnopage.text)\n",
" fulltextUrl = ''\n",
" if 'resources' in a:\n",
" fulltextUrl = a['resources'][0]['resource']['@id']\n",
" print(fulltextUrl)\n",
" \n",
" responseFulltext = ''\n",
" if fulltextUrl != '':\n",
" responseFulltext = requests.get(fulltextUrl, params = {'wskey': api_key })\n",
" \n",
" # retrieving the metadata\n",
" f = json.loads(responseFulltext.text)\n",
" # TODO check encoding\n",
" fulltext = f['value']\n",
" \n",
" print('-------')\n",
" \n",
" csv_out.writerow([title, thumbnail, date, license, typem, language, fulltextUrl, manifestUrl, fulltext])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load the CSV file from GitHub.\n",
"# This puts the data in a Pandas DataFrame\n",
"df = pd.read_csv('eu_records.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Have a peek"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Showing the thumbnails as a gallery\n",
"\n",
"Once we have queried the repository and we have the metadata as a CSV file, let's show the results as a thumbnail gallery."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import HTML, Image\n",
"\n",
"def _src_from_data(data):\n",
" \"\"\"Base64 encodes image bytes for inclusion in an HTML img element\"\"\"\n",
" img_obj = Image(data=data)\n",
" for bundle in img_obj._repr_mimebundle_():\n",
" for mimetype, b64value in bundle.items():\n",
" if mimetype.startswith('image/'):\n",
" return f'data:{mimetype};base64,{b64value}'\n",
"\n",
"def gallery(images, row_height='auto'):\n",
" \"\"\"Shows a set of images in a gallery that flexes with the width of the notebook.\n",
" \n",
" Parameters\n",
" ----------\n",
" images: list of str or bytes\n",
" URLs or bytes of images to display\n",
"\n",
" row_height: str\n",
" CSS height value to assign to all images. Set to 'auto' by default to show images\n",
" with their native dimensions. Set to a value like '250px' to make all rows\n",
" in the gallery equal height.\n",
" \"\"\"\n",
" figures = []\n",
" for image in images:\n",
" if isinstance(image, bytes):\n",
" src = _src_from_data(image)\n",
" caption = ''\n",
" else:\n",
" src = image\n",
" caption = f'