{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "The [Portable Antiquities Scheme](http://finds.org.uk) (PAS) is a programme run by the British Museum and the National Museum of Wales. Small artefacts are often found in the course of gardening, metal detecting and other activities; the Scheme allows those finds to be recorded and those objects to become known. The Scheme has a database at [finds.org.uk/database](http://finds.org.uk/database) containing well over 1 million objects. The database exposes its records in a variety of ways to encourage scholarly re-use. Daniel Pett, who [designed and built](https://finds.org.uk/info) the database and webapi, wrote the following R code as a demonstration for how to query its API and retrieve photographic records." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: bitops\n" ] } ], "source": [ "library(jsonlite)\n", "library(RCurl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we set the base URL for PAS because we'll need it later. We're going to make a search of the database, which will return results to us in json format. Some of the key:value pairs will be things like 'filename' and 'imagedir'; to get the data we want, we'll grab that information and string it together with the base url to create a download path." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# The base URL for PAS\n", "base <- 'https://finds.org.uk/'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we set up our query. Open a new browser tab and go to the PAS website and do some simple searches to see the kind of information available. When you get the search results, scroll down to see the options for how you can get the data returned to you. Click on 'json', and note the URL in the search bar. That's what we're about to set up in the next cell:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "## Set your query up \n", "# The important parameters for you to include in a search are:\n", "# q/{queryString} - which has your free text or parameterised search e.g. q/gold/broadperiod/BRONZE+AGE\n", "# /thumbnail/1 - ask for records with images\n", "# /format/json - ask for json response\n", "##\n", "url <- \"https://finds.org.uk/database/search/results/q/gold/broadperiod/BRONZE+AGE/thumbnail/1/format/json\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next line, which is using a function from the 'jsonlite' package, goes to the URL we set up above and gets the data." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Get your JSON and parse\n", "json <- fromJSON(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you look at the results of your search in the other browser window, where you clicked on 'json' at the bottom of the page, you'll see a long list of key:value pairs. Below, we're going to grab some of the values that describe the metadata for our search." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# The total results available\n", "total <- json$meta$totalResults\n", "\n", "# Number of pages\n", "\n", "# Results \n", "results <- json$meta$resultsPerPage\n", "pagination <- ceiling(total/results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're now going to set up some variables that will specify which values we wish to keep, and pass to a csv file at the end of this process." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Set which fields to keep \n", "keeps <- c(\n", " \"id\", \"objecttype\", \"old_findID\",\n", " \"broadperiod\", \"institution\", \"imagedir\", \n", " \"filename\"\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "data <- json$results\n", "# Keep the columns you want\n", "data <- data[,(names(data) %in% keeps)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at what we've got. We could just call `data` and see everything. We'll use `head(data)` instead to see just the first few lines. Any guesses as to what you'd type to see the _last_ few lines?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
idold_findIDobjecttypebroadperiodinstitutionfilenameimagedir
904260 PAS-011055 PENANNULAR RING BRONZE AGE PAS 2016T920b.jpg images/ianr/
899611 CORN-3237B8 FLAT AXEHEAD BRONZE AGE CORN DSCN0203.JPG images/atyacke/
899113 YORYM-057F37 AXEHEAD BRONZE AGE YORYM SWW0001.jpg images/bmorris/
878840 HESH-83416C FLAT AXEHEAD BRONZE AGE HESH HESH83416C.jpg images/preavill/
878015 HAMP-1248F2 PENANNULAR RING BRONZE AGE HAMP HAMP1248F2.jpg images/khindshamp/
871932 SUSS-1CBAB0 PENANNULAR RING BRONZE AGE SUSS RingSUSS1CBAB0.jpgimages/EdwinWood/
\n" ], "text/latex": [ "\\begin{tabular}{r|lllllll}\n", " id & old\\_findID & objecttype & broadperiod & institution & filename & imagedir\\\\\n", "\\hline\n", "\t 904260 & PAS-011055 & PENANNULAR RING & BRONZE AGE & PAS & 2016T920b.jpg & images/ianr/ \\\\\n", "\t 899611 & CORN-3237B8 & FLAT AXEHEAD & BRONZE AGE & CORN & DSCN0203.JPG & images/atyacke/ \\\\\n", "\t 899113 & YORYM-057F37 & AXEHEAD & BRONZE AGE & YORYM & SWW0001.jpg & images/bmorris/ \\\\\n", "\t 878840 & HESH-83416C & FLAT AXEHEAD & BRONZE AGE & HESH & HESH83416C.jpg & images/preavill/ \\\\\n", "\t 878015 & HAMP-1248F2 & PENANNULAR RING & BRONZE AGE & HAMP & HAMP1248F2.jpg & images/khindshamp/\\\\\n", "\t 871932 & SUSS-1CBAB0 & PENANNULAR RING & BRONZE AGE & SUSS & RingSUSS1CBAB0.jpg & images/EdwinWood/ \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "id | old_findID | objecttype | broadperiod | institution | filename | imagedir | \n", "|---|---|---|---|---|---|\n", "| 904260 | PAS-011055 | PENANNULAR RING | BRONZE AGE | PAS | 2016T920b.jpg | images/ianr/ | \n", "| 899611 | CORN-3237B8 | FLAT AXEHEAD | BRONZE AGE | CORN | DSCN0203.JPG | images/atyacke/ | \n", "| 899113 | YORYM-057F37 | AXEHEAD | BRONZE AGE | YORYM | SWW0001.jpg | images/bmorris/ | \n", "| 878840 | HESH-83416C | FLAT AXEHEAD | BRONZE AGE | HESH | HESH83416C.jpg | images/preavill/ | \n", "| 878015 | HAMP-1248F2 | PENANNULAR RING | BRONZE AGE | HAMP | HAMP1248F2.jpg | images/khindshamp/ | \n", "| 871932 | SUSS-1CBAB0 | PENANNULAR RING | BRONZE AGE | SUSS | RingSUSS1CBAB0.jpg | images/EdwinWood/ | \n", "\n", "\n" ], "text/plain": [ " id old_findID objecttype broadperiod institution\n", "1 904260 PAS-011055 PENANNULAR RING BRONZE AGE PAS \n", "2 899611 CORN-3237B8 FLAT AXEHEAD BRONZE AGE CORN \n", "3 899113 YORYM-057F37 AXEHEAD BRONZE AGE YORYM \n", "4 878840 HESH-83416C FLAT AXEHEAD BRONZE AGE HESH \n", "5 878015 HAMP-1248F2 PENANNULAR RING BRONZE AGE HAMP \n", "6 871932 SUSS-1CBAB0 PENANNULAR RING BRONZE AGE SUSS \n", " filename imagedir \n", "1 2016T920b.jpg images/ianr/ \n", "2 DSCN0203.JPG images/atyacke/ \n", "3 SWW0001.jpg images/bmorris/ \n", "4 HESH83416C.jpg images/preavill/ \n", "5 HAMP1248F2.jpg images/khindshamp/\n", "6 RingSUSS1CBAB0.jpg images/EdwinWood/ " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "head(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next bit is tricky. We're going to loop through all of the pages of results, and bind it all together into a single table. This might take a bit of time, depending on how you framed your query. Be careful: there is _a lot_ of data available. If you're running this notebook through Binder, know that it might time out on you if you're grabbing a vast amount." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Loop through and bind results \n", "for (i in seq(from=2, to=pagination, by=1)){\n", " urlDownload <- paste(url, '/page/', i, sep='')\n", " pagedJson <- fromJSON(urlDownload)\n", " records <- pagedJson$results\n", " records <- records[,(names(records) %in% keeps)]\n", " data <-rbind(data,records)\n", "}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll write that to csv:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Write a csv file of the data you want\n", "write.csv(data, file='data.csv',row.names=FALSE, na=\"\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a good place to stop and consider what you've done. You've queried an archaeological database, selected the subfields you want, and written it to a csv! Make sure to save a copy of your data locally (click on the jupyter icon top left and from the file manager download the csv to your computer).\n", "\n", "### Download images into folders defined by the image categories\n", "\n", "The last part of this notebook iterates through your data, creating the download path for the images and saves those images to file in a tidy notebook where anything categorized as a 'BROOCH' is in a brooch folder, anything categorized as a 'TORQUE' is in a torque folder, and so on." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Throw in a log file, just in case of troubles or missing files.\n", "failures <- \"failures.log\"\n", "log_con <- file(failures)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to make a 'function', or a small mini-program that our code can use over and over again, to download the materials." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Download function with test for URL\n", "download <- function(data){\n", " # This should be the object type taken from column 3\n", " object = data[3]\n", " # This should be the record old find ID taken from column 2\n", " record = data[2]\n", " \n", " # Check and create a folder for that object type if does not exist\n", " if (!file.exists(object)){\n", " dir.create(object)\n", " }\n", " \n", " # Create image url - image path is in column 7 and filename is column 6\n", " URL = paste0(base,data[7],data[6])\n", " \n", " # Test the file exists\n", " exist <- url.exists(URL) \n", " \n", " # If it does, download. If not say 404\n", " if(exist == TRUE){\n", " download.file(URLencode(URL), destfile = paste(object,basename(URL), sep = '/'))\n", " } else {\n", " print(\"That file is a 404\")\n", " # Log the errors for sending back to PAS to fix - probably better than csv as you \n", " # can tail -f and watch the errors come in\n", " message <- paste0(record,\"|\",URL,\"|\",\"404 \\n\")\n", " # Write to error file\n", " cat(message, file = failures, append = TRUE)\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now call that function to download our data. When you run the code block below, it might seem as if nothing is happening - but keep an eye on the filemanager! Right-click on the Jupyter logo above and open the link in a new tab. You'll see your new folders appearing in the list. Below, you'll also get a list of errors. These are being written to a text file called 'failures.log' which will list the items that couldn't be downloaded. You can try to trace these down manually on the PAS website, or contact them with the information.\n", "\n", "To stop the download, just make sure the code block below is highlighted, and then hit the stop button (the big square beside the Run button)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"That file is a 404\"\n", "[1] \"That file is a 404\"\n", "[1] \"That file is a 404\"\n", "[1] \"That file is a 404\"\n" ] } ], "source": [ "# Apply the function\n", "apply(data, 1, download)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.4" } }, "nbformat": 4, "nbformat_minor": 2 }