{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvesting a records search from Archway\n", "\n", "This notebook includes code that will enable you to harvest individual record details from a search in the National Archives of New Zealand's online database, [Archway](https://www.archway.archives.govt.nz/).\n", "\n", "If you search for keywords only there's a limit of 1,000 results returned. But it seems that if you add extra parameters, such as a date range, the maximum number of results returned is 10,000.\n", "\n", "If you want to harvest more records than this, you'll need to break your search up into chunks of less than 10,000. I'll give some possible strategies for this below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using this notebook\n", "\n", "If you're not familiar with [Jupyter notebooks](http://jupyter.org/) like this one, here's a few basic tips.\n", "\n", "* You can run real live code. Look below for code cells – they have boxes around them and code inside (d'uh).\n", "* Click on a code cell to edit it.\n", "* Once you've clicked on a code cell, it's ready to run. Just hit `Shift+Enter` to execute the code.\n", "* Start at the top of the page, running each code cell in turn – this will ensure that all the necessary modules, functions, and variables are setup and ready to use.\n", "* While a code cell is running it'll display an asterix in the square brackets next to the cell. Once it's finished, the asterix will turn into a number.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the modules we'll need\n", "# Yes this is a code cell, hit Shift+Enter to run it!\n", "import requests\n", "import pandas as pd\n", "from bs4 import BeautifulSoup\n", "import re\n", "import time\n", "from IPython.display import display, HTML, FileLink\n", "from tqdm import tqdm, tqdm_notebook\n", "\n", "s = requests.Session()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set some defaults" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# These are the default search parameters for the Advanced search page\n", "# Don't change this here. We'll add values below.\n", "# You still need to run it though -- so Shift+Enter again!\n", "params = { \n", " 'accessionNumber': '',\n", " 'agencyCode': '',\n", " 'alternativeRecordNumber': '',\n", " 'boxNumber': '',\n", " 'code': '',\n", " 'endYear': '',\n", " 'exclude': '',\n", " 'excludeSearchTypeID': '2', #2=any, 3=exact\n", " 'format': 'All', # Options: All, Artwork, map/plan, Moving Image, Not determined, Object, Photograph, Sound Recording, text\n", " 'formerArchivesRef': '',\n", " 'heldauckland': 'on',\n", " 'heldchristchurch': 'on',\n", " 'helddigitalrepository': 'on',\n", " 'helddunedin': 'on',\n", " 'heldother': 'on',\n", " 'heldwellington': 'on',\n", " 'includeUnknown': 'on',\n", " 'keyword': '',\n", " 'keywordSearchTypeID': '1', #1=all, 2=any, 3=exact\n", " 'performSearchImageButton.x': '53',\n", " 'performSearchImageButton.y': '8',\n", " 'recordNumber': '',\n", " 'sepNumber': '',\n", " 'seriesNumber': '',\n", " 'startYear': ''\n", "}\n", "\n", "# These are the default fields for an item.\n", "# I'm assuming they're consistent!\n", "details_fields = [\n", " 'Item ID',\n", " 'Agency',\n", " 'Series',\n", " 'Accession',\n", " 'Record group',\n", " 'Box/Item',\n", " 'Sep',\n", " 'Record no.',\n", " 'Part',\n", " 'Alternative no.',\n", " 'Record type'\n", "]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The harvesting code\n", "You shouldn't have to edit anything here. Just run the code cell to load everything up." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Yep, you guessed it, hit Shift+Enter again! Seeing a pattern here?\n", "\n", "def strip_string(cell):\n", " '''\n", " If there's a string in a cell, strip all the whitespace.\n", " '''\n", " if cell.string:\n", " return cell.string.strip()\n", " else:\n", " return ''\n", "\n", "def process_item_page(response):\n", " '''\n", " Extract details from an individual record page.\n", " '''\n", " details = {}\n", " soup = BeautifulSoup(response.text, 'lxml')\n", " title_row = soup.find('td', string='Title').parent.find_next_sibling('tr')\n", " title_cells = title_row.find_all('td')\n", " details['Title'] = title_cells[0].get_text().strip()\n", " details['Date'] = strip_string(title_cells[1])\n", " details_row = soup.find('td', string='Item ID').parent.find_next_sibling('tr')\n", " details_cells = details_row.find_all('td')\n", " for index, field in enumerate(details_fields):\n", " details[field] = strip_string(details_cells[index])\n", " former_row = soup.find('td', string='Former archives ref').parent.find_next_sibling('tr')\n", " details['Former archives ref'] = strip_string(former_row.td)\n", " details['Access status'] = strip_string(soup.find(class_='restriction-text').strong)\n", " return details\n", "\n", "def process_page(soup):\n", " '''\n", " Work through a page of search results, getting the details of each individual record/\n", " '''\n", " results = []\n", " links = soup.find_all('a', href=re.compile('ViewFullItem.do'))\n", " for link in tqdm_notebook(links, leave=False):\n", " id = re.search(r'ViewFullItem\\.do\\?code=(\\d+)', link['href']).group(1)\n", " item_url = 'https://www.archway.archives.govt.nz/ViewFullItem.do?code=' + id\n", " response = s.get(item_url)\n", " results.append(process_item_page(response))\n", " time.sleep(0.2)\n", " return results\n", "\n", "def prepare_search(params):\n", " '''\n", " Gathering the cookies and session details...\n", " '''\n", " # It's probably not necessary to step through the search pages like this.\n", " # I was just getting paranoid about sessions and cookies...\n", " r1 = s.get('https://www.archway.archives.govt.nz/')\n", " r2 = s.get('https://www.archway.archives.govt.nz/CallAdvancedSearch.do')\n", " r3 = s.get('https://www.archway.archives.govt.nz/CallItemAdvancedSearch.do')\n", " r4 = s.post('https://www.archway.archives.govt.nz/ItemAdvancedSearch.do', data=params)\n", " soup = BeautifulSoup(r4.text, 'lxml')\n", " params = get_page_params(soup, 1)\n", " try:\n", " total_results = int(params['searchResultsContainer.totalResultSize'])\n", " except KeyError:\n", " total_results = 0\n", " return total_results\n", "\n", "def get_page_params(soup, page):\n", " '''\n", " Get the embedded search details in a results page to feed to the next page request.\n", " '''\n", " params = {}\n", " elements = soup.find_all('input', {'name': re.compile('searchResultsContainer'), 'type': 'hidden'})\n", " for element in elements:\n", " # print(element)\n", " params[element['name']] = element['value']\n", " params['searchResultsContainer.page'] = page\n", " return params\n", "\n", "def harvest_results(params):\n", " '''\n", " Harvest results using the supplied parameters.\n", " '''\n", " total_results = prepare_search(params)\n", " # Set up some defaults\n", " page = 1\n", " results = []\n", " # Loop through the results pages, extracting details of individual records\n", " with tqdm_notebook(total=total_results, leave=False, unit='record') as pbar:\n", " while len(results) < total_results:\n", " search_response = s.post('https://www.archway.archives.govt.nz/ItemAdvancedSearchResults.do', data=params)\n", " soup = BeautifulSoup(search_response.text, 'lxml')\n", " new_results = process_page(soup)\n", " results += new_results\n", " page += 1\n", " # Need these params to get the next page of results\n", " params = get_page_params(soup, page)\n", " time.sleep(0.5)\n", " pbar.update(len(new_results))\n", " return results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Construct your search\n", "\n", "You need to feed your search terms into the parameters defined above. For example, to search for the keyword `Chinese`, you'd enter the code:\n", "\n", "```\n", "query_params['keyword'] = 'Chinese'\n", "```\n", "\n", "To search for items in a particular series you'd enter:\n", "\n", "```\n", "query_params['seriesNumber'] = '8333'\n", "```\n", "\n", "To search for an item with a particular record number you'd enter:\n", "\n", "```\n", "query_params['recordNumber'] = '1883/3052'\n", "```\n", "\n", "You can set multiple parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make a copy of the default params\n", "query_params = params.copy()\n", "# Let's set some parameters -- edit as you see fit\n", "# Once you've finished editing, hit Shift-Enter\n", "# query_params['keyword'] = 'Chinese'\n", "query_params['seriesNumber'] = '8333'\n", "query_params['keyword'] = 'naturalisation naturalization'\n", "# Make it an 'any' search\n", "query_params['keywordSearchTypeID'] = 2\n", "query_params['startYear'] = '1840'\n", "query_params['endYear'] = '1905'\n", "# This excludes records without a date\n", "query_params['includeUnknown'] = False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start the harvest!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run this cell (Shift+Enter) to kick things off\n", "# When the asterix turns to a number in the square brackets, your harvest will have finished\n", "results = harvest_results(query_params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More than 10,000?\n", "\n", "So what do you do if you want to harvest more than the limit of 10,000 records? Basically, you need to think about ways of breaking the search up into smaller chunks. From my brief explorations it seems that the `record number` field supports wildcard searches, so if the records were numbered with prefixes between 1 and 100, for example, you could try something like this:\n", "\n", "```\n", "results = []\n", "for prefix in range(1, 101):\n", " params['recordNumber'] = '{}/*'.format(prefix)\n", " results += harvest_results(params)\n", "```\n", "\n", "Alternatively, you could break your date span up into individual years. Note that this will likely mean that you'll have duplicate records in your dataset, but that can be easily fixed in Pandas using `.drop_duplicates()`. Also note that I've compared the results of searches broken down by year with searches for a continuous data span and, for some reason, the year by year searches return fewer results. I don't know why." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make a copy of the default params\n", "query_params = params.copy()\n", "query_params['seriesNumber'] = '8333'\n", "query_params['keyword'] = 'naturalisation naturalization'\n", "# Make it an 'any' search\n", "query_params['keywordSearchTypeID'] = 2\n", "\n", "results = []\n", "for start_year in tqdm_notebook(range(1840, 1906), unit='year'):\n", " query_params['startYear'] = start_year\n", " query_params['endYear'] = start_year\n", " # This excludes records without a date\n", " query_params['includeUnknown'] = False\n", " results += harvest_results(query_params)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save the results as a CSV file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Pandas makes it stupidly easy to save data as a CSV\n", "# First convert the results into a DataFrame\n", "df = pd.DataFrame(results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# If you think there might be duplicates run this to get rid of them\n", "df.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save as a CSV with good old Shift+Enter\n", "csv_filename = 'results.csv' # change this to whatever you want\n", "df.to_csv(csv_filename, index=False)\n", "display(FileLink(csv_filename))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }