{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvest parliament press releases from Trove" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trove includes more than 370,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. You can view them all in Trove by searching for [nuc:\"APAR:PR\" in the journals zone](http://trove.nla.gov.au/article/result?q=nuc%3A%22APAR%3APR%22).\n", "\n", "This notebook shows you how to harvest both metadata and full text from a search of the parliamentary press releases. The metadata is available from Trove, but to get the full text we have to go back to the Parliamentary Library's database, ParlInfo. The code in this notebook updates [my original GitHub repository](https://github.com/wragge/trove-parliament-pressreleases).\n", "\n", "There are two main steps:\n", "\n", "* Use the Trove API to search for specific keywords within the press releases and harvest metadata from the results. This gives us urls that we can use to get the text of the press releases from ParlInfo.\n", "* Use the harvested urls to retrieve the press release from ParlInfo. The text of each release is extracted from the HTML page and saved as a plain text file.\n", "\n", "Sometimes multiple press releases can be grouped together as 'works' in Trove. This is because Trove thinks that they're versions of the same thing. Indeed, there *are* multiple versions of some press releases. For example, sometimes the office of a Minister and the Minister's department both issue a copy of the same press release or transcript. But these versions are not always identical, and sometimes Trove has grouped press releases together incorrectly. To make sure that we harvest as many individual press releases as possible, the code below unpacks any versions contained within a 'work' and turns them into individual records. This means there will be more duplicates, but it also means you can explore how the versions might differ.\n", "\n", "It looks like the earlier documents have been OCRd and the results are quite variable. If you follow the `fulltext_url` link you should be able to view a PDF version for comparison.\n", "\n", "It also seems that some documents only have a PDF version and not any OCRd text. These documents will be ignored by the `save_texts()` function, so you might end up with fewer texts than records.\n", "\n", "The copyright statement attached to each record in Trove reads:\n", "\n", "> Copyright remains with the copyright holder. Contact the Australian Copyright Council for further information on your rights and responsibilities.\n", "\n", "So depending on what you want to do with them, you might need to contact individual copyright holders for permission." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An example – politicians talking about 'immigrants' and 'refugees'\n", "\n", "I've used this notebook to update an example dataset relating to refugees that I first generated in December 2017. It's been created by searching for the terms 'immigrant', 'asylum seeker', 'boat people', 'illegal arrivals', and 'boat arrivals' amongst the press releases. The exact query used is:\n", "\n", "```\n", "nuc:\"APAR:PR\" AND (\"illegal arrival\" OR text:\"immigrant\" OR text:\"immigrants\" OR \"asylum seeker\" OR \"boat people\" OR refugee OR \"boat arrivals\")\n", "```\n", "\n", "You can view the [results of this query on Trove](http://trove.nla.gov.au/article/result?q=nuc%3A%22APAR%3APR%22+AND+%28%22illegal+arrival%22+OR+text%3A%22immigrant%22+OR+text%3A%22immigrants%22+OR+%22asylum+seeker%22+OR+%22boat+people%22+OR+refugee+OR+%22boat+arrivals%22%29).\n", "\n", "After unpacking the versions and harvesting available texts I ended up with 12,619 text files. You can [browse the files on CloudStor](https://cloudstor.aarnet.edu.au/plus/s/Msoj978Zlrud40g), or download the complete dataset as a [zip file (43mb)](https://cloudstor.aarnet.edu.au/plus/s/diwB0tnpv1X1ORN)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set your options\n", "\n", "In the cell below you need to insert your search query and your Trove API key.\n", "\n", "The search query can be anything you would enter in the Trove search box. As you can see from the examples below it can include phrases, exact phrases, and boolean operators (`AND`, `OR`, and `NOT`).\n", "\n", "You can get a Trove API key by [following these instructions](https://help.nla.gov.au/trove/building-with-trove/api).\n", "\n", "You can change `output_dir` to save the results to a specific directory on your machine." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Insert your query between the single quotes.\n", "# query = '\"illegal arrival\" OR text:\"immigrant\" OR text:\"immigrants\" OR \"asylum seeker\" OR \"boat people\" OR refugee OR \"boat arrivals\"'\n", "query = 'atomic'\n", "# Insert your Trove API key between the single quotes\n", "api_key = 'YOUR API KEY GOES HERE'\n", "# You don't have to change this\n", "output_dir = 'press-releases'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import the libraries we'll need" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import time\n", "from requests.exceptions import HTTPError\n", "from bs4 import BeautifulSoup\n", "from slugify import slugify\n", "import pandas as pd\n", "from datetime import datetime\n", "import os\n", "import shutil\n", "from IPython.display import display, HTML, FileLink\n", "from tqdm import tqdm_notebook\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n", "s.mount('https://', HTTPAdapter(max_retries=retries))\n", "s.mount('http://', HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions to do the work" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def get_total_results(params):\n", " '''\n", " Get the total number of results for a search.\n", " '''\n", " these_params = params.copy()\n", " these_params['n'] = 0\n", " response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)\n", " data = response.json()\n", " return int(data['response']['zone'][0]['records']['total'])\n", "\n", "def get_fulltext_url(links):\n", " '''\n", " Loop through the identifiers to find a link to the digital version of the journal.\n", " '''\n", " url = None\n", " for link in links:\n", " if link['linktype'] == 'fulltext':\n", " url = link['value']\n", " break\n", " return url\n", "\n", "def get_source(version):\n", " '''\n", " Get the metadata source of a version.\n", " '''\n", " if 'metadataSource' in version:\n", " try:\n", " source = version['metadataSource']['value']\n", " except TypeError:\n", " try:\n", " source = version['metadataSource']\n", " except TypeError:\n", " print(version)\n", " \n", " except KeyError:\n", " source = None\n", " else:\n", " source = None\n", " return source \n", "\n", "def harvest_prs(query, api_key):\n", " '''\n", " Harvest details of parliamentary press releases using the Trove API.\n", " This function saves the 'version' level records individually (these are grouped under 'works').\n", " '''\n", " # Define parameters for the search -- you could change this of course\n", " # The nuc:\"APAR:PR\" limits the results to the Parliamentary Press Releases\n", " params = {\n", " 'q': 'nuc:\"APAR:PR\" AND ({})'.format(query),\n", " 'zone': 'article',\n", " 'n': 100,\n", " 'key': api_key,\n", " 'bulkHarvest': 'true',\n", " 'encoding': 'json',\n", " 'include': 'workVersions',\n", " 'l-availability': 'y'\n", " }\n", " start = '*'\n", " total = get_total_results(params)\n", " records = []\n", " url = 'http://api.trove.nla.gov.au/v2/result'\n", " with tqdm_notebook(total=total) as pbar:\n", " while start:\n", " params['s'] = start\n", " response = s.get(url, params=params)\n", " data = response.json()\n", " # If there's a startNext value then we get it to request the next page of results\n", " try:\n", " start = data['response']['zone'][0]['records']['nextStart']\n", " except KeyError:\n", " start = None\n", " for work in data['response']['zone'][0]['records']['work']:\n", " # Different records can be grouped within works as versions.\n", " # So we're going to extract each version as a separate record.\n", " for version in work['version']:\n", " # Sometimes there are even versions grouped together in a version... ¯\\_(ツ)_/¯\n", " # We need to extract their ids from a single string\n", " ids = version['id'].split()\n", " # This may or may not be a list...\n", " if isinstance(version['record'], list):\n", " version_records = version['record']\n", " else:\n", " version_records = [version['record']]\n", " # Loop through versions in versions.\n", " for index, record in enumerate(version_records):\n", " source = get_source(record)\n", " if source == 'APAR:PR':\n", " # Add the id to the version record\n", " record['version_id'] = ids[index]\n", " record = clean_metadata(record)\n", " records.append(record) \n", " # Try to avoid hitting the API request limit\n", " pbar.update(100)\n", " time.sleep(0.2)\n", " return records\n", "\n", "def stringify_values(version, field):\n", " '''\n", " If a value is a list, join it into a pipe separate string.\n", " Otherwise just return the string value.\n", " '''\n", " try:\n", " if isinstance(version[field], list):\n", " values = [str(v) for v in version.get(field)]\n", " value = '|'.join(values)\n", " else:\n", " value = version.get(field, '')\n", " except KeyError:\n", " value = ''\n", " return value\n", "\n", "def clean_metadata(version):\n", " '''\n", " Standardises, cleans, and stringifies record metadata.\n", " '''\n", " record = {}\n", " record['version_id'] = version['version_id']\n", " record['title'] = version.get('title')\n", " record['date'] = version.get('date')\n", " # Make sure creators is a list\n", " record['creators'] = stringify_values(version, 'creator')\n", " record['subjects'] = stringify_values(version, 'subject')\n", " record['source'] = stringify_values(version, 'source')\n", " # Get the fulltext url from the list of identifiers\n", " try:\n", " record['fulltext_url'] = get_fulltext_url(version['identifier'])\n", " except KeyError:\n", " record['fulltext_url'] = ''\n", " record['trove_url'] = 'https://trove.nla.gov.au/version/{}'.format(version['version_id'])\n", " return record\n", "\n", "def save_texts(records, output_dir, query):\n", " '''\n", " Get the text of press releases in the ParlInfo db.\n", " This function uses urls harvested from Trove to request press releases from Parlinfo.\n", " Text is extracted from the HTML files and saved as individual text files.\n", " '''\n", " # Loop through all the previously harvested records\n", " for record in tqdm_notebook(records):\n", " output_path = os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)), 'texts')\n", " os.makedirs(output_path, exist_ok=True)\n", " filename = '{}-{}-{}.txt'.format(record['date'], slugify(record['creators']), record['version_id'])\n", " file_path = os.path.join(output_path, filename)\n", " # Only save files we haven't saved before\n", " if not os.path.exists(file_path):\n", " # Get the Parlinfo web page\n", " response = s.get(record['fulltext_url'])\n", " # Parse web page in Beautiful Soup\n", " soup = BeautifulSoup(response.text, 'lxml')\n", " content = soup.find('div', class_='box')\n", " # If we find some text on the web page then save it.\n", " if content:\n", " # Open file\n", " # print 'Saving file...'\n", " with open(file_path, 'w', encoding='utf-8') as text_file:\n", " # Get the contents of each paragraph and write it to the file\n", " for para in content.find_all('p'):\n", " text_file.write('{}\\n\\n'.format(para.get_text().strip()))\n", " time.sleep(0.5)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Harvest the metadata!\n", "\n", "Running the cell below will harvest details of all the press releases matching our query using the Trove API. The results will be saved in the `records` variable for further use." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2e4c9212bbdb4254847e6e8a9822eb45", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=1419), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "records = harvest_prs(query, api_key)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save the harvested metadata\n", "\n", "The cells below convert the `records` variable into a Pandas DataFrame, have a little peek inside, and then save all the harvested metadata as a CSV formatted text file. This file provides an index to the harvested press releases." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
creatorsdatefulltext_urlsourcesubjectstitletrove_urlversion_id
0Evans, Gareth1991-09-16http://parlinfo.aph.gov.au/parlInfo/search/dis...Minister for Foreign Affairs and TradeThe International Atomic Energy Agency and the...https://trove.nla.gov.au/version/214098272214098272
1ALP1902-01-01http://parlinfo.aph.gov.au/parlInfo/search/dis...AUSTRALIAN LABOR PARTYHistory of the Federal Capital and Parliament ...Australian Labor Party: 2nd Commonwealth Confe...https://trove.nla.gov.au/version/211168619211168619
2ALP1964-04-27http://parlinfo.aph.gov.au/parlInfo/search/dis...LEADER OF THE OPPOSITIONOutside control of the liberal partyhttps://trove.nla.gov.au/version/211168681211168681
3ALP1965-06-10http://parlinfo.aph.gov.au/parlInfo/search/dis...LEADER OF THE OPPOSITIONDecisions of the federal executivehttps://trove.nla.gov.au/version/211168736211168736
4ALP1964-11-09http://parlinfo.aph.gov.au/parlInfo/search/dis...LEADER OF THE OPPOSITIONDecisions of the federal executivehttps://trove.nla.gov.au/version/211168720211168720
\n", "
" ], "text/plain": [ " creators date \\\n", "0 Evans, Gareth 1991-09-16 \n", "1 ALP 1902-01-01 \n", "2 ALP 1964-04-27 \n", "3 ALP 1965-06-10 \n", "4 ALP 1964-11-09 \n", "\n", " fulltext_url \\\n", "0 http://parlinfo.aph.gov.au/parlInfo/search/dis... \n", "1 http://parlinfo.aph.gov.au/parlInfo/search/dis... \n", "2 http://parlinfo.aph.gov.au/parlInfo/search/dis... \n", "3 http://parlinfo.aph.gov.au/parlInfo/search/dis... \n", "4 http://parlinfo.aph.gov.au/parlInfo/search/dis... \n", "\n", " source \\\n", "0 Minister for Foreign Affairs and Trade \n", "1 AUSTRALIAN LABOR PARTY \n", "2 LEADER OF THE OPPOSITION \n", "3 LEADER OF THE OPPOSITION \n", "4 LEADER OF THE OPPOSITION \n", "\n", " subjects \\\n", "0 \n", "1 History of the Federal Capital and Parliament ... \n", "2 \n", "3 \n", "4 \n", "\n", " title \\\n", "0 The International Atomic Energy Agency and the... \n", "1 Australian Labor Party: 2nd Commonwealth Confe... \n", "2 Outside control of the liberal party \n", "3 Decisions of the federal executive \n", "4 Decisions of the federal executive \n", "\n", " trove_url version_id \n", "0 https://trove.nla.gov.au/version/214098272 214098272 \n", "1 https://trove.nla.gov.au/version/211168619 211168619 \n", "2 https://trove.nla.gov.au/version/211168681 211168681 \n", "3 https://trove.nla.gov.au/version/211168736 211168736 \n", "4 https://trove.nla.gov.au/version/211168720 211168720 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(records)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the number of records in the harvested data might be different to the number of search results. This is because we've unpacked versions that had been combined into a single work." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1771, 8)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many records\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Save the data as a CSV file\n", "os.makedirs(os.path.join(output_dir, 'press-releases-{}'.format(slugify(query))), exist_ok=True)\n", "df.to_csv(os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)), 'press-releases-{}.csv'.format(slugify(query))), index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download the text files\n", "\n", "The details we've harvested from the Trove API include a url that points to the full text of the press release in the ParlInfo database. Now we can loop through all those urls, saving the text of the press releases." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Only run this cell if you need to reload the harvested metadata from the CSV\n", "df = pd.read_csv(os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)), 'press-releases-{}.csv'.format(slugify(query))), keep_default_na=False)\n", "records = df.to_dict('records')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "03c94f11d0184b0baae73988eb46558a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=1771), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "save_texts(records, output_dir, query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Zip the results for easy download\n", "\n", "The metadata and text files we've harvested are all sitting in a directory named using the `query` value. If you're running this notebook on a cloud service, like Binder, you probably want to download it all. Running the cell below will zip up the whole directory and provide a convenient download link." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "Download results" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "press-releases/press-releases-atomic.zip
" ], "text/plain": [ "/Users/tim/mycode/glam-workbench/trove-journals/notebooks/press-releases/press-releases-atomic.zip" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "output_path = os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)))\n", "shutil.make_archive(output_path, 'zip', output_path)\n", "display(HTML('Download results'))\n", "display(FileLink('{}.zip'.format(output_path)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/).\n", "\n", "Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }