{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Harvesting Commonwealth Hansard\n",
    "\n",
    "The proceedings of Australia's Commonwealth Parliament are recorded in Hansard, which is available online through the Parliamentary Library's ParlInfo database. [Results in ParlInfo](https://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;adv=yes;orderBy=_fragment_number,doc_date-rev;query=Dataset:hansardr,hansardr80;resCount=Default) are generated from well-structured XML files which can be downloaded individually from the web interface – one XML file for each sitting day. This notebook shows you how to download  the XML files for large  scale analysis. It's an updated version of the code I used to harvest Hansard in 2016.\n",
    "\n",
    "**If you just want the data, a full harvest of the XML files for both houses between 1901–1980 and 1998–2005 [is available in this repository](https://github.com/wragge/hansard-xml). XML files are not currently available for 1981 to 1997. Open Australia provides access to [Hansard XML files from 2006 onwards](http://data.openaustralia.org.au/).**\n",
    "\n",
    "The XML files are published on the Australian Parliament website [under a CC-BY-NC-ND licence](https://www.aph.gov.au/Help/Disclaimer_Privacy_Copyright#c).\n",
    "\n",
    "## Method\n",
    "\n",
    "When you search in ParlInfo, your results point to fragments within a day's procedings. Multiple fragments will be drawn from a single XML file, so there are many more results than there are files. The first step in harvesting the XML files is to work through the results for each year scraping links to the XML files from the HTML pages and discarding any duplicates. The `harvest_year()` function below does this. These lists of links are saved as CSV files – one for each house and year. You can view the CSV files in the `data` directory.\n",
    "\n",
    "Once you have a list of XML urls for both houses across all years, you can simply use the urls to download the XML files.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import what we need"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import os\n",
    "import time\n",
    "import math\n",
    "import requests\n",
    "import requests_cache\n",
    "import arrow\n",
    "import pandas as pd\n",
    "from tqdm.auto import tqdm\n",
    "from bs4 import BeautifulSoup\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "\n",
    "\n",
    "s = requests_cache.CachedSession()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n",
    "s.mount('https://', HTTPAdapter(max_retries=retries))\n",
    "s.mount('http://', HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set your output directory\n",
    "\n",
    "This is where all the harvested data will go."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_dir = 'data'\n",
    "os.makedirs(output_dir, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define the base ParlInfo urls\n",
    "\n",
    "These are the basic templates for searches in ParlInfo. Later on we'll insert a date range in the `query` slot to filter by year, and increment the `page` value to work through the complete set of results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Years you want to harvest\n",
    "# Note that no XML files are available for the years 1981 to 1998, so harvests of this period will fail\n",
    "START_YEAR = 1901\n",
    "END_YEAR = 2005\n",
    "\n",
    "URLS = {\n",
    "        'hofreps': (\n",
    "            'http://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;'\n",
    "            'adv=yes;orderBy=date-eLast;page={page};'\n",
    "            'query={query}%20Dataset%3Ahansardr,hansardr80;resCount=100'),\n",
    "        'senate': (\n",
    "            'http://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;'\n",
    "            'adv=yes;orderBy=date-eLast;page={page};'\n",
    "            'query={query}%20Dataset%3Ahansards,hansards80;resCount=100')\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define some functions to do the work"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_total_results(house, query):\n",
    "    '''\n",
    "    Get the total number of results in the search.\n",
    "    '''\n",
    "    # Insert query and page values into the ParlInfo url\n",
    "    url = URLS[house].format(query=query, page=0)\n",
    "    # Get the results page\n",
    "    response = s.get(url)\n",
    "    # Parse the HTML\n",
    "    soup = BeautifulSoup(response.text)\n",
    "    try:\n",
    "        # Find where the total results are given in the HTML\n",
    "        summary = soup.find('div', 'resultsSummary').contents[1].string\n",
    "        # Extract the number of results from the string\n",
    "        total = re.search(r'of (\\d+)', summary).group(1)\n",
    "    except AttributeError:\n",
    "        total = 0\n",
    "    return int(total)\n",
    "\n",
    "def get_xml_url(url):\n",
    "    '''\n",
    "    Extract the XML file url from an individual result.\n",
    "    '''\n",
    "    # Load the page for an individual result\n",
    "    response = s.get(url)\n",
    "    # Parse the HTML\n",
    "    soup = BeautifulSoup(response.text)\n",
    "    # Find the XML url by looking for a pattern in the href\n",
    "    try:\n",
    "        xml_url = soup.find('a', href=re.compile('toc_unixml'))['href']\n",
    "    except TypeError:\n",
    "        xml_url = None\n",
    "    if not response.from_cache:\n",
    "        time.sleep(1)\n",
    "    return xml_url\n",
    "\n",
    "def harvest_year(house, year):\n",
    "    '''\n",
    "    Loop through a search by house and year, finding all the urls for XML files.\n",
    "    '''\n",
    "    # Format the start and end dates\n",
    "    start_date = '01%2F01%2F{}'.format(year)\n",
    "    end_date = '31%2F12%2F{}'.format(year)\n",
    "    # Prepare the query value using the start and end dates\n",
    "    query = 'Date%3A{}%20>>%20{}'.format(start_date, end_date)\n",
    "    # Get the total results\n",
    "    total_results = get_total_results(house, query)\n",
    "    xml_urls = []\n",
    "    dates = []\n",
    "    found_dates = []\n",
    "    if total_results > 0:\n",
    "        # Calculate the number of pages in the results set\n",
    "        num_pages = int(math.ceil(total_results / 100))\n",
    "        # Loop through the page range\n",
    "        for page in tqdm(range(0, num_pages + 1), desc=str(year), leave=False):\n",
    "            # Get the next page of results\n",
    "            url = URLS[house].format(query=query, page=page)\n",
    "            response = s.get(url)\n",
    "            # Parse the HTML\n",
    "            soup = BeautifulSoup(response.text)\n",
    "            # Find the list of results and loop through them\n",
    "            for result in tqdm(soup.find_all('div', 'resultContent'), leave=False):\n",
    "                # Try to identify the date\n",
    "                try:\n",
    "                    date = re.search(r'Date: (\\d{2}\\/\\d{2}\\/\\d{4})', result.find('div', 'sumMeta').get_text()).group(1)\n",
    "                    date = arrow.get(date, 'DD/MM/YYYY').format('YYYY-MM-DD')\n",
    "                except AttributeError:\n",
    "                    #There are some dodgy dates -- we'll just ignore them\n",
    "                    date = None\n",
    "                # If there's a date, and we haven't seen it already, we'll grab the details\n",
    "                if date and date not in dates:\n",
    "                    found_dates.append(date)\n",
    "                    # Get the link to the individual result page\n",
    "                    # This is where the XML file links live\n",
    "                    result_link = result.find('div', 'sumLink').a['href']\n",
    "                    # Get the XML file link from the individual record page\n",
    "                    xml_url = get_xml_url(result_link)\n",
    "                    if xml_url:\n",
    "                        dates.append(date)\n",
    "                        # Save dates and links\n",
    "                        xml_urls.append({'date': date, 'url': 'https://parlinfo.aph.gov.au{}'.format(xml_url)})\n",
    "            if not response.from_cache:\n",
    "                time.sleep(1)\n",
    "        for f_date in list(set(found_dates)):\n",
    "            if f_date not in dates:\n",
    "                xml_urls.append({'date': f_date, 'url': ''})\n",
    "    return xml_urls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harvest all the XML file links\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for house in ['hofreps', 'senate']:\n",
    "    for year in range(START_YEAR, END_YEAR + 1):\n",
    "        xml_urls = harvest_year(house, year)\n",
    "        df = pd.DataFrame(xml_urls)\n",
    "        df.to_csv(os.path.join(output_dir, '{}-{}-files.csv'.format(house, year)), index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download all the XML files\n",
    "\n",
    "This opens up each house/year list of file links and downloads the XML files. The directory structure is simple:\n",
    "\n",
    "```\n",
    "    -- output directory ('data' by default)\n",
    "        -- hofreps\n",
    "            -- 1901\n",
    "                -- XML files...\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for house in ['hofreps', 'senate']:\n",
    "    for year in range(START_YEAR, END_YEAR + 1):\n",
    "        output_path = os.path.join(output_dir, house, str(year))\n",
    "        os.makedirs(output_path, exist_ok=True)\n",
    "        df = pd.read_csv(os.path.join(output_dir, '{}-{}-files.csv'.format(house, year)))\n",
    "        for row in tqdm(df.itertuples(), desc=str(year), leave=False):\n",
    "            if pd.notnull(row.url):\n",
    "                filename = re.search(r'(?:%20)*([\\w\\(\\)-]+?\\.xml)', row.url).group(1)\n",
    "                # Some of the later files don't include the date in the filename so we'll add it.\n",
    "                if filename[:4] != str(year):\n",
    "                    filename = f'{row.date}_{filename}'\n",
    "                filepath = os.path.join(output_path, filename)\n",
    "                if not os.path.exists(filepath):\n",
    "                    response = s.get(row.url)\n",
    "                    with open(filepath, 'w') as xml_file:\n",
    "                        xml_file.write(response.text)\n",
    "                    if not response.from_cache:\n",
    "                        time.sleep(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summarise the data\n",
    "\n",
    "This just merges all the house/year lists into one big list, adding columns for house and year. It saves the results as a CSV file. This will be useful to analyse things like the number of sitting days per year.\n",
    "\n",
    "The fields in the CSV file are:\n",
    "\n",
    "* `date` – date of sitting day in YYYY-MM-DD format\n",
    "* `url` – url for XML file (where available)\n",
    "* `year`\n",
    "* `house` – 'hofreps' or 'senate'\n",
    "\n",
    "Here's the results of my harvest from 1901 to 2005: [all-sitting-days.csv](data/all-sitting-days.csv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame()\n",
    "for house in ['hofreps', 'senate']:\n",
    "    for year in range(START_YEAR, END_YEAR + 1):\n",
    "        year_df = pd.read_csv(os.path.join(output_dir, '{}-{}-files.csv'.format(house, year)))\n",
    "        year_df['year'] = year\n",
    "        year_df['house'] = house\n",
    "        df = df.append(year_df)\n",
    "df.sort_values(by=['house', 'date'], inplace=True)\n",
    "df.to_csv(os.path.join(output_dir, 'all-sitting-days.csv'), index=False) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Zip up each year individually\n",
    "\n",
    "For convenience you can zip up each year individually."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from shutil import make_archive\n",
    "\n",
    "for house in ['hofreps', 'senate']:\n",
    "    xml_path = os.path.join(output_dir, house)\n",
    "    for year in [d for d in os.listdir(xml_path) if d.isnumeric()]:\n",
    "        year_path = os.path.join(xml_path, year)\n",
    "        make_archive(year_path, 'zip', year_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}