{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# How many fact sheets survived the NAA website migration in 2019" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the most recent version of the fact sheet index from the Internet Archive\n", "\n", "First we'll load the page." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Note the 'id_' in the url to get the original page without the IA navigation.\n", "response = requests.get('https://web.archive.org/web/20190716210347id_/http://www.naa.gov.au/collection/fact-sheets/by-number/index.aspx')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "soup = BeautifulSoup(response.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we'll extract the rows from the index table." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "fs_list = soup.find('table', title='Numerical list of fact sheets').find_all('tr')[1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Look for the fact sheets\n", "\n", "Let's loop through all the rows in the fact sheet index, extracting the fact sheet number, title and url. Then we'll try loading the url. We'll save all the details and the HTTP status code for further exploration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading room addresses and hours of opening: 200\n", "Using our collection: 404\n", "Addresses of Australian archival institutions: 404\n", "Reading room rules: 404\n", "What are archives?: 404\n", "Archival terms: 200\n", "The Commonwealth Record Series (CRS) system: 200\n", "Citing archival records: 200\n", "Copyright: 200\n", "Searching for records: 404\n", "Access to records under the Archives Act: 200\n", "Viewing records in the reading room: 404\n", "What to do if we refuse you access: 200\n", "RecordSearch: an overview: 404\n", "Keyword searching in RecordSearch Advanced search screens: 404\n", "Release of records containing personal information: 200\n", "Service guidelines for the National Reference Service: 404\n", "NameSearch: 200\n", "PhotoSearch: 404\n", "Parliamentary Papers: 404\n", "Commonwealth of Australia Gazettes: 404\n", "Customs House, Sydney: 404\n", "Coastal fortifications in New South Wales: 404\n", "Commonwealth Film Unit: 404\n", "The wine industry in South Australia: 404\n", "Tasmanian railways: 404\n", "Australia First Movement: 404\n", "Commonwealth banking policy: 404\n", "Navy service records: 404\n", "Navy crew and ships records: 404\n", "RAAF service records: 404\n", "Security intelligence records held in Canberra: 404\n", "Cabinet records: 404\n", "Administration of the Australian Capital Territory: 404\n", "Military records held in Hobart: 404\n", "Maritime records held in Hobart: 404\n", "Passenger records held in Canberra: 404\n", "Civilian service in World War II: 404\n", "Research agents – Canberra: 404\n", "Research agents – Sydney: 404\n", "Research agents – Brisbane: 404\n", "Research agents – Adelaide and Darwin: 404\n", "Research agents – Melbourne and Hobart: 404\n", "Research agents – Perth: 404\n", "Why we refuse access: 200\n", "Australian Overseas Information Service photographs: 404\n", "Papua New Guinea patrol reports: 404\n", "D Notices: 404\n", "Post Office records: 404\n", "Copying charges: 200\n", "Exempt information in ASIO records: 404\n", "Personal information in ASIO records: 404\n", "Veterans' case files: 404\n", "Fremantle Harbour: 404\n", "Passenger records held in Perth: 404\n", "Melbourne Olympics, 1956: 404\n", "World War I internee, alien and POW records held in Canberra: 404\n", "World War II internee, alien and POW records held in Canberra: 404\n", "Design and development of the national capital: 404\n", "World War II war crimes: 404\n", "Indonesian independence: 404\n", "War service information: 404\n", "Passenger records held in Sydney: 404\n", "Customs shipping records held in Sydney: 404\n", "Migrant selection documents held in Canberra: 404\n", "Boer War records: 404\n", "Naturalisation records held in Canberra: 404\n", "ASIO files on writers and literary groups: 404\n", "Prime ministers of Australia: 404\n", "Prime Minister Joseph Cook: 404\n", "Prime Minister William Morris Hughes: 404\n", "Prime Minister Stanley Melbourne Bruce: 404\n", "Prime Minister James Henry Scullin: 404\n", "Prime Minister Joseph Aloysius Lyons: 404\n", "Prime Minister Earle Christmas Grafton Page: 404\n", "Prime Minister Robert Gordon Menzies: 404\n", "Prime Minister Arthur William Fadden: 404\n", "Prime Minister John Joseph Ambrose Curtin: 404\n", "Prime Minister Francis Michael Forde: 404\n", "Prime Minister Joseph Benedict Chifley: 404\n", "Prime Minister Harold Edward Holt: 404\n", "Prime Minister John McEwen: 404\n", "Prime Minister John Grey Gorton: 404\n", "Family history sources held in Canberra: 404\n", "Family history sources held in Adelaide: 404\n", "Australia and the United Nations: 404\n", "Births, deaths and marriages: 404\n", "Cyclones and the Northern Territory: 404\n", "Coastal fortifications in South Australia: 404\n", "Customs houses in South Australia: 404\n", "Customs House, Port Adelaide, South Australia: 404\n", "Excise control of distilled products in South Australia: 404\n", "Walter Burley Griffin and the design of Canberra: 404\n", "J T Lang and Lang Labor: 404\n", "Regulation of beer and brewing in South Australia: 404\n", "Sir Frederick Shedden and the Shedden collection: 404\n", "Records relating to Italian migration held in Sydney: 404\n", "World War II internee, alien and POW records held in Sydney: 404\n", "The Australian flag: 404\n", "The Cocos (Keeling) Islands: 404\n", "Commonwealth electoral rolls held in Perth: 404\n", "Copyright records: 404\n", "World War I internee, alien and POW records held in Adelaide: 404\n", "World War II internee, alien and POW records held in Adelaide: 404\n", "The Pastoral industry in the Northern Territory: 404\n", "Building the provisional Parliament House: 404\n", "When to use the Freedom of Information, Archives and Privacy Acts: 200\n", "The sinking of HMAS Sydney, November 1941: 404\n", "Royal Commission into Aboriginal Deaths in Custody: 404\n", "Aboriginal and Torres Strait Islander people: 404\n", "Memorandum of Understanding with Northern Territory Aboriginal people: 404\n", "Introducing television to Australia, 1956: 404\n", "Guides to the collection: 404\n", "Australia's involvement in the Vietnam War: 404\n", "Computer resources in reading rooms: 404\n", "Commonwealth electoral rolls held in Brisbane: 404\n", "Bankruptcy records held in Sydney: 404\n", "General Sir John Monash: 404\n", "Lighthouse records held in Hobart: 404\n", "Records of British migrants held in Canberra: 404\n", "Child migration to Australia: 404\n", "Radar research in Australia during World War II: 404\n", "Radar production and use during World War II: 404\n", "War Cabinet records: 404\n", "Cabinet notebooks: 404\n", "British nuclear tests at Maralinga: 404\n", "The Royal Commission on Espionage, 1954–55: 404\n", "Posters: 404\n", "World War ll Army pay files held in Adelaide: 404\n", "Defence and service records held in Melbourne: 404\n", "Colonial defence personnel records held in Melbourne: 404\n", "Army administrative records held in Melbourne: 404\n", "Army service records: 404\n", "Navy administrative records held in Melbourne: 404\n", "Navy service records held in Melbourne: 404\n", "Royalty and Australian society: 404\n", "Cockatoo Island Dockyard: 404\n" ] } ], "source": [ "fact_sheets = []\n", "for row in fs_list:\n", " num = row.td.text\n", " fs = row.find('a')\n", " title = fs.text\n", " url = f'http://naa.gov.au{fs[\"href\"]}'\n", " response = requests.get(url)\n", " status = response.status_code\n", " print(f'{title}: {status}')\n", " fact_sheets.append({'number': num, 'title': title, 'url': url, 'status': status})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examine the results" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(fact_sheets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's break down the results by HTTP status code." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "404 251\n", "200 15\n", "Name: status, dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['status'].value_counts()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "94.36% of fact sheets are kaput!\n" ] } ], "source": [ "print(f'{251 / (251+15):.2%} of fact sheets are kaput!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Which fact sheets have survived?" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
numbertitleurlstatus
01Reading room addresses and hours of openinghttp://naa.gov.au/collection/fact-sheets/fs01....200
55Archival termshttp://naa.gov.au/collection/fact-sheets/fs05....200
66The Commonwealth Record Series (CRS) systemhttp://naa.gov.au/collection/fact-sheets/fs06....200
77Citing archival recordshttp://naa.gov.au/collection/fact-sheets/fs07....200
88Copyrighthttp://naa.gov.au/collection/fact-sheets/fs08....200
1010Access to records under the Archives Acthttp://naa.gov.au/collection/fact-sheets/fs10....200
1212What to do if we refuse you accesshttp://naa.gov.au/collection/fact-sheets/fs12....200
1515Release of records containing personal informa...http://naa.gov.au/collection/fact-sheets/fs15....200
1718NameSearchhttp://naa.gov.au/collection/fact-sheets/fs18....200
4446Why we refuse accesshttp://naa.gov.au/collection/fact-sheets/fs46....200
4951Copying chargeshttp://naa.gov.au/collection/fact-sheets/fs51....200
106110When to use the Freedom of Information, Archiv...http://naa.gov.au/collection/fact-sheets/fs110...200
170175Bringing Them Home name indexhttp://naa.gov.au/collection/fact-sheets/fs175...200
190195The bombing of Darwinhttp://naa.gov.au/collection/fact-sheets/fs195...200
214220Passenger arrivals indexhttp://naa.gov.au/collection/fact-sheets/fs220...200
\n", "
" ], "text/plain": [ " number title \\\n", "0 1 Reading room addresses and hours of opening \n", "5 5 Archival terms \n", "6 6 The Commonwealth Record Series (CRS) system \n", "7 7 Citing archival records \n", "8 8 Copyright \n", "10 10 Access to records under the Archives Act \n", "12 12 What to do if we refuse you access \n", "15 15 Release of records containing personal informa... \n", "17 18 NameSearch \n", "44 46 Why we refuse access \n", "49 51 Copying charges \n", "106 110 When to use the Freedom of Information, Archiv... \n", "170 175 Bringing Them Home name index \n", "190 195 The bombing of Darwin \n", "214 220 Passenger arrivals index \n", "\n", " url status \n", "0 http://naa.gov.au/collection/fact-sheets/fs01.... 200 \n", "5 http://naa.gov.au/collection/fact-sheets/fs05.... 200 \n", "6 http://naa.gov.au/collection/fact-sheets/fs06.... 200 \n", "7 http://naa.gov.au/collection/fact-sheets/fs07.... 200 \n", "8 http://naa.gov.au/collection/fact-sheets/fs08.... 200 \n", "10 http://naa.gov.au/collection/fact-sheets/fs10.... 200 \n", "12 http://naa.gov.au/collection/fact-sheets/fs12.... 200 \n", "15 http://naa.gov.au/collection/fact-sheets/fs15.... 200 \n", "17 http://naa.gov.au/collection/fact-sheets/fs18.... 200 \n", "44 http://naa.gov.au/collection/fact-sheets/fs46.... 200 \n", "49 http://naa.gov.au/collection/fact-sheets/fs51.... 200 \n", "106 http://naa.gov.au/collection/fact-sheets/fs110... 200 \n", "170 http://naa.gov.au/collection/fact-sheets/fs175... 200 \n", "190 http://naa.gov.au/collection/fact-sheets/fs195... 200 \n", "214 http://naa.gov.au/collection/fact-sheets/fs220... 200 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df['status'] == 200]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save the results as a CSV" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "df.to_csv('data/fact_sheets.csv', index=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }