# How many fact sheets survived the NAA website migration in 2019

In [2]:
import requests
from bs4 import BeautifulSoup

## Get the most recent version of the fact sheet index from the Internet Archive

First we'll load the page.

In [3]:
# Note the 'id_' in the url to get the original page without the IA navigation.
response = requests.get('https://web.archive.org/web/20190716210347id_/http://www.naa.gov.au/collection/fact-sheets/by-number/index.aspx')

In [4]:
soup = BeautifulSoup(response.content)

Then we'll extract the rows from the index table.

In [5]:
fs_list = soup.find('table', title='Numerical list of fact sheets').find_all('tr')[1:]

## Look for the fact sheets

Let's loop through all the rows in the fact sheet index, extracting the fact sheet number, title and url. Then we'll try loading the url. We'll save all the details and the HTTP status code for further exploration.

In [None]:
fact_sheets = []
for row in fs_list:
    num = row.td.text
    fs = row.find('a')
    title = fs.text
    url = f'http://naa.gov.au{fs["href"]}'
    response = requests.get(url)
    status = response.status_code
    print(f'{title}: {status}')
    fact_sheets.append({'number': num, 'title': title, 'url': url, 'status': status})

Reading room addresses and hours of opening: 200
Using our collection: 404
Addresses of Australian archival institutions: 404
Reading room rules: 404
What are archives?: 404
Archival terms: 200
The Commonwealth Record Series (CRS) system: 200
Citing archival records: 200
Copyright: 200
Searching for records: 404
Access to records under the Archives Act: 200
Viewing records in the reading room: 404
What to do if we refuse you access: 200
RecordSearch: an overview: 404
Keyword searching in RecordSearch Advanced search screens: 404
Release of records containing personal information: 200
Service guidelines for the National Reference Service: 404
NameSearch: 200
PhotoSearch: 404
Parliamentary Papers: 404
Commonwealth of Australia Gazettes: 404
Customs House, Sydney: 404
Coastal fortifications in New South Wales: 404
Commonwealth Film Unit: 404
The wine industry in South Australia: 404
Tasmanian railways: 404
Australia First Movement: 404
Commonwealth banking policy: 404
Navy service records

## Examine the results

In [21]:
import pandas as pd

In [28]:
df = pd.DataFrame(fact_sheets)

Let's break down the results by HTTP status code.

In [29]:
df['status'].value_counts()

404    251
200     15
Name: status, dtype: int64

In [42]:
print(f'{251 / (251+15):.2%} of fact sheets are kaput!')

94.36% of fact sheets are kaput!


## Which fact sheets have survived?

In [30]:
df.loc[df['status'] == 200]

Unnamed: 0,number,title,url,status
0,1,Reading room addresses and hours of opening,http://naa.gov.au/collection/fact-sheets/fs01....,200
5,5,Archival terms,http://naa.gov.au/collection/fact-sheets/fs05....,200
6,6,The Commonwealth Record Series (CRS) system,http://naa.gov.au/collection/fact-sheets/fs06....,200
7,7,Citing archival records,http://naa.gov.au/collection/fact-sheets/fs07....,200
8,8,Copyright,http://naa.gov.au/collection/fact-sheets/fs08....,200
10,10,Access to records under the Archives Act,http://naa.gov.au/collection/fact-sheets/fs10....,200
12,12,What to do if we refuse you access,http://naa.gov.au/collection/fact-sheets/fs12....,200
15,15,Release of records containing personal informa...,http://naa.gov.au/collection/fact-sheets/fs15....,200
17,18,NameSearch,http://naa.gov.au/collection/fact-sheets/fs18....,200
44,46,Why we refuse access,http://naa.gov.au/collection/fact-sheets/fs46....,200


## Save the results as a CSV

In [31]:
df.to_csv('data/fact_sheets.csv', index=False)