{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Get details of indexes\n", "\n", "This notebook scrapes details of available indexes from the NSW State Archives [Subjects A to Z](https://mhnsw.au/archive/subjects/?filter=indexes) page. It saves the results as a CSV formatted file.\n", "\n", "Once you've harvested the index details, you can use them to [harvest the content](harvest-indexes.ipynb) of all the individual indexes.\n", "\n", "Here's the [indexes.csv](indexes.csv) I harvested in May 2023.\n", "\n", "The fields in the CSV file are:\n", "\n", "* `title` – index title\n", "* `url` – link to the index's web page\n", "* `description` – brief description of the index\n", "* `category` – subject category this index belongs to (eg: 'Convicts')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import re\n", "\n", "import pandas as pd\n", "import requests_cache\n", "from bs4 import BeautifulSoup\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "\n", "s = requests_cache.CachedSession()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define our functions" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [] }, "outputs": [], "source": [ "def get_categories():\n", " \"\"\"\n", " Scrape a list of subject categories containing indexes from the Subjects A-Z page.\n", " Returns a list of dicts with keys:\n", " - `category` -- name of category\n", " - `url` -- link to category page\n", " \"\"\"\n", " categories = []\n", " # Get the Subjects A-Z page filtered to categories containing indexes\n", " response = s.get(\"https://mhnsw.au/archive/subjects/?filter=indexes\")\n", " soup = BeautifulSoup(response.text)\n", " # Get the div containing the category list\n", " category_list = soup.find(\"div\", class_=re.compile(\"^styles_rows__\"))\n", " # Loop through each category div saving details\n", " for row in category_list.find_all(\"div\", id=re.compile(\"^row-\")):\n", " link = row.find(\"a\")\n", " categories.append(\n", " {\"category\": link.string, \"url\": f\"https://mhnsw.au{link['href']}\"}\n", " )\n", " return categories\n", "\n", "\n", "def get_indexes(categories):\n", " \"\"\"\n", " Scrape a list of indexes for each category in the list of categories.\n", " Parameters: `categories` -- list of categories\n", " Returns a list of dicts with keys:\n", " - `title` -- title of index\n", " - `url` -- link to index page\n", " - `description` -- brief description of index\n", " - `category` -- name of category\n", " \"\"\"\n", " indexes = []\n", " # Loop through list of categories\n", " for category in categories:\n", " # Get the category page\n", " response = s.get(category[\"url\"])\n", " soup = BeautifulSoup(response.text)\n", " # Find the div containing the list of indexes\n", " index_list = soup.find(\"div\", class_=re.compile(\"^styles_rows__\"))\n", " # Loop through divs containing index info\n", " for row in index_list.find_all(\"div\", id=re.compile(\"^row-undefined\")):\n", " link = row.find(\"a\")\n", " # Get description\n", " description = row.find(\n", " \"div\", class_=re.compile(\"^styles_content__\")\n", " ).get_text()\n", " indexes.append(\n", " {\n", " \"title\": link.string,\n", " \"url\": f\"https://mhnsw.au{link['href']}\",\n", " \"description\": description,\n", " \"category\": category[\"category\"],\n", " }\n", " )\n", " return indexes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Harvest the index details" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Harvest list of categories\n", "categories = get_categories()\n", "# Harvest list of indexes from categories\n", "indexes = get_indexes(categories)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert to a dataframe and save as a CSV" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleurldescriptioncategory
0Colonial (Government) Architect index 1837-1970https://mhnsw.au/indexes/architecture-and-desi...Designed for researching the history of public...Architecture & design
1Infirm & destitute (Government) asylums index ...https://mhnsw.au/indexes/asylums/infirm-destit...This index relates to persons admitted to Gove...Asylums
2Bankruptcy index 1888-1929https://mhnsw.au/indexes/bankruptcy-and-insolv...Bankruptcy is a state in which a person is una...Bankruptcy & insolvency
3Insolvency index 1842-1887https://mhnsw.au/indexes/bankruptcy-and-insolv...Insolvency is the inability to pay debts or me...Bankruptcy & insolvency
4Bubonic plague index 1900-1908https://mhnsw.au/indexes/bubonic-plague/buboni...The Register of Cases of Bubonic Plague 1900-1...Bubonic plague
\n", "
" ], "text/plain": [ " title \n", "0 Colonial (Government) Architect index 1837-1970 \\\n", "1 Infirm & destitute (Government) asylums index ... \n", "2 Bankruptcy index 1888-1929 \n", "3 Insolvency index 1842-1887 \n", "4 Bubonic plague index 1900-1908 \n", "\n", " url \n", "0 https://mhnsw.au/indexes/architecture-and-desi... \\\n", "1 https://mhnsw.au/indexes/asylums/infirm-destit... \n", "2 https://mhnsw.au/indexes/bankruptcy-and-insolv... \n", "3 https://mhnsw.au/indexes/bankruptcy-and-insolv... \n", "4 https://mhnsw.au/indexes/bubonic-plague/buboni... \n", "\n", " description category \n", "0 Designed for researching the history of public... Architecture & design \n", "1 This index relates to persons admitted to Gove... Asylums \n", "2 Bankruptcy is a state in which a person is una... Bankruptcy & insolvency \n", "3 Insolvency is the inability to pay debts or me... Bankruptcy & insolvency \n", "4 The Register of Cases of Bubonic Plague 1900-1... Bubonic plague " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert to a Pandas dataframe\n", "df = pd.DataFrame(indexes)\n", "\n", "# Peek inside\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "# Save as a CSV file\n", "df.to_csv(\"indexes.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" } }, "nbformat": 4, "nbformat_minor": 4 }