{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get details of indexes\n",
    "\n",
    "This notebook scrapes details of available indexes from the NSW State Archives [Subjects A to Z](https://mhnsw.au/archive/subjects/?filter=indexes) page. It saves the results as a CSV formatted file.\n",
    "\n",
    "Once you've harvested the index details, you can use them to [harvest the content](harvest-indexes.ipynb) of all the individual indexes.\n",
    "\n",
    "Here's the [indexes.csv](indexes.csv) I harvested in May 2023.\n",
    "\n",
    "The fields in the CSV file are:\n",
    "\n",
    "* `title` – index title\n",
    "* `url` – link to the index's web page\n",
    "* `description` – brief description of the index\n",
    "* `category` – subject category this index belongs to (eg: 'Convicts')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import what we need"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "import pandas as pd\n",
    "import requests_cache\n",
    "from bs4 import BeautifulSoup\n",
    "from requests.adapters import HTTPAdapter\n",
    "from requests.packages.urllib3.util.retry import Retry\n",
    "\n",
    "s = requests_cache.CachedSession()\n",
    "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n",
    "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n",
    "s.mount(\"https://\", HTTPAdapter(max_retries=retries))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define our functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def get_categories():\n",
    "    \"\"\"\n",
    "    Scrape a list of subject categories containing indexes from the Subjects A-Z page.\n",
    "    Returns a list of dicts with keys:\n",
    "      - `category` -- name of category\n",
    "      - `url` -- link to category page\n",
    "    \"\"\"\n",
    "    categories = []\n",
    "    # Get the Subjects A-Z page filtered to categories containing indexes\n",
    "    response = s.get(\"https://mhnsw.au/archive/subjects/?filter=indexes\")\n",
    "    soup = BeautifulSoup(response.text)\n",
    "    # Get the div containing the category list\n",
    "    category_list = soup.find(\"div\", class_=re.compile(\"^styles_rows__\"))\n",
    "    # Loop through each category div saving details\n",
    "    for row in category_list.find_all(\"div\", id=re.compile(\"^row-\")):\n",
    "        link = row.find(\"a\")\n",
    "        categories.append(\n",
    "            {\"category\": link.string, \"url\": f\"https://mhnsw.au{link['href']}\"}\n",
    "        )\n",
    "    return categories\n",
    "\n",
    "\n",
    "def get_indexes(categories):\n",
    "    \"\"\"\n",
    "    Scrape a list of indexes for each category in the list of categories.\n",
    "    Parameters: `categories` -- list of categories\n",
    "    Returns a list of dicts with keys:\n",
    "      - `title` -- title of index\n",
    "      - `url` -- link to index page\n",
    "      - `description` -- brief description of index\n",
    "      - `category` -- name of category\n",
    "    \"\"\"\n",
    "    indexes = []\n",
    "    # Loop through list of categories\n",
    "    for category in categories:\n",
    "        # Get the category page\n",
    "        response = s.get(category[\"url\"])\n",
    "        soup = BeautifulSoup(response.text)\n",
    "        # Find the div containing the list of indexes\n",
    "        index_list = soup.find(\"div\", class_=re.compile(\"^styles_rows__\"))\n",
    "        # Loop through divs containing index info\n",
    "        for row in index_list.find_all(\"div\", id=re.compile(\"^row-undefined\")):\n",
    "            link = row.find(\"a\")\n",
    "            # Get description\n",
    "            description = row.find(\n",
    "                \"div\", class_=re.compile(\"^styles_content__\")\n",
    "            ).get_text()\n",
    "            indexes.append(\n",
    "                {\n",
    "                    \"title\": link.string,\n",
    "                    \"url\": f\"https://mhnsw.au{link['href']}\",\n",
    "                    \"description\": description,\n",
    "                    \"category\": category[\"category\"],\n",
    "                }\n",
    "            )\n",
    "    return indexes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harvest the index details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Harvest list of categories\n",
    "categories = get_categories()\n",
    "# Harvest list of indexes from categories\n",
    "indexes = get_indexes(categories)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert to a dataframe and save as a CSV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "      <th>description</th>\n",
       "      <th>category</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Colonial (Government) Architect index 1837-1970</td>\n",
       "      <td>https://mhnsw.au/indexes/architecture-and-desi...</td>\n",
       "      <td>Designed for researching the history of public...</td>\n",
       "      <td>Architecture &amp; design</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Infirm &amp; destitute (Government) asylums index ...</td>\n",
       "      <td>https://mhnsw.au/indexes/asylums/infirm-destit...</td>\n",
       "      <td>This index relates to persons admitted to Gove...</td>\n",
       "      <td>Asylums</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Bankruptcy index 1888-1929</td>\n",
       "      <td>https://mhnsw.au/indexes/bankruptcy-and-insolv...</td>\n",
       "      <td>Bankruptcy is a state in which a person is una...</td>\n",
       "      <td>Bankruptcy &amp; insolvency</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Insolvency index 1842-1887</td>\n",
       "      <td>https://mhnsw.au/indexes/bankruptcy-and-insolv...</td>\n",
       "      <td>Insolvency is the inability to pay debts or me...</td>\n",
       "      <td>Bankruptcy &amp; insolvency</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Bubonic plague index 1900-1908</td>\n",
       "      <td>https://mhnsw.au/indexes/bubonic-plague/buboni...</td>\n",
       "      <td>The Register of Cases of Bubonic Plague 1900-1...</td>\n",
       "      <td>Bubonic plague</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title   \n",
       "0    Colonial (Government) Architect index 1837-1970  \\\n",
       "1  Infirm & destitute (Government) asylums index ...   \n",
       "2                         Bankruptcy index 1888-1929   \n",
       "3                         Insolvency index 1842-1887   \n",
       "4                     Bubonic plague index 1900-1908   \n",
       "\n",
       "                                                 url   \n",
       "0  https://mhnsw.au/indexes/architecture-and-desi...  \\\n",
       "1  https://mhnsw.au/indexes/asylums/infirm-destit...   \n",
       "2  https://mhnsw.au/indexes/bankruptcy-and-insolv...   \n",
       "3  https://mhnsw.au/indexes/bankruptcy-and-insolv...   \n",
       "4  https://mhnsw.au/indexes/bubonic-plague/buboni...   \n",
       "\n",
       "                                         description                 category  \n",
       "0  Designed for researching the history of public...    Architecture & design  \n",
       "1  This index relates to persons admitted to Gove...                  Asylums  \n",
       "2  Bankruptcy is a state in which a person is una...  Bankruptcy & insolvency  \n",
       "3  Insolvency is the inability to pay debts or me...  Bankruptcy & insolvency  \n",
       "4  The Register of Cases of Bubonic Plague 1900-1...           Bubonic plague  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Convert to a Pandas dataframe\n",
    "df = pd.DataFrame(indexes)\n",
    "\n",
    "# Peek inside\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": [
     "nbval-skip"
    ]
   },
   "outputs": [],
   "source": [
    "# Save as a CSV file\n",
    "df.to_csv(\"indexes.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}