{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvest files with the access status of 'closed'\n", "\n", "The National Archives of Australia's RecordSearch database includes some information about files that we're not allowed to see. These files have been through the access examination process and ended up with an access status of 'closed'. You can read about my efforts to extract and interpret this data in [Inside Story](http://insidestory.org.au/withheld-pending-advice/).\n", "\n", "While you can search by access status in RecordSearch, you can't explore the reasons, so if you want to dig any deeper you need to harvest the data. This notebook shows you how.\n", "\n", "This code used in this notebook is similar to that in [harvesting items from a search](harvesting_items_from_a_search.ipynb). The only real difference is that full items records are harvested by default, and the access reasons are processed to separate and normalise munged-together values.\n", "\n", "This notebook uses the [RecordSearch Data Scraper](https://wragge.github.io/recordsearch_data_scraper/) to do most of the work. Note that the RecordSearch Data Scraper caches results to improve efficiency. This also makes it easy to resume a failed harvest. If you want to completely refresh a harvest, then delete the `cache_db.sqlite` file to start from scratch." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting things up" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "import time\n", "from datetime import datetime\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "from IPython.display import FileLink, display\n", "from recordsearch_data_scraper.scrapers import RSItemSearch\n", "from tqdm.auto import tqdm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Regular expressions to match against the reasons in RS to normalise them\n", "EXCEPTIONS = [\n", " [\"33(1)(a)\", r\"33\\(1\\)\\(a\\)\"],\n", " [\"33(1)(b)\", r\"33\\(1\\)[a\\(\\)]*\\(b\\)\"],\n", " [\"33(1)(c)\", r\"33\\(1\\)[ab\\(\\)]*\\(c\\)\"],\n", " [\"33(1)(d)\", r\"33\\(1\\)[abc\\(\\)]*\\(d\\)\"],\n", " [\"33(1)(e)(i)\", r\"33\\(1\\)[abcd\\(\\)]*\\(e\\)\\(i\\)\"],\n", " [\"33(1)(e)(ii)\", r\"33\\(1\\)[abcd\\(\\)]*\\(e\\)\\(ii\\)\"],\n", " [\"33(1)(e)(iii)\", r\"33\\(1\\)[abcd\\(\\)]*\\(e\\)\\(iii\\)\"],\n", " [\"33(1)(f)(i)\", r\"33\\(1\\)[abcdei\\(\\)]*\\(f\\)\\(i\\)\"],\n", " [\"33(1)(f)(ii)\", r\"33\\(1\\)[abcdei\\(\\)]*\\(f\\)\\(ii\\)\"],\n", " [\"33(1)(f)(iii)\", r\"33\\(1\\)[abcdei\\(\\)]*\\(f\\)\\(iii\\)\"],\n", " [\"33(1)(g)\", r\"33\\(1\\)[abcdefi\\(\\)]*\\(g\\)*\"],\n", " [\"33(1)(h)\", r\"33\\(1\\)[abcdefgi\\(\\)]*\\(h\\)\"],\n", " [\"33(1)(j)\", r\"33\\(1\\)[abcdefghi\\(\\)]*\\(j\\)\"],\n", " [\"33(2)(a)\", r\"33\\(2\\)\\(a\\)\"],\n", " [\"33(2)(b)\", r\"33\\(2\\)[a\\(\\)]*\\(b\\)\"],\n", " [\"33(3)(a)(i)\", r\"33\\(3\\)\\(a\\)\\(i\\)\"],\n", " [\"33(3)(a)(ii)\", r\"33\\(3\\)\\(a\\)(\\(i\\))?\\(ii\\)\"],\n", " [\"33(3)(b)\", r\"33\\(3\\)[ai\\(\\) &]*\\(b\\)\"],\n", " [\"Closed period\", r\"Closed period.*\"],\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Harvest the records" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "def normalise_reasons(items):\n", " \"\"\"\n", " Uses a set of regex patterns to try and extract a set of individual reasons from the reasons values,\n", " which can sometimes be munged together.\n", " \"\"\"\n", " for item in items:\n", " item[\"reasons\"] = []\n", " try:\n", " # The access reason field can munge together mutiple reasons, so we need to separate & normalise\n", " for reason in item[\"access_decision_reasons\"]:\n", " matched = False\n", " # Loop through the regexp patterns to see what we can find in the access reason field, save any matches\n", " for exception, pattern in EXCEPTIONS:\n", " if re.match(pattern, reason):\n", " item[\"reasons\"].append(exception)\n", " matched = True\n", " if not matched:\n", " # If nothing matches, just save the original\n", " item[\"reasons\"].append(reason)\n", " except KeyError:\n", " print(item)\n", " raise\n", "\n", " return items\n", "\n", "\n", "items = []\n", "search = RSItemSearch(record_detail=\"full\", access=\"Closed\")\n", "with tqdm(total=search.total_results) as pbar:\n", " more = True\n", " while more:\n", " data = search.get_results()\n", " if data[\"results\"]:\n", " items += normalise_reasons(data[\"results\"])\n", " pbar.update(len(data[\"results\"]))\n", " time.sleep(0.5)\n", " else:\n", " more = False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save the results for download" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "def save_harvest(search, items):\n", " params = search.params.copy()\n", " params.update(search.kwargs)\n", " today = datetime.now()\n", " search_param_str = \"_\".join(\n", " sorted(\n", " [\n", " f\"{k}_{v}\"\n", " for k, v in params.items()\n", " if v is not None and k not in [\"results_per_page\", \"sort\"]\n", " ]\n", " )\n", " )\n", " data_dir = Path(\"harvests\", f'{today.strftime(\"%Y%m%d\")}_{search_param_str}')\n", " data_dir.mkdir(exist_ok=True, parents=True)\n", " metadata = {\n", " \"date_harvested\": today.isoformat(),\n", " \"search_params\": search.params,\n", " \"search_kwargs\": search.kwargs,\n", " \"total_results\": search.total_results,\n", " \"total_harvested\": len(items),\n", " }\n", "\n", " with Path(data_dir, \"metadata.json\").open(\"w\") as md_file:\n", " json.dump(metadata, md_file)\n", "\n", " with Path(data_dir, \"results.jsonl\").open(\"w\") as data_file:\n", " for item in items:\n", " data_file.write(json.dumps(item) + \"\\n\")\n", "\n", " df = pd.json_normalize(items)\n", " df.to_csv(Path(data_dir, \"results.csv\"), index=False)\n", " display(FileLink(Path(data_dir, \"metadata.json\")))\n", " display(FileLink(Path(data_dir, \"results.jsonl\")))\n", " display(FileLink(Path(data_dir, \"results.csv\")))\n", " return str(data_dir)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "save_harvest(search, items)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }