{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparing harvests of closed files\n", "\n", "This notebook brings together annual harvests of files with an access status of 'closed', scraped from the NAA's RecordSearch database. The data files are here:\n", "\n", "* [2015](data/closed-20160101.csv) (harvested 1 January 2016)\n", "* [2016](data/closed-20170109.csv) (harvested 9 January 2017)\n", "* [2017](data/closed-20180101.csv) (harvested 1 January 2018)\n", "* [2018](data/closed-20190101.csv) (harvested 1 January 2019)\n", "* [2019](data/closed-20200101.csv) (harvested 1 January 2020)\n", "* [2020](data/closed-20210101.csv) (harvested 1 January 2021)\n", "\n", "The current code used to harvest 'closed' files is in [this notebook](harvest_closed_files.ipynb). Previous versions can be found in [this repository](https://github.com/wragge/closed_access)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import altair as alt\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "harvests = {\n", " \"2015\": \"closed-20160101.csv\",\n", " \"2016\": \"closed-20170109.csv\",\n", " \"2017\": \"closed-20180101.csv\",\n", " \"2018\": \"closed-20190101.csv\",\n", " \"2019\": \"closed-20200101.csv\",\n", " \"2020\": \"closed-20210101.csv\",\n", " \"2021\": \"closed-20220101.csv\",\n", "}" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identifierseriescontrol_symboltitleseries_titlecontents_date_strcontents_start_datecontents_end_dateaccess_statusaccess_decision_datereasonsharvested_yearlocationaccess_decision_date_strdigitised_statusdigitised_pagesaccess_decision_reasonsretrieved
012332A11911/21007Salvatore Pagano Naturalization Issued to Immi...Correspondence files, annual single number ser...1961 - 19611961-01-01 00:00:001961-01-01 00:00:00Closed1981-07-28 00:00:00Pre Access Recorder2015NaNNaNNaNNaNNaNNaN
115403A11913/6809Meeting of Commonwealth Literary Fund (Missing...Correspondence files, annual single number ser...1913 - 19131913-01-01 00:00:001913-01-01 00:00:00Closed1981-09-28 00:00:00Pre Access Recorder2015NaNNaNNaNNaNNaNNaN
233093A11915/11532Wilhelm CA Simonsen - Naturalization Issued to...Correspondence files, annual single number ser...1961 - 19611961-01-01 00:00:001961-01-01 00:00:00Closed1981-12-03 00:00:00Pre Access Recorder2015NaNNaNNaNNaNNaNNaN
346663A21907/554Report of Conference of Statisticians (File Co...Correspondence files, annual single number series1904 - 19201904-01-01 00:00:001920-01-01 00:00:00Closed1973-06-20 00:00:00Pre Access Recorder2015NaNNaNNaNNaNNaNNaN
447046A21915/346Rossino - MarioCorrespondence files, annual single number series1915 - 19151915-01-01 00:00:001915-01-01 00:00:00Closed1973-06-28 00:00:00Pre Access Recorder2015NaNNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " identifier series control_symbol \\\n", "0 12332 A1 1911/21007 \n", "1 15403 A1 1913/6809 \n", "2 33093 A1 1915/11532 \n", "3 46663 A2 1907/554 \n", "4 47046 A2 1915/346 \n", "\n", " title \\\n", "0 Salvatore Pagano Naturalization Issued to Immi... \n", "1 Meeting of Commonwealth Literary Fund (Missing... \n", "2 Wilhelm CA Simonsen - Naturalization Issued to... \n", "3 Report of Conference of Statisticians (File Co... \n", "4 Rossino - Mario \n", "\n", " series_title contents_date_str \\\n", "0 Correspondence files, annual single number ser... 1961 - 1961 \n", "1 Correspondence files, annual single number ser... 1913 - 1913 \n", "2 Correspondence files, annual single number ser... 1961 - 1961 \n", "3 Correspondence files, annual single number series 1904 - 1920 \n", "4 Correspondence files, annual single number series 1915 - 1915 \n", "\n", " contents_start_date contents_end_date access_status \\\n", "0 1961-01-01 00:00:00 1961-01-01 00:00:00 Closed \n", "1 1913-01-01 00:00:00 1913-01-01 00:00:00 Closed \n", "2 1961-01-01 00:00:00 1961-01-01 00:00:00 Closed \n", "3 1904-01-01 00:00:00 1920-01-01 00:00:00 Closed \n", "4 1915-01-01 00:00:00 1915-01-01 00:00:00 Closed \n", "\n", " access_decision_date reasons harvested_year location \\\n", "0 1981-07-28 00:00:00 Pre Access Recorder 2015 NaN \n", "1 1981-09-28 00:00:00 Pre Access Recorder 2015 NaN \n", "2 1981-12-03 00:00:00 Pre Access Recorder 2015 NaN \n", "3 1973-06-20 00:00:00 Pre Access Recorder 2015 NaN \n", "4 1973-06-28 00:00:00 Pre Access Recorder 2015 NaN \n", "\n", " access_decision_date_str digitised_status digitised_pages \\\n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN \n", "\n", " access_decision_reasons retrieved \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load all the data into a single dataframe\n", "dfs = []\n", "for year, data_file in harvests.items():\n", " df_year = pd.read_csv(\n", " Path(\"data\", data_file),\n", " parse_dates=[\n", " \"contents_start_date\",\n", " \"contents_end_date\",\n", " \"access_decision_date\",\n", " ],\n", " keep_default_na=False,\n", " )\n", " df_year[\"harvested_year\"] = year\n", " dfs.append(df_year)\n", "df = pd.concat(dfs)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of closed files in each harvest" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearcount
0201514370
6201610750
4201711189
1201811953
2201911867
5202011140
3202111377
\n", "
" ], "text/plain": [ " year count\n", "0 2015 14370\n", "6 2016 10750\n", "4 2017 11189\n", "1 2018 11953\n", "2 2019 11867\n", "5 2020 11140\n", "3 2021 11377" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "year_counts = df[\"harvested_year\"].value_counts().to_frame().reset_index()\n", "year_counts.columns = [\"year\", \"count\"]\n", "year_counts.sort_values(by=\"year\")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(year_counts).mark_bar(point=True).encode(\n", " x=alt.X(\"year:O\", title=\"Year end\"),\n", " y=alt.Y(\"count:Q\", title=\"Number of closed files\"),\n", " color=alt.Color(\"year\", legend=None),\n", " tooltip=[\"year:O\", \"count:Q\"],\n", ").properties(width=300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find the number of times each reason is cited in the annual harvests" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "df_reasons = df.copy()\n", "df_reasons[\"reason\"] = df_reasons[\"reasons\"].str.split(\"|\")\n", "df_reasons = df_reasons.explode(\"reason\")\n", "df_reasons[\"reason\"].replace(\"\", \"No reason\", inplace=True)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['33(1)(a)',\n", " '33(1)(b)',\n", " '33(1)(c)',\n", " '33(1)(d)',\n", " '33(1)(e)(i)',\n", " '33(1)(e)(ii)',\n", " '33(1)(e)(iii)',\n", " '33(1)(f)(i)',\n", " '33(1)(f)(ii)',\n", " '33(1)(f)(iii)',\n", " '33(1)(g)',\n", " '33(1)(h)',\n", " '33(1)(j)',\n", " '33(2)(a)',\n", " '33(2)(b)',\n", " '33(3)(a)(i)',\n", " '33(3)(a)(ii)',\n", " '33(3)(b)',\n", " 'Cabinet notebooks',\n", " 'Closed period',\n", " 'Court records',\n", " 'Destroyed',\n", " 'MAKE YOUR SELECTION',\n", " 'NRF',\n", " 'No reason',\n", " 'Non Cwlth-depositor',\n", " 'Non Cwlth-no appeal',\n", " 'Parliament Class A',\n", " 'Pre Access Recorder',\n", " 'Withheld pending adv',\n", " \"['33(1)(a)', '33(1)(b)', '33(1)(c)', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(b)', '33(1)(d)', '33(1)(g)', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(b)', '33(1)(d)', '33(1)(g)']\",\n", " \"['33(1)(a)', '33(1)(b)', '33(1)(d)', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(b)', '33(1)(d)']\",\n", " \"['33(1)(a)', '33(1)(b)', '33(1)(e)(ii)', '33(1)(g)']\",\n", " \"['33(1)(a)', '33(1)(b)', '33(1)(e)(ii)']\",\n", " \"['33(1)(a)', '33(1)(b)', '33(1)(e)(iii)']\",\n", " \"['33(1)(a)', '33(1)(b)', '33(1)(g)', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(b)', 'Closed period', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(b)', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(b)']\",\n", " \"['33(1)(a)', '33(1)(c)']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(e)(i)', '33(1)(g)', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(e)(i)', '33(1)(g)']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(e)(i)']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(e)(ii)', '33(1)(e)(iii)', 'NRF']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(e)(ii)', '33(1)(e)(iii)', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(e)(ii)', '33(1)(e)(iii)']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(e)(ii)', '33(1)(g)', 'Closed period', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(e)(ii)', '33(1)(g)']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(e)(iii)', '33(1)(g)']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(e)(iii)']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(g)', '33(1)(e)(i)']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(g)', '33(1)(e)(ii)']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(g)', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(d)', '33(1)(g)']\",\n", " \"['33(1)(a)', '33(1)(d)']\",\n", " \"['33(1)(a)', '33(1)(e)(ii)', '33(1)(g)']\",\n", " \"['33(1)(a)', '33(1)(e)(ii)']\",\n", " \"['33(1)(a)', '33(1)(e)(iii)']\",\n", " \"['33(1)(a)', '33(1)(f)(iii)']\",\n", " \"['33(1)(a)', '33(1)(g)', 'Withheld pending adv']\",\n", " \"['33(1)(a)', '33(1)(g)']\",\n", " \"['33(1)(a)', '33(1)(j)', '33(2)(a)', '33(2)(b)']\",\n", " \"['33(1)(a)', 'Closed period', 'Withheld pending adv']\",\n", " \"['33(1)(a)', 'Withheld pending adv']\",\n", " \"['33(1)(a)']\",\n", " \"['33(1)(b)', '33(1)(d)', '33(1)(e)(ii)']\",\n", " \"['33(1)(b)', '33(1)(e)(ii)']\",\n", " \"['33(1)(b)', '33(1)(e)(iii)']\",\n", " \"['33(1)(b)', '33(1)(g)']\",\n", " \"['33(1)(b)', 'Withheld pending adv']\",\n", " \"['33(1)(b)']\",\n", " \"['33(1)(c)']\",\n", " \"['33(1)(d)', '33(1)(e)(ii)', '33(1)(g)']\",\n", " \"['33(1)(d)', '33(1)(e)(ii)']\",\n", " \"['33(1)(d)', '33(1)(e)(iii)', '33(1)(g)']\",\n", " \"['33(1)(d)', '33(1)(g)', '33(1)(j)', '33(1)(f)(ii)']\",\n", " \"['33(1)(d)', '33(1)(g)', '33(1)(j)', 'Closed period']\",\n", " \"['33(1)(d)', '33(1)(g)', '33(1)(j)']\",\n", " \"['33(1)(d)', '33(1)(g)', 'Closed period']\",\n", " \"['33(1)(d)', '33(1)(g)']\",\n", " \"['33(1)(d)']\",\n", " \"['33(1)(e)(i)', '33(1)(e)(ii)', '33(1)(g)']\",\n", " \"['33(1)(e)(i)', '33(1)(f)(i)', '33(1)(g)']\",\n", " \"['33(1)(e)(i)', '33(1)(g)']\",\n", " \"['33(1)(e)(ii)', '33(1)(g)', '33(1)(j)']\",\n", " \"['33(1)(e)(ii)', '33(1)(g)']\",\n", " \"['33(1)(e)(ii)']\",\n", " \"['33(1)(e)(iii)', '33(1)(f)(iii)']\",\n", " \"['33(1)(e)(iii)', '33(1)(g)']\",\n", " \"['33(1)(e)(iii)', 'Withheld pending adv']\",\n", " \"['33(1)(e)(iii)']\",\n", " \"['33(1)(f)(i)']\",\n", " \"['33(1)(f)(ii)']\",\n", " \"['33(1)(f)(iii)', '33(1)(g)']\",\n", " \"['33(1)(f)(iii)']\",\n", " \"['33(1)(g)', '33(1)(j)', 'Closed period']\",\n", " \"['33(1)(g)', '33(1)(j)']\",\n", " \"['33(1)(g)', 'Closed period']\",\n", " \"['33(1)(g)', 'MAKE YOUR SELECTION']\",\n", " \"['33(1)(g)', 'Non Cwlth-depositor']\",\n", " \"['33(1)(g)', 'Non Cwlth-no appeal']\",\n", " \"['33(1)(g)', 'Parliament Class A']\",\n", " \"['33(1)(g)']\",\n", " \"['33(1)(h)']\",\n", " \"['33(1)(j)']\",\n", " \"['33(2)(a)', '33(2)(b)', '33(3)(a)(i)', '33(3)(a)(ii)', '33(3)(b)']\",\n", " \"['33(2)(a)', '33(2)(b)', 'Withheld pending adv']\",\n", " \"['33(2)(a)', '33(2)(b)']\",\n", " \"['33(3)(a)(i)', '33(3)(a)(ii)', '33(3)(b)']\",\n", " \"['33(3)(a)(i)', '33(3)(b)', '33(3)(a)(ii)', '33(3)(b)']\",\n", " \"['33(3)(a)(ii)', '33(3)(b)', 'MAKE YOUR SELECTION']\",\n", " \"['33(3)(a)(ii)', '33(3)(b)']\",\n", " \"['Closed period', 'Non Cwlth-no appeal']\",\n", " \"['Closed period', 'Withheld pending adv']\",\n", " \"['Closed period']\",\n", " \"['Court records']\",\n", " \"['Destroyed']\",\n", " \"['MAKE YOUR SELECTION', 'Withheld pending adv']\",\n", " \"['MAKE YOUR SELECTION']\",\n", " \"['NRF']\",\n", " \"['Non Cwlth-depositor']\",\n", " \"['Non Cwlth-no appeal']\",\n", " \"['Parliament Class A']\",\n", " \"['Withheld pending adv']\",\n", " '[]']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unique_reasons = sorted(list(df_reasons[\"reason\"].unique()))\n", "unique_reasons" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "harvest_reasons_counts = (\n", " df_reasons.groupby(by=[\"harvested_year\", \"reason\"]).size().reset_index()\n", ")\n", "harvest_reasons_counts.columns = [\"year\", \"reason\", \"count\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualise the number of times each reason is cited" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(harvest_reasons_counts).mark_bar().encode(\n", " x=alt.X(\"year:O\", title=None),\n", " y=alt.Y(\"count:Q\", title=\"Number of files\"),\n", " color=alt.Color(\"year:N\", legend=None),\n", " facet=alt.Facet(\n", " \"reason:O\", align=\"each\", columns=5, title=\"Reason for being closed\"\n", " ),\n", " tooltip=[\"year:O\", \"reason:N\", \"count:Q\"],\n", ").properties(height=200).resolve_scale(x=\"independent\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Focus on a specific reason\n", "\n", "Select a reason from the dropdown list to examine change over time." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "input_dropdown = alt.binding_select(\n", " options=[None] + unique_reasons, labels=[\"All\"] + unique_reasons\n", ")\n", "selection = alt.selection_single(fields=[\"reason\"], bind=input_dropdown, name=\"Select\")\n", "\n", "alt.Chart(harvest_reasons_counts).mark_bar().encode(\n", " x=alt.X(\"year:O\", title=None),\n", " y=alt.Y(\"count:Q\", title=\"Number of files\"),\n", " color=alt.Color(\"year:N\", legend=None),\n", " column=alt.Column(\"reason:N\", title=\"Reason for being closed\"),\n", " tooltip=[\"year:O\", \"reason:N\", \"count:Q\"],\n", ").add_selection(selection).transform_filter(selection).properties(\n", " height=200\n", ").resolve_scale(\n", " x=\"independent\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }