{ "cells": [ { "cell_type": "markdown", "id": "f1048302-134e-47a4-b66e-6353bfb36302", "metadata": {}, "source": [ "# Exploring harvested series data, April 2022\n", "\n", "This notebook examines data from a complete harvest of series publicly available through RecordSearch in May 2021. It also compares the results to an [earlier harvest](series_collection_stats.ipynb) in May 2021. See [this notebook](harvest_series_data.ipynb) for the harvesting method." ] }, { "cell_type": "code", "execution_count": 1, "id": "01e2eb57-2648-4752-9252-f77a3e7c5de2", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 21, "id": "541285ae-2e10-4f9a-94ed-9df0e5a83750", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\n", " \"series_totals_April_2022.csv\",\n", " dtype={\n", " \"described_total\": \"Int64\",\n", " \"digitised_total\": \"Int64\",\n", " \"access_open_total\": \"Int64\",\n", " \"access_owe_total\": \"Int64\",\n", " \"access_closed_total\": \"Int64\",\n", " \"access_nye_total\": \"Int64\",\n", " },\n", ")" ] }, { "cell_type": "code", "execution_count": 22, "id": "1a0f91b9-b1eb-4c40-9129-785133992481", "metadata": {}, "outputs": [], "source": [ "df_prev = pd.read_csv(\n", " \"series_totals_May_2021.csv\",\n", " dtype={\n", " \"described_total\": \"Int64\",\n", " \"digitised_total\": \"Int64\",\n", " \"access_open_total\": \"Int64\",\n", " \"access_owe_total\": \"Int64\",\n", " \"access_closed_total\": \"Int64\",\n", " \"access_nye_total\": \"Int64\",\n", " },\n", ")" ] }, { "cell_type": "code", "execution_count": 3, "id": "c2f788f8-3026-49b2-b3cf-f6cf98efb454", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identifiertitlecontents_date_strcontents_start_datecontents_end_datequantity_totaldescribed_notedescribed_totaldigitised_totalaccess_open_totalaccess_owe_totalaccess_closed_totalaccess_nye_total
0A1Correspondence files, annual single number ser...01 Jan 1890 - 31 Dec 19691890-01-011969-12-31477.41All items from this series are entered on Reco...64444.056666.064445.03.00.08.0
1A2Correspondence files, annual single number series01 Jan 1895 - 31 Dec 19261895-01-011926-12-3135.74All items from this series are entered on Reco...3409.0365.03403.00.00.06.0
2A3Correspondence files, annual single number ser...01 Jan 1839 - 23 May 19631839-01-011963-05-2326.48All items from this series are entered on Reco...1382.0264.01374.08.00.00.0
3A4Correspondence files, single number series wit...NaNNaNNaN0.36Click to see items listed on RecordSearch. Ple...45.02.045.00.00.00.0
4A5Correspondence files, annual single number ser...01 Jan 1923 - 31 Dec 19241923-01-011924-12-311.80Click to see items listed on RecordSearch. Ple...200.014.0198.00.00.02.0
\n", "
" ], "text/plain": [ " identifier title \\\n", "0 A1 Correspondence files, annual single number ser... \n", "1 A2 Correspondence files, annual single number series \n", "2 A3 Correspondence files, annual single number ser... \n", "3 A4 Correspondence files, single number series wit... \n", "4 A5 Correspondence files, annual single number ser... \n", "\n", " contents_date_str contents_start_date contents_end_date \\\n", "0 01 Jan 1890 - 31 Dec 1969 1890-01-01 1969-12-31 \n", "1 01 Jan 1895 - 31 Dec 1926 1895-01-01 1926-12-31 \n", "2 01 Jan 1839 - 23 May 1963 1839-01-01 1963-05-23 \n", "3 NaN NaN NaN \n", "4 01 Jan 1923 - 31 Dec 1924 1923-01-01 1924-12-31 \n", "\n", " quantity_total described_note \\\n", "0 477.41 All items from this series are entered on Reco... \n", "1 35.74 All items from this series are entered on Reco... \n", "2 26.48 All items from this series are entered on Reco... \n", "3 0.36 Click to see items listed on RecordSearch. Ple... \n", "4 1.80 Click to see items listed on RecordSearch. Ple... \n", "\n", " described_total digitised_total access_open_total access_owe_total \\\n", "0 64444.0 56666.0 64445.0 3.0 \n", "1 3409.0 365.0 3403.0 0.0 \n", "2 1382.0 264.0 1374.0 8.0 \n", "3 45.0 2.0 45.0 0.0 \n", "4 200.0 14.0 198.0 0.0 \n", "\n", " access_closed_total access_nye_total \n", "0 0.0 8.0 \n", "1 0.0 6.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 2.0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "5ede6b8d-e246-4d0c-a5a2-402820bc833f", "metadata": {}, "source": [ "## Some basic statistics\n", "\n", "Note that these numbers might not be exact. To work around the 20,000 search result limit, some totals have been calculated by aggregating a series of searches. In most cases this will be accurate, but some items have multiple control symbols and may be duplicated in the results. I think any errors will be small.\n", "\n", "The numbers in brackets indicate the change since the last harvest in May 2021.\n", "\n", "### Number of series" ] }, { "cell_type": "code", "execution_count": 38, "id": "e75ece98-db39-4c53-9da1-a707967c3988", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "65,747 series (+28)\n" ] } ], "source": [ "print(f\"{df.shape[0]:,} series ({df.shape[0] - df_prev.shape[0]:+,})\")" ] }, { "cell_type": "markdown", "id": "d2075295-6443-4eb4-abad-53409c0b6c33", "metadata": {}, "source": [ "### Quantity of records in linear metres" ] }, { "cell_type": "code", "execution_count": 39, "id": "05857d6f-2676-473e-a4f2-176bf99d991e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "341,963.73 metres of records (+4,293.04)\n" ] } ], "source": [ "print(\n", " f'{round(df[\"quantity_total\"].sum(), 2):,} metres of records ({round((df[\"quantity_total\"].sum() - df_prev[\"quantity_total\"].sum()), 2):+,})'\n", ")" ] }, { "cell_type": "markdown", "id": "c161e90d-a6bd-4a09-9c44-aa034cdd2665", "metadata": {}, "source": [ "### Number of items described in RecordSearch" ] }, { "cell_type": "code", "execution_count": 40, "id": "ad158cd4-f6c5-46d3-88ea-f720dd729c0f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "14,624,131 items described (+713,641)\n" ] } ], "source": [ "print(\n", " f'{df[\"described_total\"].sum():,} items described ({df[\"described_total\"].sum() - df_prev[\"described_total\"].sum():+,})'\n", ")" ] }, { "cell_type": "markdown", "id": "48cf9b72-d656-4cf6-a1b6-1c5bc5071d39", "metadata": {}, "source": [ "### Number of items digitised" ] }, { "cell_type": "code", "execution_count": 41, "id": "db424681-c63c-47c6-9394-6917913e3c61", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2,366,101 items digitised (+131,869)\n" ] } ], "source": [ "print(\n", " f'{df[\"digitised_total\"].sum():,} items digitised ({df[\"digitised_total\"].sum()- df_prev[\"digitised_total\"].sum():+,})'\n", ")" ] }, { "cell_type": "code", "execution_count": 42, "id": "59286900-352d-4cb2-b269-4d77bc26fb28", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "16.18% of described items are digitised (+0.12%)\n" ] } ], "source": [ "prev_percent = df_prev[\"digitised_total\"].sum() / df_prev[\"described_total\"].sum()\n", "current_percent = df[\"digitised_total\"].sum() / df[\"described_total\"].sum()\n", "print(\n", " f\"{current_percent:0.2%} of described items are digitised ({current_percent - prev_percent:+0.2%})\"\n", ")" ] }, { "cell_type": "markdown", "id": "b9d3b00a-83d8-4a25-9582-b12eb8421204", "metadata": {}, "source": [ "### Access status of items described" ] }, { "cell_type": "code", "execution_count": 44, "id": "212c689b-17b5-4eb9-ba17-15695cec826a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
access statustotalchangepercent
Open8,026,169+243,34854.06%
Open with exceptions110,688+1,5530.75%
Closed10,917-1530.07%
Not yet examined6,698,686+512,03345.12%
\n" ], "text/plain": [ "" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "access_totals = [\n", " {\n", " \"access status\": \"Open\",\n", " \"total\": df[\"access_open_total\"].sum(),\n", " \"change\": df[\"access_open_total\"].sum() - df_prev[\"access_open_total\"].sum(),\n", " },\n", " {\n", " \"access status\": \"Open with exceptions\",\n", " \"total\": df[\"access_owe_total\"].sum(),\n", " \"change\": df[\"access_owe_total\"].sum() - df_prev[\"access_owe_total\"].sum(),\n", " },\n", " {\n", " \"access status\": \"Closed\",\n", " \"total\": df[\"access_closed_total\"].sum(),\n", " \"change\": df[\"access_closed_total\"].sum()\n", " - df_prev[\"access_closed_total\"].sum(),\n", " },\n", " {\n", " \"access status\": \"Not yet examined\",\n", " \"total\": df[\"access_nye_total\"].sum(),\n", " \"change\": df[\"access_nye_total\"].sum() - df_prev[\"access_nye_total\"].sum(),\n", " },\n", "]\n", "\n", "df_access = pd.DataFrame(access_totals)\n", "df_access[\"percent\"] = df_access[\"total\"] / df_access[\"total\"].sum()\n", "\n", "df_access.style.format(\n", " {\"total\": \"{:,.0f}\", \"change\": \"{:+,}\", \"percent\": \"{:0.2%}\"}\n", ").hide()" ] }, { "cell_type": "markdown", "id": "371c6dee-6f69-4979-87f4-26fb87ae6de9", "metadata": {}, "source": [ "## Digging deeper\n", "\n", "### How many items are there in total?\n", "\n", "There's no way of knowing this from the harvested data. However, the recently-released [Tune Review](https://www.ag.gov.au/sites/default/files/2021-03/functional-efficiency-review-national-archives-of-australia.PDF) says that 37% of the NAA's holdings are described. So as we know the number described, we should be able to calculate an approximate number of total items." ] }, { "cell_type": "code", "execution_count": 51, "id": "e83bd579-5e24-48f3-841d-8f6ae12a9ce1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Approximately 39,524,678 items in total\n" ] } ], "source": [ "print(f'Approximately {int(df[\"described_total\"].sum() / 0.37):,} items in total')" ] }, { "cell_type": "markdown", "id": "8aa6c3f6-eb80-46cb-b402-858bbbe85432", "metadata": {}, "source": [ "To put that another way, this is the approximate number of items **not listed** on RecordSearch:" ] }, { "cell_type": "code", "execution_count": 52, "id": "1c16f13e-901f-4461-af4b-52595ffbc97b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Approximately 24,900,547 items **are not** listed on RecordSearch\n" ] } ], "source": [ "print(\n", " f'Approximately {int(df[\"described_total\"].sum() / 0.37) - df[\"described_total\"].sum():,} items **are not** listed on RecordSearch'\n", ")" ] }, { "cell_type": "markdown", "id": "f6332e56-afa4-44a7-893f-fa9830b88725", "metadata": {}, "source": [ "That's something to keep in mind if you're just relying on item keyword searches to find relevant content. " ] }, { "cell_type": "markdown", "id": "37a4e8f9-9a9b-4ce2-8014-9dc341d43211", "metadata": {}, "source": [ "### How much of each series is described at item level?\n", "\n", "The note that accompanies the number of items listed in RecordSearch indicates how much of the series has been described at item level. By looking at the frequency of each of the values for this note, we can get a sese of the level of description across the collection." ] }, { "cell_type": "code", "execution_count": 47, "id": "bd42c15b-7ea6-4a79-a040-8a7781e63fe9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 totalpercent
No items from the series are on RecordSearch. Please contact the National Reference Service if you need assistance.41,58463.25%
Click to see items listed on RecordSearch. Please contact the National Reference Service if you can't find the record you want as not all items from the series may be on RecordSearch.12,96619.72%
All items from this series are entered on RecordSearch.11,16616.98%
No items from the series are on RecordSearch. Please contact the Australian War Memorial if you need assistance.300.05%
\n" ], "text/plain": [ "" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_described = df[\"described_note\"].value_counts().to_frame()\n", "df_described.columns = [\"total\"]\n", "df_described[\"percent\"] = df_described[\"total\"] / df_described[\"total\"].sum()\n", "df_described.style.format({\"total\": \"{:,.0f}\", \"percent\": \"{:0.2%}\"})" ] }, { "cell_type": "markdown", "id": "c59db247-a3c6-43ef-b09e-1f3a16ee7359", "metadata": {}, "source": [ "The numbers above might be a bit misleading because sometimes series are registered on RecordSearch before any items are actually transferred to the NAA. So the reason there are no items listed might be that there are no items currently in Archives custody. To try an get a more accurate picture, we can filter out series where the quantity held by the NAA is equal to zero metres." ] }, { "cell_type": "code", "execution_count": 48, "id": "2590ac58-3ea6-4a69-af9d-9c36c4cbe91d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 totalpercent
No items from the series are on RecordSearch. Please contact the National Reference Service if you need assistance.19,22747.29%
All items from this series are entered on RecordSearch.11,04427.16%
Click to see items listed on RecordSearch. Please contact the National Reference Service if you can't find the record you want as not all items from the series may be on RecordSearch.10,38425.54%
No items from the series are on RecordSearch. Please contact the Australian War Memorial if you need assistance.10.00%
\n" ], "text/plain": [ "" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_described_held = (\n", " df.loc[df[\"quantity_total\"] != 0][\"described_note\"].value_counts().to_frame()\n", ")\n", "df_described_held.columns = [\"total\"]\n", "df_described_held[\"percent\"] = (\n", " df_described_held[\"total\"] / df_described_held[\"total\"].sum()\n", ")\n", "df_described_held.style.format({\"total\": \"{:,.0f}\", \"percent\": \"{:0.2%}\"})" ] }, { "cell_type": "markdown", "id": "5975fe27-9595-4200-b1bc-932b7764d231", "metadata": {}, "source": [ "This brings down the 'undescribed' proportion, though strangely this seems to indicate that there are zero shelf metres of some series which are fully described." ] }, { "cell_type": "code", "execution_count": 14, "id": "2390b8cd-4c45-426a-94c7-fed56e54542a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "122" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[\n", " (df[\"described_note\"].str.startswith(\"All\")) & (df[\"quantity_total\"] == 0)\n", "].shape[0]" ] }, { "cell_type": "markdown", "id": "079da0d5-5aa1-4c7b-ab32-60625fc4bfe8", "metadata": {}, "source": [ "For example:" ] }, { "cell_type": "code", "execution_count": 15, "id": "321fa0ff-cb06-4cb1-baa7-eb31a178c5f3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
identifiertitlecontents_date_strcontents_start_datecontents_end_datequantity_totaldescribed_notedescribed_totaldigitised_totalaccess_open_totalaccess_owe_totalaccess_closed_totalaccess_nye_total
121A123Name index cards (Departments), 'G' seriesNaNNaNNaN0.0All items from this series are entered on Reco...2.00.02.00.00.00.0
742A749Volume of Circulars of Public Service Commissi...NaNNaNNaN0.0All items from this series are entered on Reco...1.00.00.00.00.01.0
\n", "
" ], "text/plain": [ " identifier title \\\n", "121 A123 Name index cards (Departments), 'G' series \n", "742 A749 Volume of Circulars of Public Service Commissi... \n", "\n", " contents_date_str contents_start_date contents_end_date quantity_total \\\n", "121 NaN NaN NaN 0.0 \n", "742 NaN NaN NaN 0.0 \n", "\n", " described_note described_total \\\n", "121 All items from this series are entered on Reco... 2.0 \n", "742 All items from this series are entered on Reco... 1.0 \n", "\n", " digitised_total access_open_total access_owe_total \\\n", "121 0.0 2.0 0.0 \n", "742 0.0 0.0 0.0 \n", "\n", " access_closed_total access_nye_total \n", "121 0.0 0.0 \n", "742 0.0 1.0 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[(df[\"described_note\"].str.startswith(\"All\")) & (df[\"quantity_total\"] == 0)].head(\n", " 2\n", ")" ] }, { "cell_type": "markdown", "id": "9e82d2ae-1085-4f33-835f-98f2483e424b", "metadata": {}, "source": [ "So perhaps in some cases locations and quantities are not reliably recorded on RecordSearch." ] }, { "cell_type": "markdown", "id": "20390135-b46a-4d33-84e8-5192db3b5ab0", "metadata": {}, "source": [ "### Series with no item descriptions\n", "\n", "From the items described note it seems that 19,227 series held by the NAA or AWM have no item level descriptions. We can check that by simply looking for series where the `described_total` value is zero." ] }, { "cell_type": "code", "execution_count": 49, "id": "3a268249-caf8-4c26-a890-98a44f32a3aa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "19,228 series held by NAA have no item descriptions\n" ] } ], "source": [ "print(\n", " f'{df.loc[(df[\"quantity_total\"] > 0) & (df[\"described_total\"] == 0)].shape[0]:,} series held by NAA have no item descriptions'\n", ")" ] }, { "cell_type": "markdown", "id": "6d00223b-dc10-4def-8e9d-b72539d5d5c8", "metadata": {}, "source": [ "Yay! That (almost) matches.\n", "\n", "Boo! That's a pretty significant black hole. Let's look at the quantity of records that represents." ] }, { "cell_type": "code", "execution_count": 50, "id": "8b1b731f-8a9a-406e-945b-f4f9f4ce7d20", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "51,733.74 linear metres in series held by NAA with no item descriptions\n" ] } ], "source": [ "print(\n", " f'{df.loc[(df[\"quantity_total\"] > 0) & (df[\"described_total\"] == 0)][\"quantity_total\"].sum():,} linear metres in series held by NAA with no item descriptions'\n", ")" ] }, { "cell_type": "markdown", "id": "cfbe3593-36fe-43e3-9568-0f8745feeb0d", "metadata": {}, "source": [ "Of course, this doesn't include the quantities of series that are partially described." ] }, { "cell_type": "markdown", "id": "e520a44b-14eb-4e03-a11d-5e866911131a", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }