{ "cells": [ { "cell_type": "markdown", "id": "18d8226c-16a1-44a5-985c-aceb52f722cf", "metadata": {}, "source": [ "# Harvest details of all series in RecordSearch" ] }, { "cell_type": "code", "execution_count": null, "id": "232bc32d-d4bd-49a0-bb8c-8bdda591cd0c", "metadata": {}, "outputs": [], "source": [ "import json\n", "import string\n", "import time\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "from recordsearch_data_scraper.scrapers import RSItemSearch, RSSeriesSearch\n", "from tqdm.auto import tqdm" ] }, { "cell_type": "markdown", "id": "333991be-73b2-4d9f-afee-85ef543da97a", "metadata": {}, "source": [ "## Harvest series data\n", "\n", "To harvest all of the series data we'll simply loop through the alphabet and search for series identifiers starting with each letter. Note that the `recordsearch_data_scraper` caches results. This means that if the harvest fails for some reason, you can just re-run the cell below to start it again and it'll pick up where it left off.\n", "\n", "However, this also means that if you want a completely fresh harvest, you should delete the `cache_db.sqlite` file before you start." ] }, { "cell_type": "code", "execution_count": null, "id": "b0381d81-ae33-416f-af35-63e56573ae65", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "series = []\n", "for letter in string.ascii_uppercase:\n", " series_results = RSSeriesSearch(record_detail=\"full\", series_id=f\"{letter}*\")\n", " if series_results.total_results > 0:\n", " with tqdm(total=series_results.total_results, desc=letter) as pbar:\n", " more = True\n", " while more:\n", " data = series_results.get_results()\n", " if data[\"results\"]:\n", " series += data[\"results\"]\n", " pbar.update(len(data[\"results\"]))\n", " time.sleep(0.5)\n", " else:\n", " more = False\n", " time.sleep(1)" ] }, { "cell_type": "markdown", "id": "1c7d5018-022c-4b4d-ae25-7698fafc56e4", "metadata": {}, "source": [ "How many series did we harvest?" ] }, { "cell_type": "code", "execution_count": null, "id": "64ec61e8-68fc-4b3b-a350-785fe298db6f", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "len(series)" ] }, { "cell_type": "markdown", "id": "662fa44d-68e5-46ab-83c0-a880e6eec6e3", "metadata": {}, "source": [ "## Save the harvested data\n", "\n", "The series details include complex relationships which can't be easily saved into a flat file format like CSV. So we'll write all the harvested data to a JSON lines file, where each series description is a JSON object on a single line." ] }, { "cell_type": "code", "execution_count": null, "id": "065b90ab-a4ea-4a6a-b330-6a4490ddb60a", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "with Path(\"all_series_april_2022.ndjson\").open(\"w\") as series_file:\n", " for s in series:\n", " series_file.write(json.dumps(s) + \"\\n\")" ] }, { "cell_type": "markdown", "id": "09a2b731-2192-45c0-916e-5fb6d5b1e016", "metadata": {}, "source": [ "## Get totals greater than 20,000\n", "\n", "To find the numbers of items digitised, and the access status of items within each series, the scraper fires off an item search. However, RecordSearch returns a maximum of 20,000 results from a search. This means that some values in the harvested data will be '20,000+'. To replace that with something a bit more useful, we have to break the result set into a series of smaller chunks by filtering on the control symbol.\n", "\n", "Here we just loop through a list of letters and numbers, adding them to the control symbol symbol search until the results are below 20,000. Then we add the results for all the chunks together to get the total." ] }, { "cell_type": "markdown", "id": "9016c197-8ced-4af8-adba-f9a06d276f56", "metadata": {}, "source": [ "First convert the data into a dataframe." ] }, { "cell_type": "code", "execution_count": null, "id": "2108cff9-de23-4827-bb62-5b067ff27ec7", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df = pd.json_normalize(series)" ] }, { "cell_type": "code", "execution_count": null, "id": "e745f805-5f99-4118-bab4-9618c1746973", "metadata": { "tags": [] }, "outputs": [], "source": [ "control_range = (\n", " [str(number) for number in range(0, 10)]\n", " + [letter for letter in string.ascii_uppercase]\n", " + [\"/\"]\n", ")\n", "\n", "\n", "def get_results(**kwargs):\n", " s = RSItemSearch(**kwargs)\n", " if s.total_results == \"20,000+\":\n", " return False\n", " else:\n", " return s.total_results\n", "\n", "\n", "def refine_controls(current_control, **kwargs):\n", " total_digitised = 0\n", " for control in control_range:\n", " new_control = current_control.strip(\"*\") + control + \"*\"\n", " # print(new_control)\n", " kwargs[\"control\"] = new_control\n", " total = get_results(**kwargs)\n", " # print(total)\n", " if total is False:\n", " total_digitised += refine_controls(new_control, **kwargs)\n", " else:\n", " total_digitised += total\n", " return total_digitised\n", "\n", "\n", "def get_large_series_totals(series, digital=None, access=None):\n", " total_digitised = 0\n", " kwargs = {\"series\": series}\n", " if digital:\n", " kwargs[\"digital\"] = True\n", " if access:\n", " kwargs[\"access\"] = access\n", " for control in control_range:\n", " kwargs[\"control\"] = control + \"*\"\n", " # print(control1)\n", " total = get_results(**kwargs)\n", " # print(total)\n", " if total is False:\n", " total_digitised += refine_controls(control, **kwargs)\n", " else:\n", " total_digitised += total\n", " return total_digitised\n", "\n", "\n", "def get_digitised_total(row):\n", " if row[\"items_digitised\"] != \"20,000+\":\n", " return row[\"items_digitised\"]\n", " else:\n", " print(row[\"identifier\"])\n", " return get_large_series_totals(row[\"identifier\"], digital=True)\n", "\n", "\n", "def get_open_total(row):\n", " if row[\"access_status_totals.OPEN\"] != \"20,000+\":\n", " return row[\"access_status_totals.OPEN\"]\n", " else:\n", " print(row[\"identifier\"])\n", " return get_large_series_totals(row[\"identifier\"], access=\"OPEN\")\n", "\n", "\n", "def get_nye_total(row):\n", " if row[\"access_status_totals.NYE\"] != \"20,000+\":\n", " return row[\"access_status_totals.NYE\"]\n", " else:\n", " print(row[\"identifier\"])\n", " return get_large_series_totals(row[\"identifier\"], access=\"NYE\")\n", "\n", "\n", "def get_owe_total(row):\n", " if row[\"access_status_totals.OWE\"] != \"20,000+\":\n", " return row[\"access_status_totals.OWE\"]\n", " else:\n", " print(row[\"identifier\"])\n", " return get_large_series_totals(row[\"identifier\"], access=\"OWE\")" ] }, { "cell_type": "markdown", "id": "99816864-591e-4262-a446-c5c72e7e090d", "metadata": {}, "source": [ "Now we'll process the fields that have the '20,000+' values." ] }, { "cell_type": "code", "execution_count": null, "id": "7507f599-5e37-4c7c-9991-a579a546d3cc", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df[\"digitised_total\"] = df.apply(get_digitised_total, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "8a3088ee-1dc5-4460-bc0f-84865092131f", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df[\"access_open_total\"] = df.apply(get_open_total, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "8334447c-66c5-44a1-9a31-479ab888fc0b", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df[\"access_nye_total\"] = df.apply(get_nye_total, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "9a33d4e5-e996-428d-987d-1a5d17e09898", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df[\"access_owe_total\"] = df.apply(get_owe_total, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "a85f2375-fafb-4069-afd9-4b2558bac937", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df.loc[df[\"digitised_total\"] == \"20,000+\"]" ] }, { "cell_type": "markdown", "id": "c46433e9-df48-4ed1-9677-6714c651a812", "metadata": {}, "source": [ "## Flatten the data and save the totals" ] }, { "cell_type": "markdown", "id": "c8d36fed-40c9-4fa3-9422-1d0c41d2e532", "metadata": {}, "source": [ "Quantities and locations are stored in a list. Here we'll add up the quantities in the list to get a total quantity for each series." ] }, { "cell_type": "code", "execution_count": null, "id": "f63ca207-d71e-4124-a59a-9bd66c696646", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df[\"quantity_total\"] = (\n", " df[\"locations\"].dropna().apply(lambda l: round(sum([x[\"quantity\"] for x in l]), 2))\n", ")" ] }, { "cell_type": "markdown", "id": "2a5d0bf5-3a49-4ccc-9f1d-02e1f8af7e77", "metadata": {}, "source": [ "To save the totals as a CSV, we'll leave out the series relationships with other series and agencies." ] }, { "cell_type": "code", "execution_count": null, "id": "8009a66c-4961-4262-9dc2-2784d784be0a", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_totals = df.copy()[\n", " [\n", " \"identifier\",\n", " \"title\",\n", " \"contents_date_str\",\n", " \"contents_start_date\",\n", " \"contents_end_date\",\n", " \"quantity_total\",\n", " \"items_described_note\",\n", " \"items_described\",\n", " \"digitised_total\",\n", " \"access_open_total\",\n", " \"access_owe_total\",\n", " \"access_status_totals.CLOSED\",\n", " \"access_nye_total\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "id": "6382e7f4-7f34-41fc-b4b6-83f42d5187ae", "metadata": {}, "source": [ "Simplify some of the column headings." ] }, { "cell_type": "code", "execution_count": null, "id": "f6919442-b425-4517-a135-c4e081146c49", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_totals.columns = [\n", " \"identifier\",\n", " \"title\",\n", " \"contents_date_str\",\n", " \"contents_start_date\",\n", " \"contents_end_date\",\n", " \"quantity_total\",\n", " \"described_note\",\n", " \"described_total\",\n", " \"digitised_total\",\n", " \"access_open_total\",\n", " \"access_owe_total\",\n", " \"access_closed_total\",\n", " \"access_nye_total\",\n", "]" ] }, { "cell_type": "markdown", "id": "accd3d5f-9fca-4eeb-b3a8-998c0ef99fa2", "metadata": {}, "source": [ "Make sure the totals are stored as integers." ] }, { "cell_type": "code", "execution_count": null, "id": "0283421c-f9f5-4ce9-9e90-c12e0cccfbe4", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_totals[\"described_total\"] = df_totals[\"described_total\"].astype(\"Int64\")\n", "df_totals[\"digitised_total\"] = df_totals[\"digitised_total\"].astype(\"Int64\")\n", "df_totals[\"access_open_total\"] = df_totals[\"access_open_total\"].astype(\"Int64\")\n", "df_totals[\"access_nye_total\"] = df_totals[\"access_nye_total\"].astype(\"Int64\")\n", "df_totals[\"access_owe_total\"] = df_totals[\"access_owe_total\"].astype(\"Int64\")\n", "df_totals[\"access_closed_total\"] = df_totals[\"access_closed_total\"].astype(\"Int64\")" ] }, { "cell_type": "markdown", "id": "11b0d678-a052-465a-b868-1a0aa24c0644", "metadata": {}, "source": [ "Save as a CSV file." ] }, { "cell_type": "code", "execution_count": null, "id": "df34475e-0a1a-4db3-bf48-05967496b188", "metadata": { "tags": [ "nbval-skip" ] }, "outputs": [], "source": [ "df_totals.to_csv(\"series_totals_April_2022.csv\", index=False)" ] }, { "cell_type": "markdown", "id": "82354e65-c4b3-47ab-accc-0cf6ba979597", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }