{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a list of agencies associated with a function\n", "\n", "RecordSearch describes the business of government in terms of 'functions'. A function is an area of responsibility assigned to a particular government agency. Over time, functions change and move between agencies.\n", "\n", "If you're wanting to track particular areas of government activity, such as 'migration' or 'meteorology', it can be useful to start with functions, then follow the trail through agencies, series created by those agencies, and finally items contained within those series.\n", "\n", "Functions are also organised into a hierarchy, so moving up or down the hierarchy can help you refine or broaden your search.\n", "\n", "This notebook helps you create a list of all agencies associated with a particular function. Click on the 'Appmode' button to hide all the code.\n", "\n", "### Some limitations...\n", "\n", "The function selector in the form below uses the hierarchy of functions that are currently built into the RecordSearch interface. However, there are numerous inconsistencies in RecordSearch, and a majority of the terms are not assigned to any agencies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "import time\n", "\n", "import arrow\n", "import ipywidgets as widgets\n", "import pandas as pd\n", "from IPython.display import FileLink, display\n", "from recordsearch_data_scraper.scrapers import RSAgencySearch\n", "from slugify import slugify\n", "from tinydb import Query, TinyDB\n", "from tqdm.auto import tqdm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def harvest_agencies(function):\n", " agencies = []\n", " search = RSAgencySearch(function=function, record_detail=\"full\")\n", " with tqdm(total=search.total_results) as pbar:\n", " more = True\n", " while more:\n", " data = search.get_results()\n", " if data[\"results\"]:\n", " agencies += data[\"results\"]\n", " pbar.update(len(data[\"results\"]))\n", " time.sleep(0.5)\n", " else:\n", " more = False\n", " return agencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_terms(function):\n", " \"\"\"\n", " Gets the children of a given term\n", " \"\"\"\n", " terms = []\n", " if \"narrower\" in function:\n", " for subf in function[\"narrower\"]:\n", " terms.append(subf[\"term\"])\n", " terms += get_terms(subf)\n", " return terms\n", "\n", "\n", "def get_db():\n", " function = rsfunction.value\n", " if children.value == True:\n", " db_name = \"data/db_agencies_{}_with_children.json\".format(\n", " slugify(function[\"term\"])\n", " )\n", " else:\n", " db_name = \"data/db_agencies_{}.json\".format(slugify(function[\"term\"]))\n", " db = TinyDB(db_name)\n", " return db\n", "\n", "\n", "def get_agencies(b):\n", " \"\"\"\n", " Sends function terms off to the harvester to get related agencies.\n", " If you've selected the 'include children' options, it includes all\n", " the function terms below the selected one in the hierarchy.\n", " \"\"\"\n", " harvest_status.clear_output()\n", " Record = Query()\n", " function = rsfunction.value\n", " db = get_db()\n", " terms = [function[\"term\"]]\n", " if children.value == True:\n", " terms += get_terms(function)\n", " with harvest_status:\n", " for term in terms:\n", " print('\\nHarvesting \"{}\"'.format(term))\n", " agencies = harvest_agencies(term)\n", " for agency in agencies:\n", " db.upsert(agency, Record.agency_id == agency[\"identifier\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Select the function you want to harvest\n", "\n", "* Select a function from the dropdown list\n", "* Check 'include children' to add all the terms below the selected function in the hierarchy to the search\n", "* Click 'Get agencies' to start!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set up the interface\n", "\n", "\n", "def get_children(function, level):\n", " \"\"\"\n", " Gets the children of the supplied term.\n", " Formats/indents the terms for the dropdown.\n", " \"\"\"\n", " f_list = []\n", " if \"narrower\" in function:\n", " level += 1\n", " for subf in function[\"narrower\"]:\n", " f_list.append(\n", " (\"{}{} {}\".format(level * \" \", level * \"-\", subf[\"term\"]), subf)\n", " )\n", " f_list += get_children(subf, level=level)\n", " return f_list\n", "\n", "\n", "def get_functions():\n", " # Load the JSON file of functions we've previously harvested\n", " with open(\"data/functions.json\", \"r\") as json_file:\n", " functions = json.load(json_file)\n", "\n", " # Make the list of options for the dropdown\n", " functions_list = []\n", " for function in functions:\n", " functions_list.append((function[\"term\"], function))\n", " functions_list += get_children(function, level=0)\n", " return functions_list\n", "\n", "\n", "# Make the dropdown selector\n", "rsfunction = widgets.Dropdown(\n", " options=get_functions(), description=\"Function:\", disabled=False\n", ")\n", "\n", "# Make a checkbox to include children\n", "children = widgets.Checkbox(value=False, description=\"include children\", disabled=False)\n", "\n", "# A button to start the harvest\n", "start = widgets.Button(\n", " description=\"Get agencies\",\n", " disabled=False,\n", " button_style=\"primary\", # 'success', 'info', 'warning', 'danger' or ''\n", " tooltip=\"Start harvest\",\n", ")\n", "\n", "# Add function to the button\n", "start.on_click(get_agencies)\n", "\n", "display(\n", " widgets.HBox(\n", " [rsfunction, children, start],\n", " layout=widgets.Layout(\n", " padding=\"50px\", margin=\"20px 0 0 0\", border=\"1px solid #999999\"\n", " ),\n", " )\n", ")\n", "harvest_status = widgets.Output()\n", "harvest_status" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save and download the harvested data\n", "\n", "RecordSearch data includes a rich set of relationships. In the case of agencies, there are links to functions, people, and to previous, subsequent, controlled, and controlling agencies. It's hard to present this complex, nested data in a flat format, such as a CSV file. For convenience, the CSV file created for download doesn't include related agencies, people, and functions. It does, however, include `start_function` and `end_function` fields that indicate when the agency had responsibility for the selected function. If you've included child functions, the `start_function` and `end_function` fields contain the earliest and latest dates from any of the harvested terms. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def parse_date(date):\n", " try:\n", " if \"-\" in date:\n", " parsed_date = arrow.get(date)\n", " else:\n", " parsed_date = arrow.get(date, \"YYYY\")\n", " except TypeError:\n", " parsed_date = None\n", " return parsed_date\n", "\n", "\n", "def make_filename():\n", " function = rsfunction.value\n", " if children.value == True:\n", " filename = \"data/agencies_{}_with_children\".format(slugify(function[\"term\"]))\n", " else:\n", " filename = \"data/agencies_{}\".format(slugify(function[\"term\"]))\n", " return filename\n", "\n", "\n", "def save_csv():\n", " db = get_db()\n", " agencies = db.all()\n", " function = rsfunction.value\n", " terms = [function[\"term\"]]\n", " if children.value == True:\n", " terms += get_terms(function)\n", " rows = []\n", " for agency in agencies:\n", " earliest_date = None\n", " latest_date = None\n", " for function in agency[\"functions\"]:\n", " if function[\"identifier\"].lower() in terms:\n", " start = parse_date(function[\"start_date\"])\n", " end = parse_date(function[\"start_date\"])\n", " if start and earliest_date and start < parse_date(earliest_date):\n", " earliest_date = function[\"start_date\"]\n", " elif start and not earliest_date:\n", " earliest_date = function[\"start_date\"]\n", " if end and latest_date and end > parse_date(latest_date):\n", " latest_date = function[\"end_date\"]\n", " elif end and not latest_date:\n", " latest_date = function[\"end_date\"]\n", " row = {\n", " k: agency[k]\n", " for k in [\"agency_id\", \"title\", \"dates\", \"agency_status\", \"location\"]\n", " }\n", " row[\"start_function\"] = earliest_date\n", " row[\"end_function\"] = latest_date\n", " rows.append(row)\n", " df = pd.DataFrame(rows)\n", " # The 'contents_date' column is a dictionary, we need to flatten this out so we can easily work with the values\n", " df = pd.concat(\n", " [df, pd.DataFrame((d for idx, d in df[\"dates\"].iteritems()))], axis=1\n", " )\n", " # Delete the old date field\n", " del df[\"dates\"]\n", " # Rename column\n", " df.rename({\"date_str\": \"dates\"}, axis=1, inplace=True)\n", " df = df[\n", " [\n", " \"agency_id\",\n", " \"title\",\n", " \"agency_status\",\n", " \"dates\",\n", " \"start_date\",\n", " \"end_date\",\n", " \"location\",\n", " \"start_function\",\n", " \"end_function\",\n", " ]\n", " ]\n", " filename = \"{}.csv\".format(make_filename())\n", " df.to_csv(filename, index=False)\n", "\n", "\n", "def save_json():\n", " db = get_db()\n", " agencies = db.all()\n", " filename = \"{}.json\".format(make_filename())\n", " with open(filename, \"w\") as json_file:\n", " json.dump(agencies, json_file, indent=4)\n", "\n", "\n", "def save_data(b):\n", " out.clear_output()\n", " try:\n", " save_csv()\n", " save_json()\n", " filename = make_filename()\n", " except KeyError:\n", " with out:\n", " print(\"You need to harvest some data first!\")\n", " else:\n", " with out:\n", " display(FileLink(\"{}.json\".format(filename)))\n", " display(FileLink(\"{}.csv\".format(filename)))\n", "\n", "\n", "# A button to start the harvest\n", "download = widgets.Button(\n", " description=\"Save data\",\n", " disabled=False,\n", " button_style=\"primary\", # 'success', 'info', 'warning', 'danger' or ''\n", " tooltip=\"Save data for download\",\n", ")\n", "\n", "# Add function to the button\n", "download.on_click(save_data)\n", "\n", "out = widgets.Output()\n", "\n", "display(download)\n", "display(out)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Load environment variables if available\n", "%load_ext dotenv\n", "%dotenv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TESTING\n", "if os.getenv(\"GW_STATUS\") == \"dev\":\n", " rsfunction.value = {\"term\": \"arts\", \"narrower\": []}\n", " start.click()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) as part of the [GLAM Workbench](https://glam-workbench.github.io/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }