# How many of the functions are actually used?

In this notebook we'll import data about functions that [we've harvested earlier](harvesting_functions_from_recordsearch.ipynb) and search for each of these functions in RecordSearch to see how many are actually used.

In [1]:
import json

import altair as alt
import pandas as pd
from recordsearch_data_scraper.scrapers import RSAgencySearch
from tqdm.auto import tqdm

## Load and prepare the data

In [2]:
# Load the JSON file we've already harvested
with open("data/functions.json", "r") as json_file:
    functions = json.load(json_file)

In [3]:
def get_children(function):
    f_list = []
    if "narrower" in function:
        for subf in function["narrower"]:
            f_list.append(subf["term"])
            f_list += get_children(subf)
    return f_list


functions_list = []
for function in functions:
    functions_list.append(function["term"])
    functions_list += get_children(function)

In [4]:
# Get rid of duplicates
functions_list = set(functions_list)
# Sort terms
sorted(functions_list)

['accommodation services',
 'acquisition',
 'administrative decision appeal',
 'administrative decision review',
 'administrative law',
 'administrative services',
 'advertising standards',
 'aged persons services',
 'agricultural sciences',
 'agriculture',
 'air force',
 'air force administration',
 'air force commands',
 'air operations',
 'air safety',
 'air transport',
 'air transport safety',
 'aircraft standards',
 'airport services',
 'airports',
 'ambulance services',
 'analytical services',
 'animal and veterinary sciences',
 'applications for native title',
 'applied sciences',
 'arbitration',
 'archives administration',
 'army',
 'army administration',
 'army commands',
 'artifact export regulation',
 'arts',
 'arts development',
 'arts funding',
 'arts incentive schemes',
 'arts promotion',
 'associations and corporate law',
 'atmospheric sciences',
 'audit',
 'australian capital territory',
 'australian defence forces (adf)',
 'banking',
 'bankruptcy',
 'biological science

## Search for agencies associated with each function

In RecordSearch, functions are performed by agencies. So when you search for a function you get back a list of agencies. Here we'll loop through the list of functions and search for associated agencies.

In [None]:
function_totals = []
for function in tqdm(functions_list):
    agencies = RSAgencySearch(function=function)
    # Get the total results from each search (replace None with 0)
    total = agencies.total_results
    function_totals.append({"function": function, "total": total})

## Explore the results

In [8]:
# Create a DataFrame with the results
df = pd.DataFrame(function_totals)

In [9]:
df.describe()

Unnamed: 0,total
count,472.0
mean,27.118644
std,52.554882
min,0.0
25%,0.0
50%,1.0
75%,28.25
max,419.0


So 75% of all functions have less than 28 associated agencies.

How many are actiually used?

In [10]:
# How many functions are actually used
used = df.loc[df["total"] > 0].count()
print(used["total"])

243


In [11]:
percent_used = used["function"] / len(functions_list)
print("{:.1%} of the functions are used".format(percent_used))

51.5% of the functions are used


In [12]:
# Most used function
df.loc[df["total"] == df["total"].max()]

Unnamed: 0,function,total
72,employment,419


In [13]:
# Top 20 by number of agencies
df.sort_values(by="total", ascending=False)[:20]

Unnamed: 0,function,total
72,employment,419
340,education,294
226,army commands,286
214,social welfare,270
275,indigenous affairs,268
163,training,232
354,housing,220
203,scientific research,216
68,migration,199
417,goods and services,195


## Show how agencies are distributed across functions

In [14]:
# Bin the agencies to make it wasier to read
alt.Chart(df).mark_bar().encode(
    x=alt.X("total:Q", bin=alt.Bin(step=10), title="Number of associated agencies"),
    y=alt.Y("count()", title="Number of functions"),
    tooltip=[
        alt.Tooltip("total:Q", bin=alt.Bin(step=10), title="Agencies"),
        alt.Tooltip("count()", title="Functions"),
    ],
)

----

Created by [Tim Sherratt](https://timsherratt.org/) as part of the [GLAM Workbench](https://glam-workbench.github.io/).