{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Facets in DigitalNZ\n", "\n", "This notebook examines what data is available via facets in DigitalNZ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import requests_cache\n", "import pandas as pd\n", "from tqdm.auto import tqdm\n", "from IPython.display import FileLinks, display\n", "from pathlib import Path\n", "\n", "s = requests_cache.CachedSession()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "API_KEY = '[YOUR API KEY]'\n", "API_URL = 'http://api.digitalnz.org/v3/records.json'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def get_records(params):\n", " '''\n", " Get records from a search using the supplied parameters.\n", " '''\n", " response = s.get(API_URL, params=params)\n", " return response.json()\n", "\n", "def check_facet(facet):\n", " '''\n", " Get values for the specified facet, return the total number of values & records,\n", " and save the complete set of values and counts as a CSV.\n", " '''\n", " facet_data = []\n", " params = {\n", " 'facets': [facet],\n", " 'api_key': API_KEY,\n", " 'per_page': 0,\n", " 'facets_per_page': 350\n", " }\n", " data = get_records(params)\n", " try:\n", " facets = data['search']['facets'][facet]\n", " except KeyError:\n", " print('Not a facet!')\n", " facet_data = {'facet': facet}\n", " else:\n", " # If there are more than 350 facet values, harvest them all\n", " if len(facets) == 350:\n", " facets = harvest_facet_values(facet)\n", " \n", " # Convert the facet data to a dataframe\n", " df = pd.DataFrame.from_dict(facets, orient='index').reset_index()\n", " df.columns = ['value', 'count']\n", " \n", " # Save all the values and counts as a CSV\n", " df.to_csv(Path('facets', f'{facet}.csv'), index=False)\n", " \n", " # Display summary details\n", " print(f'Number of values: {df.shape[0]:,}')\n", " print(f'Number of records: {df[\"count\"].sum():,}')\n", " \n", " # Return summary details\n", " facet_data = {'facet': facet, 'num_values': df.shape[0], 'num_records': df['count'].sum()}\n", " return facet_data\n", " \n", "def harvest_facet_values(facet, **kwargs):\n", " '''\n", " Harvest all the available values for the given facet.\n", " '''\n", " facets = {}\n", " more = True\n", " page = 1\n", " params = {\n", " 'api_key': API_KEY,\n", " 'per_page': 0,\n", " 'facets': facet,\n", " 'facets_per_page': 350,\n", " }\n", " for k, v in kwargs.items():\n", " if k == 'text':\n", " params[k] = v\n", " else:\n", " params[f'and[{k}][]'] = v\n", " with tqdm(leave=False) as pbar:\n", " while more:\n", " params['facets_page'] = page\n", " data = get_records(params)\n", " if data['search']['facets'][facet]:\n", " facets.update(data['search']['facets'][facet])\n", " pbar.update(350)\n", " page += 1\n", " else:\n", " more = False\n", " return facets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collect facet data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The API docs say that the following facets are available via the API: `category`, `display_collection`, `creator`, `placename`, `year`, `decade`, `century`, `language`, `content_partner`, `rights`, `collection`. However, `display_collection` isn't available. It's also worth noting that the `collection` facet corresponds to the `collection_title` field.\n", "\n", "After a bit of poking around, I found that facets are also available for `usage`, `copyright`, `dc_type`, `format`, `subject`, and `primary_collection`.\n", "\n", "Let's gather values for each of the available facets." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "category\n", "Number of values: 19\n", "Number of records: 32,126,494\n", "\n", "display_collection\n", "Not a facet!\n", "\n", "creator\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 333,529\n", "Number of records: 3,554,778\n", "\n", "placename\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 216,151\n", "Number of records: 26,365,644\n", "\n", "year\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 974\n", "Number of records: 31,241,963\n", "\n", "decade\n", "Number of values: 279\n", "Number of records: 30,953,700\n", "\n", "century\n", "Number of values: 78\n", "Number of records: 30,866,245\n", "\n", "language\n", "Number of values: 235\n", "Number of records: 24,649,319\n", "\n", "content_partner\n", "Number of values: 215\n", "Number of records: 32,114,054\n", "\n", "rights\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 45,413\n", "Number of records: 29,978,417\n", "\n", "collection\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 25,015\n", "Number of records: 60,325,343\n", "\n", "usage\n", "Number of values: 5\n", "Number of records: 81,707,236\n", "\n", "copyright\n", "Number of values: 33\n", "Number of records: 31,990,162\n", "\n", "dc_type\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 3,237\n", "Number of records: 2,099,860\n", "\n", "format\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 80,228\n", "Number of records: 2,770,227\n", "\n", "subject\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 1,019,286\n", "Number of records: 13,659,843\n", "\n", "primary_collection\n", "Number of values: 315\n", "Number of records: 32,113,360\n" ] } ], "source": [ "facets = [\n", " 'category', \n", " 'display_collection', \n", " 'creator', \n", " 'placename', \n", " 'year', \n", " 'decade', \n", " 'century', \n", " 'language', \n", " 'content_partner', \n", " 'rights', \n", " 'collection', \n", " 'usage',\n", " 'copyright',\n", " 'dc_type',\n", " 'format',\n", " 'subject',\n", " 'primary_collection'\n", "]\n", "\n", "facet_data = []\n", "for facet in facets:\n", " print(f'\\n{facet}')\n", " facet_data.append(check_facet(facet))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've now a dataset that summarises the contents of each facet. If you look in the `facets` directory, you'll also find there's a CSV file containing all the values and counts for each facet.\n", "\n", "Let's look at the summary data." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
facetnum_valuesnum_records
0category1932126494
1display_collection00
2creator3335293554778
3placename21615126365644
4year97431241963
5decade27930953700
6century7830866245
7language23524649319
8content_partner21532114054
9rights4541329978417
10collection2501560325343
11usage581707236
12copyright3331990162
13dc_type32372099860
14format802282770227
15subject101928613659843
16primary_collection31532113360
\n", "
" ], "text/plain": [ " facet num_values num_records\n", "0 category 19 32126494\n", "1 display_collection 0 0\n", "2 creator 333529 3554778\n", "3 placename 216151 26365644\n", "4 year 974 31241963\n", "5 decade 279 30953700\n", "6 century 78 30866245\n", "7 language 235 24649319\n", "8 content_partner 215 32114054\n", "9 rights 45413 29978417\n", "10 collection 25015 60325343\n", "11 usage 5 81707236\n", "12 copyright 33 31990162\n", "13 dc_type 3237 2099860\n", "14 format 80228 2770227\n", "15 subject 1019286 13659843\n", "16 primary_collection 315 32113360" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert to a dataframe\n", "df = pd.DataFrame(facet_data)\n", "\n", "# Make sure counts are integers\n", "df['num_values'] = df['num_values'].fillna(0.0).astype('int64')\n", "df['num_records'] = df['num_records'].fillna(0.0).astype('int64')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's save this dataset as a CSV." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "df.to_csv(Path('facets', 'facets.csv'), index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's list all the CSV files we've saved! " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "facets/
\n", "  collection.csv
\n", "  creator.csv
\n", "  subject.csv
\n", "  collections_by_partner.csv
\n", "  format.csv
\n", "  placename.csv
\n", "  decade.csv
\n", "  content_partner.csv
\n", "  language.csv
\n", "  century.csv
\n", "  usage.csv
\n", "  rights.csv
\n", "  usage_by_collection_and_partner.csv
\n", "  year.csv
\n", "  facets.csv
\n", "  copyright.csv
\n", "  dc_type.csv
\n", "  category.csv
\n", "  primary_collection.csv
" ], "text/plain": [ "facets/\n", " collection.csv\n", " creator.csv\n", " subject.csv\n", " collections_by_partner.csv\n", " format.csv\n", " placename.csv\n", " decade.csv\n", " content_partner.csv\n", " language.csv\n", " century.csv\n", " usage.csv\n", " rights.csv\n", " usage_by_collection_and_partner.csv\n", " year.csv\n", " facets.csv\n", " copyright.csv\n", " dc_type.csv\n", " category.csv\n", " primary_collection.csv" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(FileLinks('facets', included_suffixes='.csv', recursive=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Primary collections by Content Partner\n", "\n", "I'm not sure how strict the hierarchies are, but I'm assuming we should be able to connect content partners to collections.\n", "\n", "I've used the results of this to [visualise open collections](visualise_open_collections.ipynb) in DigitalNZ." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "partners = pd.read_csv(Path('facets', 'content_partner.csv'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfs = []\n", "for row in partners.itertuples():\n", " partner = row.value\n", " facets = harvest_facet_values('primary_collection', content_partner=partner)\n", " df = pd.DataFrame.from_dict(facets, orient='index').reset_index()\n", " df.columns = ['primary_collection', 'count']\n", " df['content_partner'] = partner\n", " dfs.append(df)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "df_collections = pd.concat(dfs)\n", "df_collections = df_collections[['content_partner', 'primary_collection', 'count']].sort_values(by=['content_partner', 'primary_collection'])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "df_collections.to_csv(Path('facets', 'collections_by_partner.csv'), index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }