{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Facets in DigitalNZ\n", "\n", "This notebook examines what data is available via facets in DigitalNZ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import requests_cache\n", "import pandas as pd\n", "from tqdm.auto import tqdm\n", "from IPython.display import FileLinks, display\n", "from pathlib import Path\n", "\n", "s = requests_cache.CachedSession()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "API_KEY = '[YOUR API KEY]'\n", "API_URL = 'http://api.digitalnz.org/v3/records.json'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def get_records(params):\n", " '''\n", " Get records from a search using the supplied parameters.\n", " '''\n", " response = s.get(API_URL, params=params)\n", " return response.json()\n", "\n", "def check_facet(facet):\n", " '''\n", " Get values for the specified facet, return the total number of values & records,\n", " and save the complete set of values and counts as a CSV.\n", " '''\n", " facet_data = []\n", " params = {\n", " 'facets': [facet],\n", " 'api_key': API_KEY,\n", " 'per_page': 0,\n", " 'facets_per_page': 350\n", " }\n", " data = get_records(params)\n", " try:\n", " facets = data['search']['facets'][facet]\n", " except KeyError:\n", " print('Not a facet!')\n", " facet_data = {'facet': facet}\n", " else:\n", " # If there are more than 350 facet values, harvest them all\n", " if len(facets) == 350:\n", " facets = harvest_facet_values(facet)\n", " \n", " # Convert the facet data to a dataframe\n", " df = pd.DataFrame.from_dict(facets, orient='index').reset_index()\n", " df.columns = ['value', 'count']\n", " \n", " # Save all the values and counts as a CSV\n", " df.to_csv(Path('facets', f'{facet}.csv'), index=False)\n", " \n", " # Display summary details\n", " print(f'Number of values: {df.shape[0]:,}')\n", " print(f'Number of records: {df[\"count\"].sum():,}')\n", " \n", " # Return summary details\n", " facet_data = {'facet': facet, 'num_values': df.shape[0], 'num_records': df['count'].sum()}\n", " return facet_data\n", " \n", "def harvest_facet_values(facet, **kwargs):\n", " '''\n", " Harvest all the available values for the given facet.\n", " '''\n", " facets = {}\n", " more = True\n", " page = 1\n", " params = {\n", " 'api_key': API_KEY,\n", " 'per_page': 0,\n", " 'facets': facet,\n", " 'facets_per_page': 350,\n", " }\n", " for k, v in kwargs.items():\n", " if k == 'text':\n", " params[k] = v\n", " else:\n", " params[f'and[{k}][]'] = v\n", " with tqdm(leave=False) as pbar:\n", " while more:\n", " params['facets_page'] = page\n", " data = get_records(params)\n", " if data['search']['facets'][facet]:\n", " facets.update(data['search']['facets'][facet])\n", " pbar.update(350)\n", " page += 1\n", " else:\n", " more = False\n", " return facets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collect facet data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The API docs say that the following facets are available via the API: `category`, `display_collection`, `creator`, `placename`, `year`, `decade`, `century`, `language`, `content_partner`, `rights`, `collection`. However, `display_collection` isn't available. It's also worth noting that the `collection` facet corresponds to the `collection_title` field.\n", "\n", "After a bit of poking around, I found that facets are also available for `usage`, `copyright`, `dc_type`, `format`, `subject`, and `primary_collection`.\n", "\n", "Let's gather values for each of the available facets." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "category\n", "Number of values: 19\n", "Number of records: 32,126,494\n", "\n", "display_collection\n", "Not a facet!\n", "\n", "creator\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 333,529\n", "Number of records: 3,554,778\n", "\n", "placename\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 216,151\n", "Number of records: 26,365,644\n", "\n", "year\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 974\n", "Number of records: 31,241,963\n", "\n", "decade\n", "Number of values: 279\n", "Number of records: 30,953,700\n", "\n", "century\n", "Number of values: 78\n", "Number of records: 30,866,245\n", "\n", "language\n", "Number of values: 235\n", "Number of records: 24,649,319\n", "\n", "content_partner\n", "Number of values: 215\n", "Number of records: 32,114,054\n", "\n", "rights\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 45,413\n", "Number of records: 29,978,417\n", "\n", "collection\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 25,015\n", "Number of records: 60,325,343\n", "\n", "usage\n", "Number of values: 5\n", "Number of records: 81,707,236\n", "\n", "copyright\n", "Number of values: 33\n", "Number of records: 31,990,162\n", "\n", "dc_type\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 3,237\n", "Number of records: 2,099,860\n", "\n", "format\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 80,228\n", "Number of records: 2,770,227\n", "\n", "subject\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of values: 1,019,286\n", "Number of records: 13,659,843\n", "\n", "primary_collection\n", "Number of values: 315\n", "Number of records: 32,113,360\n" ] } ], "source": [ "facets = [\n", " 'category', \n", " 'display_collection', \n", " 'creator', \n", " 'placename', \n", " 'year', \n", " 'decade', \n", " 'century', \n", " 'language', \n", " 'content_partner', \n", " 'rights', \n", " 'collection', \n", " 'usage',\n", " 'copyright',\n", " 'dc_type',\n", " 'format',\n", " 'subject',\n", " 'primary_collection'\n", "]\n", "\n", "facet_data = []\n", "for facet in facets:\n", " print(f'\\n{facet}')\n", " facet_data.append(check_facet(facet))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've now a dataset that summarises the contents of each facet. If you look in the `facets` directory, you'll also find there's a CSV file containing all the values and counts for each facet.\n", "\n", "Let's look at the summary data." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | facet | \n", "num_values | \n", "num_records | \n", "
---|---|---|---|
0 | \n", "category | \n", "19 | \n", "32126494 | \n", "
1 | \n", "display_collection | \n", "0 | \n", "0 | \n", "
2 | \n", "creator | \n", "333529 | \n", "3554778 | \n", "
3 | \n", "placename | \n", "216151 | \n", "26365644 | \n", "
4 | \n", "year | \n", "974 | \n", "31241963 | \n", "
5 | \n", "decade | \n", "279 | \n", "30953700 | \n", "
6 | \n", "century | \n", "78 | \n", "30866245 | \n", "
7 | \n", "language | \n", "235 | \n", "24649319 | \n", "
8 | \n", "content_partner | \n", "215 | \n", "32114054 | \n", "
9 | \n", "rights | \n", "45413 | \n", "29978417 | \n", "
10 | \n", "collection | \n", "25015 | \n", "60325343 | \n", "
11 | \n", "usage | \n", "5 | \n", "81707236 | \n", "
12 | \n", "copyright | \n", "33 | \n", "31990162 | \n", "
13 | \n", "dc_type | \n", "3237 | \n", "2099860 | \n", "
14 | \n", "format | \n", "80228 | \n", "2770227 | \n", "
15 | \n", "subject | \n", "1019286 | \n", "13659843 | \n", "
16 | \n", "primary_collection | \n", "315 | \n", "32113360 | \n", "