{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Select a random(ish) record from DigitalNZ\n", "\n", "The DigitalNZ API doesn't provide a random sort option. You can jump to a randomly selected page of results, but you can't do any deeper than 100,000 pages into a results set (that's 1,000,000 records if you set the `per_page` value to 100). So we need to find some way of filtering the results until there's less than 1,000,000, then we can grab a random page and record.\n", "\n", "We can use facets to filter the results. As you can see at the bottom of this notebook, I did a bit of examination of the facets to understand their coverage. If only 50% of records have a value for a particular facet and we use it to filter the results, then 50% of the records will be missing from the pool we make our random selection from. So we want to use facets that have been applied to as many records as possible.\n", "\n", "A blank search returns 31,640,164 results.\n", "\n", "I extracted facets for `category`, `display_collection`, `creator`, `placename`, `year`, `decade`, `century`, `language`, `content_partner`, `rights`, `collection`, and `usage`. The facets that seem to have the best coverage are:\n", "\n", "* `category`: 31,653,142 records\n", "* `content_partner`: 31,642,453 records\n", "* `year`: 30,867,103 records\n", "\n", "I don't know why `category` and `content_partner` have more records than a blank search – I suppose either the blank search is filtering out records, or some records have multiple values for these facets. Note, too, that `year` has 918 values! The maximum number of facet values that can be retrieved in a single request is 350, so this makes it tricky to filter the results using just the `year` facet. By applying `category` and `content_partner` before `year`, I should limit the number of year values, and hopefully avoid overlooking too many records. (I could analyse all the combinations of these facets to see how many records might be missed, but I don't think it's worth it at this stage.)\n", "\n", "So for now, I've decided to apply a randomly selected value from each of these facets in the following order – `category`, `content_partner`, and `year`. After applying each filter I'll check to see if we were under 1,000,000 results, if so we'll grab a record by jumping to a random page, and selecting a random result!\n", "\n", "As you can see from the examples below, you can also supply your own filters if you want to limit the selection pool." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import random\n", "import math\n", "import pandas as pd\n", "from tqdm.auto import tqdm\n", "from IPython.display import Image, display, HTML" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "API_KEY = '[YOUR API KEY]'\n", "API_URL = 'http://api.digitalnz.org/v3/records.json'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def get_total(**kwargs):\n", " '''\n", " Get the total number of results from a query built using supplied kwargs as parameters.\n", " '''\n", " params = {\n", " 'api_key': API_KEY,\n", " 'per_page': 0\n", " }\n", " params = add_kwargs_to_params(params, kwargs)\n", " data = get_records(params)\n", " return data['search']['result_count']\n", " \n", "def get_records(params):\n", " '''\n", " Get records from a search using the supplied parameters.\n", " '''\n", " response = requests.get(API_URL, params=params)\n", " return response.json()\n", "\n", "def add_kwargs_to_params(params, kwargs):\n", " '''\n", " Add kwargs to query parameters.\n", " '''\n", " for k, v in kwargs.items():\n", " if k == 'text':\n", " params[k] = v\n", " else:\n", " params[f'and[{k}][]'] = v\n", " return params\n", "\n", "def get_random_result(**kwargs):\n", " '''\n", " Select a random result from a query built using supplied kwargs as parameters.\n", " '''\n", " total = get_total(**kwargs)\n", " pages = math.ceil(total / 100)\n", " page = random.choice(list(range(1,pages + 1)))\n", " params = {\n", " 'api_key': API_KEY,\n", " 'per_page': 100,\n", " 'page': page\n", " }\n", " params = add_kwargs_to_params(params, kwargs)\n", " data = get_records(params)\n", " try:\n", " record = random.choice(data['search']['results'])\n", " except KeyError:\n", " record = None\n", " return record\n", "\n", "def get_facets(facet, **kwargs):\n", " '''\n", " Get values for the specified facet.\n", " '''\n", " params = {\n", " 'facets': [facet],\n", " 'api_key': API_KEY,\n", " 'per_page': 0,\n", " 'facets_per_page': 350 # 350 is the max\n", " }\n", " params = add_kwargs_to_params(params, kwargs)\n", " data = get_records(params)\n", " total = data['search']['result_count']\n", " facets = data['search']['facets'][facet]\n", " return (total, facets)\n", "\n", "def get_random_facet(facets):\n", " '''\n", " Select a facet value from a list of facets, using the facet counts as weights.\n", " '''\n", " values = [{k:v} for k,v in facets.items()]\n", " weights = list(facets.values())\n", " facet = random.choices(values, weights=weights, k=1)[0]\n", " return list(facet.items())[0]\n", "\n", "def select_facet(facet, **kwargs):\n", " '''\n", " Apply the specified facet to a query, if the total results are less than 1,000,000 then get a random result.\n", " '''\n", " _, facets = get_facets(facet, **kwargs)\n", " value = get_random_facet(facets)\n", " print(f' * {facet.title()}: {value[0]}')\n", " kwargs[facet] = value[0]\n", " if value[1] < 1000000:\n", " record = get_random_result(**kwargs)\n", " else:\n", " record = None\n", " return (record, kwargs)\n", " \n", "def get_random_record(**kwargs):\n", " print('Additional filters:')\n", " if kwargs:\n", " total = get_total(**kwargs)\n", " if total < 1000000:\n", " print(' * None')\n", " return get_random_result(**kwargs)\n", " for facet in ['category', 'content_partner', 'year']:\n", " if facet not in kwargs:\n", " record, kwargs = select_facet(facet, **kwargs)\n", " if record:\n", " return record\n", " return 'Too many'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A random record" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Additional filters:\n", " * Category: Newspapers\n", " * Content_Partner: National Library of New Zealand\n", " * Year: 1905\n" ] }, { "data": { "text/html": [ "\n", "

FURTHER JAPANESE SUCCESSES ON DAND. (Colonist, 12 June 1905)

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "More..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get a record\n", "record = get_random_record()\n", "\n", "# Display the results\n", "display(HTML(f'\\n

{record[\"title\"]}

'))\n", "if record['description']:\n", " display(HTML(f'

{record[\"description\"]}

'))\n", "display(HTML(f'More...'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A random newspaper article" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Additional filters:\n", " * Content_Partner: National Library of New Zealand\n", " * Year: 1885\n" ] }, { "data": { "text/html": [ "\n", "

THE FATE OF THE ALABAMA AWARD. (Evening Post, 14 November 1885)

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

the fate of the alabama award at the time when the geneva award on the alabama claims was i made it waß a matter of remark that not only were the claims of the united states government grossly excessive but the actual amount awarded three millions sterling was far in exasss of any legitimate demands an article which now nppears in the new york press headed a court of swindlers is a con elusive justification of that contention after the three millions were awarded it was determined that the money...

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "More..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get a record\n", "record = get_random_record(category='Newspapers')\n", "\n", "# Display the results\n", "display(HTML(f'\\n

{record[\"title\"]}

'))\n", "display(HTML(f'

{record[\"fulltext\"][:500]}...

'))\n", "display(HTML(f'More...'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A random newspaper article from a specific decade" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Additional filters:\n", " * Content_Partner: National Library of New Zealand\n", " * Year: 1920\n" ] }, { "data": { "text/html": [ "

THE TURF. (Marlborough Express, 05 February 1920)

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

the turf taranaki results press association new plymouth feb fi the taranaki jockey club meeting conoluded in fine weather results hurdles explorer 1112 1 fair paul 90 2 cheddar 90 3 scratched master moutoa won l-y 1 lengths time 3min 1 l-ssec munster fell kawa-j handicap starland 98 1 self alliance 810 2 haversnck 67 3 all started won by four lengths time imin slsec...

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "More..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get a record\n", "record = get_random_record(category='Newspapers', decade='1920')\n", "\n", "# Display the results\n", "display(HTML(f'

{record[\"title\"]}

'))\n", "display(HTML(f'

{record[\"fulltext\"][:500]}...

'))\n", "display(HTML(f'More...'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A random article from a specific newspaper\n", "\n", "The newspaper title is stored in `collection_title` and `publisher`, but you don't seem to be able to filter using either of these, so we'll just do a `text` search for the title instead. This may mean we get results that are not actually from this newspaper..." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Additional filters:\n", " * Content_Partner: National Library of New Zealand\n", " * Year: 1936\n" ] }, { "data": { "text/html": [ "

Page 1 Advertisements Column 4 (Evening Post, 29 January 1936)

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

newman bros ltd regular services picton-blenheim-christchurch nelson-motueka-takaka west coast glaciers full particulars from a ll government tourist offices thos cook and son t and w young wellington wanted to sell wanted sell piano 14 10s or make offer owner removing iron frame overstrung write 1007 evg post yxiunxiiid to sell girl technical uni form only few months wear cheap apply 01 graftou road roseueath vyanted sell singer g6k4 droplieads bargains singer dropheads 5 hand machines cheap gu...

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "More..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get a record\n", "record = get_random_record(category='Newspapers', text='Evening Post')\n", "\n", "# Display the results\n", "display(HTML(f'

{record[\"title\"]}

'))\n", "display(HTML(f'

{record[\"fulltext\"][:500]}...

'))\n", "display(HTML(f'More...'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A random item from a specific content partner" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Additional filters:\n", " * None\n" ] }, { "data": { "text/html": [ "

McCarthy, Woman

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

Woman seated on bench, portrait in background scratched out.; Black and White Sheet Film; Black and White Negative

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "More..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get a record\n", "record = get_random_record(content_partner='Puke Ariki')\n", "\n", "# Display the results\n", "display(HTML(f'

{record[\"title\"]}

'))\n", "if 'thumbnail_url' in record and record['thumbnail_url']:\n", " display(Image(url=record['thumbnail_url'], format='jpg'))\n", "display(HTML(f'

{record[\"description\"]}

'))\n", "display(HTML(f'More...'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A random open image" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Additional filters:\n", " * Content_Partner: Museum of New Zealand Te Papa Tongarewa\n" ] }, { "data": { "text/html": [ "

Avenue

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "More..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get a record\n", "record = get_random_record(category='Images', usage='Use commercially')\n", "\n", "# Display the results\n", "display(HTML(f'

{record[\"title\"]}

'))\n", "try:\n", " if 'large_thumbnail_url' in record:\n", " display(Image(url=record['large_thumbnail_url'], format='jpg'))\n", " else:\n", " display(Image(url=record['thumbnail_url'], format='jpg'))\n", "except:\n", " pass\n", "if record['description']:\n", " display(HTML(f'

{record[\"description\"]}

'))\n", "display(HTML(f'More...'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }