# Select a random(ish) record from DigitalNZ

The DigitalNZ API doesn't provide a random sort option. You can jump to a randomly selected page of results, but you can't do any deeper than 100,000 pages into a results set (that's 1,000,000 records if you set the `per_page` value to 100). So we need to find some way of filtering the results until there's less than 1,000,000, then we can grab a random page and record.

We can use facets to filter the results. As you can see at the bottom of this notebook, I did a bit of examination of the facets to understand their coverage. If only 50% of records have a value for a particular facet and we use it to filter the results, then 50% of the records will be missing from the pool we make our random selection from. So we want to use facets that have been applied to as many records as possible.

A blank search returns 31,640,164 results.

I extracted facets for `category`, `display_collection`, `creator`, `placename`, `year`, `decade`, `century`, `language`, `content_partner`, `rights`, `collection`, and `usage`. The facets that seem to have the best coverage are:

* `category`: 31,653,142 records
* `content_partner`: 31,642,453 records
* `year`: 30,867,103 records

I don't know why `category` and `content_partner` have more records than a blank search – I suppose either the blank search is filtering out records, or some records have multiple values for these facets. Note, too, that `year` has 918 values! The maximum number of facet values that can be retrieved in a single request is 350, so this makes it tricky to filter the results using just the `year` facet. By applying `category` and `content_partner` before `year`, I should limit the number of year values, and hopefully avoid overlooking too many records. (I could analyse all the combinations of these facets to see how many records might be missed, but I don't think it's worth it at this stage.)

So for now, I've decided to apply a randomly selected value from each of these facets in the following order – `category`, `content_partner`, and `year`. After applying each filter I'll check to see if we were under 1,000,000 results, if so we'll grab a record by jumping to a random page, and selecting a random result!

As you can see from the examples below, you can also supply your own filters if you want to limit the selection pool.

## Import what we need

In [1]:
import requests
import random
import math
import pandas as pd
from tqdm.auto import tqdm
from IPython.display import Image, display, HTML

In [None]:
API_KEY = '[YOUR API KEY]'
API_URL = 'http://api.digitalnz.org/v3/records.json'

## Define some functions

In [3]:
def get_total(**kwargs):
    '''
    Get the total number of results from a query built using supplied kwargs as parameters.
    '''
    params = {
        'api_key': API_KEY,
        'per_page': 0
    }
    params = add_kwargs_to_params(params, kwargs)
    data = get_records(params)
    return data['search']['result_count']
    
def get_records(params):
    '''
    Get records from a search using the supplied parameters.
    '''
    response = requests.get(API_URL, params=params)
    return response.json()

def add_kwargs_to_params(params, kwargs):
    '''
    Add kwargs to query parameters.
    '''
    for k, v in kwargs.items():
        if k == 'text':
            params[k] = v
        else:
            params[f'and[{k}][]'] = v
    return params

def get_random_result(**kwargs):
    '''
    Select a random result from a query built using supplied kwargs as parameters.
    '''
    total = get_total(**kwargs)
    pages = math.ceil(total / 100)
    page = random.choice(list(range(1,pages + 1)))
    params = {
        'api_key': API_KEY,
        'per_page': 100,
        'page': page
    }
    params = add_kwargs_to_params(params, kwargs)
    data = get_records(params)
    try:
        record = random.choice(data['search']['results'])
    except KeyError:
        record = None
    return record

def get_facets(facet, **kwargs):
    '''
    Get values for the specified facet.
    '''
    params = {
        'facets': [facet],
        'api_key': API_KEY,
        'per_page': 0,
        'facets_per_page': 350 # 350 is the max
    }
    params = add_kwargs_to_params(params, kwargs)
    data = get_records(params)
    total = data['search']['result_count']
    facets = data['search']['facets'][facet]
    return (total, facets)

def get_random_facet(facets):
    '''
    Select a facet value from a list of facets, using the facet counts as weights.
    '''
    values = [{k:v} for k,v in facets.items()]
    weights = list(facets.values())
    facet = random.choices(values, weights=weights, k=1)[0]
    return list(facet.items())[0]

def select_facet(facet, **kwargs):
    '''
    Apply the specified facet to a query, if the total results are less than 1,000,000 then get a random result.
    '''
    _, facets = get_facets(facet, **kwargs)
    value = get_random_facet(facets)
    print(f'  * {facet.title()}: {value[0]}')
    kwargs[facet] = value[0]
    if value[1] < 1000000:
        record = get_random_result(**kwargs)
    else:
        record = None
    return (record, kwargs)
    
def get_random_record(**kwargs):
    print('Additional filters:')
    if kwargs:
        total = get_total(**kwargs)
        if total < 1000000:
            print('  * None')
            return get_random_result(**kwargs)
    for facet in ['category', 'content_partner', 'year']:
        if facet not in kwargs:
            record, kwargs = select_facet(facet, **kwargs)
            if record:
                return record
    return 'Too many'

## A random record

In [4]:
# Get a record
record = get_random_record()

# Display the results
display(HTML(f'\n<h4>{record["title"]}</h4>'))
if record['description']:
    display(HTML(f'<p>{record["description"]}</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * Category: Newspapers
  * Content_Partner: National Library of New Zealand
  * Year: 1905


## A random newspaper article

In [5]:
# Get a record
record = get_random_record(category='Newspapers')

# Display the results
display(HTML(f'\n<h4>{record["title"]}</h4>'))
display(HTML(f'<p>{record["fulltext"][:500]}...</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * Content_Partner: National Library of New Zealand
  * Year: 1885


## A random newspaper article from a specific decade

In [6]:
# Get a record
record = get_random_record(category='Newspapers', decade='1920')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
display(HTML(f'<p>{record["fulltext"][:500]}...</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * Content_Partner: National Library of New Zealand
  * Year: 1920


## A random article from a specific newspaper

The newspaper title is stored in `collection_title` and `publisher`, but you don't seem to be able to filter using either of these, so we'll just do a `text` search for the title instead. This may mean we get results that are not actually from this newspaper...

In [7]:
# Get a record
record = get_random_record(category='Newspapers', text='Evening Post')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
display(HTML(f'<p>{record["fulltext"][:500]}...</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * Content_Partner: National Library of New Zealand
  * Year: 1936


## A random item from a specific content partner

In [8]:
# Get a record
record = get_random_record(content_partner='Puke Ariki')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
if 'thumbnail_url' in record and record['thumbnail_url']:
    display(Image(url=record['thumbnail_url'], format='jpg'))
display(HTML(f'<p>{record["description"]}</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * None


## A random open image

In [9]:
# Get a record
record = get_random_record(category='Images', usage='Use commercially')

# Display the results
display(HTML(f'<h4>{record["title"]}</h4>'))
try:
    if 'large_thumbnail_url' in record:
        display(Image(url=record['large_thumbnail_url'], format='jpg'))
    else:
        display(Image(url=record['thumbnail_url'], format='jpg'))
except:
    pass
if record['description']:
    display(HTML(f'<p>{record["description"]}</p>'))
display(HTML(f'<a href="{record["landing_url"]}">More...</a>'))

Additional filters:
  * Content_Partner: Museum of New Zealand Te Papa Tongarewa


----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb).