# Exploring object records

In this notebook we'll have a preliminary poke around in the `object` data harvested from the [NMA Collection API](https://www.nma.gov.au/about/our-collection/our-apis). I'll focus here on the basic shape/stats of the data, other notebooks will explore the object data over [time](explore_collection_object_over_time.ipynb) and [space](explore_objects_and_places.ipynb).

If you haven't already, you'll either need to [harvest the `object` data](harvest_records.ipynb), or [unzip a pre-harvested dataset](unzip_preharvested_data.ipynb).

* [The shape of the data](#The-shape-of-the-data)
* [Nested data](#Nested-data)
* [The `additionalType` field](#The-additionalType-field)
* [The `extent` field](#The-extent-field)
* [How big is the collection?](#How-big-is-the-collection?)
* [The biggest object?](#The-biggest-object?)


<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them.</li>
        <li>To run a code cell click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>

<p><b>Is this thing on?</b> If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to <a href="https://mybinder.org/v2/gh/GLAM-Workbench/national-museum-australia/master?urlpath=lab%2Ftree%2Fexploring_object_records.ipynb">load a <b>live</b> version</a> running on Binder.</p>

</div>

## Import what we need

In [34]:
import pandas as pd
import math
from IPython.display import display, HTML, FileLink
from tinydb import TinyDB, Query
from pandas.io.json import json_normalize

## Load the harvested data

In [35]:
# Load the harvested data from the json db
db = TinyDB('nma_object_db.json')
records = db.all()
Object = Query()

In [36]:
# Convert to a dataframe
df = pd.DataFrame(records)
df.head()

Unnamed: 0,id,type,title,_meta,additionalType,collection,identifier,medium,extent,physicalDescription,...,isPartOf,seeAlso,description,hasVersion,temporal,relation,hasPart,location,acknowledgement,educationalSignificance
0,145400,object,"Wahlo and Tribal law by Kevin Gilbert, reprint...","{'modified': '2018-07-09', 'issued': '2011-10-...",,,,,,,...,,,,,,,,,,
1,251390,object,Pair of woven shoes made from feathers and hair,"{'modified': '2019-01-17', 'issued': '2018-04-...",[Shoes],"{'id': '5244', 'type': 'Collection', 'title': ...",2000.0014.0495,"[{'type': 'Material', 'title': 'Feather'}, {'t...","{'type': 'Measurement', 'length': 260, 'width'...","Shoes, the soles of which are made from woven ...",...,,,,,,,,,,
2,124081,object,Pair of ceremonial shoes,"{'modified': '2018-12-04', 'issued': '2006-10-...",,"{'id': '1892', 'type': 'Collection', 'title': ...",1992.0089.0165,"[{'type': 'Material', 'title': 'Feather'}]","{'type': 'Measurement', 'length': 246, 'width'...",A pair of ceremonial shoes made with several m...,...,,,,,,,,,,
3,21507,object,Grinding stone,"{'modified': '2018-06-19', 'issued': '2014-12-...",[Grinding stones],"{'id': '2229', 'type': 'Collection', 'title': ...",1985.0288.0109,,,,...,,,,,,,,,,
4,142308,object,'time CHange' [sic],"{'modified': '2019-04-15', 'issued': '2012-06-...",[Compact discs],"{'id': '3893', 'type': 'Collection', 'title': ...",AR00213.012,,,"A compact disc, housed within a clear and blac...",...,,,,,,,,,,


## The shape of the data

How many objects are there?

In [37]:
print('There are {:,} objects in the collection'.format(df.shape[0]))

There are 86,679 objects in the collection


Obviously not every record has a value for every field, let's create a quick count of the number of values in each field.

In [38]:
df.count()

id                         86679
type                       86679
title                      86463
_meta                      86679
additionalType             86652
collection                 84256
identifier                 86654
medium                     73743
extent                     64077
physicalDescription        86359
significanceStatement      32437
creator                    25076
spatial                    46658
contributor                40796
isAggregatedBy              4353
isPartOf                   10718
seeAlso                      467
description                 9097
hasVersion                 19845
temporal                   29399
relation                    3066
hasPart                     2345
location                    1364
acknowledgement              785
educationalSignificance      201
dtype: int64

Let's express those counts as a percentage of the total number of records, and display them as a bar chart using Pandas.

In [39]:
# Get field counts and convert to dataframe
field_counts = df.count().to_frame().reset_index()

# Change column headings
field_counts.columns = ['field', 'count']

# Calculate proportion of the total
field_counts['proportion'] = field_counts['count'].apply(lambda x: x / df.shape[0])

# Style the results as a barchart
field_counts.style.bar(subset=['proportion'], color='#d65f5f').format({'proportion': '{:.2%}'.format})

Unnamed: 0,field,count,proportion
0,id,86679,100.00%
1,type,86679,100.00%
2,title,86463,99.75%
3,_meta,86679,100.00%
4,additionalType,86652,99.97%
5,collection,84256,97.20%
6,identifier,86654,99.97%
7,medium,73743,85.08%
8,extent,64077,73.92%
9,physicalDescription,86359,99.63%


## Nested data

One thing you might note is that some of the fields contain nested JSON arrays or objects. For example `additionalType` contains a list of object types, while `extent` is a dictionary with keys and values. Let's unpack these columns for the second row (index of 1).

In [40]:
df['additionalType'][1][0]

'Shoes'

In [41]:
df['extent'][1]

{'type': 'Measurement',
 'length': 260,
 'width': 120,
 'depth': 40,
 'unitText': 'mm'}

In [42]:
df['extent'][1]['length']

260

## The `additionalType` field

How many objects have values in the `additionalType` column?

In [43]:
df.loc[df['additionalType'].notnull()].shape

(86652, 25)

In [44]:
print('{:%} of objects have an additionalType value'.format(df.loc[df['additionalType'].notnull()].shape[0] / df.shape[0]))

99.968851% of objects have an additionalType value


So which ones don't have an `additionalType`?

In [45]:
# Just show the first 5 rows
df.loc[df['additionalType'].isnull()].head()

Unnamed: 0,id,type,title,_meta,additionalType,collection,identifier,medium,extent,physicalDescription,...,isPartOf,seeAlso,description,hasVersion,temporal,relation,hasPart,location,acknowledgement,educationalSignificance
0,145400,object,"Wahlo and Tribal law by Kevin Gilbert, reprint...","{'modified': '2018-07-09', 'issued': '2011-10-...",,,,,,,...,,,,,,,,,,
2,124081,object,Pair of ceremonial shoes,"{'modified': '2018-12-04', 'issued': '2006-10-...",,"{'id': '1892', 'type': 'Collection', 'title': ...",1992.0089.0165,"[{'type': 'Material', 'title': 'Feather'}]","{'type': 'Measurement', 'length': 246, 'width'...",A pair of ceremonial shoes made with several m...,...,,,,,,,,,,
1054,224632,object,Glass plate negative of family and horse stand...,"{'copyright': '', 'licence': ''}",,,,,,,...,,,,,,,,,,
1276,180161,object,Awelye- panel 1 by Lily Kngwarreye,"{'copyright': '', 'licence': ''}",,,,,,,...,,,,,,,,,,
2333,180168,object,Awelye- panel 5 by Lily Kngwarreye,"{'copyright': '', 'licence': ''}",,,,,,,...,,,,,,,,,,


How many rows have more than one `additionalType`?

In [46]:
df.loc[df['additionalType'].str.len() > 1].shape[0]

1037

Let's have a look at a sample.

In [47]:
df.loc[df['additionalType'].str.len() > 1].head()

Unnamed: 0,id,type,title,_meta,additionalType,collection,identifier,medium,extent,physicalDescription,...,isPartOf,seeAlso,description,hasVersion,temporal,relation,hasPart,location,acknowledgement,educationalSignificance
45,202601,object,Album of Newspaper clippings,"{'modified': '2019-04-22', 'issued': '2010-11-...","[Albums, Newspaper clippings]","{'id': '4760', 'type': 'Collection', 'title': ...",1989.0009.0108,"[{'type': 'Material', 'title': 'Cardboard'}, {...","{'type': 'Measurement', 'height': 345, 'width'...",A brown textured hardback album with gold colo...,...,,,,,"[{'type': 'Event', 'title': '1935', 'startDate...",,,,,
118,223557,object,"Receipt issued to Tirranna Race Club, 1878","{'modified': '2019-04-23', 'issued': '2017-11-...","[Invoices, Receipts]","{'id': '6139', 'type': 'Collection', 'title': ...",2012.0019.0170,"[{'type': 'Material', 'title': 'Ink'}, {'type'...","{'type': 'Measurement', 'height': 114, 'width'...",A receipt handwritten on a piece of grey paper...,...,,,,,"[{'type': 'Event', 'title': '1878', 'startDate...",,,,,
155,227915,object,Two toned ceramic toy tea set,"{'modified': '2019-05-17', 'issued': '2018-08-...","[Tea sets, Toy tea sets]","{'id': '6773', 'type': 'Collection', 'title': ...",2013.0038.0255,"[{'type': 'Material', 'title': 'Ceramic'}, {'t...","{'type': 'Measurement', 'height': 15, 'diamete...",A hand-painted ceramic toy tea set with a blue...,...,,,,,"[{'type': 'Event', 'title': '1925 - 1935', 'st...",,,,Donated through the Australian Government’s Cu...,
173,256766,object,Handmade wolf figurine in yellow dress likely ...,"{'modified': '2018-12-13', 'issued': '2018-10-...","[Novelty toys, Toys]","{'id': '6773', 'type': 'Collection', 'title': ...",2013.0038.0556.005,"[{'type': 'Material', 'title': 'Cotton thread'...","{'type': 'Measurement', 'height': 88, 'width':...",A handmade wolf figurine robed in a yellow dre...,...,,,,,"[{'type': 'Event', 'title': '1925 - 1935', 'st...",,,,,
564,224635,object,Photograph of'Freda Mitchell',"{'modified': '2019-07-01', 'issued': '2018-11-...","[Photographs, Sepia photographs]","{'id': '6339', 'type': 'Collection', 'title': ...",2013.0062.0017.002,"[{'type': 'Material', 'title': 'Card'}, {'type...","{'type': 'Measurement', 'height': 147, 'width'...",A sepia photograph showing a young woman posin...,...,,,,,,,,,,


The `additionalType` field contains a nested list of values. Using `json_normalize()` or `explode()` we can explode these lists, creating a row for each separate value.

In [48]:
# Use json_normalize to expand 'additionalType' into separate rows, adding the id and title from the parent record
# df_types = json_normalize(df.loc[df['additionalType'].notnull()].to_dict('records'), record_path='additionalType', meta=['id', 'title'], errors='ignore').rename({0: 'additionalType'}, axis=1)

# In pandas v.0.25 and above you can just use explode -- this prodices the same result as above
df_types = df.loc[df['additionalType'].notnull()][['id', 'title', 'additionalType']].explode('additionalType')

df_types.head()

Unnamed: 0,id,title,additionalType
1,251390,Pair of woven shoes made from feathers and hair,Shoes
3,21507,Grinding stone,Grinding stones
4,142308,'time CHange' [sic],Compact discs
5,20174,Ten Days To Live - A supposed sorcery painting.,Bark paintings
6,144359,'The Dance of Life (1898-1902)' by Diana Boyer...,Booklets


Now that we've exploded the type values, we can aggregate them in different ways. Let's look at the 25 most common object types!

In [49]:
df_types['additionalType'].value_counts()[:25]

Mineral samples                   6000
Photographs                       4747
Stone artefacts                   4364
Photographic postcards            4250
Drawings                          3759
Postcards                         3697
Zoological specimens              2168
Bark paintings                    2110
Geological specimens              1993
Cartoons                          1535
Engravings                        1495
Negatives                         1124
Boomerangs                        1025
Spears                            1012
Percussion and abrading stones     982
Paintings                          840
Clubs                              747
Mounts                             745
Cards                              709
Armbands                           649
Shells                             563
Letters                            542
Documents                          517
Geophysical survey equipment       509
Posters                            495
Name: additionalType, dty

How many object types only appear once?

In [50]:
type_counts = df_types['additionalType'].value_counts().to_frame().reset_index().rename({'index': 'type', 'additionalType': 'count'}, axis=1)
unique_types = type_counts.loc[type_counts['count'] == 1]
unique_types.shape[0]

639

In [51]:
unique_types.head()

Unnamed: 0,type,count
1852,Genealogical charts,1
1853,Skivvies,1
1854,Shopping bags,1
1855,Jam spoons,1
1856,Architectural models,1


Let's save the complete list of types as a CSV file.

In [52]:
type_counts.to_csv('nma_object_type_counts.csv', index=False)
display(FileLink('nma_object_type_counts.csv'))

Browsing the CSV I noticed that there was one item with the type `Vegetables`. Let's find some more out about it.

In [53]:
# Find in the complete data set
mask = df.loc[df['additionalType'].notnull()]['additionalType'].apply(lambda x: 'Vegetables' in x)
veggie = df.loc[df['additionalType'].notnull()][mask]
veggie

Unnamed: 0,id,type,title,_meta,additionalType,collection,identifier,medium,extent,physicalDescription,...,isPartOf,seeAlso,description,hasVersion,temporal,relation,hasPart,location,acknowledgement,educationalSignificance
63775,256742,object,Wooden toy toad stalk,"{'modified': '2019-04-24', 'issued': '2018-10-...","[Toys, Vegetables]","{'id': '6773', 'type': 'Collection', 'title': ...",2013.0038.0540,"[{'type': 'Material', 'title': 'Paint - non sp...","{'type': 'Measurement', 'height': 65, 'diamete...",A painted wooden toy toad stalk with a red cap...,...,,,,,"[{'type': 'Event', 'title': '1925 - 1935', 'st...",,,,,


We can create a link into the NMA Collections Explorer using the object `id`.

In [54]:
display(HTML('<a href="http://collectionsearch.nma.gov.au/?object={}">{}</a>'.format(veggie.iloc[0]['id'], veggie.iloc[0]['title'])))

Does a toad stool count as a vegetable?

## The `extent` field

The `extent` field is a nested object, so once again we'll use `json_normalize()` to expand it out into separate columns.

In [55]:
# Without reset_index() the rows are misaligned
df_extent = df.loc[df['extent'].notnull()].reset_index().join(json_normalize(df.loc[df['extent'].notnull()]['extent'].tolist()).add_prefix("extent_"))
df_extent.head()

Unnamed: 0,index,id,type,title,_meta,additionalType,collection,identifier,medium,extent,...,educationalSignificance,extent_type,extent_length,extent_width,extent_depth,extent_unitText,extent_height,extent_diameter,extent_weight,extent_unitTextWeight
0,1,251390,object,Pair of woven shoes made from feathers and hair,"{'modified': '2019-01-17', 'issued': '2018-04-...",[Shoes],"{'id': '5244', 'type': 'Collection', 'title': ...",2000.0014.0495,"[{'type': 'Material', 'title': 'Feather'}, {'t...","{'type': 'Measurement', 'length': 260, 'width'...",...,,Measurement,260.0,120.0,40.0,mm,,,,
1,2,124081,object,Pair of ceremonial shoes,"{'modified': '2018-12-04', 'issued': '2006-10-...",,"{'id': '1892', 'type': 'Collection', 'title': ...",1992.0089.0165,"[{'type': 'Material', 'title': 'Feather'}]","{'type': 'Measurement', 'length': 246, 'width'...",...,,Measurement,246.0,190.0,45.0,mm,,,,
2,5,20174,object,Ten Days To Live - A supposed sorcery painting.,"{'modified': '2019-04-21', 'issued': '2013-06-...",[Bark paintings],"{'id': '2202', 'type': 'Collection', 'title': ...",1985.0246.0077,"[{'type': 'Material', 'title': 'Bark'}, {'type...","{'type': 'Measurement', 'length': 574, 'width'...",...,,Measurement,574.0,185.0,,mm,,,,
3,6,144359,object,'The Dance of Life (1898-1902)' by Diana Boyer...,"{'modified': '2018-06-18', 'issued': '2012-06-...",[Booklets],"{'id': '3893', 'type': 'Collection', 'title': ...",2008.0043.0022.001,"[{'type': 'Material', 'title': 'Paper'}, {'typ...","{'type': 'Measurement', 'height': 214, 'width'...",...,,Measurement,,150.0,5.0,mm,214.0,,,
4,8,42084,object,"Child's drawing by Lester Moran, Cabbage Tree ...","{'modified': '2019-04-07', 'issued': '2016-10-...",[Drawings],"{'id': '2261', 'type': 'Collection', 'title': ...",1991.0024.0027,"[{'type': 'Material', 'title': 'Paint - non sp...","{'type': 'Measurement', 'length': 560, 'width'...",...,,Measurement,560.0,380.0,0.5,mm,,,,


Let's check to see what types of things are in the `extent` field.

In [56]:
df_extent['extent_type'].value_counts()

Measurement    64077
Name: extent_type, dtype: int64

So they're all measurements. Let's have a look at the units being used.

In [57]:
df_extent['extent_unitText'].value_counts()

mm    63382
MM       10
cm        9
m         5
Name: extent_unitText, dtype: int64

In [58]:
df_extent['extent_unitTextWeight'].value_counts()

g        1473
kg        209
lb          5
oz          4
tonne       1
Name: extent_unitTextWeight, dtype: int64

Hmmm, are those measurements really in metres, or might they be meant to be 'mm'? Let's have a look at them.

In [59]:
df_extent.loc[df_extent['extent_unitText'] == 'm'][['id', 'title', 'extent_length', 'extent_width', 'extent_unitText']]

Unnamed: 0,id,title,extent_length,extent_width,extent_unitText
16781,202783,"The Percival Project, Gull Twelve, in a manill...",,230.0,m
18291,214193,Extension tube,55.0,,m
41612,123962,Gunter's chain,20.1168,,m
47232,171768,Fair Breeze,,138.0,m
56789,257184,Fishing line inside envelope,137.0,110.0,m


Other than 'Gunter's chain' it looks like the unit should indeed by 'mm'. We'll need to take that into account in calculations.

Now let's convert all the measurements into a single unit – millimetre for lengths, and gram for weights.

In [60]:
def conversion_factor(unit):
    '''
    Get the factor required to convery current unit to either mm or g.
    '''
    factors = {
        'mm': 1,
        'cm': 10,
        'm': 1, # Most should in fact be mm (see above)
        'g': 1,
        'kg': 1000,
        'tonne': 1000000,
        'oz': 28.35,
        'lb': 453.592
    }
    try:
        factor = factors[unit.lower()]
    except KeyError:
        factor = 0 
    return factor

def normalise_measurements(row):
    '''
    Convert measurements to standard units.
    '''
    l_factor = conversion_factor(str(row['extent_unitText']))
    length = row['extent_length'] * l_factor
    width = row['extent_width'] * l_factor
    depth = row['extent_depth'] * l_factor
    height = row['extent_height'] * l_factor
    diameter = row['extent_diameter'] * l_factor
    w_factor = conversion_factor(str(row['extent_unitTextWeight']))
    weight = row['extent_weight'] * w_factor
    return pd.Series([length, width, depth, height, diameter, weight])

# Add normalised measurements to the dataframe
df_extent[['length_mm', 'width_mm', 'depth_mm', 'height_mm', 'diameter_mm', 'weight_g']] = df_extent.apply(normalise_measurements, axis=1)

In [61]:
df_extent.head()

Unnamed: 0,index,id,type,title,_meta,additionalType,collection,identifier,medium,extent,...,extent_height,extent_diameter,extent_weight,extent_unitTextWeight,length_mm,width_mm,depth_mm,height_mm,diameter_mm,weight_g
0,1,251390,object,Pair of woven shoes made from feathers and hair,"{'modified': '2019-01-17', 'issued': '2018-04-...",[Shoes],"{'id': '5244', 'type': 'Collection', 'title': ...",2000.0014.0495,"[{'type': 'Material', 'title': 'Feather'}, {'t...","{'type': 'Measurement', 'length': 260, 'width'...",...,,,,,260.0,120.0,40.0,,,
1,2,124081,object,Pair of ceremonial shoes,"{'modified': '2018-12-04', 'issued': '2006-10-...",,"{'id': '1892', 'type': 'Collection', 'title': ...",1992.0089.0165,"[{'type': 'Material', 'title': 'Feather'}]","{'type': 'Measurement', 'length': 246, 'width'...",...,,,,,246.0,190.0,45.0,,,
2,5,20174,object,Ten Days To Live - A supposed sorcery painting.,"{'modified': '2019-04-21', 'issued': '2013-06-...",[Bark paintings],"{'id': '2202', 'type': 'Collection', 'title': ...",1985.0246.0077,"[{'type': 'Material', 'title': 'Bark'}, {'type...","{'type': 'Measurement', 'length': 574, 'width'...",...,,,,,574.0,185.0,,,,
3,6,144359,object,'The Dance of Life (1898-1902)' by Diana Boyer...,"{'modified': '2018-06-18', 'issued': '2012-06-...",[Booklets],"{'id': '3893', 'type': 'Collection', 'title': ...",2008.0043.0022.001,"[{'type': 'Material', 'title': 'Paper'}, {'typ...","{'type': 'Measurement', 'height': 214, 'width'...",...,214.0,,,,,150.0,5.0,214.0,,
4,8,42084,object,"Child's drawing by Lester Moran, Cabbage Tree ...","{'modified': '2019-04-07', 'issued': '2016-10-...",[Drawings],"{'id': '2261', 'type': 'Collection', 'title': ...",1991.0024.0027,"[{'type': 'Material', 'title': 'Paint - non sp...","{'type': 'Measurement', 'length': 560, 'width'...",...,,,,,560.0,380.0,0.5,,,


## How big is the collection?

In [62]:
def calculate_volume(row):
    '''
    Look for 3 linear dimensions and multiply them to get a volume.
    '''
    # Create a list of valid linear measurements from the available fields
    dimensions = [d for d in [row['length_mm'], row['width_mm'], row['depth_mm'], row['height_mm'], row['diameter_mm']] if not math.isnan(d)]
    
    # If there's only 2 dimensions...
    if len(dimensions) == 2:
        # Set a default height of 1 for items with only 2 dimensions
        dimensions.append(1)
        
    # If there's 3 or more dimensions, multiple the first 3 together
    if len(dimensions) >= 3:
        volume = dimensions[0] * dimensions[1] * dimensions[2]
    else:
        volume = 0
    return volume

df_extent['volume'] = df_extent.apply(calculate_volume, axis=1)

In [63]:
print('Total length of objects is {:.2f} km'.format(df_extent['length_mm'].sum() / 1000 / 1000))

Total length of objects is 15.36 km


In [64]:
print('Total weight of objects is {:.2f} tonnes'.format(df_extent['weight_g'].sum() / 1000000))

Total weight of objects is 194.30 tonnes


In [65]:
print('Total volume of objects is {:.2f} m\N{SUPERSCRIPT THREE}'.format(df_extent['volume'].sum() / 1000000000))

Total volume of objects is 2873.14 m³


## The biggest object?

What's the biggest thing?

In [66]:
# Get the object with the largest volume
biggest = df_extent.loc[df_extent['volume'].idxmax()]

# Create a link to Collection Explorer
display(HTML('<a href="http://collectionsearch.nma.gov.au/?object={}">{}</a>'.format(biggest['id'], biggest['title'])))

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).

Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).