{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring the Te Papa collection API\n", "\n", "Te Papa has a [new collection API](https://data.tepapa.govt.nz/docs/index.html), so I thought I should have a poke around. This notebook is just a preliminary exploration — it's not intended as a tutorial or a guide. There may well be mistakes and misinterpretations. Nonetheless, it might help you get a feel for what's possible.\n", "\n", "In the future I'll add notebooks focused on specific tasks, but for now we're just going to follow our noses and see where we end up." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

\n", "\n", "

\n", " Some tips:\n", "

\n", "

\n", "
" ] }, { "cell_type": "code", "execution_count": 254, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RendererRegistry.enable('notebook')" ] }, "execution_count": 254, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import requests\n", "import pandas as pd\n", "import altair as alt\n", "from tqdm import tnrange\n", "import re\n", "from six import iteritems\n", "from IPython.display import display, HTML\n", "alt.renderers.enable('notebook')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get an API key\n", "\n", "[Sign up here](https://data.tepapa.govt.nz/docs/register.html) for your very own API key." ] }, { "cell_type": "code", "execution_count": 358, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Your API key is: \n" ] } ], "source": [ "# Insert your API key between the quotes\n", "api_key = ''\n", "# If you don't have an API key yet, you can leave the above blank and we'll pick up a guest token below\n", "print('Your API key is: {}'.format(api_key))" ] }, { "cell_type": "code", "execution_count": 359, "metadata": {}, "outputs": [], "source": [ "search_endpoint = 'https://data.tepapa.govt.nz/collection/search'\n", "object_endpoint = 'https://data.tepapa.govt.nz/collection/object'\n", "endpoint = 'https://data.tepapa.govt.nz/collection/{}'\n", "\n", "headers = {\n", " 'x-api-key': api_key,\n", " 'Accept': 'application/json'\n", "}\n", "\n", "if not api_key:\n", " response = requests.get('https://data.tepapa.govt.nz/collection/search')\n", " data = response.json()\n", " guest_token = data['guestToken']\n", " headers['Authorization'] = 'Bearer {}'.format(guest_token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What will we search for?\n", "\n", "Here's I'm going to set a keyword that I'll use in my searches throughout this notebook. Feel free to change it to explore your own results. You can also set it to '\\*' (an asterix) to return everything." ] }, { "cell_type": "code", "execution_count": 360, "metadata": {}, "outputs": [], "source": [ "keyword = 'Chinese'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's make our first API request!" ] }, { "cell_type": "code", "execution_count": 361, "metadata": {}, "outputs": [], "source": [ "# Set our search parameters for passing to Requests\n", "params = {\n", " 'q': keyword\n", "}" ] }, { "cell_type": "code", "execution_count": 362, "metadata": {}, "outputs": [], "source": [ "# Send off the API request\n", "# We need to supply the `headers` to authenticate our request with our key\n", "response = requests.get(search_endpoint, headers=headers, params=params)\n", "# Get the JSON result data\n", "data = response.json()" ] }, { "cell_type": "code", "execution_count": 363, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'count': 11852, 'from': 0, 'size': 100, 'truncated': False}" ] }, "execution_count": 363, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Drill down to get the summary data from our search\n", "data['_metadata']['resultset']" ] }, { "cell_type": "code", "execution_count": 261, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Your search for \"Chinese\" using the Te Papa collection API returned 11,852 results.\n" ] } ], "source": [ "print('Your search for \"{}\" using the Te Papa collection API returned {:,} results.'.format(keyword, data['_metadata']['resultset']['count']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What sorts of things are in our search results?\n", "\n", "A really useful feature of the API is that you can ask for facets on lots of different fields. You've probably used the facets on the [Te Papa collection search page](https://collections.tepapa.govt.nz/) to narrow down your results. Using the API, you can go even further, using the facets to summarise your results from a variety of different angles.\n", "\n", "Let's start by getting facets from the `type` field. To request facets you have to POST your query to the [search endpoint](https://data.tepapa.govt.nz/docs/resource_SearchResource.html). Fortunately the Python Requests library makes it really easy to create and submit POST requests. All you have to do is supply the name of the field you want facets for, and the number of facets to return. If you set `size` to `5`, you'll get the 5 most frequent values. According to some ElasticSearch docs I found, you should be able to get all the facets by setting `size` to `0` (ie zero), but I couldn't get this to work.\n", "\n", "#### Create POST request data" ] }, { "cell_type": "code", "execution_count": 262, "metadata": {}, "outputs": [], "source": [ "# This is the dictionary that provides the data for the POST request\n", "# Here we're saying we want the 5 'types' with the most results\n", "# You can change the size parameter as necessary.\n", "\n", "post_data = {\n", " 'query': 'chinese',\n", " 'facets': [\n", " {\n", " 'field': 'type',\n", " 'size': 5\n", " }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make the API request\n", "\n", "Note that we're using the `post` method rather than `get`. If we supply the POST data using the `json` parameter, Requests takes care of all the tricky encoding issues." ] }, { "cell_type": "code", "execution_count": 263, "metadata": {}, "outputs": [], "source": [ "response = requests.post(search_endpoint, json=post_data, headers=headers)\n", "data = response.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Convert to a dataframe\n", "\n", "Let's convert the results to a Pandas dataframe because Pandas is awesome and it'll make it easier to create charts." ] }, { "cell_type": "code", "execution_count": 264, "metadata": {}, "outputs": [], "source": [ "types_df = pd.DataFrame(list(data['facets']['type'].items()))\n", "# Set columns names\n", "types_df.columns = ['Type', 'Count']" ] }, { "cell_type": "code", "execution_count": 265, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypeCount
0Specimen83
1Category121
2Object11451
3Topic26
4Place134
\n", "
" ], "text/plain": [ " Type Count\n", "0 Specimen 83\n", "1 Category 121\n", "2 Object 11451\n", "3 Topic 26\n", "4 Place 134" ] }, "execution_count": 265, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View the results\n", "types_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Display the results as a bar chart" ] }, { "cell_type": "code", "execution_count": 266, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "var spec = {\"config\": {\"view\": {\"width\": 400, \"height\": 300}}, \"data\": {\"name\": \"data-ebfca4b0bdcced03d9739362d7ef3d7d\"}, \"mark\": \"bar\", \"encoding\": {\"tooltip\": [{\"type\": \"ordinal\", \"field\": \"Type\"}, {\"type\": \"quantitative\", \"field\": \"Count\"}], \"x\": {\"type\": \"quantitative\", \"field\": \"Count\"}, \"y\": {\"type\": \"ordinal\", \"field\": \"Type\"}}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v2.6.0.json\", \"datasets\": {\"data-ebfca4b0bdcced03d9739362d7ef3d7d\": [{\"Type\": \"Specimen\", \"Count\": 83}, {\"Type\": \"Category\", \"Count\": 121}, {\"Type\": \"Object\", \"Count\": 11451}, {\"Type\": \"Topic\", \"Count\": 26}, {\"Type\": \"Place\", \"Count\": 134}]}};\n", "var opt = {};\n", "var type = \"vega-lite\";\n", "var id = \"6dd1b2b7-7299-4b53-a9db-cb0a153e0565\";\n", "\n", "var output_area = this;\n", "\n", "require([\"nbextensions/jupyter-vega/index\"], function(vega) {\n", " var target = document.createElement(\"div\");\n", " target.id = id;\n", " target.className = \"vega-embed\";\n", "\n", " var style = document.createElement(\"style\");\n", " style.textContent = [\n", " \".vega-embed .error p {\",\n", " \" color: firebrick;\",\n", " \" font-size: 14px;\",\n", " \"}\",\n", " ].join(\"\\\\n\");\n", "\n", " // element is a jQuery wrapped DOM element inside the output area\n", " // see http://ipython.readthedocs.io/en/stable/api/generated/\\\n", " // IPython.display.html#IPython.display.Javascript.__init__\n", " element[0].appendChild(target);\n", " element[0].appendChild(style);\n", "\n", " vega.render(\"#\" + id, spec, type, opt, output_area);\n", "}, function (err) {\n", " if (err.requireType !== \"scripterror\") {\n", " throw(err);\n", " }\n", "});\n" ], "text/plain": [ "" ] }, "metadata": { "jupyter-vega": "#6dd1b2b7-7299-4b53-a9db-cb0a153e0565" }, "output_type": "display_data" }, { "data": { "image/png": "" }, "metadata": { "jupyter-vega": "#6dd1b2b7-7299-4b53-a9db-cb0a153e0565" }, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 266, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(types_df).mark_bar().encode(\n", " y='Type:O',\n", " x='Count',\n", " tooltip=[alt.Tooltip('Type:O'), alt.Tooltip('Count')]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What collections are the objects in?\n", "\n", "One of the great things about the Te Papa API is the richness of the data and all the interrelations between things, people, places, and subjects. But this also makes it a bit of a challenge to understand how everything fits together. On the GitHub site there's a useful summary of the record structures used to represent the different types of things. Here, for example, is what an [object record](https://github.com/te-papa/collections-api/wiki/Collections-API-Object-Model#object) looks like. Using this as a guide we can start to dig down through the data.\n", "\n", "Let's get an overview of the `objects` in our search results by using the `collection` facet.\n", "\n", "#### Create the POST request data\n", "\n", "This time we're using the `filters` parameter to limit our search to things that have the `type` of 'Object'. We're then getting facets on the `collection` field.\n", "\n", "Instead of using `filters` we could include something like `type: Object` in the query string. I think this changes the way the result set is constructed, but I don't know if it affects the results returned." ] }, { "cell_type": "code", "execution_count": 267, "metadata": {}, "outputs": [], "source": [ "post_data = {\n", " 'query': 'chinese',\n", " 'filters': [{\n", " 'field': 'type',\n", " 'keyword': 'Object'\n", " }],\n", " 'facets': [\n", " {'field': 'collection',\n", " 'size': 20}\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make the API request and convert the results to a dataframe" ] }, { "cell_type": "code", "execution_count": 268, "metadata": {}, "outputs": [], "source": [ "# Get the API response\n", "response = requests.post(search_endpoint, json=post_data, headers=headers)\n", "data = response.json()" ] }, { "cell_type": "code", "execution_count": 269, "metadata": {}, "outputs": [], "source": [ "# Convert to a dataframe\n", "objects_df = pd.DataFrame(list(data['facets']['collection'].items()))\n", "objects_df.columns = ['Collection', 'Count']" ] }, { "cell_type": "code", "execution_count": 270, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CollectionCount
0Photography8920
1MuseumArchives7
2Art86
3CollectedArchives109
4Philatelic73
5TaongaMāori4
6RareBooks8
7PacificCultures29
8History2215
\n", "
" ], "text/plain": [ " Collection Count\n", "0 Photography 8920\n", "1 MuseumArchives 7\n", "2 Art 86\n", "3 CollectedArchives 109\n", "4 Philatelic 73\n", "5 TaongaMāori 4\n", "6 RareBooks 8\n", "7 PacificCultures 29\n", "8 History 2215" ] }, "execution_count": 270, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View the results\n", "objects_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Display the results as a bar chart" ] }, { "cell_type": "code", "execution_count": 271, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "var spec = {\"config\": {\"view\": {\"width\": 400, \"height\": 300}}, \"data\": {\"name\": \"data-0227cb06d4ff92d6e2e2c5989a42481f\"}, \"mark\": \"bar\", \"encoding\": {\"tooltip\": [{\"type\": \"ordinal\", \"field\": \"Collection\"}, {\"type\": \"quantitative\", \"field\": \"Count\"}], \"x\": {\"type\": \"quantitative\", \"field\": \"Count\"}, \"y\": {\"type\": \"ordinal\", \"field\": \"Collection\"}}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v2.6.0.json\", \"datasets\": {\"data-0227cb06d4ff92d6e2e2c5989a42481f\": [{\"Collection\": \"Photography\", \"Count\": 8920}, {\"Collection\": \"MuseumArchives\", \"Count\": 7}, {\"Collection\": \"Art\", \"Count\": 86}, {\"Collection\": \"CollectedArchives\", \"Count\": 109}, {\"Collection\": \"Philatelic\", \"Count\": 73}, {\"Collection\": \"TaongaM\\u0101ori\", \"Count\": 4}, {\"Collection\": \"RareBooks\", \"Count\": 8}, {\"Collection\": \"PacificCultures\", \"Count\": 29}, {\"Collection\": \"History\", \"Count\": 2215}]}};\n", "var opt = {};\n", "var type = \"vega-lite\";\n", "var id = \"ecb09175-ceba-4d8c-86a5-0f4ada1dad2f\";\n", "\n", "var output_area = this;\n", "\n", "require([\"nbextensions/jupyter-vega/index\"], function(vega) {\n", " var target = document.createElement(\"div\");\n", " target.id = id;\n", " target.className = \"vega-embed\";\n", "\n", " var style = document.createElement(\"style\");\n", " style.textContent = [\n", " \".vega-embed .error p {\",\n", " \" color: firebrick;\",\n", " \" font-size: 14px;\",\n", " \"}\",\n", " ].join(\"\\\\n\");\n", "\n", " // element is a jQuery wrapped DOM element inside the output area\n", " // see http://ipython.readthedocs.io/en/stable/api/generated/\\\n", " // IPython.display.html#IPython.display.Javascript.__init__\n", " element[0].appendChild(target);\n", " element[0].appendChild(style);\n", "\n", " vega.render(\"#\" + id, spec, type, opt, output_area);\n", "}, function (err) {\n", " if (err.requireType !== \"scripterror\") {\n", " throw(err);\n", " }\n", "});\n" ], "text/plain": [ "" ] }, "metadata": { "jupyter-vega": "#ecb09175-ceba-4d8c-86a5-0f4ada1dad2f" }, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 271, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "" }, "metadata": { "jupyter-vega": "#ecb09175-ceba-4d8c-86a5-0f4ada1dad2f" }, "output_type": "display_data" } ], "source": [ "alt.Chart(objects_df).mark_bar().encode(\n", " y='Collection:O',\n", " x='Count',\n", " tooltip=[alt.Tooltip('Collection:O'), alt.Tooltip('Count')]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So there's **lots** of photos. Let's see what we can find out about them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## When were the photos taken?\n", "\n", "As well as examining categories, we can use facets to display the date range of the results. But what date, and where is it? If you look at the structure of an [object record](https://github.com/te-papa/collections-api/wiki/Collections-API-Object-Model#object), you'll see that the `production` field is actually a list of production 'events' which have a `creationDate` field. In order to get a list of facets for `createdDate` we have to use dot notation to move down through the record hierarchy — so the field for faceting is `production.createdDate`.\n", "\n", "#### The POST data" ] }, { "cell_type": "code", "execution_count": 309, "metadata": {}, "outputs": [], "source": [ "post_data = {\n", " 'query': 'chinese',\n", " 'filters': [{\n", " 'field': 'collection',\n", " 'keyword': 'Photography'\n", " }],\n", " 'facets': [\n", " {'field': 'production.createdDate',\n", " 'size': 100}\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The API request and response" ] }, { "cell_type": "code", "execution_count": 310, "metadata": {}, "outputs": [], "source": [ "# Get the API response\n", "response = requests.post(search_endpoint, json=post_data, headers=headers)\n", "data = response.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When I tried to access the data from the `production.createdDate` facet I got an error. If we look at the fields returned in the facets we see why." ] }, { "cell_type": "code", "execution_count": 311, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['production.createdDate.verbatim', 'production.createdDate.temporal'])" ] }, "execution_count": 311, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's see what the facet data looks like\n", "data['facets'].keys()\n", "# Note that the createdDate facet returns two sets of facets -- a 'verbatim' date, which is ISO-formatted, and a timestamp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's actually two versions of the date facets — `production.createdDate.verbatim` provides ISO formatted dates, while `production.createdDate.temporal` provides timestamps. We'll use the `verbatim` field." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Convert the results to a dataframe and do some cleaning" ] }, { "cell_type": "code", "execution_count": 312, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateCount
01948-01-012
11958-01-0129
2197625
31955-01-011
4197522
\n", "
" ], "text/plain": [ " Date Count\n", "0 1948-01-01 2\n", "1 1958-01-01 29\n", "2 1976 25\n", "3 1955-01-01 1\n", "4 1975 22" ] }, "execution_count": 312, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's use the 'verbatim' dates\n", "photos_df = pd.DataFrame(list(data['facets']['production.createdDate.verbatim'].items()))\n", "photos_df.columns = ['Date', 'Count']\n", "photos_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that the `Date` field is now a mix of days and years. Let's create a new `Year` column and use it to group together the totals." ] }, { "cell_type": "code", "execution_count": 315, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCount
018701
118904
218951
318983
41900140
\n", "
" ], "text/plain": [ " Year Count\n", "0 1870 1\n", "1 1890 4\n", "2 1895 1\n", "3 1898 3\n", "4 1900 140" ] }, "execution_count": 315, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a Year column by slicing the first four chars from the Date\n", "photos_df['Year'] = photos_df['Date'].str.slice(0, 4)\n", "# Group by Year, summing together the counts\n", "years = photos_df.groupby([photos_df['Year']], as_index=False).sum()\n", "years.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make a chart" ] }, { "cell_type": "code", "execution_count": 316, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "var spec = {\"config\": {\"view\": {\"width\": 400, \"height\": 300}}, \"data\": {\"name\": \"data-8811f084e95408806ca5b03f37f6cc03\"}, \"mark\": \"bar\", \"encoding\": {\"tooltip\": [{\"type\": \"temporal\", \"field\": \"Year\", \"format\": \"%Y\"}, {\"type\": \"quantitative\", \"field\": \"Count\"}], \"x\": {\"type\": \"temporal\", \"field\": \"Year\"}, \"y\": {\"type\": \"quantitative\", \"field\": \"Count\"}}, \"selection\": {\"selector064\": {\"type\": \"interval\", \"bind\": \"scales\", \"encodings\": [\"x\", \"y\"]}}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v2.6.0.json\", \"datasets\": {\"data-8811f084e95408806ca5b03f37f6cc03\": [{\"Year\": \"1870\", \"Count\": 1}, {\"Year\": \"1890\", \"Count\": 4}, {\"Year\": \"1895\", \"Count\": 1}, {\"Year\": \"1898\", \"Count\": 3}, {\"Year\": \"1900\", \"Count\": 140}, {\"Year\": \"1901\", \"Count\": 25}, {\"Year\": \"1902\", \"Count\": 7}, {\"Year\": \"1904\", \"Count\": 2}, {\"Year\": \"1905\", \"Count\": 3}, {\"Year\": \"1906\", \"Count\": 1}, {\"Year\": \"1908\", \"Count\": 4}, {\"Year\": \"1909\", \"Count\": 1}, {\"Year\": \"1910\", \"Count\": 44}, {\"Year\": \"1913\", \"Count\": 10}, {\"Year\": \"1914\", \"Count\": 2}, {\"Year\": \"1915\", \"Count\": 1}, {\"Year\": \"1916\", \"Count\": 1}, {\"Year\": \"1918\", \"Count\": 7}, {\"Year\": \"1919\", \"Count\": 2}, {\"Year\": \"1920\", \"Count\": 1}, {\"Year\": \"1923\", \"Count\": 1}, {\"Year\": \"1924\", \"Count\": 1}, {\"Year\": \"1925\", \"Count\": 1}, {\"Year\": \"1927\", \"Count\": 1}, {\"Year\": \"1928\", \"Count\": 12}, {\"Year\": \"1930\", \"Count\": 91}, {\"Year\": \"1932\", \"Count\": 5}, {\"Year\": \"1935\", \"Count\": 2}, {\"Year\": \"1936\", \"Count\": 3}, {\"Year\": \"1937\", \"Count\": 1}, {\"Year\": \"1938\", \"Count\": 8}, {\"Year\": \"1939\", \"Count\": 1}, {\"Year\": \"1940\", \"Count\": 1}, {\"Year\": \"1945\", \"Count\": 2}, {\"Year\": \"1946\", \"Count\": 19}, {\"Year\": \"1948\", \"Count\": 11}, {\"Year\": \"1949\", \"Count\": 8}, {\"Year\": \"1950\", \"Count\": 9}, {\"Year\": \"1951\", \"Count\": 36}, {\"Year\": \"1952\", \"Count\": 1}, {\"Year\": \"1953\", \"Count\": 68}, {\"Year\": \"1954\", \"Count\": 56}, {\"Year\": \"1955\", \"Count\": 1}, {\"Year\": \"1956\", \"Count\": 14}, {\"Year\": \"1957\", \"Count\": 2463}, {\"Year\": \"1958\", \"Count\": 33}, {\"Year\": \"1959\", \"Count\": 595}, {\"Year\": \"1960\", \"Count\": 576}, {\"Year\": \"1961\", \"Count\": 15}, {\"Year\": \"1962\", \"Count\": 10}, {\"Year\": \"1963\", \"Count\": 117}, {\"Year\": \"1964\", \"Count\": 31}, {\"Year\": \"1965\", \"Count\": 9}, {\"Year\": \"1966\", \"Count\": 7}, {\"Year\": \"1969\", \"Count\": 36}, {\"Year\": \"1970\", \"Count\": 3432}, {\"Year\": \"1971\", \"Count\": 3}, {\"Year\": \"1975\", \"Count\": 22}, {\"Year\": \"1976\", \"Count\": 94}, {\"Year\": \"1984\", \"Count\": 2}, {\"Year\": \"1985\", \"Count\": 2}, {\"Year\": \"1986\", \"Count\": 71}, {\"Year\": \"1989\", \"Count\": 3}, {\"Year\": \"1995\", \"Count\": 3}]}};\n", "var opt = {};\n", "var type = \"vega-lite\";\n", "var id = \"70d2d68d-314b-4528-b4bf-5624c9b83b6b\";\n", "\n", "var output_area = this;\n", "\n", "require([\"nbextensions/jupyter-vega/index\"], function(vega) {\n", " var target = document.createElement(\"div\");\n", " target.id = id;\n", " target.className = \"vega-embed\";\n", "\n", " var style = document.createElement(\"style\");\n", " style.textContent = [\n", " \".vega-embed .error p {\",\n", " \" color: firebrick;\",\n", " \" font-size: 14px;\",\n", " \"}\",\n", " ].join(\"\\\\n\");\n", "\n", " // element is a jQuery wrapped DOM element inside the output area\n", " // see http://ipython.readthedocs.io/en/stable/api/generated/\\\n", " // IPython.display.html#IPython.display.Javascript.__init__\n", " element[0].appendChild(target);\n", " element[0].appendChild(style);\n", "\n", " vega.render(\"#\" + id, spec, type, opt, output_area);\n", "}, function (err) {\n", " if (err.requireType !== \"scripterror\") {\n", " throw(err);\n", " }\n", "});\n" ], "text/plain": [ "" ] }, "metadata": { "jupyter-vega": "#70d2d68d-314b-4528-b4bf-5624c9b83b6b" }, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 316, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "" }, "metadata": { "jupyter-vega": "#70d2d68d-314b-4528-b4bf-5624c9b83b6b" }, "output_type": "display_data" } ], "source": [ "c1 = alt.Chart(years).mark_bar().encode(\n", " x='Year:T',\n", " y='Count',\n", " tooltip=[alt.Tooltip('Year:T', format='%Y'), 'Count']\n", ").interactive()\n", "c1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmmm, the values for 1957 and 1970 are a bit extraordinary. I wonder what's going on...?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What happened in 1970?\n", "\n", "I'd hoped to try and find out what happened in 1970 by limiting the results to those with a `createdDate` of '1970-01-01'. I first tried this query using `filters` to set the value for `production.createdDate`. However, the request returned an error that said the field wasn't facetable. I then tried adding `production.createdDate:\"1970-01-01\"` to the query string but then I go no results at all. Eventually I found this [in the docs](https://github.com/te-papa/collections-api/wiki/Getting-started):\n", "\n", "> Field search is not possible against nested fields, for example collection:Art is possible, but not production:mccahon or production.contributor.title:mccahon (however all nested text is searchable in general searches)\n", "\n", "So I think I'll need to harvest all the photographs data and then explore offline. That'll have to wait..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What are the photos about?\n", "\n", "Let's try another approach. The `depicts` field provides a list of subjects (which I think could themselves be people, places, or categories). By asking for facets from the `depicts` field, we could get a picture of what the photos are about.\n", "\n", "At first I tried getting facets for `depicts.title`, but this didn't work as `title` is a text field. After a bit of trial and error, I realised that asking for facets on `depicts.href` produced useful results. The `href` field is the API link to the full record for the category, so not only does it give us facets, it provides a link to get more information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The POST data" ] }, { "cell_type": "code", "execution_count": 278, "metadata": {}, "outputs": [], "source": [ "post_data = {\n", " 'query': 'chinese',\n", " 'filters': [{\n", " 'field': 'collection',\n", " 'keyword': 'Photography'\n", " }],\n", " 'facets': [\n", " {'field': 'depicts.href',\n", " 'size': 10}\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The API response" ] }, { "cell_type": "code", "execution_count": 279, "metadata": {}, "outputs": [], "source": [ "# Get the API response\n", "response = requests.post(search_endpoint, json=post_data, headers=headers)\n", "data = response.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Convert the results to a dataframe" ] }, { "cell_type": "code", "execution_count": 280, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CategoryCount
0https://data.tepapa.govt.nz/collection/categor...167
1https://data.tepapa.govt.nz/collection/categor...176
2https://data.tepapa.govt.nz/collection/categor...345
3https://data.tepapa.govt.nz/collection/categor...161
4https://data.tepapa.govt.nz/collection/categor...241
\n", "
" ], "text/plain": [ " Category Count\n", "0 https://data.tepapa.govt.nz/collection/categor... 167\n", "1 https://data.tepapa.govt.nz/collection/categor... 176\n", "2 https://data.tepapa.govt.nz/collection/categor... 345\n", "3 https://data.tepapa.govt.nz/collection/categor... 161\n", "4 https://data.tepapa.govt.nz/collection/categor... 241" ] }, "execution_count": 280, "metadata": {}, "output_type": "execute_result" } ], "source": [ "depicts_df = pd.DataFrame(list(data['facets']['depicts.href'].items()))\n", "depicts_df.columns = ['Category', 'Count']\n", "depicts_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Add category titles to the dataframe\n", "\n", "So the `href` field by itself isn't very illuminating. But by looking it up using a simple GET request we get lots more data including the category title.\n", "\n", "I thought this could be useful later on so I created a simple function." ] }, { "cell_type": "code", "execution_count": 281, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0a83c769be994e6482a4993e92f35863", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=10), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CategoryCountTitle
0https://data.tepapa.govt.nz/collection/categor...167performing artists
1https://data.tepapa.govt.nz/collection/categor...176People
2https://data.tepapa.govt.nz/collection/categor...345Chinese
3https://data.tepapa.govt.nz/collection/categor...161men
4https://data.tepapa.govt.nz/collection/categor...241Motion picture industry
5https://data.tepapa.govt.nz/collection/categor...474Japanese
6https://data.tepapa.govt.nz/collection/categor...157actors
7https://data.tepapa.govt.nz/collection/categor...241Motion pictures
8https://data.tepapa.govt.nz/collection/categor...245Costumes
9https://data.tepapa.govt.nz/collection/categor...131Men
\n", "
" ], "text/plain": [ " Category Count \\\n", "0 https://data.tepapa.govt.nz/collection/categor... 167 \n", "1 https://data.tepapa.govt.nz/collection/categor... 176 \n", "2 https://data.tepapa.govt.nz/collection/categor... 345 \n", "3 https://data.tepapa.govt.nz/collection/categor... 161 \n", "4 https://data.tepapa.govt.nz/collection/categor... 241 \n", "5 https://data.tepapa.govt.nz/collection/categor... 474 \n", "6 https://data.tepapa.govt.nz/collection/categor... 157 \n", "7 https://data.tepapa.govt.nz/collection/categor... 241 \n", "8 https://data.tepapa.govt.nz/collection/categor... 245 \n", "9 https://data.tepapa.govt.nz/collection/categor... 131 \n", "\n", " Title \n", "0 performing artists \n", "1 People \n", "2 Chinese \n", "3 men \n", "4 Motion picture industry \n", "5 Japanese \n", "6 actors \n", "7 Motion pictures \n", "8 Costumes \n", "9 Men " ] }, "execution_count": 281, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_categories(df):\n", " '''\n", " Lookup category urls and get back the title to add to dataframe.\n", " '''\n", " for i in tnrange(len(df)):\n", " href = df.loc[i]['Category']\n", " response = requests.get(href, headers=headers)\n", " title = response.json()['title']\n", " df.at[i, 'Title'] = title\n", " return df\n", "\n", "depicts_df = get_categories(depicts_df)\n", "depicts_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make a chart\n", "\n", "Now we have the titles, let's make a bar chart." ] }, { "cell_type": "code", "execution_count": 282, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "var spec = {\"config\": {\"view\": {\"width\": 400, \"height\": 300}}, \"data\": {\"name\": \"data-d4c31248a4e03d7b41a1b2f1981a52da\"}, \"mark\": \"bar\", \"encoding\": {\"tooltip\": [{\"type\": \"quantitative\", \"field\": \"Count\"}], \"x\": {\"type\": \"quantitative\", \"field\": \"Count\"}, \"y\": {\"type\": \"ordinal\", \"field\": \"Title\"}}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v2.6.0.json\", \"datasets\": {\"data-d4c31248a4e03d7b41a1b2f1981a52da\": [{\"Category\": \"https://data.tepapa.govt.nz/collection/category/327180\", \"Count\": 167, \"Title\": \"performing artists\"}, {\"Category\": \"https://data.tepapa.govt.nz/collection/category/427670\", \"Count\": 176, \"Title\": \"People\"}, {\"Category\": \"https://data.tepapa.govt.nz/collection/category/400\", \"Count\": 345, \"Title\": \"Chinese\"}, {\"Category\": \"https://data.tepapa.govt.nz/collection/category/319914\", \"Count\": 161, \"Title\": \"men\"}, {\"Category\": \"https://data.tepapa.govt.nz/collection/category/426854\", \"Count\": 241, \"Title\": \"Motion picture industry\"}, {\"Category\": \"https://data.tepapa.govt.nz/collection/category/417\", \"Count\": 474, \"Title\": \"Japanese\"}, {\"Category\": \"https://data.tepapa.govt.nz/collection/category/327024\", \"Count\": 157, \"Title\": \"actors\"}, {\"Category\": \"https://data.tepapa.govt.nz/collection/category/426864\", \"Count\": 241, \"Title\": \"Motion pictures\"}, {\"Category\": \"https://data.tepapa.govt.nz/collection/category/422662\", \"Count\": 245, \"Title\": \"Costumes\"}, {\"Category\": \"https://data.tepapa.govt.nz/collection/category/426494\", \"Count\": 131, \"Title\": \"Men\"}]}};\n", "var opt = {};\n", "var type = \"vega-lite\";\n", "var id = \"e045b672-c27d-4314-a9ca-56cf44f0f53c\";\n", "\n", "var output_area = this;\n", "\n", "require([\"nbextensions/jupyter-vega/index\"], function(vega) {\n", " var target = document.createElement(\"div\");\n", " target.id = id;\n", " target.className = \"vega-embed\";\n", "\n", " var style = document.createElement(\"style\");\n", " style.textContent = [\n", " \".vega-embed .error p {\",\n", " \" color: firebrick;\",\n", " \" font-size: 14px;\",\n", " \"}\",\n", " ].join(\"\\\\n\");\n", "\n", " // element is a jQuery wrapped DOM element inside the output area\n", " // see http://ipython.readthedocs.io/en/stable/api/generated/\\\n", " // IPython.display.html#IPython.display.Javascript.__init__\n", " element[0].appendChild(target);\n", " element[0].appendChild(style);\n", "\n", " vega.render(\"#\" + id, spec, type, opt, output_area);\n", "}, function (err) {\n", " if (err.requireType !== \"scripterror\") {\n", " throw(err);\n", " }\n", "});\n" ], "text/plain": [ "" ] }, "metadata": { "jupyter-vega": "#e045b672-c27d-4314-a9ca-56cf44f0f53c" }, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 282, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "" }, "metadata": { "jupyter-vega": "#e045b672-c27d-4314-a9ca-56cf44f0f53c" }, "output_type": "display_data" } ], "source": [ "alt.Chart(depicts_df).mark_bar().encode(\n", " y='Title:O',\n", " x='Count',\n", " tooltip=['Count']\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmmm... Of course we should remember that these are only the top 10 facets — we might want to expand the results. But already we can see a few oddities. For example, there's separate entries for 'Men' and 'men'!\n", "\n", "Perhaps more interestingly, the most cited category in our search for 'Chinese' amongst photos in the Te Papa collection is 'Japanese'. That's weird..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why do so many photographs in my search for 'Chinese' have the category 'Japanese'?\n", "\n", "Let's see if we can find out what's going on. First of all, let's try to limit our results to those that cite the 'Japanese' category. Filtering on the category `href` value seems to work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get the 'Japanese' category href value" ] }, { "cell_type": "code", "execution_count": 283, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://data.tepapa.govt.nz/collection/category/417'" ] }, "execution_count": 283, "metadata": {}, "output_type": "execute_result" } ], "source": [ "href = depicts_df.loc[depicts_df['Title'] == 'Japanese']['Category'].values[0]\n", "href" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The POST data" ] }, { "cell_type": "code", "execution_count": 284, "metadata": {}, "outputs": [], "source": [ "post_data = {\n", " 'query': 'chinese',\n", " 'filters': [\n", " {\n", " 'field': 'collection',\n", " 'keyword': 'Photography'\n", " },\n", " {\n", " 'field': 'depicts.href',\n", " 'keyword': href\n", " }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The API response" ] }, { "cell_type": "code", "execution_count": 285, "metadata": {}, "outputs": [], "source": [ "# Get the API response\n", "response = requests.post(search_endpoint, json=post_data, headers=headers)\n", "data = response.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### View the titles\n", "\n", "Let's just loop through the results and list the titles." ] }, { "cell_type": "code", "execution_count": 286, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Japan series: boats\n", "Japan series: women mourning\n", "Japan series: Hokkaido, Furubira winter fishing port\n", "Japan Series: Honda Factory\n", "Girls cheer teenage idol Akira Mitsu, Noda, Japan. Taken for a series on Japan for ‘Life’\n", "Japan Series: Daiei Movie\n", "A typical Japanese farmer in Rain Coat and Hat\n", "Japanese Rice Planters at Dinner - Eating Rice, Japan\n", "On the Way to the Bridegroom's House\n", "One of Japan's Largest Modern Silk Weaving Plants - American Machinery and American Methods, Kirju, Japan\n" ] } ], "source": [ "for result in data['results']:\n", " print(result['title'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmmm... Ok, so we can see why they have the 'Japanese' category attached, but why do they come up in a search for 'Chinese'?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Finding the references to our keyword\n", "\n", "From the raw results it's pretty hard to see why these photos are appearing in our search. The function below loops through all the nested records in the data looking for occurances of our keyword." ] }, { "cell_type": "code", "execution_count": 287, "metadata": {}, "outputs": [], "source": [ "def find_fields(doc, keyword):\n", " '''\n", " Find fields that contain the given keyword.\n", " Return the name of the field and the parent object.\n", " '''\n", " if isinstance(doc, list):\n", " for d in doc:\n", " for result in find_fields(d, keyword):\n", " yield result\n", " if isinstance(doc, dict):\n", " for k, v in iteritems(doc):\n", " if isinstance(v, str) and keyword in v:\n", " yield [doc, k]\n", " elif isinstance(v, dict):\n", " for result in find_fields(v, keyword):\n", " yield result\n", " elif isinstance(v, list):\n", " for d in v:\n", " for result in find_fields(d, keyword):\n", " yield result\n", "\n", "fields = list(find_fields(data['results'], keyword))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's list the results, displaying the type of record the keyword appears in (Object, Place, Category etc), the title of the record, and the context in which the keyword appears." ] }, { "cell_type": "code", "execution_count": 288, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n", "\n", "Place -- Japan\n", "ry BCE; adopted handwriting and much of \u001b[43mChinese\u001b[0m culture in the 6th-9th centuries. The e\n" ] } ], "source": [ "for field in fields:\n", " print('\\n{} -- {}'.format(field[0]['type'], field[0]['title']))\n", " context = re.search('(.{{0,40}}{}.{{0,40}})'.format(keyword), field[0][field[1]]).group(1)\n", " print(context.replace(keyword, '\\33[43m{}\\033[0m'.format(keyword)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So what's going on? From the list above you can see that the reference to 'Chinese' comes from a linked record for the `Place` 'Japan'. The default `query` search looks in all nested text fields, including the `scopeNote` of linked `Place` records, which is where the text above comes from.\n", "\n", "On the one hand it's great that the default search looks in all the nested records. But on the other hand it's a bit annoying, because if we want to do anything with the data we'll have to weed out the irrelevant photos. It's a familiar trade-off between discoverability and accuracy. In a web interface it's good to include as much as possible and then relevance rank it in a sensible way. This gives users their best chance of finding what they're after. But it's not so good if you're using an API to assemble a dataset for further analysis. In that case you want to be able to set fairly firm boundaries around your results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Can we exclude results citing the 'Japanese' category?\n", "\n", "Is there a way of excluding categories from our results. Well, sort of... If we try to filter by the `depicts.href` field we run into the same problem with nested field searching as we did with the dates. But unlike a date string, the category `href` value is pretty specific, so we could probably just through it in the `query` string.\n", "\n", "#### Get the 'Japanese' category href value" ] }, { "cell_type": "code", "execution_count": 289, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://data.tepapa.govt.nz/collection/category/417'" ] }, "execution_count": 289, "metadata": {}, "output_type": "execute_result" } ], "source": [ "href = depicts_df.loc[depicts_df['Title'] == 'Japanese']['Category'].values[0]\n", "href" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The POST data" ] }, { "cell_type": "code", "execution_count": 290, "metadata": {}, "outputs": [], "source": [ "post_data = {\n", " 'query': 'chinese -\"{}\"'.format(href),\n", " 'filters': [\n", " {\n", " 'field': 'collection',\n", " 'keyword': 'Photography'\n", " }\n", " ],\n", " 'facets': [\n", " {'field': 'depicts.href',\n", " 'size': 10}\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The API response" ] }, { "cell_type": "code", "execution_count": 291, "metadata": {}, "outputs": [], "source": [ "# Get the API response\n", "response = requests.post(search_endpoint, json=post_data, headers=headers)\n", "data = response.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### How many results do we have now?" ] }, { "cell_type": "code", "execution_count": 292, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'count': 8447, 'from': 0, 'size': 10, 'truncated': False}" ] }, "execution_count": 292, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['_metadata']['resultset']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we look above, we can see we started with 8,920 photos. The 'Japanese' category had 474 results. So I expected we'd have:\n", "\n", " 8,920 - 474 = 8,446 results\n", "\n", "One off..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's next?\n", "\n", "I've had a go at [making maps](Mapping-Te-Papa-collections.ipynb) from some other facets. I think next I want to try harvesting out significant amounts of data. Stay tuned..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }