{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Get an random newspaper article from Trove\n", "\n", "Changes to the Trove API mean that the techniques I've previously used to select resources at random [will no longer work](https://updates.timsherratt.org/2019/10/09/creators-and-users.html). This notebook provides one alternative.\n", "\n", "I wanted something that would work efficiently, but would also expose as much of the content as possible. Applying multiple facets together with a randomly-generated query seems to do a good job of getting the result set below 100 (the maximum available from a single API call). This should mean that *most* of the newspaper articles are reachable, but it's a bit hard to quantify.\n", "\n", "Thanks to Mitchell Harrop for [suggesting I could use randomly selected stopwords](https://twitter.com/mharropesquire/status/1182175315860213760) as queries. I've supplemented the stopwords with letters and digits, and together they seem to do a good job of applying an initial filter and mixing up the relevance ranking.\n", "\n", "As you can see from the examples below, you can supply any of the facets available in the newspapers zone – for example: `state`, `title`, `year`, `illType`, `category`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import json\n", "import os\n", "import random\n", "\n", "import requests\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "\n", "with open(\"stopwords.json\", \"r\") as json_file:\n", " STOPWORDS = json.load(json_file)\n", "\n", "API_URL = \"http://api.trove.nla.gov.au/v2/result\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%capture\n", "# Load variables from the .env file if it exists\n", "# Use %%capture to suppress messages\n", "%load_ext dotenv\n", "%dotenv" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Insert your Trove API key\n", "API_KEY = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " API_KEY = os.getenv(\"TROVE_API_KEY\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [], "source": [ "def get_random_facet_value(params, facet):\n", " \"\"\"\n", " Get values for the supplied facet and choose one at random.\n", " \"\"\"\n", " these_params = params.copy()\n", " these_params[\"facet\"] = facet\n", " response = s.get(API_URL, params=these_params)\n", " data = response.json()\n", " try:\n", " values = [\n", " t[\"search\"] for t in data[\"response\"][\"zone\"][0][\"facets\"][\"facet\"][\"term\"]\n", " ]\n", " except TypeError:\n", " return None\n", " return random.choice(values)\n", "\n", "\n", "def get_total_results(params):\n", " response = s.get(API_URL, params=params)\n", " data = response.json()\n", " total = int(data[\"response\"][\"zone\"][0][\"records\"][\"total\"])\n", " return total\n", "\n", "\n", "def get_random_article(query=None, **kwargs):\n", " \"\"\"\n", " Get a random article.\n", " The kwargs can be any of the available facets, such as 'state', 'title', 'illtype', 'year'.\n", " \"\"\"\n", " total = 0\n", " applied_facets = []\n", " facets = [\"month\", \"year\", \"decade\", \"word\", \"illustrated\", \"category\", \"title\"]\n", " tries = 0\n", " params = {\n", " \"zone\": \"newspaper\",\n", " \"encoding\": \"json\",\n", " # Note that keeping n at 0 until we've filtered the result set speeds things up considerably\n", " \"n\": \"0\",\n", " # Uncomment these if you need more than the basic data\n", " # \"reclevel\": \"full\",\n", " # 'include': 'articleText',\n", " \"key\": API_KEY,\n", " }\n", " if query:\n", " params[\"q\"] = query\n", " # If there's no query supplied then use a random stopword to mix up the results\n", " else:\n", " random_word = random.choice(STOPWORDS)\n", " params[\"q\"] = f'\"{random_word}\"'\n", " # Apply any supplied factes\n", " for key, value in kwargs.items():\n", " params[f\"l-{key}\"] = value\n", " applied_facets.append(key)\n", " # Remove any facets that have already been applied from the list of available facets\n", " facets[:] = [f for f in facets if f not in applied_facets]\n", " total = get_total_results(params)\n", " # If our randomly selected stopword has produced no results\n", " # keep trying with new queries until we get some (give up after 10 tries)\n", " while total == 0 and tries <= 10:\n", " if not query:\n", " random_word = random.choice(STOPWORDS)\n", " params[\"q\"] = f'\"{random_word}\"'\n", " tries += 1\n", " # Apply facets one at a time until we have less than 100 results, or we run out of facets\n", " while total > 100 and len(facets) > 0:\n", " # Get the next facet\n", " facet = facets.pop()\n", " # Set the facet to a randomly selected value\n", " params[f\"l-{facet}\"] = get_random_facet_value(params, facet)\n", " total = get_total_results(params)\n", " # print(total)\n", " # print(response.url)\n", " # If we've ended up with some results, then select one (of the first 100) at random\n", " if total > 0:\n", " params[\"n\"] = \"100\"\n", " response = s.get(API_URL, params=params)\n", " data = response.json()\n", " article = random.choice(data[\"response\"][\"zone\"][0][\"records\"][\"article\"])\n", " return article" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get any old article..." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'id': '153441823',\n", " 'url': '/newspaper/153441823',\n", " 'heading': 'Advertising',\n", " 'category': 'Advertising',\n", " 'title': {'id': '765',\n", " 'value': 'Every Week (Bairnsdale, Vic. : 1914 - 1918)'},\n", " 'date': '1917-02-15',\n", " 'page': 4,\n", " 'pageSequence': 4,\n", " 'relevance': {'score': '0.105540276', 'value': 'may have relevance'},\n", " 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/153441823?searchTerm=%22on%22'}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_random_article()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a random article about pademelons" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'id': '67402600',\n", " 'url': '/newspaper/67402600',\n", " 'heading': 'THE BUCHAN CAVES. BEAUTIFUL NATURE. EQUAL TO JENOLAN.',\n", " 'category': 'Article',\n", " 'title': {'id': '209', 'value': 'The Maffra Spectator (Vic. : 1882 - 1920)'},\n", " 'date': '1907-12-30',\n", " 'page': 3,\n", " 'pageSequence': 3,\n", " 'relevance': {'score': '7.7346535', 'value': 'very relevant'},\n", " 'snippet': '\"Where will we go for the holidays?\" is the usual question that has to be answered by people at this time of the year, and in due course they proceed',\n", " 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/67402600?searchTerm=pademelon'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_random_article(query=\"pademelon\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a random article from Tasmania" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'id': '232486642',\n", " 'url': '/newspaper/232486642',\n", " 'heading': 'Advertising',\n", " 'category': 'Advertising',\n", " 'title': {'id': '1236',\n", " 'value': 'The Cornwall Press and Commercial Advertiser (Launceston, Tas. : 1829)'},\n", " 'date': '1829-03-17',\n", " 'page': 4,\n", " 'pageSequence': 4,\n", " 'relevance': {'score': '0.052769434', 'value': 'may have relevance'},\n", " 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/232486642?searchTerm=%22to%22'}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_random_article(state=\"Tasmania\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a random article from the _Sydney Morning Herald_" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'id': '14222939',\n", " 'url': '/newspaper/14222939',\n", " 'heading': 'SEARCHING FOR THE PERTHSHIRE. RETURN OF THE TUGBOAT HERO. WAS WITHIN TWENTY MILES OF THE DISABLED SHIP.',\n", " 'category': 'Article',\n", " 'title': {'id': '35',\n", " 'value': 'The Sydney Morning Herald (NSW : 1842 - 1954)'},\n", " 'date': '1899-06-03',\n", " 'page': 9,\n", " 'pageSequence': 9,\n", " 'relevance': {'score': '6.375327', 'value': 'very relevant'},\n", " 'snippet': \"Anothor vessel reached port yesterday after a 12 days' cruise in search of tho Perthshire. She was the tug Hero, belonging to Messrs J Fenwick and Co, and as she lay at the Floating Jetty, Circular Quay,\",\n", " 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/14222939?searchTerm=%22hadn%22'}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_random_article(title=\"35\", category=\"Article\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a random illustrated article" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'id': '258155681',\n", " 'url': '/newspaper/258155681',\n", " 'heading': 'SMARTLY RESCUED. A SYDNEY LADY IN PERIL.',\n", " 'category': 'Article',\n", " 'title': {'id': '1586', 'value': 'The Nowra Colonist (NSW : 1899 - 1904)'},\n", " 'date': '1899-09-06',\n", " 'page': 4,\n", " 'pageSequence': 4,\n", " 'relevance': {'score': '1.7316521', 'value': 'likely to be relevant'},\n", " 'snippet': 'The following particulars just to ha[?] of a smart rescue effected under difficulties should be of interest to must [?] Mrs David [?] of 257',\n", " 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/258155681?searchTerm=%22p%22'}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_random_article(illustrated=\"true\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a random illustrated advertisement from the _Australian Womens Weekly_" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'id': '47119929',\n", " 'url': '/newspaper/47119929',\n", " 'heading': 'Advertising',\n", " 'category': 'Advertising',\n", " 'title': {'id': '112',\n", " 'value': \"The Australian Women's Weekly (1933 - 1982)\"},\n", " 'date': '1957-06-12',\n", " 'page': 11,\n", " 'pageSequence': 11,\n", " 'relevance': {'score': '0.43551445', 'value': 'likely to be relevant'},\n", " 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/47119929?searchTerm=%22own%22'}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_random_article(title=\"112\", illustrated=\"true\", category=\"Advertising\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a random cartoon" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'id': '159624812',\n", " 'url': '/newspaper/159624812',\n", " 'heading': 'ALL DOUBTS REMOVED.',\n", " 'category': 'Article',\n", " 'title': {'id': '644',\n", " 'value': 'The Gloucester Advocate (NSW : 1905 - 1954)'},\n", " 'date': '1929-05-21',\n", " 'page': 4,\n", " 'pageSequence': 4,\n", " 'relevance': {'score': '15.173103', 'value': 'very relevant'},\n", " 'snippet': 'THE BOSS: \"Smoke cigars, Tompkins?\" TOMPKINS: \"Yes, sir. I\\'m very partial to a good cigar.\" THE BOSS: \"Umph—then I\\'ll lock \\'em up.\"',\n", " 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/159624812?searchTerm=%22up%22'}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_random_article(illtype=\"Cartoon\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a random article from 1930" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'id': '143816816',\n", " 'url': '/newspaper/143816816',\n", " 'heading': 'Political Activities of the C.P. Campaign in Murray',\n", " 'category': 'Article',\n", " 'title': {'id': '655',\n", " 'value': 'Riverina Recorder (Balranald, Moulamein, NSW : 1887 - 1944)'},\n", " 'date': '1930-09-27',\n", " 'page': 3,\n", " 'pageSequence': 3,\n", " 'relevance': {'score': '1.1914315', 'value': 'likely to be relevant'},\n", " 'snippet': 'An early review reveals a substantial foundation for high hopes in the prospects of a Country Party representative being returned, but hard work',\n", " 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/143816816?searchTerm=%22g%22'}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_random_article(year=\"1930\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a random article tagged 'poem'" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'id': '63224945',\n", " 'url': '/newspaper/63224945',\n", " 'heading': 'Sketches with Pen THE PREVISION OF JOHN BROWN.',\n", " 'category': 'Article',\n", " 'title': {'id': '49',\n", " 'value': 'The Australasian Sketcher with Pen and Pencil (Melbourne, Vic. : 1873 - 1889)'},\n", " 'date': '1889-05-16',\n", " 'page': 67,\n", " 'pageSequence': 67,\n", " 'relevance': {'score': '1.7888632', 'value': 'likely to be relevant'},\n", " 'snippet': 'Brown was weeping likewise cursing; and with amplitude of reason, For a letter had been handed him that very afternoon Which proved he had been cruelly begotten out of season',\n", " 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/63224945?searchTerm=%223%22'}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_random_article(publictag=\"poem\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Speed test" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The slowest run took 4.60 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "4.92 s ± 2.6 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%%timeit\n", "get_random_article()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" }, "vscode": { "interpreter": { "hash": "830a82eb7dea304d2620966a757085a27772f251eec0147d9ba49327e586e1bd" } } }, "nbformat": 4, "nbformat_minor": 4 }