{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualise the total number of newspaper articles in Trove by year and state\n", "\n", "Trove currently includes more 200 million digitised newspaper articles published between 1803 and 2015. In this notebook we explore how those newspaper articles are distributed over time, and by state.\n", "\n", "1. [Setting things up](#1.-Setting-things-up)\n", "2. [Show the total number of articles per year](#2.-Show-the-total-number-of-articles-per-year)\n", "3. [Show the number of newspaper articles by state](#3.-Show-the-number-of-newspaper-articles-by-state)\n", "4. [Show the number of articles by state and year](#4.-Show-the-number-of-articles-by-state-and-year)\n", "5. [Combine everything and make it interactive!](#5.-Combine-everything-and-make-it-interactive!)\n", "6. [Further reading](#6.-Further-reading)\n", "\n", "If you want to skip to the final visualisation without running any of the notebook code, here's [an HTML version to play with](https://glam-workbench.github.io/examples/trove-newspaper-articles-by-state-and-year.html). But come back here to find out how to DIY!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!

\n", "\n", "

\n", " Some tips:\n", "

\n", "

\n", "\n", "

Is this thing on? If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to load a live version running on Binder.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Setting things up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import what we need" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import time\n", "from operator import itemgetter # used for sorting\n", "\n", "import altair as alt\n", "import pandas as pd # makes manipulating the data easier\n", "import requests\n", "from IPython.display import clear_output\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "from tqdm.auto import tqdm\n", "\n", "# Make sure data directory exists\n", "os.makedirs(\"docs\", exist_ok=True)\n", "\n", "# Create a session that will automatically retry on server errors\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])\n", "s.mount(\"http://\", HTTPAdapter(max_retries=retries))\n", "s.mount(\"https://\", HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Load variables from the .env file if it exists\n", "# Use %%capture to suppress messages\n", "%load_ext dotenv\n", "%dotenv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Enter a Trove API key\n", "\n", "We're going to get our data from the Trove API. You'll need to get your own [Trove API key](http://help.nla.gov.au/trove/building-with-trove/api) and enter it below." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Insert your Trove API key\n", "API_KEY = \"YOUR API KEY\"\n", "\n", "# Use api key value from environment variables if it is available\n", "if os.getenv(\"TROVE_API_KEY\"):\n", " API_KEY = os.getenv(\"TROVE_API_KEY\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set some default parameters" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Basic parameters to send to the Trove API, we'll add more later.\n", "params = {\n", " \"zone\": \"newspaper\",\n", " \"key\": API_KEY,\n", " \"encoding\": \"json\",\n", " \"n\": 0, # We don't need any records, just the facets!\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define some functions" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def get_results(params):\n", " \"\"\"\n", " Get JSON response data from the Trove API.\n", " Parameters:\n", " params\n", " Returns:\n", " JSON formatted response data from Trove API\n", " \"\"\"\n", " response = s.get(\n", " \"https://api.trove.nla.gov.au/v2/result\", params=params, timeout=30\n", " )\n", " response.raise_for_status()\n", " # print(response.url) # This shows us the url that's sent to the API\n", " data = response.json()\n", " return data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Show the total number of articles per year\n", "\n", "In [another notebook](visualise-searches-over-time.ipynb), I look at different ways of visualising Trove newspaper searches over time. In this notebook we're going to focus on showing everything. To search for everything, we set the `q` parameter to a single space." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Set the q parameter to a single space to get ALL THE ARTICLES\n", "params[\"q\"] = \" \"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can find the total number of newspaper articles in Trove." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are currently 233,666,567 articles in Trove!\n" ] } ], "source": [ "# Get the JSON data from the Trove API using our parameters\n", "data = get_results(params)\n", "\n", "# Navigate down the JSON hierarchy to find the total results\n", "total = int(data[\"response\"][\"zone\"][0][\"records\"][\"total\"])\n", "\n", "# Print the results\n", "print(\"There are currently {:,} articles in Trove!\".format(total))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, that's not all that useful. What would be more interesting is to show the total number of articles published each year. To do this we use the `decade` and `year` facets. There's more details [in this notebook](visualise-searches-over-time.ipynb) but, in short, we have to loop through the decades from 1800 to 2010, getting the total number of articles for each year within that decade.\n", "\n", "These two functions do just that." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def get_facets(data):\n", " \"\"\"\n", " Loop through facets in Trove API response, saving terms and counts.\n", " Parameters:\n", " data - JSON formatted response data from Trove API\n", " Returns:\n", " A list of dictionaries containing: 'term', 'total_results'\n", " \"\"\"\n", " facets = []\n", " try:\n", " # The facets are buried a fair way down in the results\n", " # Note that if you ask for more than one facet, you'll have use the facet['name'] param to find the one you want\n", " # In this case there's only one facet, so we can just grab the list of terms (which are in fact the results by year)\n", " for term in data[\"response\"][\"zone\"][0][\"facets\"][\"facet\"][\"term\"]:\n", "\n", " # Get the year and the number of results, and convert them to integers, before adding to our results\n", " facets.append({\"term\": term[\"search\"], \"total_results\": int(term[\"count\"])})\n", "\n", " # Sort facets by year\n", " facets.sort(key=itemgetter(\"term\"))\n", " except TypeError:\n", " pass\n", " return facets\n", "\n", "\n", "def get_facet_data(params, start_decade=180, end_decade=201):\n", " \"\"\"\n", " Loop throught the decades from 'start_decade' to 'end_decade',\n", " getting the number of search results for each year from the year facet.\n", " Combine all the results into a single list.\n", " Parameters:\n", " params - parameters to send to the API\n", " start_decade\n", " end_decade\n", " Returns:\n", " A list of dictionaries containing 'year', 'total_results' for the complete\n", " period between the start and end decades.\n", " \"\"\"\n", " # Create a list to hold the facets data\n", " facet_data = []\n", "\n", " # Loop through the decades\n", " for decade in tqdm(range(start_decade, end_decade + 1)):\n", "\n", " # Avoid confusion by copying the params before we change anything.\n", " search_params = params.copy()\n", "\n", " # Add decade value to params\n", " search_params[\"l-decade\"] = decade\n", "\n", " # Get the data from the API\n", " data = get_results(search_params)\n", "\n", " # Get the facets from the data and add to facets_data\n", " facet_data += get_facets(data)\n", "\n", " # Try to avoid hitting the API rate limit - increase this if you get 403 errors\n", " time.sleep(0.2)\n", "\n", " # Reomve the progress bar (you can also set leave=False in tqdm, but that still leaves white space in Jupyter Lab)\n", " clear_output()\n", " return facet_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we need to tell the API we want the `year` facet values. We do this by setting the `facet` value in our parameters." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "params[\"facet\"] = \"year\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it's just a matter of feeding our parameters to the `get_facet_data()` function." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "total_facets = get_facet_data(params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's convert the data to a Pandas dataframe so we can feed it to Altair, our charting program." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
termtotal_results
01803526
11804619
21805430
31806367
41807134
\n", "
" ], "text/plain": [ " term total_results\n", "0 1803 526\n", "1 1804 619\n", "2 1805 430\n", "3 1806 367\n", "4 1807 134" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_total = pd.DataFrame(total_facets)\n", "df_total.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's display the results as a simple line chart using Altair" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Feed Altair our dataframe\n", "alt.Chart(df_total).mark_line(point=True).encode(\n", " # Years along the X axis\n", " x=alt.X(\"term:Q\", axis=alt.Axis(format=\"c\", title=\"Year\")),\n", " # Number of articles on the Y axis (formatted with thousands separators)\n", " y=alt.Y(\"total_results:Q\", axis=alt.Axis(format=\",d\", title=\"Number of articles\")),\n", " # Use tooltips to display the year and number of articles when you hover over a point\n", " tooltip=[\n", " alt.Tooltip(\"term:Q\", title=\"Year\"),\n", " alt.Tooltip(\"total_results:Q\", title=\"Articles\", format=\",\"),\n", " ],\n", ").properties(width=700, height=400)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmmm, that *is* interesting. There's a significant peak in the number of articles around 1915. Why might that be? Were there more newspapers? Were more articles written because of the war?\n", "\n", "Nope. It's because of funding and digitisation priorities. Not all Australian newspapers are in Trove. Some have been lost, and many are just waiting to be digitised. Funding is always limited, so priorities have to be set. In the lead up to the centenary of World War I, it was decided to focus on digitising newspapers from that period. This chart reflects those priorities. This is not the number of newspaper articles published in Australia, it's the number of newspaper articles that have been digitised and made available through Trove. It's important to remain aware of this as you use Trove.\n", "\n", "But what about the dramatic drop-off in the number of articles after 1954. Was it the impact of other media technologies? Nope. It's because of copyright. Australian copyright law is complex, but it's likely that much of the material published in newspapers before 1955 is out of copyright. Therefore, in undertaking its digitisation program, the National Library decided that it could best manage the risks associated with copyright by focusing on newspapers published before 1955. You'll see, however, that there are a few exceptions. Some newspapers published after 1954 have been digitised on the basis of agreements between the publisher and the National Library. If you'd like to see what newspapers are available post-1954 have a look at [Beyond the copyright cliff of death](Beyond_the_copyright_cliff_of_death.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Show the number of newspaper articles by state\n", "\n", "The setting of priorities for newspaper digitisation has, in the past, been a collaborative effort between the National and State libraries. Now the focus seems to be more on a 'contributor model' where local communities or organisations fund the digitisation of their own newspapers. These sorts of arrangements obviously affect what gets digitised and when, so lets see what Trove's newspapers look like when divide them up by state.\n", "\n", "To do this, we'll change the `facet` parameter to 'state'." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "params[\"facet\"] = \"state\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we're not harvesting data for mutiple decades, we only need to make one call to the API." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statetotal_results
0New South Wales82646819
1Queensland41089936
2Victoria40052916
3Western Australia24596770
4South Australia24562882
5Tasmania15706736
6ACT3296997
7International851152
8National452926
9Northern Territory409433
\n", "
" ], "text/plain": [ " state total_results\n", "0 New South Wales 82646819\n", "1 Queensland 41089936\n", "2 Victoria 40052916\n", "3 Western Australia 24596770\n", "4 South Australia 24562882\n", "5 Tasmania 15706736\n", "6 ACT 3296997\n", "7 International 851152\n", "8 National 452926\n", "9 Northern Territory 409433" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the data from the API\n", "data = get_results(params)\n", "facets = []\n", "\n", "# Loop through the facet terms (each term will be a state)\n", "for term in data[\"response\"][\"zone\"][0][\"facets\"][\"facet\"][\"term\"]:\n", "\n", " # Get the state and the number of results, and convert it to integers, before adding to our results\n", " facets.append({\"state\": term[\"search\"], \"total_results\": int(term[\"count\"])})\n", "\n", "# Convert to a dataframe\n", "df_states = pd.DataFrame(facets)\n", "df_states" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing that's obvious is that not all 'states' are states. This facet has been expanded to incorporate 'National', and 'International' newspapers. Let's look at how the numbers in each 'state' compare." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Chart the results\n", "alt.Chart(df_states).mark_bar().encode(\n", " x=alt.X(\"state:N\", title=\"State\"),\n", " y=alt.Y(\"total_results:Q\", axis=alt.Axis(format=\",d\", title=\"Number of articles\")),\n", " tooltip=[\n", " alt.Tooltip(\"state\", title=\"State\"),\n", " alt.Tooltip(\"total_results:Q\", title=\"Articles\", format=\",\"),\n", " ],\n", ").properties(width=700, height=400)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's almost double the number of newspaper articles from NSW as anywhere else. Why? Again, we might be tempted to think that its just because more newspapers were published in NSW. This could be true, but someone still has to pay to get them digitised. In this case, the State Library of NSW's Digital Excellence Program has supported large amounts of digitisation.\n", "\n", "Once again, it's a reminder that digital collections are constructed. Things like priorities and funding introduce biases that are not easily visible through a standard search interface." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Show the number of articles by state and year\n", "\n", "To look a bit further into this, let's combine the two charts above to show the number of newspaper articles for each state over time.\n", "\n", "We'll start by setting the `facet` value back to 'year', and the `q` parameter back to a single space (for everything)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "params[\"facet\"] = \"year\"\n", "params[\"q\"] = \" \"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the states data we just collected to get a list of possible values for the `state` facet." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['New South Wales',\n", " 'Queensland',\n", " 'Victoria',\n", " 'Western Australia',\n", " 'South Australia',\n", " 'Tasmania',\n", " 'ACT',\n", " 'International',\n", " 'National',\n", " 'Northern Territory']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states = df_states[\"state\"].to_list()\n", "states" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll define a function to loop through the list of states getting the number of articles for each year." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def get_state_facets(params, states):\n", " \"\"\"\n", " Loop through the supplied list of states searching getting the year by year results.\n", " Parameters:\n", " params - basic parameters to send to the API\n", " states - a list of states to apply using the state facet\n", " Returns:\n", " A dataframe\n", " \"\"\"\n", " dfs = []\n", " these_params = params.copy()\n", "\n", " # Loop through the supplied list of states\n", " for state in states:\n", "\n", " # Set the state facet to the current state value\n", " these_params[\"l-state\"] = state\n", "\n", " # Get year facets for this state & query\n", " facet_data = get_facet_data(these_params)\n", "\n", " # Convert the results to a dataframe\n", " df = pd.DataFrame(facet_data)\n", "\n", " # Add a state column to the dataframe and set its value to the current state\n", " df[\"state\"] = state\n", "\n", " # Add this df to the list of dfs\n", " dfs.append(df)\n", "\n", " time.sleep(1)\n", " # Concatenate all the dataframes and return the result\n", " return pd.concat(dfs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're ready to get the data!" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# GET ALL THE DATA!\n", "df_states_years = get_state_facets(params, states)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm, how are we going to visualise the results. Let's start with a stacked area chart." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make a chart\n", "alt.Chart(df_states_years).mark_area().encode(\n", " # Show years on the X axis\n", " x=alt.X(\"term:Q\", axis=alt.Axis(format=\"c\", title=\"Year\")),\n", " # Show the stacked number of articles on the Y axis\n", " y=alt.Y(\n", " \"total_results:Q\",\n", " axis=alt.Axis(format=\",d\", title=\"Number of articles\"),\n", " stack=True,\n", " ),\n", " # Use color to distinguish the states\n", " color=\"state:N\",\n", " # And show the state / year / and total details on hover\n", " tooltip=[\n", " \"state\",\n", " alt.Tooltip(\"term:Q\", title=\"Year\"),\n", " alt.Tooltip(\"total_results:Q\", title=\"Articles\", format=\",\"),\n", " ],\n", ").properties(width=700, height=400)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's interesting. You might notice that most of the post-1954 articles come from the ACT – through the special arrangements mentioned above, the *Canberra Times* has been digitised through to 1995. See [Beyond the copyright cliff of death](Beyond_the_copyright_cliff_of_death.ipynb) for details.\n", "\n", "However, it's a bit hard to see the individual contributions of each state in this chart. So let's separate them out." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# A new chat that puts states in separate facets\n", "alt.Chart(df_states_years).mark_area().encode(\n", " # Year of the X axis\n", " x=alt.X(\"term:Q\", axis=alt.Axis(format=\"c\", title=\"Year\")),\n", " # Number of articles on the Y axis\n", " y=alt.Y(\"total_results:Q\", axis=alt.Axis(format=\",d\", title=\"Number of articles\")),\n", " # Split the data up into sub-charts based on state\n", " facet=alt.Facet(\"state:N\", columns=3),\n", " # Give each state a different color\n", " color=\"state:N\",\n", " # Details on hover\n", " tooltip=[\n", " \"state\",\n", " alt.Tooltip(\"term:Q\", title=\"Year\"),\n", " alt.Tooltip(\"total_results:Q\", title=\"Articles\", format=\",\"),\n", " ]\n", " # Note the columns value to set the number of sub-charts in each row\n", ").properties(width=200, height=150)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again we can see the dominance of NSW, and the post-1954 tail of ACT articles. But what's going on in Victoria? Let's zoom in for a closer look." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Chart of Victoria only\n", "# Filter the dataframe to just show Victoria\n", "alt.Chart(\n", " df_states_years.loc[df_states_years[\"state\"] == \"Victoria\"]\n", ").mark_area().encode(\n", " # Years on the X axis\n", " x=alt.X(\"term:Q\", axis=alt.Axis(format=\"c\", title=\"Year\")),\n", " # Number of articles on the Y axis\n", " y=alt.Y(\n", " \"total_results:Q\",\n", " axis=alt.Axis(format=\",d\", title=\"Number of articles\"),\n", " stack=True,\n", " ),\n", " # Try to match the color in the chart above\n", " color=alt.value(\"#8c564b\"),\n", " # Details on hover\n", " tooltip=[\n", " alt.Tooltip(\"term:Q\", title=\"Year\"),\n", " alt.Tooltip(\"total_results:Q\", title=\"Articles\", format=\",\"),\n", " ],\n", ").properties(\n", " width=700, height=400\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yikes! This is, of course, more evidence of the World War I effect. In the lead up to the centenary of WWI, Victoria decided to focus quite narrowly on the wartime period, digitising only the period between 1914 and 1919 for a [number of newspapers](https://www.slv.vic.gov.au/digitised-wwi-victorian-newspapers).\n", "\n", "Priorities have to be set, decisions about funding have to be made. The point, once again, is that these sorts of decisions shape what you get back in your search results. To really understand what it is we're working with, we have to be prepared to look beyond the search box." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Combine everything and make it interactive!\n", "\n", "For the sake of completeness, let's try combining everything to make an interactive chart that shows both the total number of articles per year, and the contributions of each state." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.VConcatChart(...)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use color as the selector for filtering the chart\n", "selection = alt.selection_multi(encodings=[\"color\"])\n", "\n", "# Color is based on the state, or gray if another state is selected\n", "color = alt.condition(\n", " selection, alt.Color(\"state:N\", legend=None), alt.value(\"lightgray\")\n", ")\n", "\n", "# A basic area chart, starts stacked, but when filtered shows only the active state\n", "area = (\n", " alt.Chart(df_states_years)\n", " .mark_area()\n", " .encode(\n", " # Years on the X axis\n", " x=alt.X(\"term:Q\", axis=alt.Axis(format=\"c\", title=\"Year\")),\n", " # Number of articles on the Y axis\n", " y=alt.Y(\n", " \"total_results:Q\",\n", " axis=alt.Axis(format=\",d\", title=\"Number of articles\"),\n", " stack=True,\n", " ),\n", " # Color uses the settings defined above\n", " color=color,\n", " # Details on hover\n", " tooltip=[\n", " \"state\",\n", " alt.Tooltip(\"term:Q\", title=\"Year\"),\n", " alt.Tooltip(\"total_results:Q\", title=\"Articles\", format=\",\"),\n", " ],\n", " )\n", " .properties(width=700, height=400)\n", " .transform_filter(\n", " # Filter data by state when a state is selected\n", " selection\n", " )\n", " .add_selection(selection)\n", ")\n", "\n", "# Add a bar chart showing the number of articles per state\n", "bar = (\n", " alt.Chart(df_states)\n", " .mark_bar()\n", " .encode(\n", " # State on the X axis\n", " x=alt.X(\"state:N\", title=\"State\"),\n", " # Number of articles on the Y axis\n", " y=alt.Y(\n", " \"total_results:Q\", axis=alt.Axis(format=\",d\", title=\"Number of articles\")\n", " ),\n", " # Details on hover\n", " tooltip=[\n", " alt.Tooltip(\"state\", title=\"State\"),\n", " alt.Tooltip(\"total_results:Q\", title=\"Articles\", format=\",\"),\n", " ],\n", " # Color based on state as defined above\n", " color=color,\n", " )\n", " .properties(width=700, height=150)\n", " .add_selection(\n", " # Highlight state when selected\n", " selection\n", " )\n", ")\n", "\n", "# For good measure we'll add an interactive legend (which is really just a mini chart)\n", "# This makes it easier to select states that don't have many articles\n", "legend = (\n", " alt.Chart(df_states)\n", " .mark_rect()\n", " .encode(\n", " # Show the states\n", " y=alt.Y(\"state:N\", axis=alt.Axis(orient=\"right\", title=None)),\n", " # Color as above\n", " color=color,\n", " )\n", " .add_selection(\n", " # Highlight on selection\n", " selection\n", " )\n", ")\n", "\n", "# Concatenate the charts -- area & legend hotizontal, then bar added vertically\n", "(area | legend) & bar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can select a state by clicking on a colour in any of the three connected charts. To unselect, just click on an area with no color.\n", "\n", "Let's save this chart as an HTML page, so we can share and play with it outside of this notebook." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "combined_chart = (area | legend) & bar\n", "combined_chart.save(\"docs/trove-newspaper-articles-by-state-and-year.html\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Here's the result!](https://glam-workbench.github.io/examples/trove-newspaper-articles-by-state-and-year.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Further reading" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Tim Sherratt, ['Asking better questions: history, Trove and the risks that count'](http://discontents.com.au/asking-better-questions-history-trove-and-the-risks-that-count/), in *CopyFight*, Phillipa McGuiness (ed.), NewSouth Books, 2015.\n", "\n", "* Tim Sherratt, ['Seams and edges: dreams of aggregation, access, and discovery in a broken world'](http://discontents.com.au/seams-and-edges-dreams-of-aggregation-access-discovery-in-a-broken-world/), ALIA Online, 2015.\n", "\n", "* Tim Sherratt, ['Hacking heritage: understanding the limits of online access'](https://hcommons.org/deposits/item/hc:18733/), preprint of a chapter submitted for publication as part of *The Routledge International Handbook of New Digital Practices in Galleries, Libraries, Archives, Museums and Heritage Sites*, forthcoming 2019." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n", "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb).\n", "\n", "Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }