{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get a list of Trove newspapers that doesn't include government gazettes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Trove API includes an option to retrieve details of digitised newspaper titles. Version 2 of the API added a separate option to get details of government gazettes. However the original `newspaper/titles` requests actually returns *both* the newspaper and gazette titles, so there's no way of getting just the newspaper titles. This notebook explains the problem and provides a simple workaround."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add your Trove API key below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "api_key = 'YOUR API KEY GOES HERE'\n",
    "print('Your API key is: {}'.format(api_key))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The problem\n",
    "\n",
    "Getting a list of digitised newspapers or gazettes in Trove is easy, you just fire off a request to one of these endpoints:\n",
    "\n",
    "* `https://api.trove.nla.gov.au/v2/newspaper/titles/`\n",
    "* `https://api.trove.nla.gov.au/v2/gazette/titles/`\n",
    "\n",
    "Let's create a function to get either the `newspaper` or `gazette` results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_titles_df(title_type):\n",
    "    # Set default params\n",
    "    params = {\n",
    "        'key': api_key,\n",
    "        'encoding': 'json',\n",
    "    }\n",
    "    \n",
    "    # Make the request to the titles endpoint and get the JSON data\n",
    "    data = requests.get('https://api.trove.nla.gov.au/v2/{}/titles'.format(title_type), params=params).json()\n",
    "    titles = []\n",
    "    \n",
    "    # Loop through the title records, saving the name and id\n",
    "    for title in data['response']['records']['newspaper']:\n",
    "        titles.append({'title': title['title'], 'id': int(title['id'])})\n",
    "        \n",
    "    # Convert to a dataframe\n",
    "    df = pd.DataFrame(titles)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use the function to get all the newspaper titles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Canberra Community News (ACT : 1925 - 1927)</td>\n",
       "      <td>166</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Canberra Illustrated: A Quarterly Magazine (AC...</td>\n",
       "      <td>165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Federal Capital Pioneer (Canberra, ACT : 1924 ...</td>\n",
       "      <td>69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Good Neighbour (ACT : 1950 - 1969)</td>\n",
       "      <td>871</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Student Notes/Canberra University College Stud...</td>\n",
       "      <td>665</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title   id\n",
       "0        Canberra Community News (ACT : 1925 - 1927)  166\n",
       "1  Canberra Illustrated: A Quarterly Magazine (AC...  165\n",
       "2  Federal Capital Pioneer (Canberra, ACT : 1924 ...   69\n",
       "3                 Good Neighbour (ACT : 1950 - 1969)  871\n",
       "4  Student Notes/Canberra University College Stud...  665"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newspapers_df = get_titles_df('newspaper')\n",
    "newspapers_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many are there?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1567, 2)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newspapers_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Everything looks ok, but if we search inside the results for titles that include the word 'Gazette' we find that the government gazettes are all included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Papua New Guinea Government Gazette (1971 - 1975)</td>\n",
       "      <td>1372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Territory of Papua and New Guinea Government G...</td>\n",
       "      <td>1371</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Territory of Papua Government Gazette (Papua N...</td>\n",
       "      <td>1369</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Territory of Papua-New Guinea Government Gazet...</td>\n",
       "      <td>1370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Australian Government Gazette (National : 1973...</td>\n",
       "      <td>1288</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Australian Government Gazette. Chemical (Natio...</td>\n",
       "      <td>1355</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Australian Government Gazette. General (Nation...</td>\n",
       "      <td>1289</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Australian Government Gazette. Periodic (Natio...</td>\n",
       "      <td>1294</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>Australian Government Gazette. Public Service ...</td>\n",
       "      <td>1308</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>Australian Government Gazette. Special (Nation...</td>\n",
       "      <td>1286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>Commonwealth of Australia Gazette (National : ...</td>\n",
       "      <td>1214</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>Commonwealth of Australia Gazette. Agricultura...</td>\n",
       "      <td>1363</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Commonwealth of Australia Gazette. Australian ...</td>\n",
       "      <td>1358</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>Commonwealth of Australia Gazette. Australian ...</td>\n",
       "      <td>1360</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Commonwealth of Australia Gazette. Australian ...</td>\n",
       "      <td>1361</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>Commonwealth of Australia Gazette. Australian ...</td>\n",
       "      <td>1351</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>Commonwealth of Australia Gazette. Australian ...</td>\n",
       "      <td>1350</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>Commonwealth of Australia Gazette. Australian ...</td>\n",
       "      <td>1356</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>Commonwealth of Australia Gazette. Australian ...</td>\n",
       "      <td>1357</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>Commonwealth of Australia Gazette. Business (N...</td>\n",
       "      <td>1343</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                title    id\n",
       "12  Papua New Guinea Government Gazette (1971 - 1975)  1372\n",
       "17  Territory of Papua and New Guinea Government G...  1371\n",
       "18  Territory of Papua Government Gazette (Papua N...  1369\n",
       "19  Territory of Papua-New Guinea Government Gazet...  1370\n",
       "21  Australian Government Gazette (National : 1973...  1288\n",
       "22  Australian Government Gazette. Chemical (Natio...  1355\n",
       "23  Australian Government Gazette. General (Nation...  1289\n",
       "24  Australian Government Gazette. Periodic (Natio...  1294\n",
       "25  Australian Government Gazette. Public Service ...  1308\n",
       "26  Australian Government Gazette. Special (Nation...  1286\n",
       "27  Commonwealth of Australia Gazette (National : ...  1214\n",
       "28  Commonwealth of Australia Gazette. Agricultura...  1363\n",
       "29  Commonwealth of Australia Gazette. Australian ...  1358\n",
       "30  Commonwealth of Australia Gazette. Australian ...  1360\n",
       "31  Commonwealth of Australia Gazette. Australian ...  1361\n",
       "32  Commonwealth of Australia Gazette. Australian ...  1351\n",
       "33  Commonwealth of Australia Gazette. Australian ...  1350\n",
       "34  Commonwealth of Australia Gazette. Australian ...  1356\n",
       "35  Commonwealth of Australia Gazette. Australian ...  1357\n",
       "36  Commonwealth of Australia Gazette. Business (N...  1343"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newspapers_df.loc[newspapers_df['title'].str.contains('Gazette')][:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The solution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can't just filter the results on the word 'Gazette' as a number of newspapers also include the word in their titles. Instead, we'll get a list of the gazettes using the `gazette/titles` endpoint and subtract these titles from the list of newspapers.\n",
    "\n",
    "Let's get the gazettes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Papua New Guinea Government Gazette (1971 - 1975)</td>\n",
       "      <td>1372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Territory of Papua and New Guinea Government G...</td>\n",
       "      <td>1371</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Territory of Papua Government Gazette (Papua N...</td>\n",
       "      <td>1369</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Territory of Papua-New Guinea Government Gazet...</td>\n",
       "      <td>1370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Australian Government Gazette (National : 1973...</td>\n",
       "      <td>1288</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title    id\n",
       "0  Papua New Guinea Government Gazette (1971 - 1975)  1372\n",
       "1  Territory of Papua and New Guinea Government G...  1371\n",
       "2  Territory of Papua Government Gazette (Papua N...  1369\n",
       "3  Territory of Papua-New Guinea Government Gazet...  1370\n",
       "4  Australian Government Gazette (National : 1973...  1288"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gazettes_df = get_titles_df('gazette')\n",
    "gazettes_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(37, 2)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gazettes_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll create a new dataframe that only includes titles from `df_newspapers` if they **are not in** `df_gazettes`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "newspapers_not_gazettes_df = newspapers_df[~newspapers_df['id'].isin(gazettes_df['id'])]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1530, 2)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newspapers_not_gazettes_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If it worked properly the number of titles in the new dataframe should equal the number in the newspapers dataframe minus the number in the gazettes dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "newspapers_not_gazettes_df.shape[0] == newspapers_df.shape[0] - gazettes_df.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Yay!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}