{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvesting Australian Women's Weekly covers\n", "### (or all the front pages of any digitised newspaper)\n", "\n", "Somewhat confusingly, the *Australian Women's Weekly* is in with Trove's digitised newspapers and not the rest of the magazines. There are notebooks in the GLAM Workbench's journals section to help [harvest all of a journal's covers](https://glam-workbench.github.io/trove-journals/#get-covers-or-any-other-pages-from-a-digitised-journal-in-trove) as images, so I thought I should do the same for the Weekly. \n", "\n", "Just change the `TITLE_ID`, `START_DATE`, `END_DATE`, and `PREFIX`, to harvest all the front pages of any digitised newspaper.\n", "\n", "## Harvest summary\n", "\n", "* The list of issues harvested is available [in this CSV](https://github.com/GLAM-Workbench/trove-newspapers/blob/58307d3ccae4d2c939ecb6aff59944f27d213842/data/aww-issues.csv).\n", "* 2,566 images were downloaded.\n", "* The full set of images is [available from Cloudstor](https://cloudstor.aarnet.edu.au/plus/s/NaKjoKNFOGXXDNN).\n", "* For easy browsing, I've compiled the images into a set of PDF files, one for each decade, available from Dropbox:\n", " * [1933 to 1939](https://www.dropbox.com/s/0j6zpeuw6tbey5k/aww-1933-1939.pdf?dl=0)\n", " * [1940 to 1949](https://www.dropbox.com/s/y1he8dd6h655weu/aww-1940-1949.pdf?dl=0)\n", " * [1950 to 1959](https://www.dropbox.com/s/i9gp9i51nofmlqo/aww-1950-1959.pdf?dl=0)\n", " * [1960 to 1969](https://www.dropbox.com/s/2of63tovcnphijo/aww-1960-1969.pdf?dl=0)\n", " * [1970 to 1979](https://www.dropbox.com/s/f2yxpg8u4dx5uf2/aww-1970-1979.pdf?dl=0)\n", " * [1980 to 1982](https://www.dropbox.com/s/xanohtas1fi7eu4/aww-1980-1982.pdf?dl=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import what we need" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from requests.adapters import HTTPAdapter\n", "from requests.packages.urllib3.util.retry import Retry\n", "import re\n", "from pathlib import Path\n", "from tqdm.auto import tqdm\n", "import time\n", "import pandas as pd\n", "from IPython.display import display, FileLink\n", "\n", "s = requests.Session()\n", "retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])\n", "s.mount('https://', HTTPAdapter(max_retries=retries))\n", "s.mount('http://', HTTPAdapter(max_retries=retries))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set some options\n", "\n", "Modify the values below as required." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "API_KEY = 'YOUR API KEY'\n", "\n", "# The id of the newspaper you want to harvest\n", "TITLE_ID = '112' # 112 is the AWW\n", "\n", "# Range of years to harvest\n", "START_YEAR = 1933\n", "END_YEAR = 1983\n", "\n", "# A prefix to use in file names, if None then the title_id will be used\n", "PREFIX = 'aww'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define some functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TITLE_URL = f'https://api.trove.nla.gov.au/v2/newspaper/title/{TITLE_ID}'\n", "\n", "def get_current_year(years, year):\n", " '''\n", " Get data for the current year from the dictionary of years.\n", " '''\n", " for year_data in years:\n", " if year_data['date'] == str(year):\n", " return year_data\n", "\n", "def get_issues():\n", " '''\n", " Get all the issue details by looping through the range of years.\n", " Returns a list of issues.\n", " '''\n", " params = {\n", " 'encoding': 'json',\n", " 'include': 'years',\n", " 'key': API_KEY\n", " }\n", " issues = []\n", " for year in tqdm(range(START_YEAR, END_YEAR), desc='Issues'):\n", " # Setting 'range' tells the API to give us a list of issue dates & urls within that date range\n", " date_range = f'{year}0101-{year}1231'\n", " params['range'] = date_range\n", " # Get the data\n", " response = s.get(TITLE_URL, params=params)\n", " data = response.json()\n", " # Extract the details for the current year\n", " year_data = get_current_year(data['newspaper']['year'], year)\n", " # Save issue details\n", " for issue in year_data['issue']:\n", " issues.append(issue)\n", " time.sleep(0.2)\n", " return issues\n", "\n", "def get_file_prefix():\n", " '''\n", " Set the prefix to be used in filenames and data directory.\n", " Defaults to title id if prefix is not set\n", " '''\n", " if PREFIX:\n", " file_prefix = PREFIX\n", " else:\n", " file_prefix = TITLE_ID\n", " return file_prefix\n", "\n", "def create_output_dir(file_prefix):\n", " '''\n", " Create output directory.\n", " '''\n", " dir_path = Path('data', file_prefix)\n", " dir_path.mkdir(parents=True, exist_ok=True)\n", " return dir_path\n", "\n", "def download_page(page_id, size, file_path):\n", " '''\n", " Download page image using the supplied id.\n", " Size range is 1 to 7 (7 being the highest res)\n", " '''\n", " # Format the page url ising the page id\n", " page_url = f'http://trove.nla.gov.au/ndp/imageservice/nla.news-page{page_id}/level{size}'\n", " # Download the image\n", " response = s.get(page_url)\n", " file_path.write_bytes(response.content)\n", " time.sleep(0.5)\n", "\n", "def harvest_covers(size=5):\n", " '''\n", " Get a list of issues of the title.\n", " Loop through the issues downloading each front page/cover.\n", " Return issue metadata.\n", " '''\n", " # Get a list of issues\n", " issues = get_issues()\n", " # Loop through the issues\n", " for issue in tqdm(issues, desc='Pages'):\n", " # Request the issue url\n", " response = s.get(issue['url'])\n", " # The issue url will be redirected to a page url\n", " # Extract the page id from the page url\n", " page_id = re.search(r'(\\d+)$', response.url).group(1)\n", " # Save page id to metadata\n", " issue['page_id'] = page_id\n", " # Set up dirs and files\n", " file_prefix = get_file_prefix()\n", " dir_path = create_output_dir(file_prefix)\n", " file_path = Path(dir_path, f'{file_prefix}-{issue[\"date\"].replace(\"-\", \"\")}-page{page_id}.jpg')\n", " # If the image hasn't already been downloaded, then download it!\n", " if not file_path.exists():\n", " download_page(page_id, size, file_path)\n", " # Save the image name to the metadata\n", " issue['image_name'] = file_path.name\n", " time.sleep(0.2)\n", " return issues" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the harvest!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "issues = harvest_covers()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save the metadata" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(issues)\n", "df.rename(columns = {'id': 'issue_id'}, inplace=True)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file_prefix = get_file_prefix()\n", "df.to_csv(f'data/{file_prefix}-issues.csv', index=False)\n", "display(FileLink(f'data/{file_prefix}-issues.csv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). \n", "Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }